Time To Get Started on Data Wrangling

The word from the hiring grapevine is that professionals with data wrangling know-how are in high demand right now. For data professionals, this development validates the growing importance of data science, as organizations turn to the power of data to discover new business insights, inform key decisions, and train machine learning models.

But how exactly does data-wrangling fit into the picture?

Transforming and mapping data

Also known as data munging, data wrangling is the process of transforming and mapping data from one format into another. The intention is to make existing data more suited for downstream use and is based on the premise that insights are only as good as the data that informs them.

“Any analyses a business performs will ultimately be constrained by the data that informs them. If data is incomplete, unreliable, or faulty, then analyses will be too — diminishing the value of any insights gleaned,” explained marketing specialist Tim Stobierski in an HBS Online blog.

A time-consuming and taxing endeavor, the result of the data wrangling process is better, cleaner data. Depending on the size of the organization, data wrangling might be done by a data scientist or a data professional in organizations with a full data team, or even by non-data professionals in smaller outfits.

“It is easy to underestimate the difficulty of producing these analytic data structures, especially when an organization has invested heavily in its data infrastructure over many years,” cautioned Mike Thurber, lead data scientist at analyst firm Elder Research, noting that even analytic team leaders are often “astonished” by the estimates from data scientists of how long it can take to wrangle the data before modeling.

And lest one think of plucking data from an existing database or favorite dashboard to do some modeling themselves, Thurber says it just doesn’t work that way: “OLAP query answers questions for aggregates of transactions, whereas an analytical model must make a unique prediction for each transaction.”

Bad data is a real problem

A related aspect – some would argue that it should be an integral aspect – of the data wrangling process would be data cleaning. Some examples include:

  • Identifying gaps such as missing values
  • Cleaning up erroneous entries
  • Filtering out records that are not relevant to a specific analysis
  • Removing data points that are extreme outliers

But surely a small handful of erroneous data records can’t possibly skew the results too much with the weight of thousands or tens of thousands of records to analyze? Then again, one can still get a bad bout of food poisoning at a sumptuous buffet from just one bad dish, right? Ditto to the impact of bad data.

If anything, bad data has been proven to be a major problem, especially when one considers how the ML community globally had poured resources into addressing the ongoing pandemic – and came up with nothing usable.

As I noted in August, many of the AI-based tools built specifically to fight COVID didn’t work as expected due to the poor quality of the underlying data or incorrect assumptions made about it.

Proper data wrangling can take up a daunting amount of effort and time. Though the importance is not necessarily apparent to non-data professionals, it is essential to the process. After all, the adage that analytics professionals spend as much as 80% of their time on data wrangling, and only 20% for exploration and modeling didn’t come out of thin air.

Understanding the business use case

Part of the reason why it takes so long can be attributed to clarifying the use case, which is much harder than it sounds. Even a simple directive such as “Reward the top 10% of our customers with a voucher” for an e-commerce marketplace is not as straightforward as it might initially appear to be.

Should we consider (or ignore) multiple buyers from the same household? Should we include those who bought a lot a year ago but have stopped buying in the last six months? Should we favor a higher quantity of purchases versus one or two big-ticket items?

In addition: Should a customer with separate business and personal accounts be considered as the same person? And should a customer who initiates multiple returns be included – and if yes, should their returns be negated from consideration?

The list goes on. Clarifying the use case calls for an understanding of the business objectives, identifying relevant considerations, and getting the agreement of the team. Only then can our intrepid data professional start pulling the data from the right data repositories, wrangle that data using the agreed-on metrics, and perform the needed analysis.

More data-wrangling ahead

Ultimately, the heightened demand for data wrangling know-how is proportionate to the surge of activities around data. Given the choice between putting one of a limited number of data scientists on this task, or a much less expensive data professional to perform data wrangling, it is easy to imagine the latter being the preferred choice.

Of course, not everyone can do data wrangling, yet. For now, it will still require someone with the right combination of data skills, business acumen, and enough programming chops to write the scripts or work the tools to clean up the data. But as businesses continue to establish a data culture and democratize data access, expect more to step into these shoes soon.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/UroshPetrovic