Dirty data is costing you: How to solve common data preparation issues
If you’ve ever analyzed data, you know the pain of digging into your data only to find that it is "dirty"—poorly structured, full of inaccuracies, or just plain incomplete. You’re stuck fixing the data in Excel or writing complex calculations before you can answer a simple question. Data preparation is the process of getting data ready for analysis, including data discovery, transformation, and cleaning tasks—and it’s a crucial part of the analytics workflow.
Even those who aren’t directly performing data preparation tasks feel the impact of dirty data. The amount of time and energy it takes to go from disjointed data to actionable insights leads to inefficient ad-hoc analyses and declining trust in organizational data. These slower processes can ultimately lead to missed opportunities and lost revenue. In fact, Gartner research indicates that the “average financial impact of poor data quality on organizations is $9.7 million per year.”1
Why dirty data happens
Enterprises are taking steps to overcome dirty data by establishing data catalogs and glossaries. But even with these practices, it is likely for some level of dirty data to seep through the cracks of day-to-day operations. Dirty data commonly happens due to:
1. Human error
It is the most common cause of dirty data, according to Experian. From variability in data entry practices to employees manually inputting values into spreadsheets, even a simple spelling error could pose challenges down the line when someone goes to analyze the data.
2. Disparate systems
Organizations often store data in several disparate systems that have different structures, requirements, and aggregations. When it comes time to integrate this data, analysts are left with duplicate or missing fields or inconsistent labels. Fields or values might also have the same meaning, but use different names or values across systems.
3. Changing requirements
When businesses evolve, data administrators and engineers need to make changes to the data—changing its granularity, deprecating fields, or introducing new fields as needed. Often analysts don't know about these changes until they bring the data into a self-service BI or data prep tool.
Issue: Data preparation requires deep knowledge of organizational data
Before preparing data, it is crucial to understand its location, structure, and composition, along with granular details like field definitions. Some people refer to this process as “data discovery” and it is a fundamental element of data preparation. You wouldn’t start a long journey without a basic understanding of where you’re going, and the same logic applies to data prep.
The emergence of self-service BI has made data discovery easier for business users, providing them with a deeper knowledge of the existing structure and contents of their data sets. But because of information silos, these users often have less insight into the entire data landscape of their organization—what data exists, where it lives, and how it is defined. Confusion around data definitions, for example, can hinder analysis or worse, lead to inaccurate analyses across the company.
Visual, self-service data prep tools allow analysts to dig deeper into the data to understand its structure and see relationships between tables. Because they can understand the profile of their data, analysts can easily spot unexpected values that need cleaning. Although this technology brings clarity to the data, people will still need support from others in their company to understand details like field definitions.
Create a data dictionary: One way to standardize data definitions across a company is to create a data dictionary. A data dictionary helps analysts understand how terms are used within each business application, showing the fields are relevant for analysis versus the ones that are strictly system-based. Brian Davis, Project Engineer at an energy company calls data dictionaries "invaluable."
Continue to iterate and innovate: Developing a data dictionary is no small task. Data stewards and subject matter experts need to commit to ongoing iteration, checking in as requirements change.2 If a dictionary is out of date, it can actually do harm to your organization's data strategy. Communication and ownership should be built into the process from the beginning to determine where the glossary should live and how often it should be updated and refined.
Issue: “Clean data” is a matter of perspective
Different teams have different requirements and preferences regarding what makes for “well-structured” data. For example, database administrators and data engineers prioritize how data is stored and accessed—and columns may be added that are strictly for databases to leverage, not humans. When an engineer builds a data warehouse specifically for analysis, they prioritize the core business metrics that answer the majority of questions. If the information that data analysts need isn’t already in the data set, they may need to adjust aggregations or bring in outside sources. This can lead to silos or inaccuracies in the data.
Cathy Bridges, Tableau Developer at SCAN Health Plan, explained how analysts often have to go back and update a data set that has already been cleaned by another team. “Bringing in additional columns can be a long and arduous process. For example, if I need totals versus breakout, I need to duplicate the data source—and it can be a pain.”
Self-service data prep gives analysts the power to polish data sets in a way that matches their analysis, leading to faster, ad-hoc analyses and allowing them to answer questions as they appear. It also reduces the burden on IT to restructure the data whenever an unanticipated question arises. This can also reduce the amount of duplicated efforts because other analysts can reuse these models. If the datasets are valuable on a wide scale, you can combine them into a canonical set in the future.