Skip links

Data Quality Management for Machine Learning

Data is the fuel for our future and data is the lifeblood of an organization. However, to make decisions based on data, you need to be able to rely on its accuracy. Flaws in data can lead to disastrous results really quickly. As a rule of thumb, data of poor quality will automatically degrade a machine learning model’s performance (garbage-in-garbage-out – with no middle ground).

In this article, we will explore the main aspects of data quality management, a critical component of data governance, to build successful machine learning models. We will also suggest some of the best practices to keep data of high quality during the whole process. Thus, monitoring and successfully managing incoming data becomes a high priority and recurrent task.

The Cost of Bad Data

Machine learning algorithms allow you to predict some future situation based on historical data. Artificial intelligence models are consistently gaining trust, and leading to great business outcomes in many domains. Machine learning models built for financial companies can identify forged transactions, saving up to an estimated $12 billion in savings for card issuers and banks on fraud detection only. However, such outcomes are only achievable if you have high quality data to start with. Otherwise, it is a far-fetched dream.

The cost of operating on poor data can cause devastating side effects to customers as well. Financially, MIT Sloan Management Review showed that companies lose between 15% to 25% of their revenues to due to poor data. Moreover, the impact on the US economy alone is a staggering $3.1 trillion a year as estimated by IBM.

In fact, the impact of poor quality data is not only financial. Highly paid professionals waste ~80% of their time on data quality (exploring, finding, analysing , cleansing, and organizing data). This leaves very less time for actually  performing the analysis and building the models to provide the actual value. Worse still, is that work spent cleansing data (often in Excel) is often repeated again and again as different workers deal with poor data quality in their own way – often leading to inconsistent results.

Apart form the financial and time losses, taking decisions based on flawed data can lead to severe consequences. Governments may make policy decisions on the wrong data, resulting in impacts affecting millions of people. Commercially, companies don’t only risk losing their customers and revenue, but also in significant damage to their brand.

A wider view that discusses data governance perspective can be found in our previous article Data Governance for Advanced Analytics, Machine Learning & Automation.

Input Data Validation

The first axis to data quality is validating the input data. So for each dataset, you should have a data schema. In simple words a collection of rules describing the expected values, properties and distributions for all features. Here are some of the best practices about how to effectively develop a proper data schema:

  • Learn all possible values for categorical features . For example, in text analysis, get an idea about Out Of Vocab (OOV) percentages expected for your problem in production.
  • For numerical data, understand the acceptable range of values and the distribution of each. Also, how likely any of that can change in production by allowing for some minor margin.
  • Then, translate the above values and acceptable ranges to a set of explicit rules such as:
    • Check for values of the user rating feature to be between integer values of discrete values between 1 and 5.
    • Check hours per week feature to be in range 0 up to maximum allowable work hours per week.
    • Categorical features, such as Sales Region should have well defined values, agreed to across the organisation
    • Shoe size feature probably forms a Gaussian distribution, therefore values should follow this range of values around the mean & std.
    • The price feature is probably a Poisson distribution, so the values for the specific product should follow that.

After setting up the data schema as suggested, you should be able to catch many types of errors maintaining a high quality data. For example, anomalies, unexpected categorical values and un expected data distributions.

Engineered Data Validation

Engineering the data refers to transportation, transformation, and storage of data. This data flow can be achieved through the common pattern of creating and maintaining data pipelines. Data is ingested, cleaned to fill in important gaps, transformed, normalized to a sensible values to feed the machine learning models.

The second axis of validating data quality is to verify engineered data is of expected quality. The validity of raw data does not necessarily imply the validity of engineered data. Engineered data are most likely different than the original data. Pre-processing and/or processing stages of the data can easily introduce errors to the data. A good practice though is to write down unit tests for each of your engineered data features. Make sure to write high overage list of test cases. This can address not only valid inputs, but edge cases as well. Here are a few examples:

  • Check the data scaling and normalization for numeric features to have only values between [0-1].
  • You need to make sure that hot-encoded categorical features have only one value set to 1 in the feature vector. The rest of the values in the vector should always all have exactly zero values.
  • You should correctly identify missing data values and replace with expected value (mean or default).
  • Testing of OOV handling for categorical values.
  • Data distribution conformity for numeric features, e.g. normalized features using z-score should always have mean of zero.
  • Testing outlier handling according to chosen procedure (clipping, scaling & quantization).
  • Feature crosses are synthetic features that encode nonlinearity in the feature space by multiplying two or more input features together. Test the values of these features using same numeric features techniques.
  • Make sure to cover the all the features of every type with acceptable number of test cases.

Data Splitting

A valid data split is the third axis of retaining high quality data. It needs to provide near-complete coverage of all data slices. As a rule of thumb, your split of train/test sets should be as statistically similar and representative of the dataset. Here are some of the data splitting best practices:

  • Choose a suitable splitting ratio according to your problem. It should not be necessarily the 80/20 or 70/30 and fix it for any future re-training [1,2,3].
  • Ensure that your validation set mirrors the lag between training and serving (e.g. a model predicting stock prices that can only be trained at the end of the day, shouldn’t be validated against data from the training day).
  • Beware of the case when you have low data volume. The data distributions may end up quite different between training, validation, testing and from production.
  • If you have grouped data, whether as time series data, or clustered by other criteria, avoid random sampling to split. This can have the effect of the model learning from information that would not necessarily have at prediction time. Thus, in this case try to utilize domain knowledge to effectively learn how you split your data.
  • Make sure that the test set is large enough to yield statistically meaningful results. This will also ensure that your model will be less prone to high variance.
  • Be very careful with imbalanced dataset splitting. You need to assess your need (problem-based) for stratified sampling, over or under sampling for representative splits.

Each time you re-train your model with new data, monitor your split properties. A good practice though is to keep a versioned log of the split properties each time you perform one. This way you can easily detect any divergence in the statistical and distributional properties in the data and hence raise a flag.

Data Quality Monitoring

Data quality monitoring is the process to monitor, ensure and maintain the data quality standards across the data management system. It is not a one-time effort, but rather a ongoing process. A successful monitoring process starts by setting your own data quality metrics (e.g. Consistency, Accuracy, Completeness, Orderliness, Timeliness) plus some predefined key performance indicators (KPIs). The process kicks off by consistently monitoring these KPIs and metrics and evaluate against configured gold standard data quality criterion.

For an affective monitoring process, you should couple the process with either a manual or an automatic reporting and alerting system. You can enhance your machine learning pipelines to save the output of the data validation process. Later, you can use previous reports to assess the data quality over time to allow for further analytics. In addition, quality monitoring allows for timely intervention to remediate data quality issues as required. Having versioned data quality reports, can also help in identifying the source of erroneous data, or drops in prediction quality. You can easily then reason about errors, isolating factors that contribute to the issue to find the solution.


In this article, we draw the attention of ML practitioners and enthusiasts about the importance of data quality management for machine learning. Data validation is not only important for input data, but also to data engineering stages of machine learning pipelines. Integrating the practices together into your machine learning pipelines enables you to build a data quality monitoring mechanism. This can easily support tracking the data quality performance evolution over time. In summary, data quality management must be implemented as a complete and integrated strategy through model development, testing and in production. Monitoring and maintaining high-quality data is not a luxury. It is a must for meaningful modelling and successful solution to business problems.