Testing and debugging machine learning (ML) systems differs significantly from testing and debugging traditional software. We cover the main steps from debugging your machine learning model all the way to monitoring your pipelines and testing in production. This will maximize your grip on your models dealing with real-life scenarios that can become critical.
In this article, we are more interested in tracking the machine learning models after they were initially created and evaluated. Thinking about model debugging naturally resonates with software debugging where you want to make sure that the ML model works as expected.
Once the model is working, the continuous process of optimization takes place to enhance the model’s quality for production-readiness. However, reaching production is not the end line milestone. It is only the beginning of the more challenging and important phase of the model’s success which is monitoring and testing in production.
Model Development & Debugging Processes
In order to easily debug your ML model, you should follow the model development best practices. The most important ones can be summarized as follows:
- Keep your model simple, start with the least possible number of features and add incrementally. This way you can easily track the performance gain of each feature’s performance step by step.
- Keep track of all hyperparameter value ranges utilized through the process. This helps in saving time re-trying hyperparameters that are known to not work.
- Iteratively optimize your model by trying the following (individually or collaboratively) until you reach acceptable model accuracy:
- Add features (individual features or feature crosses)
- Tune hyperparameters
- Add model complexity (more layers, more nodes per layer, etc.) slowly and iteratively
- After each step you need to check your model’s accuracy against the chosen evaluation metrics. Then, you can debug the model for better understanding of its behaviour.
Given you understand how to debug software programs, the question now is how ML model debugging is different. One main difference is that ML models have both data and code. Also, a model’s poor quality may not necessarily be a bug as in SW programs. In fact, the root cause analysis for such situation requires broader range of investigations. It may be due to bad feature choice, non-optimal hyper parameter values, bad processing of data, poor quality data, missing feature columns or a bug in the model’s creation pipeline. The model debugging process is quite challenging and time consuming given the long modelling cycles (iterations) and the diversity of error sources.
In this section, we discuss the data debugging best practices to explain the model’s performance. As a rule of thumb, data of poor quality will automatically degrade the models’ performance. Thus, monitoring incoming data is a priority task.
In our previous article “Data Quality Management for Machine Learning“, we have gone through defining data quality management in details. We showed how to keep your data as of the highest standard at all times. However, here we will just list the the main activities involved in data debugging:
- Validate the incoming input data against your defined data schema (rules describing the expected values (e.g ranges, distributions, vocab, etc).
- Validate the engineered data (processed features). Sometimes the error is not in the input data, but comes later from the feature engineering step.
- Ensuring that the data splits are representative of the data distribution and unbiased, and keep monitoring your split properties at all times.
You start debugging your model once you establish that the data is ready and of a quality that meets all predefined expectations. We will try to take the model debugging step by step in a logical order as follows.
Assessing Features Predictability
First, we need to check whether the data as is can predict the label correctly. To check the predictive signal of the features versus the labels, you can calculate correlation metrices between label versus individual features. This can give a fair idea about the significance of predictivity of the features with high correlations. However, non-linear correlations cannot be detected this way, you may need to train and test the model (using some form of cross-validation) with and without the feature. Another idea is to choose a very small set of examples that the models can easily learn from which can simplify debugging the model by reducing the opportunities for other bugs.
Defining a Baseline Model
The second step in debugging your model is to establish a baseline (or a few depending on your problem). This point is focal, as how to verify that your model is performing at a specific level. You should always try to have a simple baseline coupled with your main model development process. There are some heuristics for building simple models such as:
- Use the most predictive feature individually to predict the label
- Try a simple linear model with no nonlinear features.
- Try to make that baseline inclined towards the most common labels or mean values (in case of regression)
A good practice is to have a set of baselines with different levels of complexities. This can help you build a validation staged plan to see where your model fits and where it is failing and then you can highlight the problematic area.
Devising Model Test Cases
Third, as seen with data debugging, building and running model-based test cases can then come into play at this stage. Let’s take a neural model for example, it has a set of layers and each layer has a set of neurons or units. You can have a set of test cases validating the correct model architecture for example:
- Check if the number of layers and units match the original model design
- Investigate the model inputs with NAN values
- Identify the missing inputs as compared to your feature dictionary
- Confirm the expected error messages and exceptions in cases of failure
- Check which layers and/or units are returning NANs where you can install probes on each layer and then you can check every epoch
- Track how your model runs if you place the label as one of the features
- Check the model’s convergence time (you don’t wait for days to identify a problem)
The fourth milestone you reach on your way of testing your model itself is validating and adjusting the hyperparameters. These parameters are actually the knobs that you tweak during successive runs of training a model. They cannot be changed while training and are different from the model’s trainable parameters. Each model type has a set of model specific hyperparameters, but we will only discuss a few of them as follows:
- Batch Size: It is the number of examples in a mini-batch which is only 1 in case SGD algorithm. Try to start with values between 10 and 1000. Also, bear in mind that your memory puts an upper bound of a batch size. On the other hand, a small batch sizes lets your gradient update more often per epoch, which can result in a larger decrease in loss per epoch and usually generalize better.
- Number of Epochs: Technically speaking, the minimum number of epochs that needs to be performed is 1. You can always do more unless your model start to overfit. There is no pre-set optimal number for epochs, it is purely problem and data dependent. However, you can follow the “elbow method” like in clustering. But in our case, let’s say your priority is the loss function. Then, make a line plot of your loss function vs the number of epochs. As soon as the slope of your line drops dramatically (to almost zero) that point is the elbow point. This is the point where the increasing number of epochs make only very slight difference in the loss function.
- Regularization: The rule here with regularization is that your model can predicts without any regularization. Then on witnessed occasions of overfitting, you start adding regularization. The choice of the regularization method is mostly dependent on your model type:
- For Linear Models: You can use L1 regularization if you would like to decrease your model’s size as it eliminates some of the weights altogether. For more model reproducibility and stability you can use L2 regularization. Start with small regularization rate, e.g. 1e-5 and tune through experimentation.
- For Non-Linear Models: We mean deep neural models where the regularization takes place using the dropout technique. The idea is some how similar to L1 regularization. There is a random fixed percentage of neurons in a layer to drop in each gradient step. The most common percentages for dropout rates fall between [10%-50%], so tune within this range. You can find more techniques to determine optimum rates here and its relation to batch normalization.
- Learning Rate: The gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar value known as the learning rate (aka step size) to determine the next point. Large values may cause the gradient to overshoot and may miss optimum loss and small values may take very long time to converge. Typically, the learning rate initialized differently by the different ML libraries according to the optimizer’s type. For example, TensorFlow AdagradOptimizer initializes it with value 0.05 and AdamOptimizer with 0.001. Usually these values work or at least a good start, but what if these values don’t help your model converge. Then, you can apply a hyperparameters tuning process with value ranges from 0.0001 and 1.0. You can change the value up or down on logarithmic scale from this range until your model converges. Note, that the complex the problem the more training time it would need to converge. Therefore, so you need first to assess the complexity of your problem and your model to establish when you should expect to see convergence and hence go this path.
- Model Depth & Width: The depth refers to the number of layers while the width refers to the number of neurons per layer. Increasing any of these two parameters, naturally increases the model’s complexity. Start with non-deep model (1-2 layers), then linearly increase its depth, while the width of input and output layers are problem dependent. The middle layers have lower number of neurons than the previous ones. Both the width and depth of the model need tuning to reach their optimum values depending on your problem.
Model Performance Metrics
In this section, we discuss the main quantitative values that tells you how your model is behaving and if there is something wrong. These are the model metrics, the ones used in the model’s evaluation.
Evaluation Performance Metrics
In this article, we will only go through some of the most commonly used ones:
- Loss : It is the number indicating how bad the model’s prediction was on a single example compared to what it expects? (e.g. Squared Loss, Cross Entropy, Hinge Loss, etc.). The perfect value is zero and it gets bad as it grows. The goal of training a model is then to find a set of weights that leads to the lowest loss on average, across all examples. Computing the loss then results in calculating the error which is the value to monitor for model quality. In the next section of debugging the loss curve, we will go through some common curve shapes and explain what might have gone wrong.
- Accuracy: It is the metric reflecting the fraction of correct predictions suggested by your model. This metric may not be informative enough and even misleading sometimes. For example, when you have class imbalance in your dataset.
- Recall/Precision/F1/AUC: In classification problems, and in addition to the accuracy, you need more explainability to your results and thus more metrics. Precision (specificity) attempts to answer “What proportion of positive identifications was actually correct?”, while Recall (sensitivity) attempts to answer “What proportion of actual positives identified correctly?”. These two metrics along with their summarizing metric “harmonic mean” F1 score complement the story told by accuracy. On the other hand, Area Under Curve (AUC) provides an aggregate measure of performance across all possible classification thresholds. It is the probability that the model ranks a random positive example more highly than a random negative example. AUC is very desirable for the following two reasons: 1) Scale invariant: it measures how well predictions ranked other than absolute values. 2) Classification threshold invariant: It measures the quality of the model’s predictions irrespective of what classification threshold.
Unlike binary classification, multi-class classification poses a challenge in model quality tracking. In case of having a small number of classes, you can track the per-class evaluation metrics. However, if you have a large number of classes, you can average the per-class evaluation metric values. In case you have one or more important classes or priority classes (e.g. in image classification, if your main goal is to identify people, then you specifically track the quality of predicting this class in comparison to other objects).
The overall model evaluation performance may be tricky covering problems underneath. This can show up on the surface in production through model degradation. Looking more into it you find that the model is not performing as well for a subset of your data. This can be due to non-representative training sets for all subsets of the data. For example, you do not have enough data for a specific geographic location. To help address this issue, you need to separately monitor underrepresented data slices (discovered or known in advance from problem domain). Therefore, you have two sets of models to track and debug, the overall metrics and the sliced ones. This means you have a holistic view of your performance and hence easier debugging and testing for your model.
Real-World Performance Metrics
The previously discussed model metrics may not automatically relate to the real-world impact of the model on your audience. In other words, how does improving the accuracy improves your users experience. Sometimes, a model’s higher accuracy doesn’t improve business results. You need to have a clear differentiation between the model metrics and the real-world metrics. Therefore, you need to define some real-world metrics that you find appropriate to your problem that along with the model metrics can define your success. Also, they can help you debug your model’s performance on iterative runs or where you find degradation. Then, you can easily cross-reference the model metrics versus the real-world ones and study the issue deeply.
Debugging Loss Curves
Loss curves can be quite challenging to interpret, it’s not always the smooth one way decreasing loss curve with each iteration. In reality, there can be lots of ups and downs every time your train or re-train your model. So what is the deal with these fluctuations? What might have gone wrong? can we fix it? We will try to answer some of these questions from a general perspective. We will present some example loss curve plots and propose a few probable reasons for you to debug your model.
In Figure 1, your model is not converging as you can see the loss keeps rising back and forth. Try debugging the following:
- Check the data schema skew, you may be receiving data that does not comply with the expected schema.
- Check for data values skew, there may be some significant change in the statistical properties of the data.
- The learning rate may be too large causing such unprecedented jumps.
- Simplify your model and make sure that it trains properly on a smaller scale, then compare to a baseline, then incrementally add complexity.
In Figure 2, the model was behaving normally up to a certain point then surprisingly the loss started increasing. Try debugging the following:
- The most obvious reason is that something went wrong in the loss calculation, e.g. Check for NAN values exists either in inputs or from intermediate operations
- Exploding gradient
- Division by zero
- Check for anomalies of data in your batches and if you can found in specific batches. You may need to remove or make sure to distribute equally between batches to minimize their effect.
In Figure 3, we can see less similarity between training loss values and testing loss values. Try debugging the following:
- Your model seems to be overfitting, you can try to add regularization and/or make the model less complex.
- You can check training and testing data splits for bias or miss representation of different classes.
In Figure 4, we can see non-conforming training loss against the evaluation and/or real-world metrics in testing sets.
Since your training loss function is not the exact metric you evaluate on, then you should not expect the model quality in evaluation to be always good. In Figure 4.b, we can see an example of an evaluation metric like Recall showing very low value. This may be due to the choice of the classification threshold (usually 0.5 in many algorithms, e.g. in Keras). This can also be due to class-imbalance in your dataset. To deal with this case you can either:
- Change the classification threshold (usually lower it).
- You can use other threshold invariant metric like AUC instead of PR/Recall
In Figure 5, we can see a recurrent fluctuating pattern in the training loss curve with time.
This may be due to some data with specific statistical properties all gathered in a few batches. This means that the data are distributed in a bad way across batches. Ensure that you have appropriate shuffling. Also, ensure that you have correct batch normalization by debugging the data in the different batches.
In this article, we wanted to draw the attention of ML practitioners and enthusiasts on one of the most important and yet less structured activity in the ML process. This is model debugging and testing pre-intra and postproduction. We started by discussing the importance of ML model debugging and testing and the process is different from that of the traditional software process. We have demonstrated some of the best practices when performing the different model development and debugging activities. Moreover, we explained the main differences between data debugging and model debugging and their position in the debugging activity. In addition, we covered the topics of analysing baseline models, features’ predictability, hyper-parameters, loss curve analysis. We have also given special attention to the analysis of the difference evaluation metrics versus real-world performance metrics. In our next article, we will discuss the very interesting topic of debugging and testing in production.