Validating and Optimizing Machine Learning Models
- Teaching
- January 31, 2024
Joel Kowalewski, PhD
In my classes, the distinction between Cross-Validation and performing a Training-Validation-Test split is a common topic. There is confusion as to why these different validation approaches exist and when they are applied. To help clarify, let’s get into the philosophy and then examine a simplified case.
It may be strange to talk about philosophy in the context of math. The meaning here is not like Socrates or Plato but the assumptions that analysts make (often implicitly) when applying mathematical and statistical principles to solve problems. A statistician is interested in a set of data and reporting on the trends in that data and whether those trends are attributable to chance.
In Machine Learning (ML), the goal is not necessarily interpreting trends in data. If these trends exist, then the data can be used to make accurate predictions about the future or to be more exact, new data. Accordingly, the ML analyst assigns the data a special name, “training data.” This is because the goal is to build models that predict new, arbitrary datasets; it is to make predictions. To make the distinction clearer, a traditional statistical analyst is interested in drawing conclusions by making observations in real life and estimating their rarity. The statistician might report: “Patients in Group A responded favorably to the drug treatment compared to the placebo or control, Group B.”
The ML or AI analyst wants to say something about the data but does so by simulating how successful a predictive model will be. For us, the question is: how should we train a model so that we can accurately estimate its long-term predictive success? One option is to split the training data, 80/20 or 70/30. (1) Next, we would train an AI/ML model. (2) We then predict the test data. (3) Finally, we evaluate the performance, and we report these results similarly to the statistical example above. Someone asks us, “How do we know the model will work?” We reply, “This is how,” showing or reporting the results of the evaluation, arguing that the conclusion is supported by the data and our validation approach. In some sense, our validation approach is like an experiment, even though it is not as familiar as in experiments like the one above, comparing drug treatments and placebos. Still, much like with faulty experimental designs, we may discover errors in our validation approach that make the model’s predictions seem better or worse than its real life performance. We will explore the topic further in this article.
Animation: In Machine Learning (ML), a subset of data is removed from the Training Data and is named “Test Data.” This helps simulate the ML model’s performance in real life because the model must ultimately predict data it has not seen before.
Let’s say we trained the model and the evaluation was strong. Despite the good evaluation performance, the model completely failed when we actually used it. What happened?
Here, we would revisit the problem. Logically, it seems like the metric we used to evaluate the performance must have been “biased.” In this context, “bias” means that there are influences that should affect our interpretation of the metric, and we can understand this as potentially poor experimental (validation) design. It isn’t, in other words, an accurate indicator of reality and therefore there are caveats that must be acknowledged.
On the Hunt for Bias
Bias is introduced in many ways (e.g., improper processing before the train-test split as well as reusing the data for multiple estimation problems such as tuning the parameters of an algorithm or selecting the type of algorithm for a given prediction task). These circumstances promote data leakage, a term that refers to violating the assumed independence between the training and testing sets such that the evaluation is no longer a realistic simulation, one where the model genuinely knows nothing.
Common examples of leakage include: (1) computing transformations like scaling the data prior to splitting it into separate training and testing sets; here, all data is used to compute the transformations, violating the independence assumption when the data is finally split. And (2), the next common source of leakage is using the test set instructively. For example, the test set evaluation could have been used to guide changes to our model, improving it. We keep changing the model, improving the test set performance, and we assume that we have objectively improved the model. This is a magic trick, however. Look carefully and you will recognize that we have violated the independence assumption when we adjusted the model’s parameters relative to the test data. That is leakage once again.
How do we reduce bias caused by leakage? One common method is to introduce a third dataset into our evaluation strategy/approach. Here, we ensure that there is another set of data, the validation set, which is specifically for instruction; that is, it is there to initially help guide or improve the model. Now, what we were calling the test set previously is set aside for later, as it should never be instructive. Once we are confident in the model we report the test performance, resisting the temptation to make any further modifications. Whatever the model is, that’s what it’s going to be. If it fails to meet expectations, the analyst must assume there are more fundamental issues at work, such as noisy data that does not accurately represent reality. Data is after all just a snapshot. Some snapshots are deceptive.
Cross Validation (CV): Where does that fit in?
Throughout the training phase (or within it), we also use a strategy called Cross Validation (CV). CV is a method that helps us simulate splitting the data without using our actual, initial evaluative splits. In other words, we use CV to approximate performance on the actual validation/test sets. Importantly, we pretend that we don’t have access to these sets, so we must do the next best thing: use the training data alone, repeatedly splitting it. We use CV to identify the importance of some features over others in improving predictions, which helps reduce the size and complexity of the final model. We also use CV to select the values for the hyperparameters, which adjust the model’s fit on the data. We typically never use CV to report that our model is successful (or will be)–unless we have very limited data and the goal is to use that model to predict new data, followed by verifying those predictions in reality, then adding that data to the original training set to build a better model.
Case Study:
Let’s put it all together in a simple, illustrative case of how validation works. Here, we will use the two validation approaches (Train-Test split and CV) I discussed above to build an optimally predictive model. I will simplify this by not using 3 datasets, just a single train-test split (80/20) of our original dataset. We will keep the dataset simple with 100 rows or instances. As such, the first thing we’ll do is remove 20% for testing. Now, we have 80 instances or rows that we would like to divide or partition using, for simplicity, 5-fold cross validation. That means we have 80/5 or 16 instances in folds 1-5.
They are as follows:
Fold 1: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
Fold 2: 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32
Fold 3: 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48
Fold 4: 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64
Fold 5: 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80
Our algorithm is k-Nearest Neighbors (KNN). We will assume that the implementation is from Scikit-Learn and the programming language is Python. We will tune n_ neighbors with a list containing the values [3,5] and our code is fundamentally pseudo or merely Python-like.
Now, create a loop that cycles through the list:
OUTER LOOP
FOR n in [3,5]:
(1) Instantiate the model with the neighbors value. Assuming we are on iteration 1, this is 3.
(2) fit the model on the folds using the strategy below:
INNER LOOP
FOR i in iteration 1,2,3,4,5:
Training Set: Fold 2, Fold 3, Fold 4, Fold 5; Validation Set: Fold 1
Iteration 1. Score the model (e.g. R2)
Training Set: Fold 1, Fold 3, Fold 4, Fold 5; Validation Set: Fold 2
Iteration 2. Score the model (e.g. R2)
Training Set: Fold 1, Fold 2, Fold 4, Fold 5; Validation Set: Fold 3
Iteration 3. Score the model (e,g. R2)
Training Set: Fold 1, Fold 2, Fold 3, Fold 5; Validation Set: Fold 4
Iteration 4, Score the model (e,g. R2)
Training Set: Fold 1, Fold 2, Fold 3, Fold 4; Validation Set: Fold 5
Iteration 5. Score the model (e,g. R2)
Exit the inner loop, average scores for iterations 1-5
NEXT (get the next value from the list in the loop. At this step, we are now ready to redo everything above for a KNN model with n_neighbors set at 5).
FINALLY, we have performance estimates for 2 KNN models (that is, we have two R2 values); one R2 from a KNN model when n_neighbors was set at 3 and another from when it was set at 5. Compare the R2 values. Refit the model on the 80 training instances with n_neighbors as the best performing value (3 or 5). Take that trained model and predict the test set. It contains the 20 instances we removed at the start of this exercise.
DONE
Scaling-up: Let’s Think about a 3-Dataset Validation Case (Train-Validation-Test).
Had we created 3 sets at the start we would not have 80/5 or 16 instances for each fold. The number would be smaller; we might only have 70 instances (70/5) or 14 instances in each fold. We would then reach the “FINALLY” part above and simply rename the test set a validation set. This gives us the freedom to revisit the model should we be unhappy with the performance. Let’s say we, hypothetically, make some modifications. Did we improve the validation set performance? Maybe we did. Okay, then, let’s move on to the third and final set, the test set. That score is our final expectation of the model’s future success.