Reading time: 20 minutes
Cross Validation is a procedure used to evaluate your machine learning model on limited sample of data. With the help of this, we can actually tell how well our model performs on unseen data.
In this article we will explain cross validation with the most commonly used cross validation methode known as “K-fold cross validation”
Here is the procedure that is generally followed during cross-validation:
- Shuffle your dataset randomly and split it into ‘k’ equal groups
- For each group take that group as the test set also known as validation set and all other ‘k-1’ groups as training set.
- Now,train a model using the training set and evaluate it on test set.
K can be an integer value chosen by you.Usually k=5 or k=10 is chosen.
So basically, using this method, you will be training k models and testing it each time using a validation set.
For prediciting a data point, you can find prediction on k models and take the average of those predictions. Most of the time, this works better than testing only on a single model.
We will see an example now.
Consider the datapoints
We will shuffle the dataset.After shuffling imagine we get,
$$ [1,2,3,4,5,6] $$
Suppose you choose k=3,
Your groups (folds) would be sets would be,
$$ [1,2] = fold-1 $$
$$ [3,4] = fold-2 $$
$$ [5,6] = fold-3 $$
Now k=3,models are trained which are
$Model1: Trained\ on\ Fold1\ + Fold2, Tested\ on\ Fold3$
$Model2: Trained\ on\ Fold2\ + Fold3, Tested on\ Fold1$
$Model3: Trained\ on\ Fold1\ + Fold3, Tested\ on\ Fold2$
The average of accuracy of each model is said to the average accuracy
Variations on Cross-Validation
There are a number of variations on the k-fold cross validation procedure. Commonly used variations are as follows:
- Train/Test Split: Taken to one extreme, k may be set to 1 such that a single train/test split is created to evaluate the model.
- LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset such that each observation is given a chance to be the held out of the dataset. This is called leave-one-out cross-validation, or LOOCV for short.
- Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.
- Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.
As you choose a large value for K, bias of your model increases and vice versa.This can be used to controll the bias and variance of your model.
There are many other methods to do cross validation like stratified cross validation and leave one out cross validation.