Note on cross-validation

Good morning everybody,

here's a small note on cross-validation (CV) and project 1. We discussed this during the lecture on Thursday September 17.

What we try to summarize here is a possible approach to the CV tasks in project 1.

 

1) It is sufficient to perform what many call the standard CV approach in order to obtain the estimated MSE test error (on the test data).

This means, with say five folds, to split your data in five folds and perform training on four of the folds and test on one. This is repeated for all possible distributions of your test data (5 here). The final MSE is then given by the average of these five contributions.

This is what is often done. And it would be sufficient in project 1. 

Many people have however criticized this approach, see for example https://www.sciencedirect.com/science/article/abs/pii/S1053811917305311

 

2) A more consistent approach would be to 

a) split the data in train and test and keep the test data in a vault, untouched.

b) then use cross-validation with train and validation (instead of test). With five folds, you then have five different validation sets.

These are used to find the best models  as functions of complexity and hyperparameters. For say a polynomial of degree ten, you pick the model with beste MSE for given hyperparameters λ in your validation sets if you are doing Rige or Lasso.  Then you repeat for

another polynomial degree. For say ten polynomial degrees, you find the models with the best MSE.

c) The results for the best models are then used on the untouched test data to compute only once the test error (MSE).

It may however not be the best estimate of the test error.

3) To improve this, you can set up an outer loop where you produce new test data. It could be done by a random test-train split of the data with an inner CV loop with train+validate, or two CV loops in order to obtain a better estimate of the final test error.

 

Point 1 is what we think is reasonable to do for the project. It is not perfect and when you discuss the results and compare them with the bootstrap results, a critical assessment of what has been done, is important. If you feel you have time to do so, venturing into 2 and 3 is fully possible but not required.

 

Also, for the bias-variance analysis we want you to use the bootstrap only. But again, if you wish to explore CV as well, feel free to do so.

 

Note also that scikit-learn has a grid-search functionality for the search of the best hyperparameters (including cross validation), see https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

 

We end with sharing the Nature article on numpy, pretty cool. 

https://www.nature.com/articles/s41586-020-2649-2.pdf

Publisert 23. sep. 2020 09:42 - Sist endret 23. sep. 2020 09:42