What’s the problem in Classification?
How to validate that our model is the right fit?
- Train/test split: Split data into train and test data
- Save a subset of data to make predictions
- Can verify predictions without having to collect new data (which may be difficult or expensive)
- Can help avoid overfitting
- Improve the quality of our predictions
What are some other types of validation techniques?
- k-fold validation
- hold-out validation
What is cross-validation?
- Define a dataset to “test” the model in the training phase (i.e., the second dataset), in order to avoid overfitting and give insight on how the data will behave
What is k-fold validation?
- Split data into K folds (subsets)
- Use K-1 for training
- Use 1 for testing
- Repeat K times
- Mean results
When to use k-fold?
- When we may accidentally choose to totally different training and testing data, due to lack of randomization
What is hold-out validation?
- To ommit a section of data from any validation setup, in order to truly test unknown data
Does it matter where do we split our data?
- Yes. For example, theme park attendence might be very different depending on the day of the week.