Train/test/split data with Scikit-learn

What’s the problem in Classification?

How to validate that our model is the right fit?

  • Train/test split: Split data into train and test data

Why?

  • Save a subset of data to make predictions
  • Can verify predictions without having to collect new data (which may be difficult or expensive)
  • Can help avoid overfitting
  • Improve the quality of our predictions

What are some other types of validation techniques?

  • Cross-validation
  • k-fold validation
  • hold-out validation

What is cross-validation?

  • Define a dataset to “test” the model in the training phase (i.e., the second dataset), in order to avoid overfitting and give insight on how the data will behave

What is k-fold validation?

  • Split data into K folds (subsets)
  • Use K-1 for training
  • Use 1 for testing
  • Repeat K times
  • Mean results

When to use k-fold?

  • When we may accidentally choose to totally different training and testing data, due to lack of randomization

What is hold-out validation?

  • To ommit a section of data from any validation setup, in order to truly test unknown data

Does it matter where do we split our data?

  • Yes. For example, theme park attendence might be very different depending on the day of the week.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s