How To Do Train Test Split Using Sklearn in Python – Definitive Guide

In machine learning, Train Test split activity is done to measure the performance of the machine learning algorithm when they are used to predict the new data which is not used to train the model.

You can use the train_test_split() method available in the sklearn library to split the data into train test sets.

In this tutorial, you’ll learn how to split data into train, test sets for training, and testing your machine learning models.

If You’re in Hurry…

You can use the sklearn library method train_test_split() to split your data into train and test sets.

Snippet

from collections import Counter

import numpy as np

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

iris = load_iris()

X = iris.data
y = iris.target

#Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)

print(Counter(y_train))

print(Counter(y_test))

When you print the count of the target variables, you’ll see the count of each class of a target variable in each dataset namely y_train and y_test.

Output

    Counter({0: 34, 2: 31, 1: 25})
    Counter({1: 25, 2: 19, 0: 16})

This is how you can split data into two sets Train and Test sets.

If You Want to Understand Details, Read on…

In this tutorial, you’ll understand

  • What is Test Train Sets
  • The rule of thumb to configure the percentage of the train test and split
  • Loading the data from the sklearn datasets package for demonstration
  • Splitting the dataset using the sklearn library
  • Using the Random and Stratify option
  • Split without using the sklearn library

What is Train Test Sets

The process of Train and Test split splitting the dataset into two different sets called train and test sets.

Train Sets – Used to fit the data into your machine learning model
Test Sets – Used to evaluate the fit in your machine learning model

The train set is used to teach the machine learning model. Then the second set will be used to predict the output using the trained model and compare the output with the expected output to check if your machine learning model is trained properly.

By using this, you can calculate the accuracy of how your machine learning model behaves when you pass the new unseen data.

Configuring Test Train Split

Before splitting the data, you need to know how to configure the train test split percentage.

In most cases, the common split percentages are

  • Train: 80%, Test: 20%
  • Train: 67%, Test: 33%
  • Train: 50%, Test: 50%

However, you need to consider the computational costs in training and evaluating the model, training, and test set representativeness during the split activity.

Loading the Data

In this section, you’ll learn how to load the sample dataset from the sklearn datasets library.

You’ll load the iris dataset which has four features Sepal_length, Sepal_width, Petal_length, and Petal_Width.

It has one output variable which denotes the class of the iris flower. The class will be either one of the following.

— Iris Setosa
— Iris Versicolour
— Iris Virginica

Hence with this dataset, you can implement a multiclass classification machine learning program.

You can use the below snippet to load the iris_dataset.

In machine learning programs, capital X is normally used to denote the features, and small y is used to denote the output variables of the dataset.

Once the dataset is loaded using the load_iris() method, you can assign the data to X using the iris.data and assign the target to y using the iris.target.

Snippet

import numpy as np

from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data
y = iris.target

This is how you can load the iris dataset from the sklearn datasets library.

Next, you’ll learn how to split the dataset into train and test datasets.

Train Test Split Using Sklearn Library

You can split the dataset into train and test set using the train_test_split() method of the sklearn library.

It accepts one mandatory parameter.

Input Dataset – It is a sequence of array-like objects of the same size. Allowed inputs are lists, NumPy arrays, scipy-sparse matrices, or pandas data frames.

It also accepts few other optional parameters.

  • test_size – Size of the test dataset split. It normally accepts float or int type of values. If you want to have 25% of the data for testing, you can pass 0.25 as test_size = 0.25. If it is set to None, the size will be automatically set to complement the train size. If the Train_size is also None, then it’ll be set to 0.25.
  • train_size – Size of the train dataset split. It normally accepts float or int type of values. If you want to have 75% of the data for training, you can pass 0.75 as train_size = 0.75. If it is set to None, the size will be automatically set to complement the test size. If the test_size is also None, then it’ll be set to 0.75.
  • random_state – It is an int type parameter. It controls the shuffling applied to the dataset before splitting it into two sets.
  • shuffle – It is a boolean type parameter. It is used to denote whether shuffling must be done before the split. If shuffling is False, then the next parameter, stratify must be None.
  • stratify – array-like object. It is used to split the data in a stratified fashion using the class labels.

You can use the below snippet to split the dataset into train and test sets.

For this demonstration, only the Input dataset passed as X and y along with the test_size = 0.4. It means the data will be split into 60% for training and 40% for testing.

Snippet

from collections import Counter

from sklearn.model_selection import train_test_split

#Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)

print(Counter(y_train))

print(Counter(y_test))

When you print the count of target variables in each set (Train and Test sets), you’ll see the below output.

The train set contains, 34 number of 0 labels, 25 number of 1 labels, and 31 number of 2 labels.

Output

    Counter({0: 34, 1: 25, 2: 31})
    Counter({0: 16, 1: 25, 2: 19})

Here the classes 0, 1, 2 are not balanced in the training datasets and test datasets.

In the next section, you’ll see how to split in a balanced fashion.

Stratified Train Test Split

When training the machine learning model, it is advisable to use the data with the balanced output class to avoid problems like overfitting or underfitting. This is done only for the classification machine learning problems.

To solve this, you need to have the same class distribution in your training data. For example, you need to have an equal number of data for each output class you can predict.

You can achieve this by using the stratified Train Test split strategy. This is used while train, test split activity of unbalanced dataset classification.

You can do a stratified train test split of the dataset using the train_test_split() method by passing the parameter stratify=y parameter.

Use the below snippet to perform the stratified Train and Test split.

Snippet

from collections import Counter

from sklearn.model_selection import train_test_split

# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

print(Counter(y_train))

print(Counter(y_test))

When you see the count of the output classes in the training and test set, each class has 25 data points.

Output

    Counter({2: 25, 1: 25, 0: 25})
    Counter({2: 25, 0: 25, 1: 25})

This is how you can use the stratified train split when you have the imbalanced dataset.

Random Train Test Split

In this section, you’ll learn how to random split for train and test sets.

You can do a random train test split using the train_test_split() method by passing the parameter random_state = 42.

You can pass any number for a random state. 42 is the most commonly used number.

The random split is done to ensure that the data is assigned to train and tests sets randomly to ensure that the subsets are representative samples of the main data.

You can use the below snippet to do the random train test split using the sklearn library.

Snippet

from collections import Counter

from sklearn.model_selection import train_test_split

#Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

print(Counter(y_train))

print(Counter(y_test))

When you print the count of the target variables, you can see that the train and test sets have different numbers for each class. This ensures the data is split randomly.

Output

    Counter({2: 32, 1: 31, 0: 27})
    Counter({0: 23, 1: 19, 2: 18})

This is how you can do a random train test split using sklearn for random sampling of data.

Test Train Split Without Using Sklearn Library

In this section, you’ll learn how to split data into train and test sets without using the sklearn library.

You can do a train test split without using the sklearn library by shuffling the data frame and splitting it based on the defined train test size.

Follow the below steps to split manually.

  • Load the iris_dataset()
  • Create a dataframe using the features of the iris data
  • Add the target variable column to the dataframe
  • Shuffle the dataframe using the df.sample() method.
  • Create a training size of 70%. It can be calculated by multiplying 0.7 into the total length of the data frame.
  • Split the data frame until the train_size using the :train_size and assign it to the train set.
  • Split the data frame from the train_size until the end of the data frame using the train_size: and assign it to the test set.

Snippet

from sklearn.datasets import load_iris

import pandas as pd

data = load_iris()

df = pd.DataFrame(data.data, columns=data.feature_names)

df["target"] = data.target 

# Shuffle the dataset 
shuffle_df = df.sample(frac=1)

# Define a size for your train set 
train_size = int(0.7 * len(df))

# Split your dataset 
train_set = shuffle_df[:train_size]

test_set = shuffle_df[train_size:]

Now when you print the count of a target in the train set, you’ll see the below data frame.

Use the below snippet to print the count of the classes in the trainset.

Snippet

train_set.groupby(['target']).count()

Dataframe will look like

sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
target
034343434
139393939
232323232

Now when you print the count of the target in the test set, you’ll see the below data frame.

Use the below snippet to print the count of the classes in the test set.

Snippet

test_set.groupby(['target']).count()

Dataframe Will Look Like

sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
target
016161616
111111111
218181818

This is how you can split the dataset into train and test sets without using the sklearn library.

Train Test Split With Groups

In this section, you’ll learn how to split train and test sets based on groups.

You can do a train test split with groups using the GroupShuffleSplit() method from the sklearn library.

Use the below snippet to train test split with groups using the GroupShuffleSplit. It’ll split the dataset based on the different groups available in the dataset.

Snippet

from sklearn.datasets import load_iris

from sklearn.model_selection import GroupShuffleSplit

import pandas as pd

data = load_iris()

df = pd.DataFrame(data.data, columns=data.feature_names)

df["target"] = data.target 

train_idx, test_idx = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7).split(df, groups=df['target']))

train = df.iloc[train_idx]
test = df.iloc[test_idx]

To display the training set, use the below snippet.

Snippet

train.groupby(['target']).count()
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
target
050505050
150505050

You can use the below snippet to print the test dataset count.

Snippet

test.groupby(['target']).count()

Dataframe Will Look Like

sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
target
250505050

This is how you can do a train test split with groups using the group shuffle split.

Test Train Split with Seed

In this section, you can do a train test split with a seed value. This is just similar to the random train test split method and used for random sampling of the dataset.

You can split data with the different random values passed as seed to the random_state parameter in the train_test_split() method.

Use the below snippet to train the test split with a seed value.

Snippet

from sklearn.model_selection import train_test_split

seed = 42 

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=0.20, random_state=seed)

print(Counter(y_train))

print(Counter(y_test))

Output

    Counter({0: 42, 1: 42, 2: 36})
    Counter({2: 14, 1: 8, 0: 8})

This is how you can split the data into train and test sets with random seed values.

Conclusion

To summarize, you’ve learned what is splitting data into two sets namely train and test sets. You’ve learned different methods available in the sklearn library to split the data into train and test splits. You’ve also learned how to split without using the sklearn library methods.

If you’ve any questions, comment below.

You May Also Like

Frequently Asked Questions

What is Random State in Test Train Split

It is used to assign the data into train and test sets randomly to ensure that the train and test sets contain the representative samples from the source dataset.

What should train test split be?

It is normally 80% and 20% split.

How do you check for an imbalanced data set?

You can check for an imbalanced dataset by counting the occurrences of each class in the datasets. If each class has a different count and if the difference of the count is huge, then it is known as an imbalanced dataset.

How do you split an imbalanced dataset?

You can split an imbalanced dataset using the stratified train test split method available in the train_test_split() method of the sklearn library.

How do you handle an imbalanced data set?

You can handle the imbalanced dataset by using the stratified train test split method available in the train_test_split() method of the sklearn library.

How to Install Test Train Split Sklearn in Python?

You can install sklearn using the pip install -U scikit-learn and import train_test_split using the statement from sklearn.model_selection import train_test_split.

NameError: name ‘train_test_split’ is not defined`

You can solve this error by importing the train test split method from the sklearn model_selection library using the snippet from sklearn.model_selection import train_test_split

How to solve train_test_split() error: Found input variables with inconsistent numbers of samples

You need to have the same length of the input variables and the target variables. For example, If you have 10 samples in your data, then you need to have the 10 target variables for all your samples. Otherwise, you’ll face this error.

Leave a Comment