In machine learning, Train Test split activity is done to measure the performance of the machine learning algorithm when they are used to predict the new data which is not used to train the model.
You can use the train_test_split() method available in the sklearn
library to split the data into train test sets.
In this tutorial, you’ll learn how to split data into train, test sets for training, and testing your machine learning models.
If you’re in Hurry
You can use the sklearn
library method train_test_split()
to split your data into train and test sets.
Snippet
from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X = iris.data
y = iris.target
#Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)
print(Counter(y_train))
print(Counter(y_test))
When you print the count of the target variables, you’ll see the count of each class of a target variable in each dataset namely y_train and y_test.
Output
Counter({0: 34, 2: 31, 1: 25})
Counter({1: 25, 2: 19, 0: 16})
This is how you can split data into two sets Train and Test sets.
If You Want to Understand Details, Read on…
In this tutorial, you’ll understand
- What is Test Train Sets
- The rule of thumb to configure the percentage of the train test and split
- Loading the data from the
sklearn
datasets package for demonstration - Splitting the dataset using the
sklearn
library - Using the Random and Stratify option
- Split without using the
sklearn
library
What is Train Test Sets
The process of Train and Test split splitting the dataset into two different sets called train and test sets.
Train Sets – Used to fit the data into your machine learning model
Test Sets – Used to evaluate the fit in your machine learning model
The train set is used to teach the machine learning model. Then the second set will be used to predict the output using the trained model and compare the output with the expected output to check if your machine learning model is trained properly.
By using this, you can calculate the accuracy of how your machine learning model behaves when you pass the new unseen data.
Configuring Test Train Split
Before splitting the data, you need to know how to configure the train test split percentage.
In most cases, the common split percentages are
- Train: 80%, Test: 20%
- Train: 67%, Test: 33%
- Train: 50%, Test: 50%
However, you need to consider the computational costs in training and evaluating the model, training, and test set representativeness during the split activity.
Loading the Data
In this section, you’ll learn how to load the sample dataset from the sklearn
datasets library.
You’ll load the iris dataset which has four features Sepal_length, Sepal_width, Petal_length, and Petal_Width.
It has one output variable which denotes the class of the iris flower. The class will be either one of the following.
— Iris Setosa
— Iris Versicolour
— Iris Virginica
Hence with this dataset, you can implement a multiclass classification machine learning program.
You can use the below snippet to load the iris_dataset.
In machine learning programs, capital X
is normally used to denote the features, and small y
is used to denote the output variables of the dataset.
Once the dataset is loaded using the load_iris()
method, you can assign the data to X
using the iris.data
and assign the target to y
using the iris.target
.
Snippet
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
This is how you can load the iris dataset from the sklearn
datasets library.
Next, you’ll learn how to split the dataset into train and test datasets.
Train Test Split Using Sklearn Library
You can split the dataset into train and test set using the train_test_split() method of the sklearn library.
It accepts one mandatory parameter.
–Input Dataset
– It is a sequence of array-like objects of the same size. Allowed inputs are lists, NumPy arrays, scipy-sparse matrices, or pandas data frames.
It also accepts few other optional parameters.
test_size
– Size of the test dataset split. It normally accepts float or int type of values. If you want to have 25% of the data for testing, you can pass 0.25 astest_size = 0.25
. If it is set toNone
, the size will be automatically set to complement the train size. If theTrain_size
is alsoNone
, then it’ll be set to 0.25.train_size
– Size of the train dataset split. It normally accepts float or int type of values. If you want to have 75% of the data for training, you can pass 0.75 astrain_size = 0.75
. If it is set toNone
, the size will be automatically set to complement the test size. If thetest_size
is alsoNone
, then it’ll be set to 0.75.random_state
– It is anint
type parameter. It controls the shuffling applied to the dataset before splitting it into two sets.shuffle
– It is a boolean type parameter. It is used to denote whether shuffling must be done before the split. If shuffling isFalse
, then the next parameter, stratify must beNone
.stratify
– array-like object. It is used to split the data in a stratified fashion using the class labels.
You can use the below snippet to split the dataset into train and test sets.
For this demonstration, only the Input dataset passed as X
and y
along with the test_size = 0.4
. It means the data will be split into 60% for training and 40% for testing.
Snippet
from collections import Counter
from sklearn.model_selection import train_test_split
#Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4)
print(Counter(y_train))
print(Counter(y_test))
When you print the count of target variables in each set (Train and Test sets), you’ll see the below output.
The train set contains, 34
number of 0
labels, 25
number of 1
labels, and 31
number of 2
labels.
Output
Counter({0: 34, 1: 25, 2: 31})
Counter({0: 16, 1: 25, 2: 19})
Here the classes 0, 1, 2 are not balanced in the training datasets and test datasets.
In the next section, you’ll see how to split in a balanced fashion.
Stratified Train Test Split
When training the machine learning model, it is advisable to use the data with the balanced output class to avoid problems like overfitting or underfitting. This is done only for the classification machine learning problems.
To solve this, you need to have the same class distribution in your training data. For example, you need to have an equal number of data for each output class you can predict.
You can achieve this by using the stratified Train Test split strategy. This is used while train, test split activity of unbalanced dataset classification.
You can do a stratified train test split of the dataset using the train_test_split() method by passing the parameter stratify=y
parameter.
Use the below snippet to perform the stratified Train and Test split.
Snippet
from collections import Counter
from sklearn.model_selection import train_test_split
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
print(Counter(y_train))
print(Counter(y_test))
When you see the count of the output classes in the training and test set, each class has 25 data points.
Output
Counter({2: 25, 1: 25, 0: 25})
Counter({2: 25, 0: 25, 1: 25})
This is how you can use the stratified train split when you have the imbalanced dataset.
Random Train Test Split
In this section, you’ll learn how to random split for train and test sets.
You can do a random train test split using the train_test_split()
method by passing the parameter random_state = 42
.
You can pass any number for a random state. 42 is the most commonly used number.
The random split is done to ensure that the data is assigned to train and tests sets randomly to ensure that the subsets are representative samples of the main data.
You can use the below snippet to do the random train test split using the sklearn
library.
Snippet
from collections import Counter
from sklearn.model_selection import train_test_split
#Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)
print(Counter(y_train))
print(Counter(y_test))
When you print the count of the target variables, you can see that the train and test sets have different numbers for each class. This ensures the data is split randomly.
Output
Counter({2: 32, 1: 31, 0: 27})
Counter({0: 23, 1: 19, 2: 18})
This is how you can do a random train test split using sklearn for random sampling of data.
Test Train Split Without Using Sklearn Library
In this section, you’ll learn how to split data into train and test sets without using the sklearn library.
You can do a train test split without using the sklearn library by shuffling the data frame and splitting it based on the defined train test size.
Follow the below steps to split manually.
- Load the iris_dataset()
- Create a dataframe using the features of the iris data
- Add the target variable column to the dataframe
- Shuffle the dataframe using the df.sample() method.
- Create a training size of 70%. It can be calculated by multiplying
0.7
into the total length of the data frame. - Split the data frame until the
train_size
using the:train_size
and assign it to the train set. - Split the data frame from the
train_size
until the end of the data frame using thetrain_size:
and assign it to the test set.
Snippet
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target
# Shuffle the dataset
shuffle_df = df.sample(frac=1)
# Define a size for your train set
train_size = int(0.7 * len(df))
# Split your dataset
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]
Now when you print the count of a target in the train set, you’ll see the below data frame.
Use the below snippet to print the count of the classes in the trainset.
Snippet
train_set.groupby(['target']).count()
Dataframe will look like
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
target | ||||
0 | 34 | 34 | 34 | 34 |
1 | 39 | 39 | 39 | 39 |
2 | 32 | 32 | 32 | 32 |
Now when you print the count of the target in the test set, you’ll see the below data frame.
Use the below snippet to print the count of the classes in the test set.
Snippet
test_set.groupby(['target']).count()
Dataframe Will Look Like
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
target | ||||
0 | 16 | 16 | 16 | 16 |
1 | 11 | 11 | 11 | 11 |
2 | 18 | 18 | 18 | 18 |
This is how you can split the dataset into train and test sets without using the sklearn library.
Train Test Split With Groups
In this section, you’ll learn how to split train and test sets based on groups.
You can do a train test split with groups using the GroupShuffleSplit() method from the sklearn library.
Use the below snippet to train test split with groups using the GroupShuffleSplit. It’ll split the dataset based on the different groups available in the dataset.
Snippet
from sklearn.datasets import load_iris
from sklearn.model_selection import GroupShuffleSplit
import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target
train_idx, test_idx = next(GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7).split(df, groups=df['target']))
train = df.iloc[train_idx]
test = df.iloc[test_idx]
To display the training set, use the below snippet.
Snippet
train.groupby(['target']).count()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
target | ||||
0 | 50 | 50 | 50 | 50 |
1 | 50 | 50 | 50 | 50 |
You can use the below snippet to print the test dataset count.
Snippet
test.groupby(['target']).count()
Dataframe Will Look Like
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
target | ||||
2 | 50 | 50 | 50 | 50 |
This is how you can do a train test split with groups using the group shuffle split.
Test Train Split with Seed
In this section, you can do a train test split with a seed value. This is just similar to the random train test split method and used for random sampling of the dataset.
You can split data with the different random values passed as seed to the random_state parameter in the train_test_split()
method.
Use the below snippet to train the test split with a seed value.
Snippet
from sklearn.model_selection import train_test_split
seed = 42
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=seed)
print(Counter(y_train))
print(Counter(y_test))
Output
Counter({0: 42, 1: 42, 2: 36})
Counter({2: 14, 1: 8, 0: 8})
This is how you can split the data into train and test sets with random seed values.
Conclusion
To summarize, you’ve learned what is splitting data into two sets namely train and test sets. You’ve learned different methods available in the sklearn library to split the data into train and test splits. You’ve also learned how to split without using the sklearn library methods.
If you’ve any questions, comment below.
You May Also Like
- How to Plot Correlation Matrix in Python
- How to Save and Load Machine Learning Models in python
- How to convert sklearn datasets into pandas dataframe
Frequently Asked Questions
What is Random State in Test Train Split
It is used to assign the data into train and test sets randomly to ensure that the train and test sets contain the representative samples from the source dataset.
What should train test split be?
It is normally 80% and 20% split.
How do you check for an imbalanced data set?
You can check for an imbalanced dataset by counting the occurrences of each class in the datasets. If each class has a different count and if the difference of the count is huge, then it is known as an imbalanced dataset.
How do you split an imbalanced dataset?
You can split an imbalanced dataset using the stratified train test split method available in the train_test_split() method of the sklearn library.
How do you handle an imbalanced data set?
You can handle the imbalanced dataset by using the stratified train test split method available in the train_test_split() method of the sklearn library.
How to Install Test Train Split Sklearn in Python?
You can install sklearn using the pip install -U scikit-learn
and import train_test_split using the statement from sklearn.model_selection import train_test_split
.
NameError: name ‘train_test_split’ is not defined`
You can solve this error by importing the train test split method from the sklearn model_selection library using the snippet from sklearn.model_selection import train_test_split
How to solve train_test_split() error: Found input variables with inconsistent numbers of samples
You need to have the same length of the input variables and the target variables. For example, If you have 10 samples in your data, then you need to have the 10 target variables for all your samples. Otherwise, you’ll face this error.