Train Test Split Sklearn

In this article, you’ll see the implementation of a train-test split in Python using Sklearn. If you want to learn machine learning, I recommend you get the “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” book. In this example, we will use the iris dataset. There are 150 rows and 5 columns in this dataset. We’ll split this dataset into a train and a test set with an 80:20 ratio. The train set will be 80 percent, that is 120 rows, and the test set will be the remaining 20 percent, that is 30 rows. The number of columns will remain the same. You have to follow these steps to split the dataset.

Step 1: Load the dataset

In this step, load the iris dataset with the help of seaborn library. You can use any dataset. After loading the dataset, display the shape and first five rows of data.

# Import the Seaborn library
import seaborn as sns

# Load the iris dataset
df = sns.load_dataset("iris")

# Display the shape of the dataset
print(df.shape)

# Display the first five rows of the data
df.head()

(150, 5)

Out[2]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Step 2: Separate independent and dependent variables

In this step, separate the features and labels. We’ll store the features in variable X and their labels in y. For this you can use different methods but I prefer to use .iloc method.

# Separate independent and dependent variables (Features X)(Labels y)
X = df.iloc[:, :-1]
y = df['species']

Step 3: Train Test Split using Sklearn

This step is the main that is splitting. The scikit-learn library provides the train_test_split function for this purpose. You can change the value of the test size according to your requirements. If you assign a value to random_state, then every time you execute the code, it will randomly select the different values.

# import train_test_split from sklearn module
from sklearn.model_selection import train_test_split

# Pass the features(X) and labels(y) and set the test size (eg. 0.2 = 20 percent)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Step 4: Now check the shape of the train and test set

You can check the shape of train and test set with the help of this code.

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(120, 4)
(120,)
(30, 4)
(30,)

Step 1: Load the dataset

Step 2: Separate independent and dependent variables

Step 3: Train Test Split using Sklearn

Step 4: Now check the shape of the train and test set

Related Posts

Leave a Comment Cancel Reply