In this article, you’ll see the implementation of a train-test split in Python using Sklearn. If you want to learn machine learning, I recommend you get the “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” book. In this example, we will use the iris dataset. There are 150 rows and 5 columns in this dataset. We’ll split this dataset into a train and a test set with an 80:20 ratio. The train set will be 80 percent, that is 120 rows, and the test set will be the remaining 20 percent, that is 30 rows. The number of columns will remain the same. You have to follow these steps to split the dataset.
Step 1: Load the dataset
In this step, load the iris dataset with the help of seaborn library. You can use any dataset. After loading the dataset, display the shape and first five rows of data.
# Import the Seaborn library import seaborn as sns # Load the iris dataset df = sns.load_dataset("iris") # Display the shape of the dataset print(df.shape) # Display the first five rows of the data df.head()
(150, 5)
Out[2]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Step 2: Separate independent and dependent variables
In this step, separate the features and labels. We’ll store the features in variable X and their labels in y. For this you can use different methods but I prefer to use .iloc method.
# Separate independent and dependent variables (Features X)(Labels y) X = df.iloc[:, :-1] y = df['species']
Step 3: Train Test Split using Sklearn
This step is the main that is splitting. The scikit-learn library provides the train_test_split function for this purpose. You can change the value of the test size according to your requirements. If you assign a value to random_state, then every time you execute the code, it will randomly select the different values.
# import train_test_split from sklearn module from sklearn.model_selection import train_test_split # Pass the features(X) and labels(y) and set the test size (eg. 0.2 = 20 percent) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Step 4: Now check the shape of the train and test set
You can check the shape of train and test set with the help of this code.
print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape)
(120, 4) (120,) (30, 4) (30,)