The dataset has 1079134
rows.
I am taking 20% of the of the data for validation and 20% of the of the data for testing from the same dataset.
Then, why should the accuracy be different in the case of validation and testing?
Output:
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
(1079134, 9)
(1079134,)
col3 col4 col5 col6 col7
count 1.079134e+06 1.079134e+06 1.079134e+06 1.079134e+06 1.079134e+06
mean 5.965598e+00 7.416868e+00 9.035799e+00 1.504262e-02 7.553835e-02
std 8.436995e-01 2.182468e+00 3.029521e+00 1.221784e-01 2.767082e-01
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 5.467000e+00 5.278000e+00 6.272000e+00 0.000000e+00 0.000000e+00
50% 5.795000e+00 7.869000e+00 8.905000e+00 0.000000e+00 0.000000e+00
75% 6.563000e+00 9.370000e+00 1.184800e+01 0.000000e+00 0.000000e+00
max 7.826000e+00 1.159000e+01 1.492200e+01 2.000000e+00 4.000000e+00
col8 col9 col10 col11
count 1.079134e+06 1.079134e+06 1.079134e+06 1079134.0
mean 2.246755e-01 8.234492e-01 2.491767e+00 0.0
std 4.954273e-01 1.201070e+00 2.386875e+00 0.0
min 0.000000e+00 0.000000e+00 0.000000e+00 0.0
25% 0.000000e+00 0.000000e+00 1.000000e+00 0.0
50% 0.000000e+00 0.000000e+00 2.000000e+00 0.0
75% 0.000000e+00 1.000000e+00 4.000000e+00 0.0
max 6.000000e+00 1.000000e+01 1.800000e+01 0.0
count 1079134
unique 3
top H
freq 459325
Name: Label, dtype: object
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Accuracy on training data: 98.51%
Accuracy on validation data: 82.74%
Accuracy on test data: 90.63%
Python script:
!pip -q install imblearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
np.random.seed(123)
from google.colab import drive
drive.mount('/content/drive')
pd_dataframe = pd.read_csv('/content/drive/MyDrive/data_set.dat', delim_whitespace=True)
# Select target and feature variables
y_full = pd_dataframe.iloc[:, 2]
X_full = pd_dataframe.iloc[:, 3:12]
# Sample a subset of the data
X_small = X_full.iloc[:, :]
y_small = y_full.iloc[:]
print(X_full.shape)
print(y_full.shape)
print(X_small.describe())
print(y_small.describe())
# Apply one-hot encoding to the target variable
# Encode the target variable
y_small_encoded = []
for y in y_small:
if y == "H":
y_small_encoded.append(0)
elif y == "C":
y_small_encoded.append(1)
else:
y_small_encoded.append(2)
y_small_encoded = np.array(y_small_encoded)
# Apply undersampling
from imblearn.under_sampling import RandomUnderSampler
undersampler_obj = RandomUnderSampler(sampling_strategy='majority')
X_under, y_under = undersampler_obj.fit_resample(X_small, y_small_encoded)
# Apply oversampling
from imblearn.over_sampling import RandomOverSampler
oversample_obj = RandomOverSampler(sampling_strategy='minority')
X_over, y_over = oversample_obj.fit_resample(X_small, y_small_encoded)
# Apply SMOTE
from imblearn.over_sampling import SMOTE
smote_obj = SMOTE()
X_sm, y_sm = smote_obj.fit_resample(X_small, y_small_encoded)
# Concatenate the oversampled and SMOTE datasets
X_sampled_concat = pd.concat([pd.DataFrame(X_over), pd.DataFrame(X_sm)], axis=0)
y_sampled_concat = pd.concat([pd.DataFrame(y_over), pd.DataFrame(y_sm)], axis=0)
# Split the data into training+validation and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training=60% and valid+test=40% sets
X_train_val, X_test, y_train_val, y_test = train_test_split(X_sampled_concat, y_sampled_concat, test_size=0.40, shuffle=True)
# Split the valid+test set into separate training=50% and validation=50% sets
X_train, X_val, y_train, y_val = train_test_split(X_test, y_test, test_size=0.50, shuffle=True)
# Define and train the AdaBoost classifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
adaboost_classifier_obj = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20), n_estimators=80, algorithm="SAMME.R", learning_rate=0.9)
adaboost_classifier_obj.fit(X_train, y_train)
# Predict the labels on the training, validation, and test data
y_pred_train = adaboost_classifier_obj.predict(X_train)
y_pred_val = adaboost_classifier_obj.predict(X_val)
y_pred_test = adaboost_classifier_obj.predict(X_test)
# Calculate the accuracy score on the training, validation, and test data
accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_val = accuracy_score(y_val, y_pred_val)
accuracy_test = accuracy_score(y_test, y_pred_test)
# Print the accuracy scores
print("Accuracy on training data: %.2f%%" % (accuracy_train * 100.0))
print("Accuracy on validation data: %.2f%%" % (accuracy_val * 100.0))
print("Accuracy on test data: %.2f%%" % (accuracy_test * 100.0))