Thiết kế website giá rẻ

Question

The dataset has 1079134 rows.

I am taking 20% of the of the data for validation and 20% of the of the data for testing from the same dataset.

Then, why should the accuracy be different in the case of validation and testing?

Output:

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
(1079134, 9)
(1079134,)
               col3          col4          col5          col6          col7  
count  1.079134e+06  1.079134e+06  1.079134e+06  1.079134e+06  1.079134e+06   
mean   5.965598e+00  7.416868e+00  9.035799e+00  1.504262e-02  7.553835e-02   
std    8.436995e-01  2.182468e+00  3.029521e+00  1.221784e-01  2.767082e-01   
min    0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
25%    5.467000e+00  5.278000e+00  6.272000e+00  0.000000e+00  0.000000e+00   
50%    5.795000e+00  7.869000e+00  8.905000e+00  0.000000e+00  0.000000e+00   
75%    6.563000e+00  9.370000e+00  1.184800e+01  0.000000e+00  0.000000e+00   
max    7.826000e+00  1.159000e+01  1.492200e+01  2.000000e+00  4.000000e+00   

               col8          col9         col10       col11  
count  1.079134e+06  1.079134e+06  1.079134e+06   1079134.0  
mean   2.246755e-01  8.234492e-01  2.491767e+00         0.0  
std    4.954273e-01  1.201070e+00  2.386875e+00         0.0  
min    0.000000e+00  0.000000e+00  0.000000e+00         0.0  
25%    0.000000e+00  0.000000e+00  1.000000e+00         0.0  
50%    0.000000e+00  0.000000e+00  2.000000e+00         0.0  
75%    0.000000e+00  1.000000e+00  4.000000e+00         0.0  
max    6.000000e+00  1.000000e+01  1.800000e+01         0.0  
count     1079134
unique          3
top             H
freq       459325
Name: Label, dtype: object
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Accuracy on training data: 98.51%
Accuracy on validation data: 82.74%
Accuracy on test data: 90.63%

Python script:

!pip -q install imblearn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder

np.random.seed(123)

from google.colab import drive
drive.mount('/content/drive')
pd_dataframe = pd.read_csv('/content/drive/MyDrive/data_set.dat', delim_whitespace=True)

# Select target and feature variables
y_full = pd_dataframe.iloc[:, 2]
X_full = pd_dataframe.iloc[:, 3:12]


# Sample a subset of the data
X_small = X_full.iloc[:, :]
y_small = y_full.iloc[:]


print(X_full.shape)
print(y_full.shape)

print(X_small.describe())
print(y_small.describe())


# Apply one-hot encoding to the target variable
# Encode the target variable
y_small_encoded = []
for y in y_small:
    if y == "H":
        y_small_encoded.append(0)
    elif y == "C":
        y_small_encoded.append(1)
    else:
        y_small_encoded.append(2)
y_small_encoded = np.array(y_small_encoded)

# Apply undersampling
from imblearn.under_sampling import RandomUnderSampler
undersampler_obj = RandomUnderSampler(sampling_strategy='majority')
X_under, y_under = undersampler_obj.fit_resample(X_small, y_small_encoded)

# Apply oversampling
from imblearn.over_sampling import RandomOverSampler
oversample_obj = RandomOverSampler(sampling_strategy='minority')
X_over, y_over = oversample_obj.fit_resample(X_small, y_small_encoded)

# Apply SMOTE
from imblearn.over_sampling import SMOTE
smote_obj = SMOTE()
X_sm, y_sm = smote_obj.fit_resample(X_small, y_small_encoded)

# Concatenate the oversampled and SMOTE datasets
X_sampled_concat = pd.concat([pd.DataFrame(X_over), pd.DataFrame(X_sm)], axis=0)
y_sampled_concat = pd.concat([pd.DataFrame(y_over), pd.DataFrame(y_sm)], axis=0)

# Split the data into training+validation and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training=60% and valid+test=40% sets
X_train_val, X_test, y_train_val, y_test = train_test_split(X_sampled_concat, y_sampled_concat, test_size=0.40, shuffle=True)

# Split the valid+test set into separate training=50% and validation=50% sets
X_train, X_val, y_train, y_val = train_test_split(X_test, y_test, test_size=0.50, shuffle=True)

# Define and train the AdaBoost classifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

adaboost_classifier_obj = AdaBoostClassifier(DecisionTreeClassifier(max_depth=20), n_estimators=80, algorithm="SAMME.R", learning_rate=0.9)
adaboost_classifier_obj.fit(X_train, y_train)

# Predict the labels on the training, validation, and test data
y_pred_train = adaboost_classifier_obj.predict(X_train)
y_pred_val = adaboost_classifier_obj.predict(X_val)
y_pred_test = adaboost_classifier_obj.predict(X_test)

# Calculate the accuracy score on the training, validation, and test data
accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_val = accuracy_score(y_val, y_pred_val)
accuracy_test = accuracy_score(y_test, y_pred_test)

# Print the accuracy scores
print("Accuracy on training data: %.2f%%" % (accuracy_train * 100.0))
print("Accuracy on validation data: %.2f%%" % (accuracy_val * 100.0))
print("Accuracy on test data: %.2f%%" % (accuracy_test * 100.0))

Thiết kế website giá rẻ

Danh mục

Why are validation accuracy and test accuracy different?