I want to train a model to classify multi labels. Below is the code that created and trained sample data.
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
x_data, y_data = make_multilabel_classification(n_samples=10000, n_classes=1, random_state=0)
print(x_data.shape)
print(y_data.shape)
pip = Pipeline((('preprocess', StandardScaler()), ('classifier', DecisionTreeClassifier())))
grid_param = [
{'preprocess': [StandardScaler()], 'classifier': [MultiOutputClassifier(KNeighborsClassifier())], 'classifier__estimator__n_neighbors': range(1,30)},
{'preprocess': [None], 'classifier': [MultiOutputClassifier(LogisticRegression())]},
{'preprocess': [None], 'classifier': [MultiOutputClassifier(DecisionTreeClassifier())], 'classifier__estimator__max_depth': range(2,10), 'classifier__estimator__criterion': ['gini']},
{'preprocess': [None], 'classifier': [MultiOutputClassifier(RandomForestClassifier())], 'classifier__estimator__n_estimators': range(20,200,20)},
]
grid_m = GridSearchCV(pip, grid_param, cv=5, return_train_score=True, verbose=3)
grid_m.fit(x_data, y_data)
- Result
The train set and test set scores are mostly similar.
Fitting 5 folds for each of 47 candidates, totalling 235 fits
[CV 1/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.969) total time= 0.0s
[CV 2/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.970) total time= 0.0s
[CV 3/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.977) total time= 0.0s
[CV 4/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.969) total time= 0.0s
[CV 5/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.976) total time= 0.0s
[CV 1/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.967) total time= 0.0s
[CV 2/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.970) total time= 0.0s
[CV 3/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.977) total time= 0.0s
[CV 4/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.970) total time= 0.0s
[CV 5/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.973) total time= 0.0s
[CV 1/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=3, preprocess=StandardScaler();, score=(train=1.000, test=0.967) total time= 0.0s
[CV 2/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=3, preprocess=StandardScaler();, score=(train=1.000, test=0.969) total time= 0.0s
[CV 3/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=3, preprocess=StandardScaler();, score=(train=1.000, test=0.981) total time= 0.0s
[CV 4/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=3, preprocess=StandardScaler();, score=(train=1.000, test=0.968) total time= 0.0s
[CV 5/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=3, preprocess=StandardScaler();, score=(train=1.000, test=0.974) total time= 0.0s
[CV 1/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=4, preprocess=StandardScaler();, score=(train=1.000, test=0.970) total time= 0.0s
[CV 2/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=4, preprocess=StandardScaler();, score=(train=1.000, test=0.971) total time= 0.0s
[CV 3/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=4, preprocess=StandardScaler();, score=(train=1.000, test=0.978) total time= 0.0s
[CV 4/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=4, preprocess=StandardScaler();, score=(train=1.000, test=0.968) total time= 0.0s
.
.
Here, the n_classes count was increased to two and trained in the same way.
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
x_data, y_data = make_multilabel_classification(n_samples=10000, n_classes=2, random_state=0)
print(x_data.shape)
print(y_data.shape)
pip = Pipeline((('preprocess', StandardScaler()), ('classifier', DecisionTreeClassifier())))
grid_param = [
{'preprocess': [StandardScaler()], 'classifier': [MultiOutputClassifier(KNeighborsClassifier())], 'classifier__estimator__n_neighbors': range(1,30)},
{'preprocess': [None], 'classifier': [MultiOutputClassifier(LogisticRegression())]},
{'preprocess': [None], 'classifier': [MultiOutputClassifier(DecisionTreeClassifier())], 'classifier__estimator__max_depth': range(2,10), 'classifier__estimator__criterion': ['gini']},
{'preprocess': [None], 'classifier': [MultiOutputClassifier(RandomForestClassifier())], 'classifier__estimator__n_estimators': range(20,200,20)},
]
grid_m = GridSearchCV(pip, grid_param, cv=5, return_train_score=True, verbose=3)
grid_m.fit(x_data, y_data)
- Result
The score of the test set decreased.
Fitting 5 folds for each of 47 candidates, totalling 235 fits
[CV 1/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.720) total time= 0.0s
[CV 2/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.730) total time= 0.0s
[CV 3/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.731) total time= 0.0s
[CV 4/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.715) total time= 0.0s
[CV 5/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=1, preprocess=StandardScaler();, score=(train=1.000, test=0.726) total time= 0.0s
[CV 1/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.728) total time= 0.0s
[CV 2/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.731) total time= 0.0s
[CV 3/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.730) total time= 0.0s
[CV 4/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.722) total time= 0.0s
[CV 5/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=2, preprocess=StandardScaler();, score=(train=1.000, test=0.719) total time= 0.0s
[CV 1/5] END classifier=MultiOutputClassifier(estimator=KNeighborsClassifier()), classifier__estimator__n_neighbors=3, preprocess=StandardScaler();, score=(train=1.000, test=0.726) total time= 0.0s
.
.
As the number of n_classes increases, the score of the test set decreases. What is the reason?