I would like to know how to increase the accuracy score and lower the loss in a multilabel classification problem.
If you look at the sklearn reference, there is a mention of multilabel in Multiclass and multioutput algorithms and I am testing it now.
(https://scikit-learn.org/stable/modules/multiclass.html)
The sample data had 10 features using make_multilabel_classification in sklearn.datasets, and a dataset was created by modifying n_classes.
When there are two classes in multilabel, it seems that the accuracy and loss are somewhat satisfactory.
from numpy import mean
from numpy import std
from sklearn.datasets import make_multilabel_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, hamming_loss
# define dataset
X, y = make_multilabel_classification(n_samples=10000, n_features=10, n_classes=2, random_state=1)
# summarize dataset shape
print(X.shape, y.shape)
# summarize first few examples
for i in range(10):
print(X[i], y[i])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
print(scaler.mean_)
print(scaler.var_)
x_train_std = scaler.transform(X_train)
x_test_std = scaler.transform(X_test)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train_std, y_train)
pred = knn.predict(x_test_std)
print(accuracy_score(y_test, pred))
print(hamming_loss(y_test, pred))
accuracy_score: 0.8345, hamming_loss: 0.08875
However, as the number of classes exceeds 3, the accuracy score gradually decreases and the loss increases.
# define dataset
X, y = make_multilabel_classification(n_samples=10000, n_features=10, n_classes=3, random_state=1)
n_classes= 3 –> accuracy_score: 0.772, hamming_loss: 0.116
n_classes= 4 –> accuracy_score: 0.4875, hamming_loss: 0.194125
This is also similary when using the RandomForestClassifier algorithm and MLPClassifier algorithm, as shown in Reference, or when using ClassifierChain(estimator=SVC) to use an algorithm that does not support Multilabel classification.
Which hyperparameters should I adjust to improve accuracy?