I have this spotify dataset with around 100k records and 28 fetures mixed of numerical(descrete and continuous) and binary, some of the numerical variables have so many zero values.
I want to perform machine learning classification classifying 113 genres which is so many. I want to have robust Principal Component Analysis to improve my model. But I am confused why normalized data(I use Standard Scaler) performs much worse accuracy than without normalization.
At first, after data preprocessing I perform simple decision tree classifier (after hyperparameter tuning) and get 43% of accuracy.
-
Then I perform Principle Component Analysis(PCA) to reduce the number of features so I can cluster the data to reduce the number of classes and detect and visualize the outliers more easily.
-
As I know that before depploying PCA, my data needs to be normalized(I used StandardScaler here). However I discover that without normalization the performance of my decision tree classifier with 1 dimension PCA improveS to 51%. And on the opposite with normalization it decreases to only 5% accuracy with 2 Dimension, here I need to.
How it is possible? How can I improve the algorithm performance?
My code
pca = PCA(n_components=20) # I try many values here as the accuracy increase with more components
pca.fit(X_train_norm)
X_train_pca = pca.transform(X_train_norm)
# Decision Tree
X_test_pca = pca.transform(X_test_norm)
dt = DecisionTreeClassifier(min_samples_leaf = 3,random_state=42)
dt.fit(X_train_pca, y_train)
y_pred = dt.predict(X_test_pca)
print('Accuracy %s' % accuracy_score(y_test, y_pred))
print('F1-score %s' % f1_score(y_test, y_pred, average=None))
print(classification_report(y_test, y_pred))
pca_all = PCA(n_components=25)
pca_all.fit(X_train_norm)
X_train_pca = pca_all.transform(X_train_norm)
explained_variance = pca_all.explained_variance_ratio_
plt.figure(figsize=(10, 6))
plt.bar(range(len(explained_variance)), explained_variance, alpha=0.7, align='center',
label='Individual explained variance')
plt.step(range(len(explained_variance)), np.cumsum(explained_variance), where='mid',
label='Cumulative explained variance')
plt.xlabel('Principal Component Index')
plt.ylabel('Explained Variance Ratio')
plt.legend(loc='best')
plt.title('Explained Variance by Principal Components with Normalization')
plt.grid()
plt.show()
I attach data distribution for the pca after standardizaztion (round shape), and without normalization (elbow shape). The colors represents the genres. Also the variance explanability of my PCA [enter image description here](https://i.sstatic.net/HaDoNROy.png)enter image description hereenter image description here
azzure19 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.