I’m currently working on a machine learning project. It’s a supervised learning problem. My goal is to predict for given data of an animal(keeping,size,weight,…) ingredients(energy,vitamine etc..). First i have cleaned the data and encoded the categorial features with LabelEncoding. I choose Random Forest as algorithm, because i have read that trees are good for mixed data(categorial and continues). So i have trained the model with several parameters and i have noticed that i get excellent training results but very bad test results. In my opinion this indicates overfitting. The model is learning the noise. So and i know i have two options for that: More data and reducing the complexity of the model. But i have tried PCA, remove some features, changed the hyperparameter(max_depth to 15). But none of these actions helped. I have reduced the max_depth but then i got higher training error but still a massive high test error.
So what could the problem here be? Is this model ovrfitting or has the data to much noise?
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.decomposition import KernelPCA
param_grid = {
'n_estimators': [i for i in range(50, 500, 50)],
'max_depth': [i for i in range(5, 20, 5)],
}
estimator = RandomForestRegressor()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=52)
X_train,scalerX = normalize(X_train)
Y_train,scalerY = normalize(Y_train)
X_test = scalerX.transform(X_test)
Y_test = scalerY.transform(Y_test)
gridModel = GridSearchCV(estimator=estimator,param_grid=param_grid,n_jobs=4,cv=5,scoring='neg_mean_squared_error')
gridModel.fit(X_train,Y_train)
print(gridModel.best_params_)
best_params: {‘max_depth’: 15, ‘n_estimators’: 150}
when changing the grid to [i for i in range(5, 50, 5)] then best_params: {‘max_depth’: 30, ‘n_estimators’: 50}
y_pred_test = gridModel.predict(X_test)
test_r2_score = r2_score(y_pred=y_pred_test,y_true=Y_test)
y_pred_train = gridModel.predict(X_train)
train_r2_score = r2_score(y_pred=y_pred_train,y_true=Y_train)
print("Result Test:",test_r2_score)
print("Result Train:",train_r2_score)
{‘max_depth’: 15, ‘n_estimators’: 150}
Result Test: -2.952394644421328e+31
Result Train: 0.8043381537451035
{‘max_depth’: 30, ‘n_estimators’: 50}
Result Test: -7.37835882483847e+30
Result Train: 0.9286384515560636
Marco Cotrotzo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.