At the moment i’m developing a machine learning project. Its a supervised learning problem. I have as input horse_data(size,weight,peformance,…) and as a output the ingredient of the reciepe. Now i want to predict for a given horsedata the ingredients.
So this is a summary of my horse_data: HorseData
This is a summary of my targtes(the reciepe):
FirstPartTargets
SecondPartTargets
So with this data i want to train a ml model. In this case random forest regressor, because the inputs have many categorial variables(keeping,peformance, worktype,race and raceType).
resDf = pd.DataFrame(columns=['Train R^2 Score','Test R^2 Score','Train MSE','Test MSE','Train RMSE','Test RMSE','Train MSAE','Test MSAE'])
param_grid = {
'n_estimators': [100,200,1000],
'max_depth': [10,20,30],
'min_samples_split':[2,5,10],
'min_samples_leaf':[1,2,4]
}
for ing in Y.columns:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y[ing], test_size=0.2, random_state=52)
gridModel = make_pipeline(GridSearchCV(estimator=RandomForestRegressor(),cv=10,param_grid=param_grid,n_jobs=-1,scoring='neg_mean_squared_error',verbose=True))
gridModel.fit(X_train,Y_train)
y_pred_train = gridModel.predict(X_train)
train_mse_error = mean_squared_error(y_pred=y_pred_train,y_true=Y_train)
train_mse_absoulte_error = mean_absolute_error(y_pred=y_pred_train,y_true=Y_train)
train_r2_score = r2_score(y_pred=y_pred_train,y_true=Y_train)
train_rmse = train_mse_error ** (0.5)
y_pred = gridModel.predict(X_test)
test_mse_error = mean_squared_error(y_pred=y_pred,y_true=Y_test)
test_mse_absoulte_error = mean_absolute_error(y_pred=y_pred,y_true=Y_test)
test_r2_score = r2_score(y_pred=y_pred,y_true=Y_test)
test_rmse = test_mse_error ** (0.5)
resDf.loc[ing] = [train_r2_score,test_r2_score,train_mse_error,test_mse_error,train_rmse,test_rmse,train_mse_absoulte_error,test_mse_absoulte_error]
This are the results:
FirstPartResult
SecondPartResult
The problem for me now is that sometimes I don’t understand the result. What I understand is that my model is overfitting, because in every row the score and the error are higher on the test set than on the training set. But some rows I do not understand. VitaminA has a good r2 score on the training set but bad on the test set (to me this is overfitting). But the RMSE is very high on both the training and test sets. Also confusing is “schwefel”. It has a bad r2score on the training set and a terrible score on the test set. But I can’t see in the system why I get these results. Is the problem the features or is it the problem, that the targets have sometimes a big range?
Marco Cotrotzo is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.