After using cross-validation to see how a custom predictive function performs on unseen data, I applied to function to the original dataset, and the performance (based on coefficient of determination) massively decreases.
I’m trying to build a predictive model to take 3 input features and predict a proportion. I defined a custom function to bound the output between 0 and 1:
`def xg_curve(X, a, b, c, d, e):
x1, x2, x3 = X[:, 0], X[:, 1], X[:, 2]
return 1 / (1 + a * b ** (c * x1 + d * x2 + e * x3))`
I attempted stratified 5-fold cross validation, by ordering the dataframe by the target variable, then assigning them 0-4 cyclically, which had the distribution of the target across all folds nearly identical to each other, and the dataset as a whole.
The mean R2 for test fold was 0.6485, and very tightly clustered around this value across all folds, with mean training R2 equal to 0.655 and also very tightly clustered around this value.
Based on domain knowledge, I was happy with around 65% of the variability in the target accounted for by the features, and applied the function to the dataset as a whole. However, when I did this, the R2 value dropped to 0.478. I have provided my code below, I can’t spot any mistakes! Any help would be massively appreciated!
`def xg_curve(X, a, b, c, d, e):
x1, x2, x3 = X[:, 0], X[:, 1], X[:, 2]
return 1 / (1 + a * b ** (c * x1 + d * x2 + e * x3))
X_data = over50[["Required_rate_of_closure", "Shot angle", "Lateral_diff_spin"]].values
y_data = over50["Goal Percentage"].values
folds = over50["Fold"].values
# Initialize lists to store RMSE and R^2 for each fold
train_rmse_values = []
train_r2_values = []
test_rmse_values = []
test_r2_values = []
params_list = []
# Number of folds
k = 5
for i in range(k):
# Split the data into training and testing sets
train_idx = folds != i
test_idx = folds == i
X_train, X_test = X_data[train_idx], X_data[test_idx]
y_train, y_test = y_data[train_idx], y_data[test_idx]
# Fit the model on the training set
popt, _ = curve_fit(xg_curve, X_train, y_train, p0=[1, 1, 1, 1, 1])
# Predict on the training set
y_train_pred = xg_curve(X_train, *popt)
# Predict on the testing set
y_test_pred = xg_curve(X_test, *popt)
# Calculate RMSE and R^2 for training set
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)
# Calculate RMSE and R^2 for testing set
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
test_r2 = r2_score(y_test, y_test_pred)
# Store the results
train_rmse_values.append(train_rmse)
train_r2_values.append(train_r2)
test_rmse_values.append(test_rmse)
test_r2_values.append(test_r2)
params_list.append(popt)
# Calculate mean and standard deviation for RMSE and R^2 across all folds
mean_train_rmse = np.mean(train_rmse_values)
std_train_rmse = np.std(train_rmse_values)
mean_train_r2 = np.mean(train_r2_values)
std_train_r2 = np.std(train_r2_values)
mean_test_rmse = np.mean(test_rmse_values)
std_test_rmse = np.std(test_rmse_values)
mean_test_r2 = np.mean(test_r2_values)
std_test_r2 = np.std(test_r2_values)
print("Training RMSE: Mean = {:.4f}, Std = {:.4f}".format(mean_train_rmse, std_train_rmse))
print("Training R^2: Mean = {:.4f}, Std = {:.4f}".format(mean_train_r2, std_train_r2))
print("Validation RMSE: Mean = {:.4f}, Std = {:.4f}".format(mean_test_rmse, std_test_rmse))
print("Validation R^2: Mean = {:.4f}, Std = {:.4f}".format(mean_test_r2, std_test_r2))
final_params, _ = curve_fit(xg_curve, X_data, y_data, p0=[1,1,1,1,1])
a,b,c,d,e = final_params
print(f' Final Parameters: a = {a}, b = {b}, c = {c}, d = {d}, e = {e}')
print(r2_score(xg_curve(X_data, *final_params), y_data))`
Beginner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.