I am working in a non parametric setting and wish to conduct a independence test using permutation. I am using random forest as my regression model and have the following code:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
X = pd.read_csv("X_p.csv", delimiter=",", engine="c", index_col=0, low_memory=False)
Y = pd.read_csv("Y_p.csv", delimiter=",", engine="c", low_memory=False)
X = X.iloc[:, 1:]
Y = Y.iloc[:, 1:]
Y = Y['0'].values.ravel()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=42)
rf_model = RandomForestRegressor(n_estimators=100,min_samples_split=5,max_depth=50, random_state=42)
original_mse = mean_squared_error(Y_train, rf_model.fit(X_train, Y_train).predict(X_train))
permuted_mse = []
for _ in range(10): # Number of permutations
permuted_Y = np.random.permutation(Y_train)
permuted_mse.append(mean_squared_error(permuted_Y, rf_model.fit(X_train, permuted_Y).predict(X_train)))
# Step 6: Significance Testing based on MSE
p_value_mse = np.sum(permuted_mse > original_mse) / (len(permuted_mse)+1)
# Print results
print("Original MSE:", original_mse)
print("Permutation test p-value based on MSE:", p_value_mse)
Just running 10 permutations the code takes quite a long time and it always returns a very high p value (bassicly all permuted mse’s are larger than the initial one) what is my mistake? I have tried to modify my random forest parameters but nothings seems to help.
New contributor
Rasmus Larsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.