I want to make a prediction with sklearn and Random Forest but I want to select the best features before.
I have 70 features and I want to select automatically the best features.
So SequentialFeatureSelector or RFE is useless for me beaucause I don’t want to put an arbitrary number of features to select.
I see that RFECV is interesting beacuse is select automatically the bests features with cross validation and RFE.
Beacuse it use cross validation I don’t split my data sample between train and test, beaucause for me it’s more interesting to use 100% of the data with cross validation, my I wrong ?
Here is my code below :
data = gpd.read_file(path_train)
X = data.drop(columns=['geometry', 'classvalue', 'classname')
y = data['classvalue']
classifier = RandomForestClassifier(n_estimators=100, random_state=None, n_jobs=-1)
cv = StratifiedKFold(5)
classifier_selected = RFECV(
estimator=classifier,
step=1,
cv=cv,
scoring="accuracy",
min_features_to_select=1,
n_jobs=-1,
)
classifier_selected.fit(data_input, classes)
print(f"Optimal number of features: {classifier_selected.n_features_}")
Now I want to use the classifier_selected to make prediction, but what is the best solution :
-
Option 1 : Make a prediction directly with classifier_selected and the data to predict ?
It’s not clear if it will use only the best features, because “classifier_selected.feature_names_in_” return all the features, with no selection, the number is different from “classifier_selected.n_features_” … -
Option 2 : Create a new classifier with the optimal features selected (with cross validation or just a simple Random Forest ?) ?
I don’t know what is the propper practice
Rien Avoir is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.