I want to implement a predicting architecture which features an intermediate classifier. This model will be fitted and then predict probabilities for the classes of a binary feature on another sector of the training set. Then the transformer will remove every instance whose probability is smaller than a given value. Obviously, the y
values corresponding to those instances should also be removed. This is not supported by the default sklearn behaviour, which does not allow for y to be transformed within a pipeline to be optimized through, e.g., GridSearchCV
.
My custom model-transformer is the following (I’m using sklearn==1.4.0, this is very important):
class My_Int_Classifier(BaseEstimator, TransformerMixin):
def __init__(self, pctg_int_model=0.5, int_target=None, thres_prob=0.5, y_final_target=None, **kwargs):
self.model=RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None,
min_weight_fraction_leaf=0.0, max_features='sqrt',
max_leaf_nodes=None, min_impurity_decrease=0.0,
bootstrap=True, oob_score=False, n_jobs=None,
random_state=None, verbose=0, warm_start=False,
class_weight=None, ccp_alpha=0.0, max_samples=None,
monotonic_cst=None, min_samples_split=2, min_samples_leaf=1)
self.__dict__.update(kwargs)
self.pctg_int_model =pctg_int_model
self.int_target = int_target
self.thres_prob =thres_prob
self.y_final_target=y_final_target
self.model=self.model.set_params(**kwargs)
def fit(self, x, y=None):
int_X=X.iloc[0:math.floor(self.pctg_int_model*len(X))]
self.model.fit(int_X.drop(self.int_target, axis=1), int_X[self.int_target])
return self
def transform(self, x, y=None):
remaining_X=X.iloc[math.floor(self.pctg_int_model*len(x)):]
remaining_X=remaining X.drop(self.int_target, axis=1)
remaining_y = self.y_final_target.iloc[math.floor(self.pctg_int_model*len(self.y_final_target)):]
probs = self.model.predict proba (remaining X)[:0]
remaining_X['proba_int'] = probs
remaining_X = remaining_X.reset_index(drop=True)
remaining_y=remaining_y.reset_index(drop=True)
remaining_y= pd.Series(remaining_y, name='remaining_y')
df = pd.concat([remaining X, remaining_y])
df =df[df['proba_int']<=self.thres_prob]
remaining_X=df.drop('remaining y', axis-1)
remaining_y=df['remaining_y']
return (remaining_X, remaining_y)
and corresponding fit_transform
. This works fine in isolation, the problem is that I need to pass remaining_y
as the new y in the following (last) estimator in a pipeline of the form
int_model = My_Int_Classifier(int_target='int_binary_target', y_final_target=y_train)
pipeline_rfr = Pipeline([('scaler', StandardScaler()), ('int_model', model_rfr), ('final_model', RandomForestRegressor())])
which doesn’t work.
I would like a solution such as below, of the form provided here:
class MyWrapperEstimator(RealEstimator):
def fit(X, y=None):
if isinstance(X, tuple):
X, y = X
super().fit(X=X, y=y)
so that I could wrap the last RandomForestRegressor in this class, but this did not work as well.
Any insight is greatly appreciated.