I’m a just a beginner, I’m still learning about sparse matrices and how they work with other stuff.
Here is the problem I encountered, searched the web and couldn’t find a proper answer.
I OneHotEncoded the categorical labels with default params sparse_output=True,
When I tried to fit the RandomForestClassifier with the transformed_X and target y after train test splitting, it shows this error.
#seed
np.random.seed(42)
#one hot encoding imports
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer as ct
#Data splitting
X = f_data.drop('attended', axis = 1)
y = f_data['attended']
#select columns
cat_col = ['days_before','day_of_week','time','category']
#initialize encoder
enc = OneHotEncoder()
#fitting for encoder using ct
transformer = ct([('enc',enc,cat_col)], remainder = 'passthrough')
transformed_X = transformer.fit_transform(X)
transformed_X
<1480x36 sparse matrix of type '<class 'numpy.float64'>' with 10360 stored elements in Compressed Sparse Row format>
#BaseLine model
np.random.seed(42)
#imports
from sklearn.model_selection import train_test_split as tts
from sklearn.ensemble import RandomForestClassifier
#splitting
X_train,Y_train,X_test,Y_test = tts(transformed_X,y, test_size = 0.2)
#model fitting
model = RandomForestClassifier()
model.fit(X_train,Y_train)
#model score
blsc = model.score(X_test,Y_test)
print(f'Baseline Model Score is : {blsc}')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[416], line 13
11 #modelling
12 model = RandomForestClassifier()
---> 13 model.fit(X_train,Y_train)
15 #model score
16 blsc = model.score(X_test,Y_test)
File G:Md JafferUDEMYMachine Learning Course ZTMProjectsHeartDesease_ClassificationenvLibsite-packagessklearnbase.py:1474, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1467 estimator._validate_params()
1469 with config_context(
1470 skip_parameter_validation=(
1471 prefer_skip_nested_validation or global_skip_validation
1472 )
1473 ):
-> 1474 return fit_method(estimator, *args, **kwargs)
File G:Md JafferUDEMYMachine Learning Course ZTMProjectsHeartDesease_ClassificationenvLibsite-packagessklearnensemble_forest.py:361, in BaseForest.fit(self, X, y, sample_weight)
359 # Validate or convert input data
360 if issparse(y):
--> 361 raise ValueError("sparse multilabel-indicator for y is not supported.")
363 X, y = self._validate_data(
364 X,
365 y,
(...)
369 force_all_finite=False,
370 )
371 # _compute_missing_values_in_feature_mask checks if X has missing values and
372 # will raise an error if the underlying tree base estimator can't handle missing
373 # values. Only the criterion is required to determine if the tree supports
374 # missing values.
ValueError: sparse multilabel-indicator for y is not supported.
I tried to set the sparse_output=False and it gave the inconsistent numbers of samples error. Actual shape after label encoding is (1480 x 36)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[444], line 13
11 #modelling
12 model = RandomForestClassifier()
---> 13 model.fit(X_train,Y_train)
15 #model score
16 blsc = model.score(X_test,Y_test)
ValueError: Found input variables with inconsistent numbers of samples: [1184, 296]
Jaffer Sulaiman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.