Say we’ve got a dataframe with a mixture of categorical and numerical features which will be used for binary classification with missing values.
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Sample data creation with missing values
data = {
'cat_feature1': np.random.choice(['A', 'B', 'C'], size=100),
'cat_feature2': np.random.choice(['X', 'Y', 'Z'], size=100),
'num_feature1': np.random.rand(100),
'num_feature2': np.random.rand(100),
'num_feature3': np.random.rand(100),
'binary_outcome': np.random.choice([0, 1], size=100)
}
df = pd.DataFrame(data)
# Introduce missing values
nan_indices = np.random.choice(df.index, size=10, replace=False)
df.loc[nan_indices, 'num_feature1'] = np.nan
df.loc[nan_indices, 'num_feature2'] = np.nan
nan_indices_cat = np.random.choice(df.index, size=5, replace=False)
df.loc[nan_indices_cat, 'cat_feature1'] = np.nan
As expected when i try this,
itp = IterativeImputer(estimator=RandomForestRegressor(), random_state=42)
itp.fit(X)
I get:
ValueError: could not convert string to float: 'A'
An obvious solution would be to first use one hot encoding for the categorical features.
BUT after doing so, how would i guarantee that only one of the encoded columns is True for each row.
To make it more clear: cat_feature1 has values ‘A’, ‘B’ and ‘C’. After one hot encoding and using iterative imputer is there any guarantee than only one of A, B, C columns would be 1 and the other two zero?