I have a text classification problem where the input has 2 features: a text and a language:
the text is a string variable. the language is a string variable that has the following values: “EN”, “FR”, “DE”, etc. and the output is an imbalanced categorical variable.
As a regular NLP problem, the text feature was tokenized using Keras tokenizer and then padded:
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text)
X_train_sequences = tokenizer.texts_to_sequences(X_train_text)
X_test_sequences = tokenizer.texts_to_sequences(X_test_text)
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_len, padding='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_len, padding='post')
while the language variable was encoded using one-hot encoding:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
language_encoder = OneHotEncoder(handle_unknown="infrequent_if_exist")
X_train_lang_encoded = market_encoder.fit_transform(X_train_lang.values.reshape(-1, 1))
X_test_lang_encoded = market_encoder.transform(X_test_lang.values.reshape(-1, 1))
My target classes are not balanced so I performed oversampling using SMOTE:
from imblearn.over_sampling import SMOTE
oversampler = SMOTE(random_state=seed, k_neighbors=3)
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train_padded, y_train)
And then I built my NN model as the following:
from tensorflow.keras.layers import Dense, LSTM, Embedding, Input, Dropout, Conv1D, MaxPooling1D, Flatten, GlobalMaxPooling1D, SimpleRNN, Bidirectional, Concatenate
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
input_text = Input(shape=(max_len,), dtype="int32", name="input_text")
embedding = Embedding(input_dim=vocab_size, output_dim=128, input_length=max_len)(input_text)
lstm = Bidirectional(LSTM(64, return_sequences=False))(embedding)
dropout = Dropout(0.5)(lstm)
input_market = Input(shape=(X_train_lang_encoded.shape[1],), dtype='float32', name='input_market')
concat = Concatenate()([dropout, input_market])
output = Dense(len(label_encoder.classes_), activation="softmax")(concat)
model = Model(inputs=[input_text, input_market], outputs=output)
optimizer = Adam(learning_rate=1e-3)
loss = SparseCategoricalCrossentropy(from_logits=False)
metric = SparseCategoricalAccuracy("accuracy")
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
history = model.fit(
[X_train_resampled, X_lang_encoded], y_train,
validation_data=([X_test_padded, X_test_lang_encoded], y_test),
epochs=10,
batch_size=16
)
However, the concatenation between the resampled text data and the encoded language data failed.
The following solutions were suggested by ChatGPT but they didn’t work:
1- oversample the language feature to match the resampled text data:
X_train_lang_resampled = np.repeat(X_train_lang_encoded, smote.sample_indices_.shape[0]
X_train_lang_encoded.shape[0], axis=0)
which gave this error:
‘SMOTE’ object has no attribute ‘sample_indices_’
2- combine text and language and pass them to SMOTE together then resplit them:
X_train_combined = np.hstack((X_train_padded, X_train_lang_encoded))
X_train_resampled_combined, y_train_resampled = smote.fit_resample(X_train_combined, y_train)
X_train_resampled = X_train_resampled_combined[:, :max_len]
X_train_lang_resampled = X_train_resampled_combined[:, max_len:]
which gives this error:
all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
Can you please help me solve the issue so I can pass both the resampled text and the language as 2 inputs to my NN model?