I have a text dataset (~42000 samples) and I am performing sentiment analysis on it. I have encoded the text via the CountVectorizer
class. There are 6 classes. When I run the following script “Fitting 5 folds for each of 2 candidates, totaling 10 fits” is output in my terminal and my cpu usage temporarily increases but no progress is made. (Note: I have tried this with scikit’s wine dataset and it works as excepted; printing out its progress as it goes.
Notably, I have also tried converting my data into numpy arrays, matching the required format of (n_samples, n_features) for X and (n_samples) for y.
The program is unbearably slow for my current settings and I need help figuring out what’s causing it.
Here is the code:
X_train, X_val, y_train, y_val = sklearn.model_selection.train_test_split(df.tweet_text, df.cyberbullying_type, test_size=0.2, random_state=115)
#Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X_train)
X_val = vectorizer.transform(X_val)
parameters = {
'booster' : ['gbtree','dart'],
'learning_rate' : [0.01],
'n_estimators' : [100, 200, 300, 400, 500, 600],
}
def tune_hyperparameters(base_model, parameters, n_iter, kfold, X, Y, X_val=None, Y_val=None, SEED=115):
start_time = time.time()
# Arrange data into folds with approx equal proportion of classes within each fold
k = KFold(kfold)
optimal_model = RandomizedSearchCV(
base_model,
param_distributions=parameters,
n_iter=n_iter,
cv=k,
random_state=SEED,
scoring='accuracy',
n_jobs=1,
verbose=2,
error_score='raise'
)
optimal_model.fit(X, Y)#,eval_set=zip(X_val, Y_val))
stop_time = time.time()
scores = cross_val_score(optimal_model, X, Y, cv=k, scoring="accuracy")
return optimal_model
device= 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'The device in use is: {device}')
base_model = XGBClassifier(objective='multi:softmax', num_class=6, device=device, n_jobs=None, verbosity=1, random_state=115)
# fit model
model = tune_hyperparameters(base_model, parameters, n_iter=2, kfold=5, X=X_train, Y=y_train)#, X_val=X_val, Y_val=Y_val)
Please let me know if there is any additional information that I should provide. Thanks.
rootdrew27 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.