I have a dataset with 41 features, out of which 4 are text features. I’ve been given “Bag of Words” numpy arrays (npz) for these four features, which I combined with the other numerical features to train an SVM model. There are a total of 100000 records and 41 features, 4 of which are vectorized as mentioned.
This model has been training for 45 minutes now :). Is there a way to decrease training time? I used the linear kernel because I know non-linear ones for 100k rows would take a century to finish. Is there anything wrong with how I pre-process the dataset (particularly combine npz and existing numerical features)? Any other options that I could explore?
title_feature = load_npz('train_title_bow.npz')
overview_feature = load_npz('train_overview_bow.npz')
tagline_feature = load_npz('train_tagline_bow.npz')
production_companies_feature = load_npz('train_production_companies_bow.npz')
numerical_features = df_train[df_train.columns.difference(['title', 'overview', 'tagline', 'production_companies', 'rate_category', 'average_rate', 'original_language'])]
text_features = np.hstack([title_feature.toarray(), overview_feature.toarray(), tagline_feature.toarray(), production_companies_feature.toarray()])
svm_X_train = np.hstack([numerical_features, text_features])
svm_y_train = df_train['rate_category']
svm_classifier = SVC(kernel='linear') # Linear kernel is used, you can choose other kernels too
svm_classifier.fit(svm_X_train, svm_y_train)