I have trained a model to using the pipeline method from sklearn. Now i am trying to load the model and let it predict a completely new and different set of answers. I am getting this error though.
ValueError: Found array with 0 sample(s) (shape=(0, 6665)) while a minimum of 1 is required by LinearSVC.
What I think is happen is that the new sentence I am trying to predict doesn’t have any words in it that my vectorizer had. So when everything is vectorized the SVM errors on an empty vector.
I maybe wrong though, it’s just my first impression. Are there any workarounds or solutions to this?
pipeline = Pipeline([
('vect', None), # default vectorizer
('clf', None) # default classifier
])
params = [
{
'vect': [CountVectorizer()], # CountVectorizer option
'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
'vect__binary': [True, False],
'vect__min_df': [0.001, 0.0005, 1],
'clf': [LinearSVC(dual="auto")] # classifier option BernoulliNB()
},
{
'vect': [TfidfVectorizer()], # TfidfVectorizer option
'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
'vect__min_df': [0.001, 0.0005, 1],
'clf': [LinearSVC(dual="auto")]
}
]
grid_search = GridSearchCV(pipeline, params, scoring="roc_auc", verbose=3, error_score='raise', cv=3)
grid_search.fit(x_train, y_train)
The best params that come out are:
CountVectorizer(ngram_range=(1, 2))), (‘clf’, LinearSVC(dual=’auto’)
The for loop is because i am using a dictionary of dataframes
model_path = Path(__file__).parent/'model.pkl'
model = joblib.load(model_path)
for sheet in self.data:
to_predict = self.data[sheet]["column"]
self.data[sheet]["code"] = model.predict(to_predict)
Also i did a dropna all_data.dropna(subset=”column”, inplace=True) before saving the dataframe in teh dictionary.
I have tried removing the min_df attribute from the countvectorizer but it didnt change anything.
Edit: I just realized it seems a bit wierd that I get the error since the word count also changes. I did preprocess my training data should I also do the same for this new data?