I have a Pandas dataframe (over 1k of rows). There are numbers, objects, strings, and Boolean values in my dataframe. I want to convert each ‘cell’ of the dataframe to a vector, and work with the resulting vectors. I then plan to compare each row of vectors for similarities.
For example, My data is:
Col 0,Col 1,Col 2,Col 3,Col 4,Col 5,Col 6,Col 7,Col 8,Col 9,Col 10
12,65e1e35b7fe333,harry Joe,1,FALSE,swe,1,142.158.0.2,10.10.0.2,text1,0
13,65e1e35b7fe599,allen,1,FALSE,swe,1,142.158.0.20,10.10.0.20,text2,0
14,65e1e35b7fe165,carter,1,FALSE,swe,1,142.158.0.21,10.10.0.21,text3,0
I want to end up with a dataframe of vectors that looks like:
Col 0,Col 1,Col 2,Col 3,Col 4,Col 5,Col 6,Col 7,Col 8,Col 9,Col 10
Vect1,Vect2,Vect3,Vect4,Vect5,Vect6,Vect7,Vect8,Vect9,Vect10,Vect11
Vect12,Vect 13,Vect 14,Vect4,Vect5,Vect6,Vect7,Vect 15,Vect 16,Vect 17,Vect11
Vect18,Vect 19,Vect 20,Vect4,Vect5,Vect6,Vect7,Vect 21,Vect 22,Vect 23,Vect11
Is there a good way to do this in Python w SciKit maybe?
I have tried:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
df=mydata
vectorizer = TfidfVectorizer()
# Transform the data to feature vectors
X = vectorizer.fit_transform(df)
X = pd.DataFrame(_X.todense(), index=df.index, columns=vectorizer.vocabulary_)
X.head()
# Labels
y = df['label']
What I got was:
TypeError Traceback (most recent call last)
Cell In[11], line 14
11 vectorizer = TfidfVectorizer()
13 # Transform the text data to feature vectors
---> 14 X = vectorizer.fit_transform(df)
16 X = pd.DataFrame(_X.todense(), index=df.index, columns=vectorizer.vocabulary_)
17 X.head()
File /anaconda/envs/xxx_py38/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:2079, in TfidfVectorizer.fit_transform(self, raw_documents, y)
2072 self._check_params()
2073 self._tfidf = TfidfTransformer(
2074 norm=self.norm,
2075 use_idf=self.use_idf,
2076 smooth_idf=self.smooth_idf,
2077 sublinear_tf=self.sublinear_tf,
2078 )
-> 2079 X = super().fit_transform(raw_documents)
2080 self._tfidf.fit(X)
2081 # X is already a transformed view of raw_documents so
2082 # we set copy to False
File /anaconda/envs/xxx_py38/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:1338, in CountVectorizer.fit_transform(self, raw_documents, y)
1330 warnings.warn(
1331 "Upper case characters found in"
1332 " vocabulary while 'lowercase'"
1333 " is True. These entries will not"
1334 " be matched with any documents"
1335 )
1336 break
-> 1338 vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
1340 if self.binary:
1341 X.data.fill(1)
File /anaconda/envs/xxx_py38/lib/python3.9/site-packages/sklearn/feature_extraction/text.py:1207, in CountVectorizer._count_vocab(self, raw_documents, fixed_vocab)
1205 values = _make_int_array()
1206 indptr.append(0)
-> 1207 for doc in raw_documents:
1208 feature_counter = {}
1209 for feature in analyze(doc):
TypeError: 'Data' object is not iterable
Tavi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.