I have an ndarray called “data” that I need to perform One Hot Encoding on for preparation that looks something like this but much longer.
| ID | DevID | Colour | Hours |
| ——– | ——– | ——– | ——– |
| 1 | 2342 | Black | 2344 |
| 2 | 5645 | White | 234 |
| 3 | 5673| Black | 952 |
| 4 | 2485| White | 7542 |
Obviously, The third column is the only categorical data in the table that needs to be encoded.
In order to encode it, I tried using this code snippet
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [2])], remainder='passthrough')
data_encoded = scipy.sparse.csr_matrix(ct.fit_transform(data)).toarray()
However, this produces an error.
ValueError Traceback (most recent call last)
Cell In[187], line 2
1 ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [2])], remainder='passthrough')
----> 2 data_encoded = scipy.sparse.csr_matrix(ct.fit_transform(data)).toarray()
File ~anaconda3envsit3102Libsite-packagesscipysparse_compressed.py:88, in _cs_matrix.__init__(self, arg1, shape, dtype, copy)
86 raise ValueError(msg) from e
87 coo = self._coo_container(arg1, dtype=dtype)
---> 88 arrays = coo._coo_to_compressed(self._swap)
89 self.indptr, self.indices, self.data, self._shape = arrays
91 # Read matrix dimensions given, if any
File ~anaconda3envsit3102Libsite-packagesscipysparse_coo.py:366, in _coo_base._coo_to_compressed(self, swap)
363 indices = np.empty_like(minor, dtype=idx_dtype)
364 data = np.empty_like(self.data, dtype=self.dtype)
--> 366 coo_tocsr(M, N, nnz, major, minor, self.data, indptr, indices, data)
367 return indptr, indices, data, self.shape
ValueError: unsupported data types in input
Weirdly the only way I’ve found that doesn’t throw this error is including column 4 in addition to column 3. This isn’t what I want since column 4 isn’t categorical, but the code refuses to run without it.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [2,3])], remainder='passthrough')
For some reason only column 4 needs to be added and not the first 2 columns despite both having the same data types in all the rows.
I’ve checked column 4 and confirmed that weren’t any signs of NaNs or stray strings in any row, so I’m unsure why column 4 specifically needs to be added to the encoded columns in the code for it to work.