I have a dataframe which contains columns of type list. If I write this dataframe to parquet and read from it, it changes to numpy array. MWE:
home_directory = os.path.join(rf"C:\Users\{os.getlogin()}")
df = pd.DataFrame({'my_column': [['a', 'b', 'c'], ['c', 'd', 'e']]})
print(type(df['my_column'].iloc[0]))
df.to_parquet(os.path.join(home_directory,'data'), engine='pyarrow')
df_from_parquet = pd.read_parquet(os.path.join(home_directory,'data'), engine='pyarrow')
print(type(df_from_parquet['my_column'].iloc[0]))
Output:
<class ‘list’>
<class ‘numpy.ndarray’>
Is there a way to maintain the list type?
0
Use engine='fastparquet'
in your function calls.
example:
df.to_parquet(os.path.join(home_directory,'data'), engine='fastparquet')
and
pd.read_parquet(os.path.join(home_directory,'data'), engine="fastparquet")
It will give you the following output:
make sure to download fastparquet
package using:
pip install fastparquet
I hope this helps!