I’m trying to perform data analysis on a large number of quite large parquet files. The analysis itself is relatively simple, but using e.g. pandas requires nested for loops to slice the data into increasingly small bites to then extract the data.
Pandas MWE:
import pandas as pd
import numpy as np
col1 = [1,5,8,6,4,5,3,8,5,8,9,6,4,5,6,7,2,89,6,65,5,8,5,8,65,6,4]
col2 = ['a','a','a','a','b','b','b','b','c','c','c','c','a','a','a','a','a','a','a','a','b','b','b','c','c','c','c']
col3 = [1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2]
df = pd.DataFrame({'col1':col1, 'col2':col2, 'col3':col3})
for i in np.unique(df.col3):
slice_1 = df[df.col3==i]
for j in np.unique(slice_1.col2):
slice_2 = slice_1[slice_1.col2==j]
print(slice_2.col1.values[-1])
It seems that pyarrow should be ideal for this type of application, but I can’t find any examples of how to actually use the package beyond the extremely minimalist examples given in the documentation.
I can filter the parquet file in an equivalent manner to a pandas slice, but this still requires nested for loops.
Is there a way of getting to the print()
statement in my MWE without for loops?
P.S. importing the actual data using pyarrow.parquet.read_table(data)
gives output of the form [['a','a','a','a',],['a','a','a','a','a','a','a']]
which is different output from using pyarrow.Table.from_pandas(df)
on the MWE given above. I don’t know why that is.