I have this pure Pandas statement that works (on small dataset).
grouped_dfs = {key: group.drop(columns=['country']) for key, group in df.groupby('country')}
Now, to manage very large csv files, I am reading-in using Dask and trying to get a dict of grouped dfs like above. Is there a way to do exactly the same using Dask?
For now, I am converting the dask-dataframe to Pandas DF (df = df.compute()
) and proceeding to run my tasks. Sounds crazy, but, currently, my dataset is small enough, so, I am able to get by.
I have read many posts and articles and couldn’t find much. Dask seems to require some sort of aggregation after grouping.
Any help is appreciated.
You should check duckDB
Read your csv file as pandas df
. Insert the df
in duckdb
and then you can perform queries like this:
query = F"SELECT * FROM your_df where country = ?;"
with duckdb.connect(local_file_path) as conn:
result = conn.execute(query,('US',)).fetchdf()
or
countries = con.execute("SELECT DISTINCT country FROM my_df").fetchall()
grouped_dfs = {}
for country, in countries:
query = f"SELECT value FROM my_table WHERE country = ?"
grouped_dfs[country] = con.execute(query, (country)).df()