I’m working with a large Dask DataFrame (data) and I need to filter rows based on a specific ID (id12) in the column named ‘ID’. Additionally, I want to select only two columns (‘col1’ and ‘col2’) from the filtered DataFrame. Here’s the approach I’m currently using:
data_filtered = data.query(f"ID == 'id12' ")[['col1','col2']].reset_index(drop=True).compute()
Where:
data is my Dask DataFrame.
‘ID’ is the column containing IDs.
‘id12’ is the specific ID I’m filtering for.
‘col1’ and ‘col2’ are the columns I want to select after filtering.
I’m using query() to filter rows based on the ID and then selecting specific columns using indexing ([[‘col1’, ‘col2’]]). Finally, I call compute() to retrieve the actual data into memory.
Issue:
Performance Concerns: This approach seems to work, but I’m concerned about its efficiency, it is taking good time around 30sec to execute.
Any insights or alternative approaches to optimize performance and memory usage would be greatly appreciated. Thank you!