I have a python function using pandas that does operations on some dataframes. This python functions currently consumes a lot of RAM. I have tried to minimize RAM usage as much as possible but currently I’ve not been successfull.
My data is ~1.5 Go in memory (all dataframes summed), but the peak RAM usage of the python process running this function is ~15 Go.
Here is the current code of the function (df1, df2, df3, df4 are my function’s inputs):
return (
df1
.query("`col1` == 'val1'")
.merge(
right=(
df2
),
on=["col2"],
how="left",
).merge(
right=(
df3
),
on=["col3"],
how="left"
).merge(
right=(
df4
),
on=["col4"],
how='left',
)
.sort_values(
by=["col1", "col2", "col3"],
ascending=False
)
.drop_duplicates(["col2", "col3"])
)
I have already tried the following:
- Reduce memory footprint by setting optimal types to my pandas dataframes
- Reduce memory footprint by filtering out non necessary columns
Any further idea on what could be optimized ? What am I doing wrong ?