As part of our production we produce fairly large xlsx files.
Around 1000 columns and 200k rows.
We have noticed that older versions of pandas/openpyxl are doing a much better job at being memory and time efficient in producing these files.
Using:
openpyxl=3.0.7
pandas=1.2.4
The runtime is roughly half an hour with an output file around 100mb and RAM usage around 4gb
Using
openpyxl = 3.1.4
pandas = 2.1.4
it takes 2 hours and the output file is 400mb and it takes up all available RAM at almost 16gb.
I’ll do some more experimentation whether this is down to pandas or openpyxl but I was wondering if anyone knows whats going on here.