I am using Pandas in Spark API for some data preprocessing file which was initially in Pandas. I am seeing that the date operations are very slow and some are not compatible at all. For Eg: I cannot do this df[time_col] + pd.Timedelta(1, unit=’D’) instead I had to write the below operation: df[time_col ].apply(lambda x: x+timedelta(days=1)).
Is there any other way I can use date_add operations. And why would pandas on spark be slow under the hood.
I have tried Pyspark code which has the interval operation and works fast.
Chaitanya Kulkarni is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.