Relative Content

Tag Archive for pythonpandaspysparkout-of-memory

Unable to filter away dataframes in huge dataset in PySpark

I have a huge PySpark dataframe that contains 1.5B rows, including the column fieldA. I have a list of 8.8M unique fieldA values, that I want to filter out of the 1.5B rows. However, I think due to the large data size, I keep getting errors like StackOverflowError or OutOfMemoryError.