Unable to filter away dataframes in huge dataset in PySpark
I have a huge PySpark dataframe that contains 1.5B rows, including the column fieldA
. I have a list of 8.8M unique fieldA
values, that I want to filter out of the 1.5B rows. However, I think due to the large data size, I keep getting errors like StackOverflowError
or OutOfMemoryError
.