I am facing a problem with the Azure Synapse Notebook. I have a big python script where is used the Pandas Dataframe, I can load a ‘parquet’ file, but I cannot convert into pandas using toPandas(), because is throwing the error: 'org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow'.
It is not a huge data, it is just 7 millions rows and 80 columns.
I have tried:
-
Added more resources to the cluster and is not working.
-
Trying to modify the buffer size
from pyspark.sql import SparkSession
spark = SparkSession.builder
.appName("YourAppName")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max", "512m")
.getOrCreate()
But it is not working either.
- Reduce number of columns and it is not working.
The only thing that worked was reduce rows, but that is not an option, I need whole records.
I need to keep the same code as well (no changing into pyspark)
I hope someone could have an idea of how to fix it.
Thanks
Walbersy Navarro Valladares is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.