I am trying to process and parse the data from xml file to delta table in azure databricks using python script.
We receive files in xml format and we parse it as structured format using dataframes.
But while writing the dataframe to the table it is throwing the serialization error and failing.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 35.0 failed 4 times, most recent failure: Lost task 13.3 in stage 35.0 (TID 901) (10.139.64.13 executor 1): org.apache.spark.api.python.PythonException: 'ValueError: can not serialize object larger than 2G'.
Below is the cluster configuration I am using for the activity.
Driver: Standard_D64s_v3 · Workers: Standard_D64s_v3 · 1-8 workers · DBR: 13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)
I tried with multiple cluster configuration and still getting the same error.
Any Idea how to resolve the serialization issues ?
Thanks in Advance!