So I have a spark dataframe with multiple columns which are complex structs. I am trying to transform the value of a field in one of the struct columns based on the value of a field in another struct column. I am working with spark 3.5
so I am using the withField
function. The transformation:
df = df.withColumn(
"Column_1",
F.when(
F.col("Column_2.field_2").isNotNull(), F.col("Column_1").withField("field_1", F.col("Column_2.field_2")*F.lit(1000000))
))
I am trying to update the value of Column_1.field_1
based on the value of Column_2.field_2
.
When I try to display/write the dataframe after the transformation, it takes a long time and the cluster driver crashes with the following error:
java.lang.OutOfMemoryError : GC overhead limit exceeded
Is this not the correct way to use withField
? It does not give me any error while planning or providing the output of df.explain()
. Although it does take significantly more time to give out the output of df.explain()
(50-60s compared to a second or two for other chain of transformations).