Supposed I want to drop duplicates or perform an aggregation on 3 columns in my Spark dataframe.
Would it be more optimal to do
df = df.withColumn(
"hash_dup",
f.hash(
f.coalesce(f.col("id_1"), f.lit("")),
f.coalesce(f.col("id_2"), f.lit("")),
f.coalesce(f.col("id_3"), f.lit(""))
)
).dropDuplicates(["hash_dup"])
or to directly remove the duplicates using the columns like:
df = df.dropDuplicates(["id_1","id_2","id_3"])
Thanks.