How to loop over Spark Row data?
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.Row.html
Spark sql leftsemi join,the right side is skewed. The AQE rule OptimeSkewedJoin not work?
I have a spark sql run with Spark-3.2. The sql has SMJ and join type is LeftSemi, but the right side is skewed.
remote File change Detection in S3 while running S3 queries [apache-spark]
Can someone can help us understand the spark behavior for scenarios listed below?
spark sql error UNRESOLVED_COLUMN on aggregate with group by after createOrReplaceTempView on same thread
The error is “A column or function parameter with name elevation
cannot be resolved.”
Table being broadcasted in YARN but not in K8s
I am running same queries in Spark on YARN and Spark on K8s. Both K8s & YARN refer to the same hive metastore and hdfs path. When I run the job in YRAN certain table is getting broadcasted (in join), while same is not happening in K8s. In both the environment broadcast threshold is same. Table is also same. But there is difference in plan when run on YARN vs K8s. And both places broadcast is enabled.
How to remove duplicates if a specific value exists
My SparkSQL DataFrame looks like this: