I have been testing spark jobs with hive tables backed by google buckets.
I have tested two cases
- spark sql something like
INSERT OVERWRITE TABLE output_table partition(date_key, hour) select some_columns
from input_table - spark dataframe write by
test.write.partitionBy("date_key", "hour").mode("overwrite").parquet("gs://my-bucket") spark.sql("alter table output_table add partition() set path ...```
Since i am writing the output parquet files to gcs will be there be any difference between the two as i have seen the approach 2 was always faster than 1 . Also will both of them write to staging folder first and then write to final folder ( rename operation which is very slow) ?
P.S i am using spark 3.5.1 on k8s with AQE enabled