I have AWS Glue job which is trying to merge data into Apache Iceberg table partitioned by product_id
.
What i’m trying to achieve is to be able to run concurrent merge operations using AWS Glue jobs for different product_id
.
Ex. example of columns to be stored
product_id, name, ... , user_id
Merge query:
existing_data.product_id = '{here_product_id}' AND new_data.product_id = existing_data.product_id AND existing_data.user_id = new_data.user_id
Saving code
merge_sql = f"""
MERGE INTO glue_catalog.default.{APACHE_ICEBERG_PREFIX}{target_path} existing_data
USING td new_data
ON existing_data.product_id = '{here_product_id}' AND new_data.product_id = existing_data.product_id AND existing_data.user_id = new_data.user_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
"""
spark.sql(merge_sql)
As a result i discovered that even using partitioning it took long to merge data. For concurrent executions per different product_id
time is increasing.
Ex.
Jobs running | Exported rows of data | Execution times [s] |
---|---|---|
1 | 5k | 395 |
4 | 5k | 450, 420, 423, 452 |
8 | 5k | 630, 628, 672, 677, 695, 613, 631, 641 |
1 | 10k | 432 |
4 | 10k | 628, 508, 597, 619 |
8 | 10k | 840, 861, 809, 846, 882, 876, 887, 861 |
I have tried also Delta table
format with same problem. Merge time increase a lot when there are parallel executions on same table , but different product_id
Is it possible to speedup merge time for one job execution (395[s] seems to be pretty long for 5k rows merge)? Is it possible to speedup merge time for parallel executions using different product_id
? For now it’s look that aws glue jobs lock/wait for each other during merge operation.