community,
I am experimenting with Delta Lake as the primary storage solution for my tabular data, which is updated daily. I tried to mimic the basic use case – an existing target table is updated by the new data that can change existing data, i.e. upserts. I am using a MERGE INTO
operation, where the target table is my Delta table and the table with updates is simply saved as a Parquet file.
My tables are special in the number of columns – there are up to 10000 columns, most of them are binary. One column contains an identifier of a row, represented as a string hash, which is used in the matching condition of the merge operation.
I am experimenting with a small main table having 5000 rows, which is 10 MB of one Parquet file on disk and the same table, but stored as a plain Parquet, which has several small 3.5 MB files.
My merge operation takes extremely long, probably stuck without any computation. What am I missing? I haven’t tried partitioning since the size of the whole table is just 10 MB. I expected this operation to be extremely fast even in the case of non-optimized tables.
I will appreciate your help, thank you!