The following URLs are for two DeltaTables ake_original and ake_updates:
- ake_original
- ake_updates
Each DeltaTable has two columns: - AKE_ID PRIMARY KEY pa.String()
- MODIFIED_DATE pa.timestamp(‘us’)
Details:
- First DeltaTable ‘ake_original’ has 59,767 rows. for all the data inserted on 07 June 2024 from 00:00:00 to 23:59:59.
- Second DeltaTable ‘ake_updates’ has 6,262 rows. representing some of the data inserted on 07 June 2024 from 00:00:00 to 23:59:59 but updated on 08 June 2024 from 00:00:00 to 23:59:59.
Using below Python code to merge updated rows into the original rows using unique column AKE_ID resulting in rows duplication.
All the 6,262 AKE_ID values of ake_updates DeltaTable already exist in the ake_original DeltaTable.It inserts/duplicates 4,276 and updates 1,986 rows.
What is the reason for data duplication ?
from deltalake import DeltaTable, write_deltalake
import pyarrow.dataset as ds
import pandas as pd
deltaTable = DeltaTable('ake_original')
dataset_update = ds.dataset('ake_updates')
df = dataset_update.to_table().to_pandas()
(
deltaTable.merge(
source=df,
predicate="s.AKE_ID = t.AKE_ID",
source_alias="s",
target_alias="t",
)
.when_matched_update_all()
.when_not_matched_insert_all()
.execute()
)
print(deltaTable.history())