I am working on implementing CDC (Change Data Capture) which will use Azure Sql as the source and Azure Databricks delta lake as the target.
Azew Sql creates CDC tables with a very specific column names which include special characters: “__$start_lsn” is being one example.
I am having problems executing the following PySpark / Delta “MERGE” statement:
MERGE INTO poc_table_2 AS target
USING cdc_data AS source
ON target.last_processed_lsn = source.__$start_lsn
WHEN MATCHED AND source.__$operation = 1 THEN DELETE
WHEN MATCHED AND source.__$operation IN (3, 4) THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
In this context, the “source” refers to the Azure Sql cdc table, which I cannot change.
When I run this statement I get this error:
[PARSE_SYNTAX_ERROR] Syntax error at or near '$'. SQLSTATE: 42601
File <command-1987778677957359>, line 40
30 merge_query = """
31 MERGE INTO poc_table_2 AS target
32 USING cdc_data AS source
(...)
36 WHEN NOT MATCHED THEN INSERT *
37 """
39 # Execute the merge query
---> 40 spark.sql(merge_query)
42 # Update the last processed LSN
43 last_lsn = cdc_data.agg({"__$start_lsn": "max"}).collect()[0][0]
I have tried using single and double quotes around the column names, but it does not work. When I use quotes, I get a slightly different error:
[PARSE_SYNTAX_ERROR] Syntax error at or near '"__$start_lsn"'. SQLSTATE: 42601
How do I get around this issue?