I am now working on a use case which would need to ingest huge data (~10M of rows) from Azure Databricks materialized view to an Azure SQL database. The database is using the elastic standard (eDTU 50) as the pricing tier. I have already implemented various optimization measures on the Databricks side but the spark job is not running at all! This makes me wonder whether the bottleneck is actually on the database instead of the spark config.
try:
df.write
.format("com.microsoft.sqlserver.jdbc.spark")
.mode("overwrite")
.option("url", sql_db_url)
.option("dbtable", target_table)
.option("user", username)
.option("password", password)
.option("batchsize", "100000")
.option("tableLock", "true")
.option("schemaCheckEnabled", "false")
.option("reliability Level", "BEST_EFFORT")
.save()
print("Successfully write data into target SQL database")
except Exception as error:
print("An exception occurred:", error)
(Whenever running the inserting statement on Databricks, the CPU utilization of DB hits 100%)
Appreciate for any advice
Tried various optimization measures in Databricks and also different cluster sizing.