I am writing an ETL job in AWS Glue using pyspark.
I am reading the data from S3 and loading it in dataframe. After doing the manipulation in dataframe I want to update the data in a sql table.
Since I have a lot of data to update(around 1,00,000 rows) so was figuring out ways to do the update operation.
The limitation is the df.write.jdbc does not provide any mode for update. It only provided modes for insertion.
Few approaches that I have are:
- Insert the data in a separate table and do a inner join with the main table.
df.write.jdbc(url=jdbc_url, table=table_name, mode="append", properties=None)
- Do the bulk insertion by creating batches, and then doing multi-threading to run the query for each batch.
Both the approaches have some cons, since in first, multiple tables will have to be created since I have multiple jobs.
And in 2nd approach I will have to handle multithreading and batching which complicates it.
Is there any other better approach through which I can do a bulk update in a sql table?
I am looking for better approaches.
Akash agrawal is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.