My PySpark script joins several tables and writes the result with the code below:
sink = glueContext.getSink(connection_type="s3", path="s3://bucket1234/",
enableUpdateCatalog=True,
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["year", "month", "day"]
)
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase="db1", catalogTableName="table1")
sink.writeFrame(df1)
When I check the logs, I noticed the error below being printed for different partitions, and it keeps the script running forever.
INFO [Thread-16] sinks.HadoopDataSink (DataSink.scala:$anonfun$forwardPotentialDynamicFrameToCatalog$23(327)): Error creating partitions: {PartitionValues: [2019_03_07],ErrorDetail: {ErrorCode: AlreadyExistsException,ErrorMessage: Partition already exists.}}
The output bucket was empty before the run, so there was no existing data there, and there was no table in db1.table1. How can I fix it?