I’ve got a Spark Structured Streaming job that reads data from Kafka and writes them to S3 (NetApp StorageGRID appliance, on-prem) as an Apache Iceberg table (via Nessie catalog).
Afterwards I access the table via Dremio which worked in the beginning, but now it seems that some Parquet files are corrupt – or at least Dremio is not able to read them.
I am getting the following error in Dremio:
IOException: /s3bucket/measurements_7b67d322-277d-4e26-bb8a-c2c833423ca6/data/timestamp_day=2024-06-28/00120-6-b193ded0-6b70-4506-bf77-be6da1863b21-00001.parquet is not a Parquet file. expected magic number [80, 65, 82, 49] at tail, but found [118, -33, 125, -73]
Querying the data via Spark and Starburst/Trino is possible. Downloading the mentioned parquet file and using parquet-tools
to show the content works as well.
In the past I used Spark to fully rewrite the table once and then Dremio was able to query the data again, but after some time I got the mentioned error again. So it seems that Spark writes the Iceberg table in a way that lead to such errors.
I could imagine that the streaming job might crash during writes as I have a very low-performant and unstable demo environment here, but as Iceberg uses ACID transactions I would rather expect a rollback than that the invalid files are available for consumers.
Has someone an idea how to fix this issue or even better: What might be the root cause?