I’m working on an Apache Beam pipeline that processes data and writes it to BigQuery. The pipeline works perfectly when using the DirectRunner, but when I switch to the DataflowRunner, it completes without errors or warnings but doesn’t insert any rows into BigQuery. Additionally, I see large leftover files in the temporary directory of my Cloud Storage bucket (gs://my-bucket/temp/bq_load/...
), and no data appears in the target table.
Here’s the pipeline structure:
worker_options.sdk_container_image = '...'
with beam.Pipeline(options=pipeline_options) as p:
processed_data = (
p
| "ReadFiles" >> beam.Create(FILE_LIST)
| "ProcessFiles" >> beam.ParDo(ProcessAvroFileDoFn())
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(
table=f"{PROJECT_ID}:{DATASET_ID}.{TABLE_ID}",
schema=BQ_SCHEMA,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED
)
)
Key Observations:
- The pipeline succeeds with the DirectRunner, writing data to BigQuery without any issues.
- With the DataflowRunner, the pipeline completes without errors or warnings, but:
– No rows are written to BigQuery.
– Large temporary files remain in the bucket (e.g., bq_load/…). - The data being processed is valid NDJSON.
- The BigQuery schema matches the data structure.
What I’ve Tried:
-
Inspecting the leftover temp files, I downloaded the temp file and verified that it contains valid NDJSON rows. Manually uploading this file to BigQuery using the bq load command works fine.
-
Testing with other datasets:
I tried many different inputs, but the issue persists. -
Checking Dataflow logs:
I looked at the logs in the Dataflow Monitoring Console but found no errors or warnings.
I saw one other thread about this (Can’t make apache beam write outputs to bigquery when using DataflowRunner) but nothing got resolved there.
5