I currently have a Beam pipeline executing in Dataflow as a streaming job, which reads from a Kafka Topic and uses a PeriodicImpulse to load a whitelist from a BigQuery table.
The problem is that the PeriodicImpulse
transform causes two unintended behaviors:
- It prevents the pipeline from draining (Upon drain request, the PeriodicImpulse does not stop executing, and Dataflow thinks that it is still doing work, so it doesn’t send a cancellation request for the job)
- It makes Autoscaling unable to scale down (For the same reason as above, DF runner thinks the worker in charge of the PI is always working, so it doesn’t bring it down)
This has been acknowledged by both Google and the Beam project.
Which option is there left to do the same operation? I just need to execute a query and load the results as a side input every hour. I read about GenerateSequence
, but honestly I cannot find much information about it in the Beam docs to be able to tell how is it different from a PeriodicImpulse.