I have a Dataflow pipeline (in Java) which reads messages from Kafka topic and writes row updates to CloudSQL (Postgres).
The pipeline doesn’t perform any aggregations and transformation steps take 4-5 milliseconds. According to metrics, reading of Kafka record from topic takes 400-500 ms. But Dataflow shows that processing latency is 8-10 seconds. I can’t understand where time is spent. How to achieve sub-5-second latency ?
For JDBC write I use batching and sharding (JdbcIO.write().withBatchSize().withAutoSharding().Also, connection pooling is enabled, CloudSQL instance has enough resources.
I tried to measure time between sending of a particular Kafka record and storing of message data in the database table. For 95% of messages, it takes 9-11 seconds to read from Kafka topic and store a row to DB. Data freshness and Latency metrics in Dataflow shows 8-9 seconds.
Ihor Istomin is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.