I’m running a Spark Structured Streaming application that reads from a Kinesis stream, applies some non-stateful transformations and then performs a dropDuplicates
operation using an id, but with no watermark. The state is being stored on RocksDB (Spark 3.5.1)
I’m trying to understand how state is stored:
- Where is the state related data stored? Memory, Disk or on my checkpoint location? Looking at the
memoryUsedBytes
metric, I see that the memory used to track state keeps growing and then at some point it goes down.
- When checking for duplicates, considering that not all keys are stored in memory, does RocksDB have to access a store that does not provide low latency? So far I haven’t had major issues with latency, could that happen if state gets too big?
- When restarting the streaming query, how does it recover previous state? If state is retrieved from my checkpoint location, does it mean that if it keeps getting bigger, recovering will start to become slower?
I tried looking for some more detailed articles about how RocksDB is used in Spark Streaming, but couldn’t find one