I have been experiencing quite a few problems with Spark and Kinesis, and wanted to clarify my understanding of it. Say I have the following logic:
Subscribe to stream
Event arrives and RDD is built
Transformation
Transformation
Transformation
ForEachBatch
Action
It seems that when I do the ACTION, Spark goes back to Kinesis and asks for the actual batch of records using lazy evaluation, potentially from multiple executors. I would expect the mechanism to be that an RDD is created in memory earlier in the process.
This kind of breaks my mental model of streaming as it seems Kinesis is telling us an event is ready, then we have to go back and ask for it when the action is processed.
A few specific questions:
-
Can this model not result in multiple executors and tasks connecting to Kinesis in parallel to pull batches, meaning that 1 inbound batch turns into tens or hundreds of calls back into Kinesis from different executors?
-
The connection back to Kinesis takes up to 5 seconds to instantiate. Rather than do it once, this model seems to imply that our executors repeatedly need to connect back to Kinesis to process actions. This also seems inefficient and slow.
Any guidance what is happening here or sources of documentation? Thanks in advance!