I have a spark structured streaming job, reading from Kafka, parsing avro, exploding a column, computing some extra columns as simple combinations (sum/product/division) of existing columns, and write the result to delta table. No windows or state, and not using foreachbatch.
I am strugling to get the latency down, and in the logs I see statements like
24/05/20 19:09:53 INFO CodeGenerator: Generated method too long to be JIT compiled: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$ is 43612 bytes
24/05/20 19:09:53 INFO CodeGenerator: Code generated in 249.483881 ms
and from the timestamps they seem to correspond to the micro-batches. After having enabled debug log I see that it is in fact the same code which gets generated for every micro-batch.
This raised the following questions:
- Is this as expected, or is it a sign of something beeing wrong?
- If it is as expected, what is the rationale? The transformations I want done is the same for every batch, why regenerate and recompile code instead of just using the old code on the new data?
- I found this cache, is it supposed to help?