Going off docs/other posts online, you should cache() before checkpoint() because checkpoint() is done afterwards with a different action. However looking at spark query plan, this doesn’t seem to be true for what I’m doing:
<bunch of spark transformations - filters, joins, column renaming>
df_main = df_main.checkpoint()
<bunch of spark transformations - filters, joins, column renaming>>
df_main.write.mode("overwrite").parquet('s3://<location>')
The query plan for the final parquet jobs query:
You can see there’s a Scan ExistingRDD
, which is the results from the previous query for the transformations up to the checkpoint. I was thinking maybe this is some special quirk when doing a write so I tried count() instead and got the same results.
In a different much longer spark calc, I tried the same thing, however this time the checkpoint DID re-evaluate the transformations, so now I’m more confused about checkpoint() behaviour.
I want to understand exactly whats happening so that I know how/if I should be using checkpoint+cache() together. Am I looking at the wrong thing? Is my understanding incorrect?
Thanks