I’m currently struggling with spark checkpoints and trying to understand what’s the difference between DataFrame and RDD checkpoints.
To make it clear: after DataFrame.checkpoint() spark creates file on hdfs, and it is something like hdfs:///sdfasd-dfasdf-dfasdf/rdd-2, – so it saves RDD of DataFrame. The same thing saves after checkpointing RDD.
So, the main question: does this files differ? and will spark use this checkpoint for dataframe if we made RDD.checkpoint() (if they are the same files).
And also I have some side question: for RDD there are getCheckpointFile() method which allows to get checkpoint file full path. Are there any chance to get the same for DataFrame? Because there are no box method.
Daniil is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.