We are using spark to process big data (PB) of events. They are JSON, no fixed schema. We are wondering whether we use dataframe to infer schema, or, we use RDD.
We know that dataframe needs a schema and spark will scan data once to decide the schema, and 2nd time to load data into memory using that schema. Because of this, will RDD be faster? I assume no schema is needed and it only needs to load every record as JacksonJson.
Later, with RDD, we can still do filtering, joining, etc. We dont necessarily need to use SQL.
Thank you.
2