Relative Content

Tag Archive for apache-sparkapache-spark-sql

Shutdown Apache Spark batch job programatically

My use case is to start a spark job from Dolphin Scheduler or Airflow. This job will read the data from the Apache Pulsar stream for a timespan, process the records, and then shut down or close the application completely.

Catalyst rule return wrong logicalplan

def apply(plan: LogicalPlan): LogicalPlan = { plan transform { case unresolvedRelation: UnresolvedRelation => val tblSchemaName: Array[String] = unresolvedRelation.tableName.split(“\.”) if (tblSchemaName.length == 1) return plan val schema = tblSchemaName.apply(0) val tblName = tblSchemaName.apply(1) for (ref <- this.refs) { if (tblName == ref.nqName) { return unresolvedRelation.copy(multipartIdentifier = Seq(schema.toUpperCase, tblName.toUpperCase), unresolvedRelation.options, unresolvedRelation.isStreaming) } } unresolvedRelation case unresolvedWith: UnresolvedWith […]

Do you still need to cache() before checkpoint()?

Going off docs/other posts online, you should cache() before checkpoint() because checkpoint() is done afterwards with a different action. However looking at spark query plan, this doesn’t seem to be true for what I’m doing:

Spark Dataframe on JSON VS RDD

We are using spark to process big data (PB) of events. They are JSON, no fixed schema. We are wondering whether we use dataframe to infer schema, or, we use RDD.