From what I’ve found I need to use a SparkSubmitOperator to submit my PySpark script. But what if I want to assign the extract, transform, and load parts of my Spark job to different tasks in my Airflow DAG such that in my DAG I can see: start_etl >> createsession >> extract >> transform >> load >> end_etl
instead start_etl >> etl >> end_etl
.
My currently non working method is that I have a function to create the SparkSession, extract, transform, load and I call PythonOperator to set my task but it’s obviously not working.
Can such a way to set up the DAG work? If it’s not possible, is this a feature coming to Airflow in the future? It would seem very useful to know where an ETL pipeline fails in case it does fail, and the orchestration would be an added bonus.
It’s not proper design. Spark should be responsible for logging your Spark job execution process, not Airflow.
Simply create a single Spark job to run on the cluster and a single Airflow task