I’m struggling to set up a batch data processing architecture with Spark, Docker, and Kubernetes, as shown in the attached diagram. The examples online are unclear and often don’t work.
Architecture Components:
• Web UI: HUE UI
• Processing Layer: Spark
• Resource Layer: Hive
• Storage Layer: Hadoop
• Scheduling Layer: Airflow for orchestration and DAGs
Tools and Technologies:
• Docker
• Kubernetes
• Apache Spark
• Apache Hive
• Apache Hadoop
• Apache Airflow
Do you know of any training or documentation that could help me? Any guidance or example configurations would be greatly appreciated.