Dear apache spark community,
I would like to ask some question.
My questions:
- From apache spark 3.4 onwards, there is new design remote connect, what is the difference between master? Is this utilized all the existing worker, I could not find any docs about this.
- When we using
SparkSession.builder.master(...)
in server B, do we also need to have spark ready on server B?
Background: a server A hosted a spark cluster with master and multiple worker. server B try to access server A’s spark cluster with python IDE
Note: server B does not have anything about spark, Only pyspark installed!
I have two case to show:
Case 1: In server B, i can connect to server A using remote connect in IDE like:
from pyspark.sql import SparkSession
spark_master_url = "sc://ServerA_IP_ADDRESS:15002"
spark = SparkSession.builder
.appName('test')
.remote(spark_master_url)
.getOrCreate()
This one working perfectly fine!
Case 2: In server B, i try to connect to server A using master connect in IDE like:
from pyspark.sql import SparkSession
spark_master_url = "spark://ServerA_IP_ADDRESS:7077"
spark = SparkSession.builder
.appName('test')
.master(spark_master_url)
.getOrCreate()
I got this issue when running this code:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
My yaml file (from: https://medium.com/@yssmelo/spark-connect-launch-spark-applications-anywhere-with-the-client-server-architecture-dbt-f99399c566fe):
version: '3'
services:
spark-master:
image: apache/spark:3.5.1
container_name: spark-master
hostname: spark-master
ports:
- "8080:8080"
- "7077:7077"
environment:
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=7077
- SPARK_MASTER_WEBUI_PORT=8080
entrypoint:
- "bash"
- "-c"
- "/opt/spark/sbin/start-master.sh && tail -f /dev/null"
volumes:
- ./data:/opt/spark/work-dir/spark-warehouse/data:rw
spark-connect:
image: apache/spark:3.5.1
container_name: spark-connect
hostname: spark-connect
ports:
- "4040:4040"
- "15002:15002"
depends_on:
- spark-master
volumes:
- ./jars/spark-connect_2.12-3.5.1.jar:/opt/spark/jars/spark-connect_2.12-3.5.1.jar
- ./data:/opt/spark/work-dir/spark-warehouse/data:rw
command:
- "bash"
- "-c"
- "/opt/spark/sbin/start-connect-server.sh --jars /opt/spark/jars/spark-connect_2.12-3.5.1.jar && tail -f /dev/null"
spark-worker:
image: apache/spark:3.5.1
container_name: spark-worker
hostname: spark-worker
depends_on:
- spark-master
entrypoint:
- "bash"
- "-c"
- "/opt/spark/sbin/start-worker.sh spark://spark-master:7077 && tail -f /dev/null"
volumes:
- ./data:/opt/spark/work-dir/spark-warehouse/data:rw
zzzzzz is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.