When installing PySpark from pypi.org
pip install pyspark==3.5.0
there doesn’t seem to be any requirement for setting SPARK_HOME environment variable.
How does that work ?
On the contrary, if i download ‘Apache Spark’ from
https://www.apache.org/dyn/closer.lua/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
which contains PySpark, the following steps need to be done
export SPARK_HOME=/opt/software/spark
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.7-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH
How does `pip install pyspark’ obviate the need to setup SPARK_HOME ? What is the mechanism ?
1
Regardless of PySpark installation method you can run PySpark with spark-submit
or pyspark
.
If SPARK_HOME
is not set, spark-submit
or pyspark
tries to set it up automatically by executing the find-spark-home
script. The script checks if PySpark is pip installed or not. When it is pip installed find_spark_home.py
is available in $VIRTUAL_ENV/bin
. Then find_spark_home.py
does the the rest of the job, you will find more in that Python script.
So as long as you have Java installed you should be fine without additional configurations like those three exports. But as always, errors happen when the whole process is based on many environment variables checks. Hope that helps.