I have trouble getting different versions of PySpark to work correctly on my windows machine in combination with different versions of Python installed via PyEnv.
The setup:
- I installed pyenv and let it set the environment variables (PYENV, PYENV_HOME, PYENV_ROOT and the entry in PATH)
- I installed Amazon Coretto Java JDK (jdk1.8.0_412) and set the JAVA_HOME environment variable.
- I downloaded the winutils.exe & hadoop.dll from here and set the HADOOP_HOME environment variable.
- Via pyenv I installed Python 3.10.10 and then pyspark 3.4.1
- Via pyenv I installed Python 3.8.10 and then pyspark 3.2.1
Python works as expected, but I’m having trouble with PySpark.
For one, I cannot start PySpark via the powershell console by running pyspark
-> The term 'pyspark' is not recognized as the name of a cmdlet, function, script file....
.
More annoyingly, my repo-scripts (with a .venv created via pyenv & poetry) also fail:
Caused by: java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
[…]Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
However, both work after I add the following two entries to the PATH environment variable:
- C:Usersmyuser.pyenvpyenv-winversions3.10.10
- C:Usersmyuser.pyenvpyenv-winversions3.10.10Scripts
but I would have to “hardcode” the Python Version – which is exactly what I don’t want to do while using pyenv.
If I hardcode the path, even if I switch to another Python version (pyenv global 3.8.10
), once I run pyspark
in Powershell, the version PySpark 3.4.1 starts from the environment PATH entry for Python 3.10.10.
I was hoping to be able to start PySpark 3.2.1 from Python 3.8.10 which I just “activated” with pyenv globally.
What do I have to do to be able to switch between the Python installations (and thus also between PySparks) with pyenv without “hardcoding” the Python paths?
Example PySpark script:
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.master("local[*]")
.appName("myapp")
.getOrCreate()
)
data = [("Finance", 10),
("Marketing", 20),
]
df = spark.createDataFrame(data=data)
df.show(10, False)