I have a pyspark application that runs fine locally with master “local”. Now i want to spark-submit the application to a simple cluster (standalone, client mode, running in docker).
/opt/bitnami/spark/data-processing/dist/
contains:
- src.zip (all python files)
- venv.zip (created with venv-pack as documentation suggested)
Additionally i unpacked src.zip to /opt/bitnami/spark/data-processing/dist/src because it seems i can not run my python script from zip directly.
I run this command (on driver, simple docker image based on docker.io/bitnami/spark:3.5.1):
# needed to resolve modules within src
PYTHONPATH=$PYTHONPATH:/opt/bitnami/spark/data-processing/dist/
spark-submit
--master spark:7077
--archives /opt/bitnami/spark/dist/venv.zip#venv
--py-files /opt/bitnami/spark/dist/src.zip
--conf PYSPARK_PYTHON=/opt/bitnami/spark/data-processing/dist/venv/bin/python
/opt/bitnami/spark/data-processing/dist/src/healthcheck.py
Running this yields simple error ModuleNotFoundError: No module named 'pydantic_settings'
.
I can understand because the whole program has many dependencies shipped with venv.
But until now i followed the documentation exactly i think?
What i tried:
running spark-submit
in activated venv: python3: error while loading shared libraries: libexpat.so.1: cannot open shared object file: No such file or directory
So how to get this running with remote master?
And why i need to ship venv.zip and src.zip to master (documentation sais)? Will the program run there and not only on driver?
is --conf PYSPARK_PYTHON
a setting active on driver or master?