I’m quite new to PySpark and was looking for some advice on how to kit production environments with docker. I’m building an ML pipeline that continuously consumes events from Kafka. I mention this in passing to emphasise that (a) there will be many dependencies for my main.py file and (b) the job will be running on the cluster at all times. In digging around there seems to be quite a few options (couple mentioned below), but was looking for guidance on best practices.
- Build the PySpark worker container images with the dependencies installed (pip install requirements.txt etc) then copy main.py in.
- Tar the dependencies and run a spark-submit with the relevant files
- Seems to be a spark-extension lib that can be used to load dependencies at run time
- Airflow submitted method
- I was also wondering whether it would be possible to mount a volume on the container and have the dependencies and main.py present on the docker host system
I quite like the last option, but haven’t actually seen anyone implement it, which makes me think it may be a non-option. Anyways … experience and thoughts much appreciated.
This more a philosophy and experience question.
roninsrv is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.