I’m new on apache spark and I have a problem, but I think it’s more conceptual than technical.
I have created a cluster environment with a master and two workers. Then, I’m trying to execute a really simple code that create a DataFrame from a csv and show the information:
from pyspark.sql import SparkSession
#Create SparkSession
spark = SparkSession.builder
.master("spark://127.0.0.1:7077")
.appName("Test")
.config("spark.driver.host", "[myIP]")
.getOrCreate()
spark.sparkContext.setLogLevel("DEBUG")
file= "/opt/bitnami/spark/apps/mpg.csv"
mpg_data = spark.read.csv(file,header=True,inferSchema=True)
mpg_data.show()
Then I receive this error:
Traceback (most recent call last):
File “c:spark-ml.py”, line 41, in
mpg_data = spark.read.csv(file,header=True,inferSchema=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:readwriter.py”, line 740, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “C:java_gateway.py”, line 1322, in call
return_value = get_return_value(
^^^^^^^^^^^^^^^^^
File “C:captured.py”, line 185, in deco
raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/opt/bitnami/spark/apps/mpg.csv.
All the docker machines have a volume created with the route /opt/bitnami/spark/apps/ and they arrive to the file.
I think the problem is that I’m the driver and the code is executed in my computer, not in the workers. But if I point to the file in my computer, the workers cannot arrive to the file.
In a real world, what is the best way to do that? I should to create a new docker for the driver or use another system to send the file?
Thank you!
Daniel Moya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.