I’m not very knowledgeable about Python packages and Java, so please bear with me.
I’m having some troubles with the following code on my Jupyter noteboook.
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.functions import explode,col
from pyspark.conf import SparkConf
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
uri = "mongodb://localhost:27017/connection1"
spark = SparkSession
.builder
.master("local")
.appName("Tesi")
.config("spark.mongodb.read.connection.uri", uri)
.config("spark.mongodb.write.connection.uri", uri)
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.13:10.3.0')
.config("spark.driver.bindAddress", "127.0.0.1")
.getOrCreate()
df = spark.read
.format("mongodb")
.option("uri", uri)
.option("database", "local")
.option("collection", "electricity_readings")
.load()
Running this code triggers the following exception:
An error occurred while calling o213.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: mongodb.
While investigating the problem, I came across a few possible solutions, but nothing is really working, so either my problem is different or I’m doing something wrong, and I think it’s the latter, so I’m looking for some help to finish setting up this project.
These are the relevant environment variables.
- PYSPARK_HOME = C:UsersfilosAppDataRoamingPythonPython39site-packagespyspark
- SPARK_HOME = C:Sparkspark-3.5.1-bin-hadoop3
- JAVA_HOME = C:Progra~1Javajdk-18.0.1.1
- Path contains the above variables with their respective bin folders
- HADOOP_HOME = C:Hadoop (it only contains a bin folder with winutils.exe)
No CLASSPATH or JAVA_CLASSPATH is set. I’ve downloaded the jar files for mongo-spark, so I put them in the %PYSPARK_HOME%jars folder, hoping for the jars to be read from there, but it didn’t work. I put them in a separate folder (C:Progra~1Javajars) made by me and set the CLASSPATH variable to that address, but the following exception got triggered instead.
An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.IOException: Failed to connect to xxx/xxx.xxx.x.x:59587
...
Caused by: java.net.ConnectException: Connection refused: no further information
...
This also triggers if i put mongo-spark-connector_2.13-10.3.0.jar in the C:Program FilesJavajre-1.8lib folder. This problem is not only exclusive to the cell with the code I’ve submitted, a previous cell with a different SparkSession reading from a json file also triggers the same exception when run.
I believe the jars I’ve downloaded must be put somewhere, and maybe the CLASSPATH variable must be set in order to read those jars, but I’m not sure how since just putting a random folder with the jars in as the value for CLASSPATH triggers an exception.