I am trying to create a job in EMR Studio to run in an EMR Serverless application. It’s a relatively basic script to use PySpark to read some Athena tables, do some joins, create an output dataframe and write back to S3 as parquets.
However, I am constantly hitting the same errors that are failing the job.
Traceback (most recent call last):
File "/tmp/spark-05cfc8f7-6c38-4b42-bd7c-82bedb9b2920/AMF_EMR_test.py", line 37, in <module>
users_df = read_athena_table("<table_name>")
File "/tmp/spark-05cfc8f7-6c38-4b42-bd7c-82bedb9b2920/AMF_EMR_test.py", line 32, in read_athena_table
driver=driver
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 184, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o267.load.
: java.lang.ClassNotFoundException: com.amazon.athena.jdbc.AthenaDriver
24/07/23 11:00:06 INFO SparkContext: Added JAR s3://<bucket location>/athena-jdbc-3.2.1-with-dependencies.jar at s3://<bucket location>/athena-jdbc-3.2.1-with-dependencies.jar with timestamp 1721732405904
24/07/23 11:00:06 INFO SparkContext: Added JAR s3://<bucket location>/hadoop-aws-3.3.1.jar at s3://<bucket location>/hadoop-aws-3.3.1.jar with timestamp 1721732405904
....
24/07/23 11:00:12 ERROR TaskSchedulerImpl: Lost executor 2 on [2a05:d01c:126:af00:e372:7f2e:9945:40dc]: Unable to create executor due to java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
The top of the script is below.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat_ws, lit, split, when, date_add, expr, lead, lag, datediff
from pyspark.sql.window import Window
import logging
# Initialize Spark session
spark = SparkSession.builder
.appName("ComplexQueryProcessing")
.config("spark.jars", "s3://<bucket location>/athena-jdbc-3.2.1-with-dependencies.jar,s3://<bucket location>/hadoop-aws-3.3.1.jar")
.config("spark.driver.extraClassPath", <bucket location>/athena-jdbc-3.2.1-with-dependencies.jar,s3://<bucket location>/hadoop-aws-3.3.1.jar")
.config("spark.executor.extraClassPath", "s3://<bucket location>/athena-jdbc-3.2.1-with-dependencies.jar,s3://<bucket location>/hadoop-aws-3.3.1.jar")
.getOrCreate()
logging.info("Spark session created successfully")
logging.info("JARs loaded: %s", spark.sparkContext.getConf().get("spark.jars"))
athena_jdbc_url = "jdbc:athena://AwsRegion=eu-west-2;S3OutputLocation=s3://<bucket location>/temp-files/"
driver = "com.amazon.athena.jdbc.AthenaDriver"
def read_athena_table(dbtable):
return spark.read.format("jdbc").options(
url=athena_jdbc_url,
dbtable=dbtable,
driver=driver
).load()
I’ve tried both JDBC 2.x and JDBC 3.x, switching the driver class name accordingly. I’ve tried without hadoop-aws-3.3.1. I’ve submitted the job using a job configuration JSON to see if that makes any difference. Same errors every time. This has become a real blocker, so all ideas welcome.
si1287 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.