my use case is to create an EMR 7.3 in AWS and invoke lambda to http request to pass the configs and payload over to initiate a spark job.
this is my software settings
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.acls.enable": "false",
"spark.authenticate": "false",
"spark.driver.memory": "4g",
"spark.master": "yarn-cluster",
"spark.modify.acls": "livy",
"spark.modify.acls.groups": "",
"spark.ui.view.acls": "livy",
"spark.ui.view.acls.groups": ""
}
},
{
"Classification": "hdfs-site",
"Properties": {
"dfs.client.failover.connection.retries": "3",
"dfs.client.failover.connection.retries.on.timeouts": "3",
"dfs.datanode.handler.count": "50"
}
},
{
"Classification": "spark-log4j2",
"Properties": {
"logger.IdentifierForClass.level": "info",
"logger.IdentifierForClass.name": "com.package",
"rootLogger.level": "info",
"rootLogger.name": "com.package"
}
}
]
my java program pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.pacjage</groupId>
<artifactId>promobatch</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>promobatch</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<java.version>17</java.version>
<scala.version>2.12</scala.version>
<spark.version>3.5.1</spark.version>
<postgresql.version>42.7.3</postgresql.version>
<spark.excel.version>0.20.4</spark.excel.version>
<spring.version>6.1.14</spring.version>
<ojdbc.version>19.6.0.0</ojdbc.version>
<aws.sdk.version>2.25.70</aws.sdk.version>
<maven.compiler.source>${java.version}</maven.compiler.source>
<maven.compiler.target>${java.version}</maven.compiler.target>
</properties>
<dependencyManagement>
<dependencies>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>bom</artifactId>
<version>${aws.sdk.version}</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>s3</artifactId>
</dependency>
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>${postgresql.version}</version>
</dependency>
<dependency>
<groupId>com.oracle.database.jdbc</groupId>
<artifactId>ojdbc8</artifactId>
<version>${ojdbc.version}</version>
</dependency>
<dependency>
<groupId>com.crealytics</groupId>
<artifactId>spark-excel_${scala.version}</artifactId>
<version>${spark.version}_${spark.excel.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-jdbc</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-context</artifactId>
<version>${spring.version}</version>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>kms</artifactId>
<version>${aws.sdk.version}</version>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>sns</artifactId>
<version>${aws.sdk.version}</version>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>dynamodb</artifactId>
<version>${aws.sdk.version}</version>
</dependency>
</dependencies>
<build>
<finalName>emr7-promobatch-0.0.1-SNAPSHOT</finalName>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.2.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<archive>
<manifest>
<mainClass>com.package.promobatch.App</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
</project>
this is my log4j2.properties that located at src/main/resources/log4js.properties
rootLogger.level = info
rootLogger.appenderRef.stdout.ref = Console
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_ERR
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
logger.repl.name = org.apache.spark.repl.Main
logger.repl.level = warn
logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
logger.thriftserver.level = warn
logger.jetty1.name = org.sparkproject.jetty
logger.jetty1.level = warn
logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
logger.jetty2.level = error
logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
logger.replexprTyper.level = info
logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
logger.replSparkILoopInterpreter.level = info
logger.parquet1.name = org.apache.parquet
logger.parquet1.level = error
logger.parquet2.name = parquet
logger.parquet2.level = error
logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
logger.RetryingHMSHandler.level = fatal
logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
logger.FunctionRegistry.level = error
appender.console.filter.1.type = RegexFilter
appender.console.filter.1.regex = .*Thrift error occurred during processing of message.*
appender.console.filter.1.onMatch = deny
appender.console.filter.1.onMismatch = neutral
my java program main class
public class Test {
private String clusterId;
private static final String JOB_NAME = "Test";
private static final Logger logger = LogManager.getLogger(Test.class);
public Test(String clusterId) {
logger.info("Dealer Cluster ID: {}", clusterId);
this.clusterId = clusterId;
}
public void start() {
logger.info("Starting EMR Job");
SparkSession spark = null;
AnnotationConfigApplicationContext context = null;
String topicArnDevOps = "";
try {
boolean isEncrypt = true;
String os = System.getProperty("os.name");
if (os.startsWith("Windows")) {
spark = SparkSession.builder().appName(JOB_NAME).master("local").getOrCreate();
isEncrypt = false;
} else {
spark = SparkSession.builder().appName(JOB_NAME).getOrCreate();
isEncrypt = true;
}
configureSparkSession(spark);
topicArnDevOps = PropertiesConstants.getSns_topic();
uploadFileToS3();
} catch (Exception e) {
logger.error("Exception for EMR Job: {}", JOB_NAME, e);
SNSUtil.sendNotification(topicArnDevOps, String.format("EMR Job %s was failed. Kindly check logs for Cluster Id: %s. Exception: %s", JOB_NAME, clusterId, e.toString()));
} finally {
closeResources(spark, context);
}
}
private void configureSparkSession(SparkSession spark) {
spark.conf().set("spark.sql.session.timeZone", DateTimeUtil.ZONE_ID);
spark.conf().set("materializationProject", "maxis-cdw-shared-analysis");
spark.conf().set("materializationDataset", "TEMP");
}
private void closeResources(SparkSession spark, AnnotationConfigApplicationContext context) {
logger.info("Stopping Spark Context");
if (spark != null && !spark.sparkContext().isStopped()) {
spark.sparkContext().stop();
spark.close();
}
if (context != null) {
context.close();
}
}
public void uploadFileToS3() {
logger.info("Starting S3 File Upload");
try {
String bucketName = "nba-dev-generic-bucket";
String key = "voon_test/test.txt";
String content = "This is a test content to be written into S3";
S3Client s3Client = S3Client.builder().build();
PutObjectRequest putObjectRequest = PutObjectRequest.builder()
.bucket(bucketName)
.key(key)
.build();
s3Client.putObject(putObjectRequest, RequestBody.fromString(content));
logger.info("Successfully written to S3: s3://{}/{}", bucketName, key);
} catch (Exception e) {
logger.error("Exception during S3 File Upload: ", e);
}
}
}
the main program is run successfully just that the logs are not showing
this is the logs at the livy console
24/12/17 14:59:33 INFO BatchSessionManager: Registered new session 1
24/12/17 14:59:36 INFO LineBufferedStream: 24/12/17 14:59:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/17 14:59:38 INFO LineBufferedStream: 24/12/17 14:59:38 WARN DependencyUtils: Skip remote jar s3://nba-dev-generic-bucket/EMR7/lib/spoiwo_2.12-2.2.1.jar.
...
24/12/17 14:59:38 INFO LineBufferedStream: 24/12/17 14:59:38 WARN DependencyUtils: Skip remote jar s3://nba-dev-generic-bucket/EMR7/lib/xmlbeans-5.2.1.jar.
24/12/17 14:59:38 INFO LineBufferedStream: 24/12/17 14:59:38 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-10-156-120-90.aws.men.package.com.my/10.156.120.90:8032
24/12/17 14:59:39 INFO LineBufferedStream: 24/12/17 14:59:39 INFO Configuration: resource-types.xml not found
24/12/17 14:59:39 INFO LineBufferedStream: 24/12/17 14:59:39 INFO ResourceUtils: Unable to find 'resource-types.xml'.
24/12/17 14:59:39 INFO LineBufferedStream: 24/12/17 14:59:39 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11712 MB per container)
24/12/17 14:59:39 INFO LineBufferedStream: 24/12/17 14:59:39 INFO Client: Will allocate AM container, with 4505 MB memory including 409 MB overhead
24/12/17 14:59:39 INFO LineBufferedStream: 24/12/17 14:59:39 INFO Client: Setting up container launch context for our AM
24/12/17 14:59:39 INFO LineBufferedStream: 24/12/17 14:59:39 INFO Client: Setting up the launch environment for our AM container
24/12/17 14:59:39 INFO LineBufferedStream: 24/12/17 14:59:39 INFO Client: Preparing resources for our AM container
24/12/17 14:59:39 INFO LineBufferedStream: 24/12/17 14:59:39 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
24/12/17 14:59:40 INFO LineBufferedStream: 24/12/17 14:59:40 INFO Client: Uploading resource file:/mnt/tmp/spark-ec16008a-d702-40d7-b17e-c6177e1894bb/__spark_libs__17854039851719844287.zip -> hdfs://ip-10-156-120-90.aws.men.package.com.my:8020/user/livy/.sparkStaging/application_1734418464855_0002/__spark_libs__17854039851719844287.zip
...
24/12/17 15:00:17 INFO LineBufferedStream: 24/12/17 15:00:17 INFO S3NativeFileSystem: Opening 's3://nba-dev-generic-bucket/EMR7/lib/spring-expression-6.1.14.jar' for reading
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO SecurityManager: Changing view acls to: livy
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO SecurityManager: Changing modify acls to: livy
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO SecurityManager: Changing view acls groups to:
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO SecurityManager: Changing modify acls groups to:
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: livy; groups with view permissions: EMPTY; users with modify permissions: livy; groups with modify permissions: EMPTY
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO Client: Submitting application application_1734418464855_0002 to ResourceManager
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO YarnClientImpl: Submitted application application_1734418464855_0002
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO Client: Application report for application_1734418464855_0002 (state: ACCEPTED)
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO Client:
24/12/17 15:00:22 INFO LineBufferedStream: client token: N/A
24/12/17 15:00:22 INFO LineBufferedStream: diagnostics: [Tue Dec 17 15:00:22 +0800 2024] Application is Activated, waiting for resources to be assigned for AM. Details : AM Partition = <DEFAULT_PARTITION> ; Partition Resource = <memory:11712, vCores:4> ; Queue's Absolute capacity = 100.0 % ; Queue's Absolute used capacity = 0.0 % ; Queue's Absolute max capacity = 100.0 % ; Queue's capacity (absolute resource) = <memory:11712, vCores:4> ; Queue's used capacity (absolute resource) = <memory:0, vCores:0> ; Queue's max capacity (absolute resource) = <memory:11712, vCores:4> ;
24/12/17 15:00:22 INFO LineBufferedStream: ApplicationMaster host: N/A
24/12/17 15:00:22 INFO LineBufferedStream: ApplicationMaster RPC port: -1
24/12/17 15:00:22 INFO LineBufferedStream: queue: default
24/12/17 15:00:22 INFO LineBufferedStream: start time: 1734418822248
24/12/17 15:00:22 INFO LineBufferedStream: final status: UNDEFINED
24/12/17 15:00:22 INFO LineBufferedStream: tracking URL: http://ip-10-156-120-90.aws.men.package.com.my:20888/proxy/application_1734418464855_0002/
24/12/17 15:00:22 INFO LineBufferedStream: user: livy
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO ShutdownHookManager: Shutdown hook called
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-e03a87a5-3496-48ce-bc98-b6e3d7239cbf
24/12/17 15:00:22 INFO LineBufferedStream: 24/12/17 15:00:22 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-ec16008a-d702-40d7-b17e-c6177e1894bb
lambda request payload that trigger the spark job
{
"file":"s3://nba-dev-generic-bucket/emr7-promobatch-0.0.1-SNAPSHOT.jar",
"className":"com.package.promobatch.App",
"name":"Test-20241217-0659",
"args":[
"Test",
"-",
"j-MP182UNW7396",
"100"
],
"conf":{
"spark.driver.extraJavaOptions":"-Dlog4j.configuration=log4j.properties",
},
"jars":[
"s3://nba-dev-generic-bucket/EMR7/lib/SparseBitSet-1.3.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/aircompressor-0.25.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/annotations-17.0.0.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/apache-client-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/arns-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/audience-annotations-0.5.0.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/auth-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/aws-core-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/aws-json-protocol-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/aws-query-protocol-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/aws-xml-protocol-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/checker-qual-3.42.0.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/checksums-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/checksums-spi-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/crt-core-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/curvesapi-1.08.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/dynamodb-2.25.70.jar",
"s3://nba-dev-generic-bucket/EMR7/lib/endpoints-spi-2.25.70.jar"
]
}
what could be the issue? the logs should be able to print out at the livy console or logs at the S3
Voon see hong is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
2