I am trying to output logs in spark event logs so that they are accessible in history server.
I have tried two approaches
- Adding my own custom logger that extends
Serializable
- extending
org.apache.spark.internal.Logging
trait in my Object
The code are as Below
MyCustomLogger Object
package com.vikas
import org.apache.logging.log4j.Logger
import org.apache.logging.log4j.LogManager
import org.apache.logging.log4j.core.appender.ConsoleAppender
object MyCustomLogger extends Serializable {
@transient lazy val log = LogManager.getRootLogger()
}
WordCountObject
package com.vikas
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.logging.log4j.Logger
import org.apache.logging.log4j.LogManager
import org.apache.logging.log4j.core.config.LoggerConfig
import org.apache.logging.log4j.core.config.Configurator
import org.apache.spark.internal.Logging
object WordCount extends Logging {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("WordCountExample").getOrCreate()
// val filePath = "src/main/resources/sample.txt"
val filePath = args(0).trim
WordCount.log.warn("Vikas: reading files")
//logWarn("reading files")
val lines = spark.sparkContext.textFile(filePath)
MyCustomLogger.log.warn("transforming lines")
val wordCounts = transform(lines)
// println("foreach over collect will output in same order of textfile")
import java.time.LocalDateTime
import java.time.format.DateTimeFormatter
val currdate: String = DateTimeFormatter.ofPattern("yyyyMMdd_HHmm").format(LocalDateTime.now)
val outputDir = args(1).trim+currdate
MyCustomLogger.log.warn("writing output")
wordCounts.coalesce(1).saveAsTextFile(outputDir)
MyCustomLogger.log.warn("finished writing output")
}
def transform(rdd: RDD[String]): RDD[(String, Int)] = {
rdd
.flatMap(_.split("\s")) // Split on any white character
.map(_.replaceAll(
"[,.!?:;]", "") // Remove punctuation and transfer to lowercase
.trim
.toLowerCase)
.filter(!_.isEmpty) // Filter out any non-words
.map((_, 1)) // Initialize word count pairs
.reduceByKey(_ + _) // Finally, count words
.sortByKey() // and sort the word counts in a lexical order
}
}
After running the code, I am not seeing output published in the job logs on history server
Can someone please help me what I am doing wrong here.
lo64j2.properties file
rootLogger.level = info
rootLogger.appenderRef.stdout.ref = console
appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_OUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex
Log File from History Server
24/09/10 06:19:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/09/10 06:19:07 WARN DependencyUtils: Skip remote jar s3://vikasemrbucket/jars/SampleSparkProject-1.0.jar.
24/09/10 06:19:07 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-172-31-20-226.ap-southeast-2.compute.internal/172.31.20.226:8032
24/09/10 06:19:08 INFO Configuration: resource-types.xml not found
24/09/10 06:19:08 INFO ResourceUtils: Unable to find 'resource-types.xml'.
24/09/10 06:19:08 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
24/09/10 06:19:08 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
24/09/10 06:19:08 INFO Client: Setting up container launch context for our AM
24/09/10 06:19:08 INFO Client: Setting up the launch environment for our AM container
24/09/10 06:19:08 INFO Client: Preparing resources for our AM container
24/09/10 06:19:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
24/09/10 06:19:11 INFO Client: Uploading resource file:/mnt/tmp/spark-60dda2ce-2cc0-41b4-9800-9a8e679e3833/__spark_libs__429154216109170134.zip -> hdfs://ip-172-31-20-226.ap-southeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1725949012477_0002/__spark_libs__429154216109170134.zip
24/09/10 06:19:13 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
24/09/10 06:19:13 INFO Client: Uploading resource s3://vikasemrbucket/jars/SampleSparkProject-1.0.jar -> hdfs://ip-172-31-20-226.ap-southeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1725949012477_0002/SampleSparkProject-1.0.jar
24/09/10 06:19:14 INFO S3NativeFileSystem: Opening 's3://vikasemrbucket/jars/SampleSparkProject-1.0.jar' for reading
24/09/10 06:19:15 INFO Client: Uploading resource file:/etc/hudi/conf.dist/hudi-defaults.conf -> hdfs://ip-172-31-20-226.ap-southeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1725949012477_0002/hudi-defaults.conf
24/09/10 06:19:16 INFO Client: Uploading resource file:/mnt/tmp/spark-60dda2ce-2cc0-41b4-9800-9a8e679e3833/__spark_conf__6454346447521834005.zip -> hdfs://ip-172-31-20-226.ap-southeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1725949012477_0002/__spark_conf__.zip
24/09/10 06:19:16 INFO SecurityManager: Changing view acls to: hadoop
24/09/10 06:19:16 INFO SecurityManager: Changing modify acls to: hadoop
24/09/10 06:19:16 INFO SecurityManager: Changing view acls groups to:
24/09/10 06:19:16 INFO SecurityManager: Changing modify acls groups to:
24/09/10 06:19:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
24/09/10 06:19:16 INFO Client: Submitting application application_1725949012477_0002 to ResourceManager
24/09/10 06:19:16 INFO YarnClientImpl: Submitted application application_1725949012477_0002
24/09/10 06:19:17 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:17 INFO Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1725949156784
final status: UNDEFINED
tracking URL: http://ip-172-31-20-226.ap-southeast-2.compute.internal:20888/proxy/application_1725949012477_0002/
user: hadoop
24/09/10 06:19:18 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:19 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:20 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:21 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:22 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:23 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:23 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: ip-172-31-29-193.ap-southeast-2.compute.internal
ApplicationMaster RPC port: 34877
queue: default
start time: 1725949156784
final status: UNDEFINED
tracking URL: http://ip-172-31-20-226.ap-southeast-2.compute.internal:20888/proxy/application_1725949012477_0002/
user: hadoop
24/09/10 06:19:24 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:25 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:26 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:27 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:28 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:29 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:30 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:31 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:32 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:33 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:34 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:35 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:36 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:37 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:38 INFO Client: Application report for application_1725949012477_0002 (state: FINISHED)
24/09/10 06:19:38 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: ip-172-31-29-193.ap-southeast-2.compute.internal
ApplicationMaster RPC port: 34877
queue: default
start time: 1725949156784
final status: SUCCEEDED
tracking URL: http://ip-172-31-20-226.ap-southeast-2.compute.internal:20888/proxy/application_1725949012477_0002/
user: hadoop
24/09/10 06:19:38 INFO ShutdownHookManager: Shutdown hook called
24/09/10 06:19:38 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-fd81d76f-bb66-490e-b4be-83149cb4ea4a
24/09/10 06:19:38 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-60dda2ce-2cc0-41b4-9800-9a8e679e3833
Command exiting with ret '0'
2
Leaving this as an answer if someone gets into same problem. adding my log4j2.properties files to spark.driver.extraclasspath
and spark.executor.extraclasspath
did the trick. The logs are not going to eventlogs but are going to the pod logs