Thiết kế website giá rẻ

Question

I am trying to output logs in spark event logs so that they are accessible in history server.

I have tried two approaches

Adding my own custom logger that extends Serializable
extending org.apache.spark.internal.Logging trait in my Object

The code are as Below

MyCustomLogger Object

package com.vikas

import org.apache.logging.log4j.Logger
import org.apache.logging.log4j.LogManager
import org.apache.logging.log4j.core.appender.ConsoleAppender

object MyCustomLogger extends Serializable {  
  @transient lazy val log = LogManager.getRootLogger()
}

WordCountObject

package com.vikas

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.logging.log4j.Logger
import org.apache.logging.log4j.LogManager
import org.apache.logging.log4j.core.config.LoggerConfig
import org.apache.logging.log4j.core.config.Configurator
import org.apache.spark.internal.Logging

object WordCount extends Logging {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("WordCountExample").getOrCreate()
    // val filePath = "src/main/resources/sample.txt"

    val filePath = args(0).trim

    WordCount.log.warn("Vikas: reading files")
    //logWarn("reading files") 
    val lines = spark.sparkContext.textFile(filePath)

    MyCustomLogger.log.warn("transforming lines")
    val wordCounts = transform(lines)

    // println("foreach over collect will output in same order of textfile")
    import java.time.LocalDateTime
    import java.time.format.DateTimeFormatter
    val currdate: String = DateTimeFormatter.ofPattern("yyyyMMdd_HHmm").format(LocalDateTime.now)
    val outputDir = args(1).trim+currdate

    MyCustomLogger.log.warn("writing output")
    wordCounts.coalesce(1).saveAsTextFile(outputDir)

    MyCustomLogger.log.warn("finished writing output")
  }

  def transform(rdd: RDD[String]): RDD[(String, Int)] = {
    rdd
    .flatMap(_.split("\s"))    // Split on any white character
    .map(_.replaceAll(
      "[,.!?:;]", "")           // Remove punctuation and transfer to lowercase
      .trim
      .toLowerCase)
    .filter(!_.isEmpty)         // Filter out any non-words
    .map((_, 1))                // Initialize word count pairs
    .reduceByKey(_ + _)         // Finally, count words
    .sortByKey()                // and sort the word counts in a lexical order
  }
}

After running the code, I am not seeing output published in the job logs on history server
Can someone please help me what I am doing wrong here.

lo64j2.properties file

rootLogger.level = info
rootLogger.appenderRef.stdout.ref = console

appender.console.type = Console
appender.console.name = console
appender.console.target = SYSTEM_OUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n%ex

Log File from History Server

24/09/10 06:19:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/09/10 06:19:07 WARN DependencyUtils: Skip remote jar s3://vikasemrbucket/jars/SampleSparkProject-1.0.jar.
24/09/10 06:19:07 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-172-31-20-226.ap-southeast-2.compute.internal/172.31.20.226:8032
24/09/10 06:19:08 INFO Configuration: resource-types.xml not found
24/09/10 06:19:08 INFO ResourceUtils: Unable to find 'resource-types.xml'.
24/09/10 06:19:08 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
24/09/10 06:19:08 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
24/09/10 06:19:08 INFO Client: Setting up container launch context for our AM
24/09/10 06:19:08 INFO Client: Setting up the launch environment for our AM container
24/09/10 06:19:08 INFO Client: Preparing resources for our AM container
24/09/10 06:19:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
24/09/10 06:19:11 INFO Client: Uploading resource file:/mnt/tmp/spark-60dda2ce-2cc0-41b4-9800-9a8e679e3833/__spark_libs__429154216109170134.zip -> hdfs://ip-172-31-20-226.ap-southeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1725949012477_0002/__spark_libs__429154216109170134.zip
24/09/10 06:19:13 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
24/09/10 06:19:13 INFO Client: Uploading resource s3://vikasemrbucket/jars/SampleSparkProject-1.0.jar -> hdfs://ip-172-31-20-226.ap-southeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1725949012477_0002/SampleSparkProject-1.0.jar
24/09/10 06:19:14 INFO S3NativeFileSystem: Opening 's3://vikasemrbucket/jars/SampleSparkProject-1.0.jar' for reading
24/09/10 06:19:15 INFO Client: Uploading resource file:/etc/hudi/conf.dist/hudi-defaults.conf -> hdfs://ip-172-31-20-226.ap-southeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1725949012477_0002/hudi-defaults.conf
24/09/10 06:19:16 INFO Client: Uploading resource file:/mnt/tmp/spark-60dda2ce-2cc0-41b4-9800-9a8e679e3833/__spark_conf__6454346447521834005.zip -> hdfs://ip-172-31-20-226.ap-southeast-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1725949012477_0002/__spark_conf__.zip
24/09/10 06:19:16 INFO SecurityManager: Changing view acls to: hadoop
24/09/10 06:19:16 INFO SecurityManager: Changing modify acls to: hadoop
24/09/10 06:19:16 INFO SecurityManager: Changing view acls groups to: 
24/09/10 06:19:16 INFO SecurityManager: Changing modify acls groups to: 
24/09/10 06:19:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
24/09/10 06:19:16 INFO Client: Submitting application application_1725949012477_0002 to ResourceManager
24/09/10 06:19:16 INFO YarnClientImpl: Submitted application application_1725949012477_0002
24/09/10 06:19:17 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:17 INFO Client: 
     client token: N/A
     diagnostics: AM container is launched, waiting for AM container to Register with RM
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1725949156784
     final status: UNDEFINED
     tracking URL: http://ip-172-31-20-226.ap-southeast-2.compute.internal:20888/proxy/application_1725949012477_0002/
     user: hadoop
24/09/10 06:19:18 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:19 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:20 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:21 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:22 INFO Client: Application report for application_1725949012477_0002 (state: ACCEPTED)
24/09/10 06:19:23 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:23 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: ip-172-31-29-193.ap-southeast-2.compute.internal
     ApplicationMaster RPC port: 34877
     queue: default
     start time: 1725949156784
     final status: UNDEFINED
     tracking URL: http://ip-172-31-20-226.ap-southeast-2.compute.internal:20888/proxy/application_1725949012477_0002/
     user: hadoop
24/09/10 06:19:24 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:25 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:26 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:27 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:28 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:29 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:30 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:31 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:32 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:33 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:34 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:35 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:36 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:37 INFO Client: Application report for application_1725949012477_0002 (state: RUNNING)
24/09/10 06:19:38 INFO Client: Application report for application_1725949012477_0002 (state: FINISHED)
24/09/10 06:19:38 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: ip-172-31-29-193.ap-southeast-2.compute.internal
     ApplicationMaster RPC port: 34877
     queue: default
     start time: 1725949156784
     final status: SUCCEEDED
     tracking URL: http://ip-172-31-20-226.ap-southeast-2.compute.internal:20888/proxy/application_1725949012477_0002/
     user: hadoop
24/09/10 06:19:38 INFO ShutdownHookManager: Shutdown hook called
24/09/10 06:19:38 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-fd81d76f-bb66-490e-b4be-83149cb4ea4a
24/09/10 06:19:38 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-60dda2ce-2cc0-41b4-9800-9a8e679e3833
Command exiting with ret '0'

Thiết kế website giá rẻ

Danh mục

Unable to add logs from my spark jobs to spark eventlogs