I have a lazy val
defined in my code studentDataReader
which eventually reads data from an S3 path. My understanding is that even if I call this multiple times, it should call S3 once only.
The compiler does not immediately evaluate the bound expression of a
lazy val
. It evaluates the variable only on its first access.
Upon initial access, the compiler evaluates the expression and stores the result in the lazy val. Whenever we access this val at a later stage, no execution happens, and the compiler returns the result. Source: https://www.baeldung.com/scala/lazy-val
However, that is not happening. And every invokation does an execution i.e. makes a call to S3 to read again. (confirmed from Spark UI)
Below is a skeleton code. (Spark 2.4.5, Scala 2.1)
The Reader
used is a custom trait not defined below for simplicity:
trait CommonProcessor {
val spark: SparkSession = SparkSession.getActiveSession.get
import spark.implicits._
lazy val studentDataReader: Reader[Setting, DataFrame] =
Reader({config =>
getStudentData(config)
})
def getStudentData(setting: Setting): DataFrame = {
val studentData = getInput(setting, deltaPathToS3) //Behind the scenes calls spark.read.format("delta")
studentData.cache
}
}
trait StudentDataProcessor { self: CommonProcessor =>
import spark.implicits._
lazy val getResult: Reader[Setting, DataFrame] = {
for {
df <- getFinalData
} yield {
dropDuplicates(df)
}
}
lazy val getFinalData: Reader[Setting, DataFrame] = {
for {
studentData <- studentDataReader
studentTeacherData <- studentTeacherDataReader
} yield {
doSomeProcess(studentData, studentTeacherData)
}
}
lazy val studentTeacherDataReader: Reader[Setting, DataFrame] = {
for {
s <- studentDataReader
f <- teacherDataReader
} yield {
mergeData(s, f)
}
}
}
object StudentData extends StudentDataProcessor with CommonProcessor {
def process(setting: Setting) := {
try {
val result = getResult.run(setting)
}
}
}
StudentData.process(setting)
As you can see, I am making 2 calls.
studentData <- studentDataReader
and s <- studentDataReader
. In each call, spark is reading from S3.
What can I do here to have studentDataReader
return a cached result instead of going through whole execution everytime.