I developed spark streaming application using Spark Structured Streaming v3.1.2
When i try to run Spark application on yarn Cluster mode, it returns an error with an exception trying to access HDFS root directory.
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=app-stream, access=WRITE, inode="/":hdfs:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:255)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkDefaultEnforcer(RangerHdfsAuthorizer.java:589)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:350)
I looked into what raised an exception, and then
I figured out the fact that it tries to create temporary checkpoint location when application starts
2024-07-04 13:27:33,283 WARN streaming.StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-39eda0e7-a3a4-4a17-8810-6c3a8c136be2. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting
temp checkpoint folder is best effort.
2024-07-04 13:27:33,415 WARN streaming.StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-57b737e7-382d-4035-8f20-dcf19e4280ce. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting
temp checkpoint folder is best effort.
So, when the app runs on client mode, it accesses local root directory (/tmp) but on the other hand it accesses HDFS root directory (hdfs://) on Cluster mode.
I tried every method on the web such as setting checkpointDir,
but AFAIK, temporary checkpoint is not what i can handle with.
It’s done by Spark internal code.
(https://issues.apache.org/jira/browse/SPARK-26825)
I verfied that it does not return such error as long as it does not create temporary checkpoint.
My app1 in caseA (without join with static dataframe) does not create temporary checkpoint, and it returns no error
My app2 in caseB (Join with static dataframe which refreshes every 6 hours w/ Delta table) tries to create temporary checkpoint, and it returns errors
So, my question is
- Is there a way to make spark stop trying to create temporary checkpoint ?
- When spark streaming try to create temporary checkpoint ?