I am currently trying to read and display a file from the Databricks File System (DBFS), but I encountered an issue. Here is the code I was using:
file_path = "/dbfs/cluster-logs/use_case/default_job_cluster/cluster_id/init_scripts/cluster_id/20240801_proxy-init.sh.stderr.log"
with open(file_path, 'r') as file:
contents = file.read()
print(contents)
However, interestingly I get the following error:
bash: line 11: /Volumes/landing/default/artifacts/projects/use_case/databricks/scripts/proxy-init.sh: No such file or directory
As you can see the path did not match the original input.
In the end, I was able to correctly read and display the log file content with the following code:
file_path = "/dbfs/cluster-logs/use_case/default_job_cluster/cluster_id/init_scripts/cluster_id/20240801_proxy-init.sh.stderr.log"
from pyspark.sql import functions as F
from pyspark.sql.functions import collect_list
if dbutils.fs.ls(file_path):
file_df_to_check = spark.read.text(file_path).agg(collect_list("value").alias("all_lines"))
display(file_df_to_check)
Questions:
- Why does the first code snippet produce an error referring to the volume path?
- What does it mean in the documentation that DBFS provides a scheme for volumes? Shouldnt the first snipped work then?
- Why can the file only be read using Spark and not with the standard Python open function?
Thank you for your assistance.
4