I have files in each directory inside a Azure storage account(devcome) which has the cointainer (inputdata) for example
devcome
inputdata
abc
01
Module191.json
Module192.json
02
Module191.json
Module192.json
def
03
Module191.json
Module192.json
04
Module191.json
Module192.json
I am supposed to find the size of each file and add to the dataframe i am reading with:
i am using the below script
def get_json_file_size(json_data):
import json
json_bytes = [json.dumps(row).encode('utf-8') for row in json_data]
return sum(len(row) for row in json_bytes)
spark.udf.register("get_json_file_size_udf", get_json_file_size)
df = spark.read.format("json")
.load(f"abfss://[email protected]/{abc,def}/*/*/*/")
.withColumn("file_name", input_file_name())
.withColumn("file_data_size", expr("get_json_file_size_udf(file_name)"))
I am unsure where i am getting it wrong since the file size is something which is not matching.
can i have some advice