We found that Hive cannot concatenate some ORC files generated by Spark 3.2.1 and higher versions which contain long strings.
val valid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1024, 'A') as string;")
val invalid = spark.sql("SELECT 1 as id, cast(NULL as string) as null, lpad('A', 1025, 'A') as string;")
valid.withColumn("len", length($"string")).show() +---+----+--------------------+----+ | id|null| string| len| +---+----+--------------------+----+ | 1|null|AAAAAAAAAAAAAAAAA...|1024| +---+----+--------------------+----+ invalid.withColumn("len", length($"string")).show() +---+----+--------------------+----+ | id|null| string| len| +---+----+--------------------+----+ | 1|null|AAAAAAAAAAAAAAAAA...|1025| +---+----+--------------------+----+
valid.write.format("orc")
.option("path", "s3://bucket/test/test_orc/")
.option("compression", "zlib")
.mode("overwrite")
.save()
When I check statistics of generated files via command hive --orcfiledump s3://bucket/tets/test_orc/
I can see incorrect statistics for the column string
bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025
Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025
Processing data file s3://bucket-dev/tets/test_orc/part-00000-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orc [length: 488]Structure for s3://timmedia-dev/volodymyr/test_orc/part-00000-ec01de8f-8f6b-4937-b107-e88f5a5d2d67-c000.zlib.orcFile Version: 0.12 with FUTURERows: 1Compression: ZLIBCompression size: 262144Type: struct<id:int,null:string,string:string>
Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025
File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 1 max: 1 sum: 1 Column 2: count: 0 hasNull: true bytesOnDisk: 5 Column 3: count: 1 hasNull: false bytesOnDisk: 23 min: null max: null sum: 1025
Stripes: Stripe: offset: 3 data: 34 rows: 1 tail: 66 index: 108 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 2 section ROW_INDEX start: 38 length 19 Stream: column 3 section ROW_INDEX start: 57 length 54 Stream: column 1 section DATA start: 111 length 6 Stream: column 2 section PRESENT start: 117 length 5 Stream: column 2 section DATA start: 122 length 0 Stream: column 2 section LENGTH start: 122 length 0 Stream: column 2 section DICTIONARY_DATA start: 122 length 0 Stream: column 3 section DATA start: 122 length 16 Stream: column 3 section LENGTH start: 138 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DICTIONARY_V2[0] Encoding column 3: DIRECT_V2
File length: 488 bytesPadding length: 0 bytesPadding ratio: 0%
User Metadata: org.apache.spark.version=3.4.1
For DF with a string smaller than 1024 statistics is OK.
I tried different Spark properties (spark.sql.orc.impl,spark.sql.hive.convertMetastoreOrc, etc.)
Am I missing something?
vtianchuk is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.