I tried the following to get the compression format (File is in managed volume in Databricks catalog)
- Find the file format:
<code>%sh
file -d path/* output: [try json 1]
gzip -d path/*.gz output: not in gzip format
</code>
<code>%sh
file -d path/* output: [try json 1]
gzip -d path/*.gz output: not in gzip format
</code>
%sh
file -d path/* output: [try json 1]
gzip -d path/*.gz output: not in gzip format
This is a json file with .gz extension and I am reading thru DBR Autoloader.
Code below:
<code> spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("header", "False")
.option("multiline", "true")
</code>
<code> spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("header", "False")
.option("multiline", "true")
</code>
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("header", "False")
.option("multiline", "true")
This is not gzip compressed file and fails with java.io.IOException: incorrect header check.
Below is the error stack:
<code>at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:539)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:133)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:256)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1744)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:1143)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.inputStream(CreateJacksonParser.scala:83)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.$anonfun$forInputStream$5(CreateJacksonParser.scala:133)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.$anonfun$forDataStream$1(CreateJacksonParser.scala:142)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:85)
at org.apache.spark.util.SparkErrorUtils.tryWithResource(SparkErrorUtils.scala:47)
</code>
<code>at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:539)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:133)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:256)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1744)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:1143)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.inputStream(CreateJacksonParser.scala:83)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.$anonfun$forInputStream$5(CreateJacksonParser.scala:133)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.$anonfun$forDataStream$1(CreateJacksonParser.scala:142)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:85)
at org.apache.spark.util.SparkErrorUtils.tryWithResource(SparkErrorUtils.scala:47)
</code>
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method)
at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:227)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:539)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:133)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:256)
at com.fasterxml.jackson.core.JsonFactory._createParser(JsonFactory.java:1744)
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:1143)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.inputStream(CreateJacksonParser.scala:83)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.$anonfun$forInputStream$5(CreateJacksonParser.scala:133)
at org.apache.spark.sql.catalyst.json.CreateJacksonParser$.$anonfun$forDataStream$1(CreateJacksonParser.scala:142)
at org.apache.spark.sql.catalyst.json.JsonInferSchema.$anonfun$infer$1(JsonInferSchema.scala:85)
at org.apache.spark.util.SparkErrorUtils.tryWithResource(SparkErrorUtils.scala:47)
I have referenced other request related to this error in stackoverflow and all are referring to compression format.
Any help?