I’m trying to find a way to decompress a column in pyspark from a SQL Server Table exported in parquet.
What i have:
1. SQL Server Table with an xml compressed field
This is the original column type: XML(.)
This is the compressed column type: varbinary(max)
the expression to compress the column is COMPRESS(CAST([unzipped] AS VARCHAR(MAX)))
an extra cast as VARCHAR is needed because as the documentation says XML is not supported as expression for compression
The compression algorithm is gzip – documentation
Tipically to extract the original data this expression works perfectly fine: CAST(DECOMPRESS([zipped]) AS XML)
2. The .parquet extraction of the table made using the “Data Copy” activity of Azure Data Factory
The mapping table – parquet maps the data type of the column like this:
Sql Server Table type | partquet type |
---|---|
varbinary |
BINARY |
The parquet is saved on the datalake and accessed via Databricks
3. The dataframe column type as read by spark
StructField('column_name', BinaryType(), True)
What i would like to achive:
The goal is to obtain a table where the compressed field is correctly decompressed as xml.
e.g.
Source table
Fields 1 to N | Compressed_field |
---|---|
“Classic” types | BINARY |
Target table
Fields 1 to N | Uncompressed_field |
---|---|
“Classic” types | String or XML (if supported) |
My Tentative
import gzip
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# limit on small portion just to test fast, table will be way bigger
parquetFile = spark.read.parquet("path_to_file/file.parquet").limit(10)
# Fast python test to check the algorithm suggest bi the SQL Server documentation is correct (Result is ok)
df = parquetFile.toPandas()
str(gzip.decompress(df['zipped'][0]))
# Now with the whole dataframe
def ungzip(buf):
result = gzip.decompress(buf)
return result
ungzip_udf = udf(ungzip, StringType())
parquetFile = parquetFile.withColumn("unzipped", ungzip_udf("zipped"))
My results
The fast python experiments works fine:
But when I use the UDF my the table display seems wrong
what am I missing? Text encoding?