I’m using apache spark to write parquet files with snappy compression enabled. parquet schema is quite big, 300+ columns, numbers, string, raw bytes.
All output files at HDFS have snappy.parquet suffix.
I did hdfs dfs -cat hdfs://cluster/dir/file.snappy.parquet
and I can see raw strings in output, some strings are even repeated multiple times in a row. Is it by design or I have some issues with compression settings?
These strings are columns values, not headers or any metadata related stuff.