I have multiple question about how spark handles data internally.
First :
I heard in multiple sources i can’t find anymore ( Medium, Youtube; Data with Zack) that parquet encoding was taken into account by spark. ( in my case pyspark )
but i cannot find any piece of official documentation that specifies it.
especially when i do a caching of the data, it goes from 300Mb to 2Gb so i don’t really see any encoding or compression kept ?
Second topic:
About caching, here is what they say in official Pyspark doc;
-->They say in python, records are always serialized. ( as mentionned below )
“”” .. note:: The following four storage level constants are
deprecated in 2.0, since the records
will always be serialized in Python.
“”” StorageLevel.MEMORY_ONLY_SER = StorageLevel.MEMORY_ONLY “””.. note::
Deprecated in 2.0, useStorageLevel.MEMORY_ONLY
instead.”””
StorageLevel.MEMORY_ONLY_SER_2 = StorageLevel.MEMORY_ONLY_2 “””..
note:: Deprecated in 2.0, useStorageLevel.MEMORY_ONLY_2
instead.””” StorageLevel.MEMORY_AND_DISK_SER =
StorageLevel.MEMORY_AND_DISK “””.. note:: Deprecated in 2.0, use
StorageLevel.MEMORY_AND_DISK
instead.”””
StorageLevel.MEMORY_AND_DISK_SER_2 = StorageLevel.MEMORY_AND_DISK_2
“””.. note:: Deprecated in 2.0, useStorageLevel.MEMORY_AND_DISK_2
instead.”””
--> However, in spark UI, the cached Dataframes are always marked as "DESERIALIZED"
does anyone know ?
Thanks for your help.