We all know .gz is non-splittable, that means only single core can read it. This means, when I place a huge .gz file on HDFS, it should actually be present as a single block. I see it is getting split into blocks of 128MB, how is it possible to split in HDFS but not in Spark?
From HDFS perspective, Non splittable means that blocks cannot be processed in parallel ; it does not mean a large file cannot span across more than one block.