On my HDFS, the block size is 512 MB.
I have an offline flow which generates data daily and outputs the data to HDFS.
I saw many small files (several KB) generated, so I updated the flow, making it combined the small files before outputting them to HDFS.
I DID see small files (several KB) combined as bigger files (several GB) on HDFS.
Before I applied the change, I executed the original flow for testing.
Before the test:
$ hdfs dfs -count -q -h -v /jobs/myflow
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
48.8 K 25.8 K 20 T 9.4 T 122 22.9 K 3.5 T /jobs/myflow
After the test:
$ hdfs dfs -count -q -h -v /jobs/myflow
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
48.8 K 23.3 K 20 T 7.9 T 130 25.4 K 4.0 T /jobs/myflow
After I applied the change, I executed the updated flow for testing.
Before the test:
$ hdfs dfs -count -q -h -v /jobs/myflow
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
48.8 K 29.3 K 20 T 7.8 T 130 19.4 K 4.1 T /jobs/myflow
After the test:
$ hdfs dfs -count -q -h -v /jobs/myflow
QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
48.8 K 28.9 K 20 T 6.3 T 138 19.7 K 4.6 T /jobs/myflow
Obviously, before the change, the flow generated 2.5 K (25.8 K – 23.3 K) files, while after the change, the flow generated only 0.4 K (29.3 K – 28.9 K) files.
Therefore, from my understanding, the storage consumption should have become smaller after the change, because bigger files occupy less blocks (again, 512 MB per block) on HDFS.
However, the fact is that the storage consumption kept unchanged as 1.5 T (9.4 T – 7.9 T = 1.5 T; 7.8 T – 6.3 T = 1.5 T).
Can someone please help me understand it?