I have recently build the zig compiler from source using the provided build
file:
./build x86_64-linux-gnu broadwell
and are wondering about the extreme inflation of total size of files in the directory with the source code after building the executable.
30x times increase of size means that each piece of information provided in the source files was 29 times “translated” to some other form in the process.
Let’s analyse the process of building an executable from source. In which step of building extensive repetition of input takes place able to explain that after the build the directory with source code and all other files including the resulting executable is 30 times larger?
What I am aware of is: we have a source code file … it gets pre-processed and is saved. This doubles the amount of the data provided in the source. Now it is twice there. As source and as the preprocessed file. OK , next step is tokenization : the file is translated into indexed and marked tokens. This adds a bit to the original content as marks are added giving the tokens a context and meaning. Now we arrive at 3.5 times of the source code files size. Then the tokens are translated to CPU instructions – this gives a total of 4.5 times the original code size in different forms i.e. “languages” . But this is by far not enough to explain 30 times … so there are still over 20 other forms/languages to which the source code file is translated/duplicated … which ones?
Let’s consider the process to be a sequence of translations from one language to another one: the by the machine understandable format is the machine “language” … language spoken by the machine/CPU. The source code file gets pre-processed and translated to a language which can be understood by the tokenizer … the tokenizer translates it to a language which can be understood by the part making CPU instructions out of tokens … it is a chain of translations to the next language understood by the subsequent tool … it is like stating something in Japanese and due to lack of a translator from Japanese to Russian it needs to be translated to English first and then from there to Russian giving three times increase in size instead of two times.
In other words the smart genius able to translate source code directly into machine code is not available … but some specialized intermediate translators are available, so they are chained to arrive at the result. By 30 times code duplication there must be at least over 20 intermediate steps I am not aware of … which ones???
Here some numbers:
- 45 MByte : zig-bootstrap-0.12.0.tar.xz
- 370 MByte : from archive extracted [zig-bootstrap-0.12.0] directory
- 10.7 GByte : size of [zig-bootstrap-0.12.0] after finished build
- 158 MByte : size of the build executable file [zig-bootstrap-0.12.0]/out/zig-x86_64-linux-gnu-broadwell/zig
In other words the original 45 MByte of information turns into 10.7 GByte of duplicating this information into another forms/languages. Isn’t it extreme? How does it come?
~ $ pwd
/home/o/oOo/@@/zigPrgLang/zig-bootstrap-0.12.0
~ $ du --max-depth=7 --bytes | sort -n --reverse | head -n 21
11534816050 .
9798353789 ./out
3641046291 ./out/build-llvm-host
2396029965 ./out/host
1946493794 ./out/build-llvm-x86_64-linux-gnu-broadwell
1845856812 ./out/build-llvm-host/bin
1707984748 ./out/host/bin
1518799725 ./zig
1413107665 ./out/build-llvm-host/lib
1378744641 ./out/build-llvm-x86_64-linux-gnu-broadwell/lib
1330181057 ./zig/zig-cache
1193590140 ./zig/zig-cache/o
1020543936 ./out/build-zig-host
654602936 ./out/build-llvm-x86_64-linux-gnu-broadwell/lib/Target
647543403 ./out/build-llvm-host/lib/Target
642163088 ./out/host/lib
481527064 ./out/x86_64-linux-gnu-broadwell
435528487 ./out/x86_64-linux-gnu-broadwell/lib
353705094 ./out/build-llvm-host/tools
333373368 ./out/build-zig-host/stage3
311775221 ./out/zig-x86_64-linux-gnu-broadwell
4