C is one of the most widely-used languages in the world. It accounts for a huge proportion of existing code and continues to be used for a vast amount of new code. It’s beloved by its users, it’s so widely ported that being able to run C is to many the informal definition of a platform, and is praised by its fans for being a “small” language with a relatively clean set of features.
So where are all the compilers?
On the desktop, there are (realistically) two: GCC and Clang. Thinking about it for a few seconds you’ll probably remember Intel exists as well. There are a handful of others, far too obscure for the average person to name and almost universally not bothering to support a recent language version (or often even a well-defined language subset, just “a subset”). Half of the members of this list are historical footnotes; most of the rest are very specialized and still don’t actually implement the full language. Very few actually seem to be open-source.
Scheme and Forth – other small languages that are beloved by their fans for it – probably have more compilers than actual users. Even something like SML has more “serious” implementations to choose between than C. Whereas the announcement of a new (unfinished) C compiler aiming at verification actually sees some pretty negative responses, and veteran implementations struggle to get enough contributors to even catch up to C99.
Why? Is implementing C so hard? It isn’t C++. Do users simply have a very skewed idea about what complexity group it falls in (i.e. that it actually is closer to C++ than Scheme)?
21
Today, you need a real C compiler to be an optimizing compiler, notably because C is no longer a language close to the hardware, because current processors are incredibly complex (out-of-order, pipelined, superscalar, with complex caches & TLB, hence needing instruction scheduling, etc…). Today’s x86 processors are not like i386 processors of the previous century, even if both are able to run the same machine code. See the C is not a low level language (Your computer is not a fast PDP-11) paper by David Chisnall.
Few people are using naive non-optimizing C compilers like tinycc or nwcc, since they produce code which is several times slower than what optimizing compilers can give.
Coding an optimizing compiler is difficult. Notice that both GCC and Clang are optimizing some “source language-neutral” code representation (Gimple for GCC, LLVM for Clang). The complexity of a good C compiler is not in the parsing phase!
In particular, making a C++ compiler is not much harder than making a C compiler: parsing C++ and transforming it into some internal code representation is complex (because the C++ specification is complex), but is well understood, but the optimization parts are even more complex (inside GCC: the middle-end optimizations, source-language and target-processor neutral, form the majority of the compiler, with the rest being balanced between front-ends for several languages and back-ends for several processors). Hence most optimizing C compilers are also able to compile some other languages, like C++, Fortran, D, … The C++ specific parts of GCC are about 20% of the compiler…
Also, C (or C++) is so widely used that people expect their code to be compilable even when it does not exactly follow the official standards, which do not define precisely enough the semantics of the language (so each compiler may have its own interpretation of it). Look also into the CompCert proved C compiler, and the Frama-C static analyzer, which care about more formal semantics of C.
And optimizations are a long-tail phenomenon: implementing a few simple optimizations is easy, but they won’t make a compiler competitive! You need to implement a lot of different optimizations, and to organize and combine them cleverly, to get a real-world compiler that is competitive. In other words, a real-world optimizing compiler has to be a complex piece of software. BTW, both GCC and Clang/LLVM have several internal specialized C/C++ code generators. And both are huge beasts (several millions of source lines of code, with a growth rate of several percent each year) with a large developer community (a few hundred persons, working mostly full-time, or at least half-time).
Notice that there is no (to the best of my knowledge) multi-threaded C compiler, even if some parts of a compiler could be run in parallel (e.g. intra-procedural optimization, register allocation, instruction scheduling… ). And parallel build with make -j
is not always enough (especially with LTO).
Also, it is difficult to get funded on coding a C compiler from scratch, and such an effort needs to last several years. Finally, most C or C++ compilers are free software today (there is no longer a market for new proprietary compilers sold by startups) or at least are monopolistic commodities (like Microsoft Visual C++), and being a free software is nearly required for compilers (because they need contributions from many different organizations).
I’d be delighted to get funding to work on a C compiler from scratch as free software, but I am not naive enough to believe that is possible today!
20
I would like to contest your underlying assumption that there are only a small number of C implementations.
I don’t even know C, I don’t use C, I am not a member of the C community, and yet, even I know far more than the few compilers you mentioned.
First and foremost, there is the compiler which probably completely dwarfs both GCC and Clang on the desktop: Microsoft Visual C. Despite the inroads that both OSX and Linux have been making on the desktop, and the marketshare that iOS and Android have “stolen” away from former traditional desktop users, Windows is still the dominant desktop OS, and the majority of Windows desktop C programs are probably compiled using Microsoft tools.
Traditionally, every OS vendor and every chip vendor had their own compilers. Microsoft, as an OS vendor, has Microsoft Visual C. IBM, as both an OS vendor and a chip vendor, has XLC (which is the default system compiler for AIX, and the compiler with which both AIX and i/OS are compiled). Intel has their own compiler. Sun/Oracle have their own compiler in Sun Studio.
Then, there are the high-performance compiler vendors like PathScale and The Portland Group, whose compilers (and OpenMP libraries) are used for numbercrunching.
Digital Mars is also still in business. I believe Walter Bright has the unique distinction of being the only person on the planet who managed to create a production-quality C++ compiler (mostly) by himself.
Last but not least we have all the proprietary compilers for embedded microcontrollers. IIRC, there are more microcontrollers sold every year than desktop, mobile, server, workstation, and mainframe CPUs have been sold in the entire history of computing combined. So, those are definitely not niche products.
An honorary mention goes out to TruffleC, a C interpreter(!) running on the JVM(!) written using the Truffle AST interpreter framework that is only 7% slower than GCC and Clang (whichever is fastest on any given particular benchmark) across the Computer Languages Benchmark Game, and faster than both on microbenchmarks. Using TruffleC, the Truffle team was able to get their version of JRuby+Truffle to execute Ruby C extensions faster than the actual C Ruby implementation!
So, these are 6 implementations in addition to the ones you listed which I can name off the top of my head, without even knowing anything about C.
14
How many compilers do you need?
If they have different feature sets, you create a portability problem. If they’re commoditised you choose either the “default” (GCC, Clang or VS). If you care about the last 5% performance you have a benchmark-off.
If you’re doing programming language work recreationally or for research purposes, it’s likely to be in a more modern language. Hence the proliferation of toy compilers for Scheme and ML. Although OCaml seems to be getting some traction for non-toy non-academic uses.
Note this varies a lot by language. Java has essentially the Sun/Oracle toolchain and the GNU one. Python has various compilers none of which are really respected compared to the standard interpreter. Rust and Go have exactly one implementation each. C# has Microsoft and Mono.
4
C/C++ is unique amongst compiled languages in that it has 3 major implementations of a common specification.
Going by the rule of dismissing anything that’s not used much, every other compiled language has 0 to 1.
And I think javascript is the only reason you need to specify ‘compiled’.
6
So what is your target language?
SML compilers are often targeting C or something like LLVM (or as seen in your link, the JVM or JavaScript).
If you’re compiling C, it’s not because you’re going to the JVM. You’re going to something worse than C. Far worse. And then you get to duplicate that minor hell a bunch of times for all your target platforms.
And sure, C isn’t C++, but I’d say that it’s closer to C++ than Scheme. It does have its own subset of undefined behavior evilness (I’m looking at you size of built in types). And if you screw up that minutiae (or do it “correctly” but unexpectedly) then you have decades of existing code on vital systems that will tell you how terrible you are. If you screw up an SML compiler, it just won’t work – and someone might notice. Someday.
7