I’ve heard in a number of places now that people expect languages to use, or at least have, a self-hosting compiler in order to deserve respect.
I’m curious as to why this is. A compiler seems like a very significant piece of software to write, and I imagine not all languages are well-suited to creating them. Wouldn’t it make more sense to spend the effort working in something that will give better results?
3
Wouldn’t it make more sense to spend the effort working in something that will give better results?
Like what?
The nice thing about compilers is that they don’t have many dependencies. This makes them good candidates for a new language that likely doesn’t have a very large or diverse standard library yet.
Better yet, they require a variety of things, while also being well studied. The variety helps make sure that your example tests various parts of the language. Being well studied means that you have other compilers to compare against – as well as giving more credence to academic sorts that you know what you’re doing.
And while compilers seem like a ton of work, they’re pretty small in the grand scheme of things. If the language implementers can’t even do something they’ve done before in the new language, how are they going to do novel things? How are they going to handle the really big stuff like standard libraries or an IDE?
4
The goal of having a compiler in the language that is being compiled is often part of the practice of “eating your own dog food.” It demonstrates to the world that you consider the language, compiler, and ecosystem of supporting modules and tools to be “good enough for serious work” or “production ready.”
It also has the virtuous effect of forcing those closest to language, compiler, and runtime design to directly face the effects of all the decisions they have made, and development priorities they’ve chosen–warts and all. This often leads to a core group that not only understands the language environment in theory, but that have extensive practical experience using the language/tools in the crucible of hard, real-word conditions.
1
People create new general purpose languages for one main reason: they hate at least one thing about every other language out there. This is why so many languages don’t get off the ground. You have a great idea for a language that would improve your programming life, but you have to make the first implementation in a language that annoys you in at least one way. Self-hosting means you no longer have to work in that old annoying language. That’s why a language’s creators work toward that step, and see it as a major milestone.
A lot of language features look good on paper, but when you get around to using them in a real project you start to see their limitations. For one example, a lot of languages don’t have decent unicode support at first. Completing a large project helps ensure a lot of those sorts of situations have been encountered and dealt with, and a self-hosting compiler is as good a project as any. That’s why people other than the language’s creators see it as a major milestone.
That doesn’t mean it’s the only milestone worth noting. There is functionality that isn’t exercised by a compiler, such as database integration, graphical interfaces, networking, etc.
2
Steve Yegge wrote a great blog post that, somewhat indirectly, addresses this.
Big point #1: compilers encompass pretty much every aspect of computer science. They’re an upper-level course because you need to know all the other things you learn in the computer science curriculum just to get started. Data structures, searching and sorting, asymptotic performance, graph coloring? It’s all in there.
There’s a reason Knuth has been working on his monumental (and never-ending) “Art of Computer Programming” for several decades, even though it started out as (just) a compiler textbook. In the same way that Carl Sagan said “If you wish to make an apple pie from scratch, you must first invent the universe”, if you wish to write a compiler, you must first deal with nearly every aspect of computer science.
That means if the compiler is self-hosted, then it’s pretty sure to be able to do what I need, no matter what I’m doing. Conversely, if you didn’t write a compiler in your language, there’s a good chance it misses something that’s really important to somebody, because the language implementors never had to write a program that would require them to think about all those issues.
Big point #2: from 30,000 feet, a surprising number of problems look just like compilers.
Compilers take a stream of symbols, figure out their structure according to some domain-specific predefined rules, and transform them into another symbol stream. Sounds pretty general, doesn’t it? Well, yeah.
Whether you’re on the Visual C++ team or not, you will very often find yourself needing to do something that looks just like part of a compiler. I do it literally every day.
Unlike most other professions, programmers don’t just use tools, but build their own tools. A programmer who can’t (due to lack of skill, or lack of usable tools with which to build other tools) write tools will forever be handicapped, limited to the tools that somebody else provides.
If a language is “not well-suited to creating” programs that can take a stream of symbols, applying rules to them, and transforming it into another stream of symbols, that sounds like a pretty limited language, and not one that would be useful to me.
(Fortunately, I don’t think there are many programming languages which are ill-suited to transforming symbols. C is probably among the worst such language in use today, yet C compilers are usually self-hosted, so that never stopped anyone.)
A third reason I’ll end with, from personal experience, not mentioned by Yegge (because he wasn’t writing about “why self-host”): it shakes out bugs. When you’re writing a compiler, that means every time you build it (not just every time you run it), you depend on it to work, and to work correctly against a decent-sized codebase (the compiler itself).
This month I’ve been using a relatively new and famous non-self-hosted compiler (you can probably guess which one), and I can’t go 2 days without segfaulting the thing. I wonder how much the designers actually had to use it.
0
If you want to have a compiler for language X be self-hosting, your first have to implement it in some other language, say Y, such that it takes input for language X and spits out assembly code, or some intermediate code, or even object code for the machine the compiler is running on. You want to choose language Y to be as similar to language X as possible, since at some point you will translating code written in Y to X.
But you don’t want to write any more of the compiler in language Y than necessary, so to start with, you implement only a subset of the language — eliminating redundant constructs. In the case of a ‘C’ type language, while but no for or do while. if but no case or tertiary op. No structures or unions or enumerations. Etc. What you have left is just enough of the language to write a parser and a rudimentary code generator for language X. Then check the output. Again.
Once you have this working, you can rewrite the compiler source that was written in language Y into language X, and compile the language X source using the compiler written in language Y. The output will be a new compiler written in the new language X that compiles language X, i.e. it is now self-hosting. However it is not complete since you only implemented a subset of the language in language Y.
So now you add in the missing features, testing each one (or group of features) that they generate correct code. i.e. once the feature is implemented in the compiler, you can write test programs using the new feature(s), compile and test them, but you shouldn’t use them in the compiler source yet. Once the new feature(s) are verified, you can then use these new features in the compiler source itself — perhaps replacing some of the original code written in the language subset — recompile the compiler source using the version with the new features.
You now have a mechanism for adding new features to the language — and, once the code generation for the features has been verified correct, they can be used in the next generation of the compiler itself.
Back 60 years or so ago when computers first arrived on the scene (and later again when microprocessors first arrived), there were no other languages Y suitable for implementing the initial compiler. So the first compilers had to be written in assembly code, and then when enough of the compiler was running, the assembly code would be replaced by the version written in new language. No assembler either? The the whole processor dropped down another level, with the assembler initially being written in machine code.
Is it possible to produce a programming language that is not well designed for writing a compiler but is well designed for some other purpose?
Looking at a language like SQL I suppose the answer is yes. But languages of that nature are not general purpose.
1
Who says that? …anyway, it’s just an opinion. Some might agree, some may not, there is no right or wrong here. Some languages have compilers written in itself, others don’t. Whatever.
Nevertheless, I think it’s a nice exercice/proof-of-concept if a language is able to “self-compile” …it’s just …nice …and it prooves the language is suited to do some complex stuff.
I’d also like to mention that despite being nice, there are still a variety of reasons a compiler might be written in another language.
For instance, most javascript engines are not written in javascript. There are many reasons for this: integration with other software, linking to existing libraries/dependencies, superior tools, performance, legacy code… Sometimes, the language self-compiling is nice, but it still makes sense to maintain the core compiler in another. Yet, the language by itself makes sense. It’s just you usually can’t afford to redevelop an entire ecosystem.
0
Clang is written in C++. It wouldn’t be too hard to rewrite the Clang Objective-C compiler in Objective-C, but then it would be quite useless. Any change in the C++ compiler would have to be redone in Objective-C and vice versa. So why?
There’s now a Clang Swift compiler. Surely that compiler could be rewritten in Swift. But what purpose would it be? To demonstrate that the language is powerful enough to write a compiler in it? Nobody cares if you can write compilers in Swift. People do care if you can write user interfaces in Swift, and you demonstrably can.
If you have a well tested compiler that can easily be adapted to compile different languages, it’s quite pointless to rewrite it into different languages, unless the rewrite in one different language would make it easier to work with the compiler. And if it would make sense to write Clang in Swift, for example, then the Clang C, C++, and Objective-C compilers would all be written in Swift.
There are more important things to do than to prove that you can write a compiler in some programming language.
It shows that the language is capable of processing complex string processing and translating to another language/interpreting itself.
In the process of creating a compiler (the first big project) there will be issue that come to the fore.