I’m slowly working to finish my degree, and this semester is Compilers 101. We’re using the Dragon Book. Shortly into the course and we’re talking about lexical analysis and how it can be implemented via deterministic finite automata (hereafter, DFA). Set up your various lexer states, define transitions between them, etc.
But both the professor and the book propose implementing them via transition tables which amount to a giant 2d array (the various non-terminal states as one dimension, and the possible input symbols as the other) and a switch statement to handle all of the terminals as well as dispatch to the transition tables if in a non-terminal state.
The theory is all well and good, but as someone who’s actually written code for decades, the implementation is vile. It’s not testable, it’s not maintainable, it’s not readable, and it’s a pain and a half to debug through. Worse yet, I can’t see how it would be remotely practical if the language was UTF capable. Having a million or so transition table entries per non-terminal state gets unweildy in a hurry.
So what’s the deal? Why is the definitive book on the subject saying to do it this way?
Is the overhead of function calls really that much? Is this something that works well or is necessary when the grammar isn’t known ahead of time (regular expressions?)? Or perhaps something that handles all cases, even if more specific solutions will work better for more specific grammars?
(note: possible duplicate “Why use an OO approach instead of a giant switch statement?” is close, but I don’t care about OO. A functional approach or even saner imperative approach with standalone functions would be fine.)
And for the sake of example, consider a language that only has identifiers, and those identifiers are [a-zA-Z]+
. In the DFA implementation, you’d get something like:
private enum State
{
Error = -1,
Start = 0,
IdentifierInProgress = 1,
IdentifierDone = 2
}
private static State[][] transition = new State[][]{
///* Start */ new State[]{ State.Error, State.Error (repeat until 'A'), State.IdentifierInProgress, ...
///* IdentifierInProgress */ new State[]{ State.IdentifierDone, State.IdentifierDone (repeat until 'A'), State.IdentifierInProgress, ...
///* etc. */
};
public static string NextToken(string input, int startIndex)
{
State currentState = State.Start;
int currentIndex = startIndex;
while (currentIndex < input.Length)
{
switch (currentState)
{
case State.Error:
// Whatever, example
throw new NotImplementedException();
case State.IdentifierDone:
return input.Substring(startIndex, currentIndex - startIndex);
default:
currentState = transition[(int)currentState][input[currentIndex]];
currentIndex++;
break;
}
}
return String.Empty;
}
(though something that would handle end of file correctly)
Compared to what I would expect:
public static string NextToken(string input, int startIndex)
{
int currentIndex = startIndex;
while (currentIndex < startIndex && IsLetter(input[currentIndex]))
{
currentIndex++;
}
return input.Substring(startIndex, currentIndex - startIndex);
}
public static bool IsLetter(char c)
{
return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'));
}
With the code in NextToken
refactored out into its own function once you have multiple destinations from the start of the DFA.
15
The motivation for the particular algorithm is largely that it’s a learning exercise, so it tries to stay close to the idea of a DFA, and keep states and transitions very explicit in the code. As a rule, nobody would actually manually write any of this code anyway – you would use a tool to generate code from a grammar. And that tool wouldn’t care about the readability of the code because it isn’t source code, it’s an output based on the definition of a grammar.
Your code is cleaner for someone maintaining a hand-written DFA, but a little farther removed from the concepts being taught.
In practice these tables are generated from regular expressions that define the tokens of the language:
number := [digit][digit|underscore]+
reserved_word := 'if' | 'then' | 'else' | 'for' | 'while' | ...
identifier := [letter][letter|digit|underscore]*
assignment_operator := '=' | '+=' | '-=' | '*=' | '/='
addition_operator := '+' | '-'
multiplication_operator := '*' | '/' | '%'
...
We have had utilities to generate lexical analyzers since 1975 when lex was written.
You are basically suggesting replacing regular expressions with procedural code. This expands a couple of characters in a regular expression into several lines of code. Handwritten procedural code for lexical analysis of any moderately interesting language tends to be both inefficient and difficult to maintain.
10
The inner loop of:
currentState = transition[(int)currentState][input[currentIndex]];
currentIndex++;
break;
has a lot of performance advantages. There are no branches in that at all, because you do exactly the same thing for every input character. The performance of the compiler can be gated by the lexer (which must operate on a scale of every character of input). This was even more true when the Dragon Book was written.
In practice, besides CS students studying lexers, no one has to implement (or debug) that inner loop because it is part of the boilerplate that comes with the tool that builds the transition
table.
From memory, — it’s a long time since I’ve read the book, and I’m pretty sure I didn’t read the latest edition, I for sure don’t remember something looking like Java — that part was written with the code being intended to be a template, the table being filled with a lex like lexer generator. Still from memory, there was a section on table compression (again from memory, it was written in such a way that it was also applicable to table driven parsers, thus perhaps further in the book than what you have seen yet). Similarly, the book I remember assumed an 8-bit character set, I’d expect a section on handling bigger character set in later editions, probably as part of the table compression. I’ve given an alternative way to handle that as an answer to a SO question.
There is a sure performance advantage in having a tight loop data driven in modern architecture: it’s quite cache friendly (if you have compressed the tables), and jump prediction is as perfect as possible (one miss at the end of the lexem, perhaps one miss for the switch dispatching to the code which depend on the symbol; that’s assuming that your table decompression can be done with predictable jumps). Moving that state machine to pure code would decrease the jump prediction performance and perhaps increase the cache pressure.
Having worked through the Dragon Book before, the principle reason for having table driven levers and parsers is so that you can use regular expressions to generate the lexer and BNF to generate the parser. The book also covers how tools like lex and yacc work, and in order so that you know how these tools work. Furthermore, it is important for you to work through some practical examples.
Despite many of comments, it has nothing to do with the style of code that was written in the 40’s, 50’s, 60’s …, it has to do with gaining a practical understanding of what the tools are doing for you and what you have to do to make them work. It has everything to do with the fundamental understanding how compilers work both from a theoretical and practical standpoint.
Hopefully, your instructor will also let you use lex and yacc (unless it is a graduate level class and you get to write lex and yacc).
Late to the party 🙂 The tokens are matched against regular expressions. Since there are many of them, you have the multi regex engine, which in turn is giant DFA.
“Worse yet, I can’t see how it would be remotely practical if the language was UTF capable.”
It is irrelevant (or transparent). Besides UTF has nice property its entities do not overlap even partially. E.g. the byte representing character “A” (from ASCII-7 table) is not used again for any other UTF character.
So, you have single DFA (which is multi-regex) for entire lexer. How better to write it down than 2d array?