Background
I’m currently designing my own programming language as a research project.
I have most of the grammar done and written down as context-free grammar, and it should be working as is. – Now I’m working on the actual compiler that should translate the language into x86 binary assembly code
, more specifically, I am working on the parser
(the front end).
The language’s syntax is, for the most part, very similar to Java/C/C++.
The parser, which constructs an intermediate representation out of source code, works as follows:
The grammar is built as a big tree in which the actual source code only determines the leaves; Each syntactic variable (or nonterminal) has it’s own class, and each of these classes has a static get(byte[] source, int offset)
method which returns a new leaf (node), or null
if the source code syntax does not fit this nonterminal structure.
I am using a variation of predictive parsing
, by the way.
For the nonterminal DataType
, I have chosen the following grammatical structure:
DataType:
PrimitiveDataType
| ArrayDataType
| ComplexDataType
ArrayDataType:
DataType []
Did I mention this language is object-oriented?
So the problem here is that when DataType
‘s get
method is called, it first checks whether the following is a primitive data type, PrimitiveDataType
‘s get
method is called. Assuming we have an array, this would return null
, so it continues on to check whether it’s an ArrayDataType
, by calling it’s get
method.
Problem
Arrays may be created of any data type, including arrays themselves (which would look like Type[][]
). So what ArrayDataType
‘s get
method would do is again call DataType
‘s get
method to figure out what type the array is of.
Unfortunately, this is where my parser implementation would fail because this behavior results in a loop!
Question
Would there be any good/better design alternatives to this?
1
Predictive parser you selected (LL(k)) means you will have to solve left-recursion problems. Algorithm for solving direct and indirect recursions is clearly described on wikipedia:
http://en.wikipedia.org/wiki/Left_recursion
Some info can be found in posts here on StackOverflow:
https://stackoverflow.com/questions/2652060/removing-left-recursion
https://stackoverflow.com/questions/2999755/removing-left-recursion-in-antlr
https://stackoverflow.com/questions/4994036/left-recursion-elimination
In human language (non-scientific 🙂 “left-recursion problem” means you can’t endlessly go into recursion with non-terminal (A -> Ab) again and again. At some time you HAVE TO feed parser algorithm with a terminal to breake a loop.
In BNF this could look like:
Recursion Problem:
NT: NT T
NT: T
One solution:
NT: T NT2
NT2: T NT2
NT2:
For your grammar this could look like:
DataType:
PrimitiveDataType ArrayDimensions
| ComplexDataType ArrayDimensions
ArrayDimensions:
[] ArrayDimensions
|
If your parser generator doesn’t allow empty productions and/or if you want to process array types separately, try something like this:
DataType:
DataTypeName
| ArrayDataType
ArrayDataType:
DataTypeName ArrayDimensions
DataTypeName:
PrimitiveDataType
| ComplexDataType
ArrayDimensions:
[]
| [] ArrayDimensions
The problem is that using a LL(1) parser can lead to left-recursion. You can avoid this by switching to a LR parser or using this grammar to avoid recursion:
DataType:
PrimitiveDataType ArrayDataType
| ComplexDataType ArrayDataType
ArrayDataType:
[] ArrayDataType
| (empty)
If parsing is not part of your project, just search for a parsing library/generator, there are probably many out there.
The key here is simply to use an LR parser rather than LL. There’s a reason why virtually no implementations actually use LL as it’s stated, and it’s because it really can’t recognize many particularly useful grammars.
1
This problem you are facing looks with something I’ve already faced. In that case, it should be implemented in the Syntatic Analyzer class, and yes, it will generate a tree, it would be indeed recursive.
E.g.: in my case, I had to implement a way that the function would receive a list of arguments, so my grammar rule for arguments list would be something like this:
<ArgLi> ::= <Arg> <ArgLi> |
where:
<Arg> ::= '(' <Type> Identifier ')'
<ArgLi>
is called recursivelly
<Type> ::= 'int' | 'real' | 'str'
Identifier = {ID Head}{ID Tail}*
So, to implement that feature, the command has delimiters, so even if it enters a loop, it will check the rule, and determine the action based on the current token I’m looking.
For example, let’s say I want to write this:
( def main [ ( int a1 ) ( str a2 ) ]
...
)
The code my language follows to understand this is:
/**
* <Def> ::= '(' 'def' Identifier '[' <ArgLi> ']' <CommandLi> ')'
*/
public Node Def() {
Node node = new Node(NodeIdentifier.DEF);
if (token.getImage().equals("(")) {
node.addSon(new Node(NodeIdentifier.TOKEN, token));
token = readToken();
if (token.getImage().equals("def")) {
node.addSon(new Node(NodeIdentifier.TOKEN, token));
token = readToken();
if (token._getClass().equals("ID")
|| token.getImage().equals("main")) {
node.addSon(new Node(NodeIdentifier.TOKEN, token));
token = readToken();
if (token.getImage().equals("[")) {
node.addSon(new Node(NodeIdentifier.TOKEN, token));
token = readToken();
node.addSon(argLi());
if (token.getImage().equals("]")) {
node.addSon(new Node(NodeIdentifier.TOKEN, token));
token = readToken();
node.addSon(commandLi());
if (token.getImage().equals(")")) {
node.addSon(new Node(NodeIdentifier.TOKEN,
token));
token = readToken();
} else {
errors.add("Error: token ')' not recognized. line: "
+ token.getLine());
}
} else {
errors.add("Error: token ']' not recognized. line: "
+ token.getLine());
}
} else {
errors.add("Error: token '[' not recognized. line: "
+ token.getLine());
}
} else {
errors.add("Error: token 'ID' not recognized. line: "
+ token.getLine());
}
} else {
errors.add("Error: token 'def' not recognized. line: "
+ token.getLine());
}
} else {
errors.add("Error: token '(' not recognized. line: "
+ token.getLine());
}
return node;
}
/**
* <ArgLi> ::= <Arg> <ArgLi> |
*/
public Node argLi() {
Node node = new Node(NodeIdentifier.ARGLI);
if (token.getImage().equals("(")) {
node.addSon(arg());
node.addSon(argLi());
}
return node;
}
/**
* <Arg> ::= '(' <Type> Identifier ')'
*/
public Node arg() {
Node node = new Node(NodeIdentifier.ARG);
if (token.getImage().equals("(")) {
node.addSon(new Node(NodeIdentifier.TOKEN, token));
token = readToken();
node.addSon(type());
if (token._getClass().equals("ID")) {
node.addSon(new Node(NodeIdentifier.TOKEN, token));
token = readToken();
if (token.getImage().equals(")")) {
node.addSon(new Node(NodeIdentifier.TOKEN, token));
token = readToken();
} else {
errors.add("Error: token ')' not recognized. line: "
+ token.getLine());
}
} else {
errors.add("Error: token 'ID' not recognized. line: "
+ token.getLine());
}
} else {
errors.add("Error: token '(' not recognized. line: "
+ token.getLine());
}
return node;
}
As you can see the methods arg()
and argLi()
are called recursivelly, and that’s how I’ve implemented it to avoid infinit loops in an efficient way.
I’m pretty sure this technique can help you with your problem.
2