Programming Language Parser (in Java) – What would be a better design alternative for a special case?

Background

I’m currently designing my own programming language as a research project.
I have most of the grammar done and written down as context-free grammar, and it should be working as is. – Now I’m working on the actual compiler that should translate the language into x86 binary assembly code, more specifically, I am working on the parser (the front end).

The language’s syntax is, for the most part, very similar to Java/C/C++.
The parser, which constructs an intermediate representation out of source code, works as follows:

The grammar is built as a big tree in which the actual source code only determines the leaves; Each syntactic variable (or nonterminal) has it’s own class, and each of these classes has a static get(byte[] source, int offset) method which returns a new leaf (node), or null if the source code syntax does not fit this nonterminal structure.

I am using a variation of predictive parsing, by the way.

For the nonterminal DataType, I have chosen the following grammatical structure:

DataType:
    PrimitiveDataType
 |  ArrayDataType
 |  ComplexDataType

ArrayDataType:
    DataType []

Did I mention this language is object-oriented?

So the problem here is that when DataType‘s get method is called, it first checks whether the following is a primitive data type, PrimitiveDataType‘s get method is called. Assuming we have an array, this would return null, so it continues on to check whether it’s an ArrayDataType, by calling it’s get method.

Problem

Arrays may be created of any data type, including arrays themselves (which would look like Type[][]). So what ArrayDataType‘s get method would do is again call DataType‘s get method to figure out what type the array is of.
Unfortunately, this is where my parser implementation would fail because this behavior results in a loop!

Question

Would there be any good/better design alternatives to this?

Predictive parser you selected (LL(k)) means you will have to solve left-recursion problems. Algorithm for solving direct and indirect recursions is clearly described on wikipedia:

http://en.wikipedia.org/wiki/Left_recursion

Some info can be found in posts here on StackOverflow:

https://stackoverflow.com/questions/2652060/removing-left-recursion
https://stackoverflow.com/questions/2999755/removing-left-recursion-in-antlr
https://stackoverflow.com/questions/4994036/left-recursion-elimination

In human language (non-scientific 🙂 “left-recursion problem” means you can’t endlessly go into recursion with non-terminal (A -> Ab) again and again. At some time you HAVE TO feed parser algorithm with a terminal to breake a loop.

In BNF this could look like:

Recursion Problem:

NT: NT T
NT: T

One solution:

NT: T NT2
NT2: T NT2
NT2:

For your grammar this could look like:

DataType:
    PrimitiveDataType ArrayDimensions
 |  ComplexDataType ArrayDimensions

ArrayDimensions:
    [] ArrayDimensions
 |

If your parser generator doesn’t allow empty productions and/or if you want to process array types separately, try something like this:

DataType:
    DataTypeName
 |  ArrayDataType

ArrayDataType:
    DataTypeName ArrayDimensions

DataTypeName:
    PrimitiveDataType
 |  ComplexDataType

ArrayDimensions:
    []
 |  [] ArrayDimensions

The problem is that using a LL(1) parser can lead to left-recursion. You can avoid this by switching to a LR parser or using this grammar to avoid recursion:

DataType:
    PrimitiveDataType ArrayDataType
 |  ComplexDataType ArrayDataType

ArrayDataType:
    [] ArrayDataType
 |  (empty)

If parsing is not part of your project, just search for a parsing library/generator, there are probably many out there.

The key here is simply to use an LR parser rather than LL. There’s a reason why virtually no implementations actually use LL as it’s stated, and it’s because it really can’t recognize many particularly useful grammars.

This problem you are facing looks with something I’ve already faced. In that case, it should be implemented in the Syntatic Analyzer class, and yes, it will generate a tree, it would be indeed recursive.

E.g.: in my case, I had to implement a way that the function would receive a list of arguments, so my grammar rule for arguments list would be something like this:

<ArgLi> ::= <Arg> <ArgLi> |

where:

<Arg> ::= '(' <Type> Identifier ')'

<ArgLi> is called recursivelly

<Type> ::= 'int' | 'real' | 'str'

Identifier = {ID Head}{ID Tail}*

So, to implement that feature, the command has delimiters, so even if it enters a loop, it will check the rule, and determine the action based on the current token I’m looking.

For example, let’s say I want to write this:

( def main [ ( int a1 ) ( str a2 )  ]

...

)

The code my language follows to understand this is:

/**
 * <Def> ::= '(' 'def' Identifier '[' <ArgLi> ']' <CommandLi> ')'
 */
public Node Def() {

    Node node = new Node(NodeIdentifier.DEF);
    if (token.getImage().equals("(")) {
        node.addSon(new Node(NodeIdentifier.TOKEN, token));
        token = readToken();
        if (token.getImage().equals("def")) {
            node.addSon(new Node(NodeIdentifier.TOKEN, token));
            token = readToken();
            if (token._getClass().equals("ID")
                    || token.getImage().equals("main")) {
                node.addSon(new Node(NodeIdentifier.TOKEN, token));
                token = readToken();
                if (token.getImage().equals("[")) {
                    node.addSon(new Node(NodeIdentifier.TOKEN, token));
                    token = readToken();
                    node.addSon(argLi());
                    if (token.getImage().equals("]")) {
                        node.addSon(new Node(NodeIdentifier.TOKEN, token));
                        token = readToken();
                        node.addSon(commandLi());
                        if (token.getImage().equals(")")) {
                            node.addSon(new Node(NodeIdentifier.TOKEN,
                                    token));
                            token = readToken();
                        } else {
                            errors.add("Error: token ')' not recognized. line: "
                                    + token.getLine());
                        }
                    } else {
                        errors.add("Error: token ']' not recognized. line: "
                                + token.getLine());
                    }
                } else {
                    errors.add("Error: token '[' not recognized. line: "
                            + token.getLine());
                }
            } else {
                errors.add("Error: token 'ID' not recognized. line: "
                        + token.getLine());
            }
        } else {
            errors.add("Error: token 'def' not recognized. line: "
                    + token.getLine());
        }
    } else {
        errors.add("Error: token '(' not recognized. line: "
                + token.getLine());
    }
    return node;
}

/**
 * <ArgLi> ::= <Arg> <ArgLi> |
 */
public Node argLi() {

    Node node = new Node(NodeIdentifier.ARGLI);
    if (token.getImage().equals("(")) {
        node.addSon(arg());
        node.addSon(argLi());
    }
    return node;
}

/**
 * <Arg> ::= '(' <Type> Identifier ')'
 */
public Node arg() {
    Node node = new Node(NodeIdentifier.ARG);
    if (token.getImage().equals("(")) {
        node.addSon(new Node(NodeIdentifier.TOKEN, token));
        token = readToken();
        node.addSon(type());
        if (token._getClass().equals("ID")) {
            node.addSon(new Node(NodeIdentifier.TOKEN, token));
            token = readToken();
            if (token.getImage().equals(")")) {
                node.addSon(new Node(NodeIdentifier.TOKEN, token));
                token = readToken();
            } else {
                errors.add("Error: token ')' not recognized. line: "
                        + token.getLine());
            }
        } else {
            errors.add("Error: token 'ID' not recognized. line: "
                    + token.getLine());
        }
    } else {
        errors.add("Error: token '(' not recognized. line: "
                + token.getLine());
    }

    return node;
}

As you can see the methods arg() and argLi() are called recursivelly, and that’s how I’ve implemented it to avoid infinit loops in an efficient way.

I’m pretty sure this technique can help you with your problem.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 14:22

Thẻ: compiler, java, languages, parsing

Thiết kế website giá rẻ

Danh mục

Programming Language Parser (in Java) – What would be a better design alternative for a special case?

Background

Problem

Question