The antlr4 version is the latest, at the time of posting, 4.13.2, compiled from source.
I am using the java8-grammar provided by the grammars-v4 antlr repo
I intend to parse large projects for static analysis purposes and it requires a parser implemented in C++. I turned to ANTLR4 for the parser to avoid writing one by hand as the java grammer is intensely recursive, verbose and heavy on context.
The parser generated feels abysmal in performance, taking seconds to parse medium files and minutes for this specific one openapi-generator-java-class, specifically one and a half minutes. It is a 2K LoC file but this order of size feels too much. A 1.2K Loc File oas-gen-java-class takes around 25.5 seconds. The measurements were done using a QElapsedTimer (along with sitting and waiting for the complete parse to finish which takes a lot).
The code i’ve written to use the parser is the simplest possible.
stream.open(javaFile);
QElapsedTimer timer;
timer.start();
ANTLRInputStream input(stream);
Java8Lexer lexer(&input);
BufferedTokenStream tokens(&lexer);
Java8Parser parser(&tokens);
tree::ParseTree *tree = parser.compilationUnit();
std::cout << "Parsing time: " << timer.elapsed() << "ms .n";
My questions are:
- What steps would you take to improve its performance?
- For this use case, Parsing java code in C++, is there any alternative I should try out(libraries,parser generators etc. etc.)?
Things I have tried so far(will edit this if i try something new):
- Using BufferedTokenStream instead of CommonTokenStream
- Linking antlr statically (using the .a file instead of the .so file, with -fpic)
10