I have to parse with Python (pyparse) a legacy file format that is not well defined.
It is of the curly brace family (im fact, having to parse arbitrary curly brace formats is a recurring issue because people are always like “XML is too verbose, let’s invent our own format”).
So I have things like
Variable = [1, 2]
Variable2 = {
Variable,
"Literal",
}
Variable[1] = Variable2
Namespace::Function(argument, {"string literal",
100})
Newlines are significant outside of expressions (separating statements like ;
does in C) but not significant in any kind of braces ({}
, ()
and []
).
To make sense of it, it needs to be parsed such that:
- Each line is a child of the root of the AST (where “line” refers to lines delimited by line breaks not inside any kind of braces).
- Each kind of parentheses constitutes a node (annotated with the type of parentheses).
- The children of a parenthesis list are comma separated (in this sense newlines also constitute a kind of brace).
- Anything else is treated as a single token to be handled in another pass.
So for the above example the AST should look like this:
Root
* Brace: n
* Token: 'Variable ='
* Brace: []
* Token: '1'
* Token: '2'
* Brace: n
* Token: 'Variable2 ='
* Brace: {}
* Token: 'Variable'
* Token: '"Literal"'
* Brace: n
* Token: 'Variable'
* Brace: []
* Token: '1'
* Token: '= Variable2'
* Brace: n
* Token: 'Namespace:Function'
* Brace: ()
* Token: 'argument'
* Brace: {}
* Token: '"string literal"'
* Token: '100'
Represented in Python as something like:
[
Brace('n', [
'Variable =',
Brace('[]', ['1', '2'])
]),
...
The issues are
- Specifying the syntax for the parser,
- Converting the result from the parser into a data structure that I can use afterwards.
(I used a dedicated parser package that did (1), but the result was (2) completely unusable for any further steps, so I’m trying again, this time with pyparsing
.)
1
Parsing Curly Basic Formats with ‘pyparsing’:
- Define The Grammer: Use
pyparsing
to create rules for handling variables, literals, different types of brackets('[]','{}','()')
. UsenestedExpr
for nested structures.
from pyparsing import Word, alphas, nums, nestedExpr, quotedString, Group, ZeroOrMore, LineEnd
identifier = Word(alphas + "_", alphas + nums + "_")
literal = quotedString
square_braces = nestedExpr('[', ']', content=identifier | literal)
curly_braces = nestedExpr('{', '}', content=identifier | literal | square_braces)
parentheses = nestedExpr('(', ')', content=identifier | curly_braces)
statement = Group(identifier + '=' + (square_braces | curly_braces | parentheses))
grammar = ZeroOrMore(statement + LineEnd().suppress())
- Convert To AST: Use parse actions to build an Abstract Syntax Tree (AST). Define a
Brace
class to represent a different brace types.
class Brace:
def __init__(self, brace_type, children):
self.brace_type = brace_type
self.children = children
def create_ast_node(tokens):
return Brace(tokens[0], tokens[1:])
statement.setParseAction(create_ast_node)
- Example Usage:
Parse your input and generate as AST
input_text = """
Variable = [1, 2]
Variable2 = {
Variable,
"Literal",
}
"""
result = grammar.parseString(input_text)
print(result)
This approach efficiently parses custom brace formats and organizes the output into a structured, usable form. I hope my definition helps!