I am learning Paul McGuire’s fabulous pyparsing Python module. I’m posting this for discussion and hopefully to get a deeper insight into pyparsing.
The following code parses a parenthetical expression that looks similar to a Django Q() object expression used for querying data. Here is an example expression:
'( title__regex=r"^(An?|The) +" ) | ( id__in=[1, 3, 4] )'
Here’s the code:
import pyparsing as p
ident = p.Word(p.identchars, p.identbodychars)
cond_op = p.Opt('!') + '='
quoted_string = p.quoted_string
unquoted_string = p.Word(p.string.printable, exclude_chars='()'"')
rvalue = p.OneOrMore(quoted_string | unquoted_string)
infix_operator = p.one_of('& | ^').set_name('"logical operator"')
whitespace = p.ZeroOrMore(p.White())
expr = p.infix_notation(
p.Combine(ident + whitespace + cond_op + whitespace + rvalue),
[(infix_operator, 2, p.opAssoc.LEFT)]
)
string = '( title__regex=r"^(An?|The) +" ) | ( id__in=[1, 3, 4] )'
try:
results = expr.parse_string(string, parse_all=True).as_list()
except p.ParseException as e:
print(e.explain())
else:
print(results)
The code correctly outputs:
[['title__regex=r"^(An?|The) +"', '|', 'id__in=[1, 3, 4]']]
What other solutions are there?
Is there a better way to express a sequence of unquoted and quoted characters? Here quotes prevent parens from being interpreted as the end or start of a parenthetical expression.
If I mangle the input string by removing the =
char, i.e. title__regex r"^(An?|The) +"
, the parser returns:
(title__regex r"^(An?|The) +") | (id__in=[1, 3, 4])
^
ParseException: Expected "logical operator" term, found 'r' (at char 14), (line:1, col:15)
Why is the parser expecting a “logical operator”? It should be expecting “=”.
Check out some of the examples (https://github.com/pyparsing/pyparsing/tree/master/examples) that use infix_notation, such as simpleBool.py and simpleArith.py, and eval_arith.py, which not only parses but also evaluates the arithmetic expression.
Precedence of operations is important when parsing infix expressions – look at how NOT, AND, and OR get handled at different precedence levels in this short example:
import pyparsing as pp
logical_operand = pp.one_of("True False", as_keyword=True)
NOT, AND, OR = pp.CaselessKeyword.using_each("not and or".split())
logical_expression = pp.infix_notation(
logical_operand ,
[
(NOT, 1, pp.OpAssoc.RIGHT),
(AND, 2, pp.OpAssoc.LEFT),
(OR, 2, pp.OpAssoc.LEFT),
]
)
logical_expression.run_tests("""
True
True or False
True and not False
True and False or True
True and (False or True)
""", full_dump=False)
Gives this output (note how “not” gets higher precedence than “and”, and “and” gets higher precedence than “or”, and that this can be overridden using ()’s):
True
['True']
True or False
[['True', 'or', 'False']]
True and not False
[['True', 'and', ['not', 'False']]]
True and False or True
[[['True', 'and', 'False'], 'or', 'True']]
True and (False or True)
[['True', 'and', ['False', 'or', 'True']]]
With this template for the logical expression part of your parser, then you just need to focus on how the logical_operand
expression will be defined (in place of the True/False
literals in the example. Maybe something like
variable_name + pp.one_of("= !=") + rvalue
.
Some other tips:
- Avoid inserting whitespace terms in your parser if you can help it. pyparsing will skip whitespace by default.
- Write out a BNF for yourself to think through what kinds of things you’ll want to parse for. It will help as a kind of checklist of what you need to implement in your parser.
- Your
unquoted_string
term will match a lot of things, even some important words like AND and OR. It is worth the trouble to make this term more specific.