I have a simple markup grammar that can have matched and standalone tags:
{tag}Tagged content{/tag} plain text {standalone}{tag}Tagged again!{/tag}
I wrote the following pyparsing
grammar:
import pyparsing as pp
pp.ParserElement.set_default_whitespace_chars("")
LBRACE, RBRACE, SLASH = map(pp.Suppress, "{}/")
IDENTIFIER_CHARS = pp.alphanums + "_"
TAG_NAME = pp.Word(IDENTIFIER_CHARS)
# Tags
OPEN_TAG = LBRACE + TAG_NAME("open_tag") + RBRACE
CLOSE_TAG = LBRACE + SLASH + TAG_NAME("close_tag") + RBRACE
# Forward declaring content due to its recursivity
content = pp.Forward()
# Main elements
TAGGED_CONTENT = pp.Group(OPEN_TAG + content + CLOSE_TAG)("tagged_content*")
PLAIN_TEXT = pp.Group(pp.CharsNotIn("{}"))("plain_text*")
STANDALONE_TAG = pp.Group(LBRACE + TAG_NAME + RBRACE)("standalone_tag*")
# Recursive definition of content
content <<= pp.ZeroOrMore(TAGGED_CONTENT | STANDALONE_TAG | PLAIN_TEXT)
test = "{tag}Tagged content{/tag} plain text {standalone}{tag}Tagged again!{/tag}"
p = content.parse_string(test, parse_all=True)
print(p.dump(indent=" "))
This currently matches first occurrence as standalone without pairing correct tags, which is another problem, but it highlights the situation I don’t quite get. Why does it choose {standalone}
as standalone and not {tag}
. At this point it can match either way, but the TAGGED_CONTENT
is first in MatchFirst
list, so I expected it to have precedence.
It becomes worse when I try to extend grammar to support optional attributes that look like this (yeah, nasty markup, but oh well…):
{tag}Tagged text.|attribute|{/tag}
PIPE = pp.Suppress("|")
ATTRIBUTE = PIPE + TAG_NAME("attribute") + PIPE
# Updated
TAGGED_CONTENT = pp.Group(OPEN_TAG + content + pp.Opt(ATTRIBUTE) + CLOSE_TAG)("tagged_content*")
This doesn’t work – the attribute is matched as PLAIN_TEXT
along with actual preceding text. I would think Opt
would match if possible. Is there maybe a stronger Opt
alternative? Or maybe my approach is flawed in general?