I am writing a parser in Python which operates on a stream. The stream may be unbounded, so I cannot keep it all in memory. As a convenience, I would like to offer a regular expression matcher which matches part of the stream. I believe that means that I need to come up with a regular expression that matches any prefix of a full regular expression match. For purposes of this question, let us assume that all regular expression matches will only match a finite set of characters.* My intent is to simply buffer up the stream in chunks until the regular expression match is complete
I’m trying to figure out an algorithm to determine whether I need to fetch another chunk or not. If the match failed to find a match because it needs more characters to match or could match more characters, then I need to get another chunk. If it failed because the characters it already has could never match the string, then I should not buffer more data (because I might end up trying to buffer an unbounded amount of data). As an example, consider the regular expression for a pair of identifiers separated by a colon: ([0-9A-Za-z]+):([0-9A-Za-z]+)
This does not match “Hello”, but there’s the potential that the next chunk may give it a matching string such as “Hello:World”. However, the string “^Hello” cannot possibly match that regular expression because ^ is not part of [0-9A-Za-z]
. No additional characters could create a match (thus my parser needs to use a different rule to match the “^” first)
I believe this can be done with the careful application of ?
to make later parts of the regular expression optional. I’m looking for an algorithm that can transform a regular expression matching a string into a regular expression matching a prefix of any possible matching string.
My goal is to do this using only regular expression functionality available in Python. I do not want to code my own regex engine because the built-in one will be far faster than anything I can write in python.
*. In practice, I’ll end up setting a limit on how large a regular expression match can get, but I’d rather exclude that detail from this question.