I am trying to parse a series of mathematical formulas and need to extract variable names efficiently using Polars in Python.
Regex support in Polars seems to be limited, particularly with look-around assertions.
Is there a simple, efficient way to parse symbols from formulas?
Here’s the snippet of my code:
import re
import polars as pl
# Define the regex pattern
FORMULA_DECODER = r"b[A-Za-z][A-Za-z_0-9_]*b(?!()"
# b # Assert a word boundary to ensure matching at the beginning of a word
# [A-Za-z] # Match an uppercase or lowercase letter at the start
# [A-Za-z0-9_]* # Match following zero or more occurrences of valid characters (letters, digits, or underscores)
# b # Assert a word boundary to ensure matching at the end of a word
# (?!() # Negative lookahead to ensure the match is not followed by an open parenthesis (indicating a function)
# Sample formulas
formulas = ["3*sin(x1+x2)+A_0",
"ab*exp(2*x)"]
# expected result
pl.Series(formulas).map_elements(lambda formula: re.findall(FORMULA_DECODER, formula), return_dtype=pl.List(pl.String))
# Series: '' [list[str]]
# [
# ["x1", "x2", "A_0"]
# ["ab", "x"]
# ]
# Polars does not support this regex pattern
pl.Series(formulas).str.extract_all(FORMULA_DECODER)
# ComputeError: regex error: regex parse error:
# b[A-Za-z][A-Za-z_0-9_]*b(?!()
# ^^^
# error: look-around, including look-ahead and look-behind, is not supported