I’m working on a project where I need to evaluate complex logical queries on a DataFrame in Python. The queries can contain nested subqueries, and I need to process and evaluate each subquery individually before integrating the results back into the main query.Now in my case: using df.eval() would solve pretty much everything however, i need to know where exactly data was not matching my query, in which part of my condition / nested condition did it not match? Which column or subcondition was the main problem.
For some reason my head cant break it down any further and i am stuck with trying to replace parts of the query but then not being able to process that query through since i am changing the query in a way my function breaks.
Sample Data:
data = {
'a': ['ON', 'OFF', 'ON', 'OFF', 'ON'],
'b': [0, 0, 0, 0, 0],
'c': [0, 0, 0, 0, 0],
'd': [1, 0, 1, 1, 0],
'e': [2, 1, 2, 2, 1],
'f': [3, 3, 3, 3, 3]
}
df = pd.DataFrame(data)
Sample Query:
query_str = "((a == 'ON' and b == 0) or (c == 0 and d == 1)) and (e == 2 or e == 3)"
Query Parser
# Methode to just evaluate a simple expression using eval()
def evaluate_expression(expression, df):
try:
return df.eval(expression)
except Exception as e:
raise ValueError(f"Error evaluating expression: '{expression}': {e}")
# Methode to somehow slice my query into subqueries, my goal was to transform my query into something like this: ((True and True) or (True and False)) and (True or False). Having a) a mask for each single condition, b) a mask for every nested subcondition and then a final overall result.
So i would know that a: True, b: True, a and b: True and so on...
def parse_query(query_str, df):
result_df = pd.DataFrame(index=df.index)
def process_subquery(subquery, df):
while '(' in subquery:
match = re.search(r'(([^()]+))', subquery)
if match:
inner_query = match.group(1).strip()
result = evaluate_expression(inner_query, df)
result_df[match.group()] = result
subquery = subquery[:match.start()] + str(result_str) + subquery[match.end():]
else:
break
subquery = subquery.replace('AND', 'and').replace('OR', 'or')
return subquery
simplified_query = process_subquery(query_str, df)
...
If you want my full notepad please let me know, tried to keep it clean.
3