I’m currently learning about fuzzing and testing and there’s a part that I’m not too sure how to do.
I am given grammars like this:
grammar = {
"<start>": ["<product>;<quantity><separater><status>"],
"<product>": ["book", "game", "child", "software"],
"<quantity>": ["0", "1", "2", "3", "5", "10"],
"<status>": ["pending", "processed", "shipped", "cancelled", "<status>"],
"<separater>": [";", ""]
}
The strings that can be generated from this grammar will be used as input to test some basic programs (the programs only have one input()
max
And given a value n
, we want to generate n
input strings that maximise the branch coverage.
I currently have this code to generate a string from the grammar:
import random
import time
def tokenise(production):
"""extract tokens from a production rule."""
tokens = []
buffer = ""
in_token = False
for char in production:
if char == "<":
if buffer: # Flush buffer as a literal text
tokens.append(buffer)
buffer = ""
in_token = True
buffer += char
elif char == ">":
buffer += char
tokens.append(buffer)
buffer = ""
in_token = False
else:
buffer += char
if buffer: # Catch any remaining literals after the last token
tokens.append(buffer)
return tokens
def expand(symbol, grammar, depth=0, max_depth=10):
"""Recursively expands a symbol using the provided grammar."""
if depth > max_depth:
return ''
if symbol not in grammar: # Base case: symbol is a terminal
return symbol
production = random.choice(grammar[symbol])
# Split the production into tokens (both terminals and non-terminals)
tokens = tokenise(production)
# Recursively expand each token and concatenate the result
expanded = ''.join(expand(token, grammar, depth+1, max_depth) for token in tokens)
return expanded
However, the issue with this code is that it’s very inefficient because it uses random.choice()
to generate a string, and I need to test if it increases branch coverage. If it does, I add it to my list of test input strings.
To make it more efficient, I need to read the program as a tree using AST, and then use it to determine which strings in the grammar you should generate to cover specific branches. You can reduce the running time of string production in your grammar by preserving the non-terminal productions you generate and using those to generate sets of strings instead of starting form the start state every time.”
But I’m not quite sure how I’d do that. I can identify all the branches using AST, but how would I decide what path I’d take through the grammar? and how do I track the paths I’ve already taken to make sure I take the next best path?
I read about symbolic execution, but Im not allowed to not use it because only allowed to use Python standard libraries in the final exam, so it’ll be better practise to only use standard libraries.
I have tried randomly generating the input but that is very inefficient.
Tahsin Abedin is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.