I am parsing some game config files using Python and putting it all in dictionaries. It seemed to all work well until I encountered the following edge-case:
random_owned_controlled_state = {
create_unit = {
division = "name = "6. Belarusian Red Riflemen" division_template = "Belarusian Red Riflemen" start_experience_factor = 0.5"
owner = PREV
}
create_unit = {
division = "name = "7. Belarusian Red Riflemen" division_template = "Belarusian Red Riflemen" start_experience_factor = 0.5"
owner = PREV
}
}
I want to remove all spaces, tabs and newlines unless within doublequotes and eventually delimit separate statements with semicolons to get something like this which is easier to extract information from:
random_owned_controlled_state={create_unit={division="name = "6. Belarusian Red Riflemen" division_template = "Belarusian Red Riflemen" start_experience_factor = 0.5";owner=PREV;};create_unit={division="name = "7. Belarusian Red Riflemen" division_template = "Belarusian Red Riflemen" start_experience_factor = 0.5";owner=PREV;};};
I created this function to achieve this, or so I thought:
def remove_spacers(string: str, brackets: Tuple[str, str], operators: List[str], replacement: str, commentchar: str) -> str:
"""
Get rid of all white spaces in file and remove or replace with replacement
:param string: Raw String
:param replacement: What to replace with
:param commentchar: Character used to indicate comments
:return: Cleaned String
"""
# Strip front and end
result = string.strip(" nt")
# Remove all comments from text
result = re.sub(rf"{commentchar}.*", "", result)
# Remove all whitespace bundles except for inside "" and replace with single space
result = re.sub(r"s+(?=([^"]*"[^"]*")*[^"]*$)", " ", result)
# Remove whitespaces around operators (=,<,>)
for operator in operators:
result = re.sub(rf"s*{operator}s*", rf"{operator}", result)
# Remove whitespaces after {
result = re.sub(rf"{brackets[0]}s", rf"{brackets[0]}", result)
# Replace whitespaces after anything else with ;
result = re.sub(r"s(?=([^"]*"[^"]*")*[^"]*$)", replacement, result)
# Replace any potential multiple consecutive ; with a single ;
result = re.sub(rf"{replacement}{2,}", replacement, result)
# Make sure every statement ends with ;
result = re.sub(rf"(?<!{replacement}){brackets[1]}", rf"{replacement}{brackets[1]}", result)
if result[-1] != replacement:
result += replacement
result = f"{{{result}}}{replacement}"
return result
Calling it like: cleaned_text = remove_spacers(text, ('{', '}'), ['=', '<', '>'], ";", "#")
I likely need to adjust the following regex line:
result = re.sub(r"s+(?=([^"]*"[^"]*")*[^"]*$)", " ", result)
But am unsure how to achieve this behaviour since I am by no means experienced with regex.