As the title says – I have some weird code I need to clean up from comments, which my program parses line by line. In each line There could be some text in our outside of quotes.
Each comment starts with a #, but I can’t just replace #.+$
with nothing, because if # happens inside a string, I would remove half of a string with all the code that comes after, despite it not being a comment.
So I want to catch and sanitize (through removal) all strings containing #
The regex engine I’m using is the re library of Python.
Basically, I need a regex that will match all strings in a line that contain the # character, will not match strings without it, and won’t match # that is between strings.
I tried using a simple: "[^rn"]*?#[^rn"]*?"
but it fails on lines that have # between strings, like:
_,aa."bb"!@#cc1"dd,#/ee"(ff)"gg|#|hh"_ii "jj "_kk,"ll,## mm" #comment and "comment"
because it will catch "!@#cc1"
and removing this string also uncomments a lot of garbage.
My 2nd approach was to count opening and closing quotes (escaped quotes and escaped backslashes are already sanitized, so that simplifies things), and I came up with this:
^[^rn"]*?("[^rn"]*?")*?[^rn"]*?"[^rn"]*?#[^rn"]*?"
but the problem is, it will only catch the first (or last, if I change the non-greedy *? to just *) string with # in it, not all.
So if I call regex replace on the line, it will only deal with 1 string that has # in it, and I need to catch all of them.
Is there some smart way of achieving that? Counting quotes before the match through some kind of lookbehind?
I tried in Notepad++ and lookbehind is rather unhappy to receive wildcards in lookbehinds.
Maybe Python is more forgiving?
Also, I’m not good with lookahead/lookbehind stuff (I’m just dumb in this particular department).
Btw feel free to use this test string, because I think it covers most cases:
_,aa."bb"!@#cc1"dd,#/ee"(ff)"gg|#|hh"_ii "jj "_kk,"ll,## mm" #comment and "comment"