I am working on an NLP project that requires me to remove computer code from a piece of text. The code is encased between the tags <pre><code>
and </code></pre>
. Now I could do a simple regex match, but I want to generalize this function so that it can remove text between any two specified strings, even when they are nested.
For example, if I have a string:
string = "<pre><code> Should be deleted <pre><code> also deleted </code></pre> delete this too </code></pre> don't delete <pre><code> delete </code></pre> "
then I am expecting the output to be the string " don't delete "
.
If I do re.sub('(<pre><code>)(.*)(</code></pre>)', '', string)
, it gives me an empty string.
I know how to do this for single characters, like removing string between curly braces. For example, if my string was string = "{ Should be deleted { also deleted } delete this too } don't delete { delete } "
, then doing this gives me the desired output:
a = '{'
b = '}'
regexp = a + '[^' +a + '^' + b + ']*' + b
while a in string or b in string:
string = re.sub(regexp, '', string)
regexp
evaluates to '{[^{^}]*}'
here. The while loop is necessary because solutions like re.sub(r'{[^{^}]*}', '', string)
doesn’t work for nested cases.
I tried to apply the same logic as the single characters case by doing re.sub('(<pre><code>)[^(</code></pre>)^(<pre><code>)](</code></pre>)', '', string)
but it produces the output: "<pre><code> Should be deleted <pre><code> also deleted </code></pre> delete this too </code></pre> don't delete <pre><code> delete </code></pre> "
which means nothing is matched.