I want to match the closing quote together with the opening quote of the following string if both are on the same line. Two strings may be separated either by a blank
or a blank-plus-blank +
.
Regex engine: Python
F.i. from
this is "some string" "; which should match" 234
"and this" + "should also match"" ""and this"
but not this: " " a " + "
I’d like to see matches for:
- line 1:
" "
from betweensome string
and; which...
- line 2:
" + "
from betweenand this
andshould also match"
" "
from betweenshould also match"
and"and this
- line 3: No matches
So in fact, I think it might be best to only match the groups " "
and " + "
if there is an odd number of quotes before and after the group. Since lookbehing/ahead is fixed length only, I didn’t find a good way to do it.
I tried
re.compile(r'(" + ")|(" ")(?!;|,)')
but this assumes that there may be no semicolon within a string
and also
re.compile(r'"[^"]+")
but this only finds the strings themselves, but not the “inter-string” quotes.
6
Here’s the character loop parsing method I mentioned above. I track whether we are inside a quote or not, and I track the characters between quotes.
data = """
this is "some string" "; which should match" 234
"and this" + "should also match\"" "\"and this"
but not this: " " a " + "
"""
def check(line):
in_quotes = False
between = "xxxx"
found = []
escape = False
for c in line:
if escape:
escape = False
elif c == '"':
if not in_quotes and between in (' ', ' + '):
found.append( between )
between = ""
in_quotes = not in_quotes
elif c == '\':
escape = True
elif not in_quotes:
between += c
return found
for line in data.splitlines():
print(line)
matches = check(line)
print(matches)
Output:
this is "some string" "; which should match" 234
[' ']
"and this" + "should also match"" ""and this"
[' + ', ' ']
but not this: " " a " + "
[]
1
So in fact, I think it might be best to only match the groups ” ” and
” + ” if there is an odd number of quotes before and after the group.
Since lookbehing/ahead is fixed length only, I didn’t find a good way
to do it.
If you use re.finditer
then you might use .start()
and .end()
to detect where it does start and end which allows you to get substring and count character, consider following simple example
import re
text = 'uno " dos " tres " '
for m in re.finditer(r's"s', text):
before = text[:m.start()]
print("Match starting at", m.start(), "and ending at", m.end(), "has", before.count('"'), "quotes in front")
gives output
Match starting at 3 and ending at 6 has 0 quotes in front
Match starting at 9 and ending at 12 has 1 quotes in front
Match starting at 16 and ending at 19 has 2 quotes in front
In your you would need to split your text in lines before applying this solution.
I would indeed suggest to match pairs of (unescaped) quotes and the part that follows it. With a little lookahead you can also put the next character in a capture group, so you can check that it is a quote (and not the end of the line).
Here is an implementation:
import re
def get_matches(s):
return (
m.span(1)
for m in re.finditer(r'"(?:\.|[^"nr])*"((?:\.|[^"nr])*)(?=("?))', s)
if m.group(2) and m.group(1) in (' ', ' + ')
)
This generator returns spans, so you get the location of where the match occured.
Run as follows:
s = """
this is "some string" "; which should match" 234
"and this" + "should also match" " ""and this"
but not this: " " a " + "
"""
for start, end in get_matches(s):
print(f'{start}:{end} = "{s[start:end]}"')
Output:
21:22 = " "
59:62 = " + "
81:82 = " "