I have input text that is kind of structured, but where each main line can spread over multiple sub-lines. The spreading can occur in different manners however, and I can’t figure out how to catch these multiple scenarios.
The basic structure of the input (text file) is like follows:
11 abc 1 1
22 abc 2 2
def
33 abc
def 3 3
In English: a line is composed of 2 digits, then some text and then 2 individual digits (e.g. “11 abc 1 1
“. The text may however be spread over 2 (or more lines). Sometimes the trailing digits appear on the first sub-line (e.g. “22 abc 2 2ndef
“), sometimes on the last sub-line of blocks of text that start with the 2 digits (e.g. “33 abdndef 3 3
“).
My regexes manage to catch only one of the two scenarios.
I always used the following expression to get the matches:
re.findall(pat, t, re.M|re.DOTALL|re.X)
So I used re.M
to specifically allow multiline matches, re.DOTALL
to include newline characters and re.X
to make the patterns more readable with whitespaces.
I expect the following result:
[('11', 'abc', '1', '1', ''),
('22', 'abc', '2', '2', 'def'),
('33', 'abcndef', '3', '3', '')]
In other words, I want the numbers always to appear in the same locations of the tuples, and the text may be split in 2 parts (2nd and last position of the tuple), but none of the parts may be ignored.
I tried with the following:
pat = r'^(dd) s (.*?) s (d)s(d) (.*?)?'
But this doesn’t catch the 2nd part of the “22 ...
” line.
Then I tried a more greedy approach:
pat = r'^(dd) s (.*?) s (d)s(d) (.*)?'
But this catches the entire string.
Then I tried a negative lookahead, with the intent to start the next match as soon as a double-digit is encountered:
pat = r'^(dd) s (.*?) s? (d)s(d) (.*?) (?=dd)'
But this doesn’t catch the “33 ...
” line, because it is the last line, and therefore no double-digits follow.
I tried a few other crooked things, not worth mentioning, but I can’t find a solution to my problem.
Any hints would be greatly appreciated.