I have to do a lot of text matching and I am relying on regex which seems great at this.
This is the text I am searching below: Note that there is hierarchy where Test1_1 is under Test1_1, Test1_1_1 is under Test1_1, Test1_1_2 is under Test1_1
Also Test1_2 is under Test1 and Test1_2_1 is under Test1_2 and Test1_2_2 is under Test1_2
<Class Test1>n
<Class Test1_1>n
<Class Test1_1_1>n
<Function testAmplitude at 10.7.10.9>n
<Function testAmplitude at 10.7.10.18>n
<Function testAverage at 10.7.10.9>n
<Function testAverage at 10.7.10.18>n
<Function testSquareWave at 10.7.10.9>n
<Function testSquareWave at 10.7.10.18>n
......
<Class Test1_1_2>
<Function testAverage at 10.7.10.9> n
<Function testAverage at 10.7.10.18> n
<Function testSquareWave at 10.7.10.9> n
<Function testSquareWave at 10.7.10.18> n
........
<Class Test1_2>n
<Class Test1_2_1>
<Function testAmplitude at 10.7.10.9>n
<Function testAmplitude at 10.7.10.18>n
<Function testAverage at 10.7.10.9>n
<Function testAverage at 10.7.10.18>n
<Function testSquareWave at 10.7.10.9>n
<Function testSquareWave at 10.7.10.18>n
......
<Class Test1_2_2>
<Function testAverage at 10.7.10.9> n
<Function testAverage at 10.7.10.18> n
<Function testSquareWave at 10.7.10.9> n
<Function testSquareWave at 10.7.10.18> n
........
I want be able to get the 2 IP’s that are a part of the test provided that they are under the correct heirarchy. So for example, I want to be able to search for all testAmplitude under Test1, Test1_1, Test1_1_1 and get 10.7.10.9 and 10.7.10.18 both. I show an example in python below.
I am using Python. The closest i could come up with is :
import re
rec_pattern = re.compile(
f"<CLASS TEST1>.+?" #
f"<CLASS TEST1_1>.+?" #
f"<CLASS TEST1_1_1>" #
"(" #
"(?:.+?" #
f"<FUNCTION testAmplitude AT (?:.*?)>"
")*" #
")" #
, re.IGNORECASE|re.DOTALL)
matches = re.findall(rec_pattern, self.output_text)
The regex pattern would be:
<CLASS TEST1>.+<CLASS TEST1_1>.+?<CLASS TEST1_1_1>((?:.+?<FUNCTION testAmplitude AT (?:.*?)>)*)
This gets me below in 1 match(not 2 seperate matches and extra text. I only want the IP’s in 2 seperate matches).
<Function testAmplitude at 10.7.10.9>
<Function testAmplitude at 10.7.10.18>
How can i just get 10.7.10.9 and 10.7.10.18 in 2 separate matches? I tried non greedy approach and it didn’t work.
The other thing which I found hard to tackle is that websites like this https://regex101.com/ and https://regexr.com/ only seem to support 1 match which makes it hard to test and find out correct answer. The given solution was to test with these websites, find 1 match (it only supports 1 match) and then I have to test in python to find multiple matches using findall. Is there a way to get these websites or are you familiar with other websites that can get multiple matches?? Am i getting this question wrong or missing something?
1
If the difference in between the two IPs is the last part (“host identifier”), one way to do this is to write your patterns using:
d{2}.d.d{2}.d
d{2}.d.d{2}.d{2}
such as:
(?im)^s*(<functions+.*sats+(d{2}.d.d{2}.d))bs*>
(?im)^s*(<functions+.*sats+(d{2}.d.d{2}.d{2}))bs*>
Example for the first IP:
import re
st = """
<Class Test1>n
<Class Test1_1>n
<Class Test1_1_1>n
<Function testAmplitude at 10.7.10.9>n
<Function testAmplitude at 10.7.10.18>n
<Function testAverage at 10.7.10.9>n
<Function testAverage at 10.7.10.18>n
<Function testSquareWave at 10.7.10.9>n
<Function testSquareWave at 10.7.10.18>n
......
<Class Test1_1_2>
<Function testAverage at 10.7.10.9> n
<Function testAverage at 10.7.10.18> n
<Function testSquareWave at 10.7.10.9> n
<Function testSquareWave at 10.7.10.18> n
........
<Class Test1_2>n
<Class Test1_2_1>
<Function testAmplitude at 10.7.10.9>n
<Function testAmplitude at 10.7.10.18>n
<Function testAverage at 10.7.10.9>n
<Function testAverage at 10.7.10.18>n
<Function testSquareWave at 10.7.10.9>n
<Function testSquareWave at 10.7.10.18>n
......
<Class Test1_2_2>
<Function testAverage at 10.7.10.9> n
<Function testAverage at 10.7.10.18> n
<Function testSquareWave at 10.7.10.9> n
<Function testSquareWave at 10.7.10.18> n
"""
regex = r'(?im)^s*(<functions+.*sats+(d{2}.d.d{2}.d))bs*>'
print(re.findall(regex, st))
which gives you a list of tuples:
[('<Function testAmplitude at 10.7.10.9', '10.7.10.9'), ('<Function testAverage at 10.7.10.9', '10.7.10.9'), ('<Function testSquareWave at 10.7.10.9', '10.7.10.9'), ('<Function testAverage at 10.7.10.9', '10.7.10.9'), ('<Function testSquareWave at 10.7.10.9', '10.7.10.9'), ('<Function testAmplitude at 10.7.10.9', '10.7.10.9'), ('<Function testAverage at 10.7.10.9', '10.7.10.9'), ('<Function testSquareWave at 10.7.10.9', '10.7.10.9'), ('<Function testAverage at 10.7.10.9', '10.7.10.9'), ('<Function testSquareWave at 10.7.10.9', '10.7.10.9')]
If you only need the entire line, the capture groups can be removed, such as with:
(?im)^[^rn]*<functions+.*sats+d{2}.d.d{2}.d{2}bs*>s*$
Example for the second IP
import re
st = """
<Class Test1>n
<Class Test1_1>n
<Class Test1_1_1>n
<Function testAmplitude at 10.7.10.9>n
<Function testAmplitude at 10.7.10.18>n
<Function testAverage at 10.7.10.9>n
<Function testAverage at 10.7.10.18>n
<Function testSquareWave at 10.7.10.9>n
<Function testSquareWave at 10.7.10.18>n
......
<Class Test1_1_2>
<Function testAverage at 10.7.10.9> n
<Function testAverage at 10.7.10.18> n
<Function testSquareWave at 10.7.10.9> n
<Function testSquareWave at 10.7.10.18> n
........
<Class Test1_2>n
<Class Test1_2_1>
<Function testAmplitude at 10.7.10.9>n
<Function testAmplitude at 10.7.10.18>n
<Function testAverage at 10.7.10.9>n
<Function testAverage at 10.7.10.18>n
<Function testSquareWave at 10.7.10.9>n
<Function testSquareWave at 10.7.10.18>n
......
<Class Test1_2_2>
<Function testAverage at 10.7.10.9> n
<Function testAverage at 10.7.10.18> n
<Function testSquareWave at 10.7.10.9> n
<Function testSquareWave at 10.7.10.18> n
"""
regex = r'(?im)^[^rn]*<functions+.*sats+d{2}.d.d{2}.d{2}bs*>s*$'
print(re.findall(regex, st))
That returns:
[' <Function testAmplitude at 10.7.10.18>n', ' <Function testAverage at 10.7.10.18>n', ' <Function testSquareWave at 10.7.10.18>n', ' <Function testAverage at 10.7.10.18> n', ' <Function testSquareWave at 10.7.10.18> n ', ' <Function testAmplitude at 10.7.10.18>n', ' <Function testAverage at 10.7.10.18>n', ' <Function testSquareWave at 10.7.10.18>n', ' <Function testAverage at 10.7.10.18> n', ' <Function testSquareWave at 10.7.10.18> n nn']
6