I need to scan text files and pull their text, but there are certain sections I want to ignore.
The sections to ignore start with:
--- start private ---
and end with:
--- end private ---
I don’t mind if the start and end phrases are included or not.
Example
test1
test2
--- start private ---
this text should not be returned by the regex
--- end private ---
test3
test4
--- start private ---
this text should also not be returned by the regex
--- end private ---
test5
Only the lines starting with “test” should be returned.
I have the .net 6 flavour of Regex to do this with, but I don’t have access to the .net code. I can only use regex to get what I need
12
The canonical approach in .NET would be to use Regex.Replace
to replace everything that starts with --- start private ---
and ends with --- end private ---
with an empty string, leaving just the lines you want.
However, you mention in your question that you cannot use Regex.Replace
, you can just parameterize an existing call to Regex.Match
. So we need to get creative.
This should do it:
(?s)(^|--- end private ---).*?(--- start private ---|$)
It matches everything that
- starts at the start of the document (
^
) or with--- end private ---
and - ends with
--- start private ---
or the end of the document ($
).
(?s)
is the single line flag to ensure that newlines are also matched by .
.
The question mark at the end of .*?
ensures that the match is non-greedy. Otherwise the regex would just match everything from the start to the end of the document.
If you want to exclude the separator lines, you can use look-ahead/look-behind assertions for matching the separator texts:
(?s)(^|(?<=--- end private ---)).*?((?=--- start private ---)|$)
3