I am trying to build an efficient regular regression that joins multiple lines with the following constraints to a single line.
- The first lines start with an upper case letter and end with a colon character.
- Other consecutive lines contain 1 to 10 words; the first word should start with an upper case letter. The rest of the words can be mixed case. All the words in a line can contain characters such as ,-./
Positive Samples:
Example 1:
My favorite books are as follows:nA great booknThe Effective Book onscala.ionTest Bookn
Matches:
My favorite books are as follows:nA great booknThe Effective Book onscala.ionTest Bookn
Example 2:
My favorite books are as follows:nA great booknanother sentencenAnother bookn
Matches: lines until first letter of each line is upper case.
My favorite books are as follows:nA great bookn
Negative Sample:
My favorite books are as follows:na great booknanother sentencen
Doesn’t match
I created the following regular expression, but it matches every line regardless of whether it starts with an uppercase.
(^[A-Z].*:)n(^[A-Z].+(?:s+[a-zA-Z0-9,./ ]+){1,10})
If the above regex matched the expected lines, I would replace n
with a space and comma.
5
You may use this regex to get your matches:
^[A-Z][^:n]*:n(?:[A-Z][w,./-]*(?:h+[w,./-]+){0,9}n)+
RegEx Demo
RegEx Details:
^
: Start new line[A-Z]
: Match an upper case letter[^:n]*
: Match 0 or more characters that are not:
:n
: Match colon followed by a line break(?:
: Start non-capture group[A-Z][w,./-]*
: Match first word that must start with an uppercase letter(?:h+[w,./-]+){0,9}
: Match 0 to 9 of other words sepaeated with 1+ whitespacesn
: Match a line break
)+
: End non-capture group. Repeat this group 1+ times
4
First off use this site to test things: https://regex101.com/
It will save you so much time.
This should get you want you want, although you’ll need to make sure you’re not using a multiline flag so that it doesn’t match your negative case.
^([A-Z](.*h?){0,9}n?)+
You can replace .*
with a more restricted pattern for a single line (excluding the first character) like this:
^(([A-Z])([w:.,-]*h?){0,9}n?)+
What’s important here is that you do need a starting match but you also need a n ending match to make sure you partially match your negative case. I added the h
to only match tabs and spaces but not newlines.
Here’s a demo with it
The biggest issues with your regex are:
- It wasn’t matching anything without a multiline flag. If you use the website I linked you can see that the second start of string meta character does not work without the
m
flag. That’s because it will only match the start of the entire string unless you tell it to match the start of each line (that’s what the multiline aspect means) .+(?:s+[a-zA-Z0-9,./ ]+)
this is redundant and not actually serving you what you think..+
is basically matching the entire line except the last word (based on the space) and then it matches[a-zA-Z0-9,./ ]
the last word or symbol and then repeats for 15 lines.
Problem 2 isn’t that bad because it does do what you want for the most part but what I provide should be lean enough that you build off of it you run into problems later. Again you’re original regex could if you remove the second ^
but I imagine something might mess up because there’s a lot of unnecessary things in it.
6