I have a text with special cases I must care about when I want to split it into an array of sentences.
Here is an example of the a possible text with the regular expression I tried: https://regex101.com/r/QDbm5b/1
In my case a sentence ends with a dot, whitespace and the next sentence must starts with an uppercase letter. If no sentence came after this sentence, it should also be recognized as a sentence. If something of this is missing such as the uppercase letter the words must be assigned to the sentence before it.
I need this because there might be inputs such as 2000 I.E. or Unit 2.5.
Examples:
- Lorem ipsum dolor sit amet,invidunt ut 06.03.24 invidunt 12.03.24. This is the next sentence. (in this case all is fine and the sentences will be splitted into two)
- Lorem ipsum dolor sit amet,invidunt ut 06.03.24 invidunt 12.03.24. this is the next sentence. (in this case the second sentence starts with a lowercase letter. Now it musn’t be splitted into two sentences)
- Lorem ipsum dolor sit amet,invidunt ut 06.03.24 invidunt 12.03.24. This is the next sentence.
Lorem ipsum dolor sit amet,invidunt ut 06.03.24 invidunt 12.03.24. This is the next sentence.
(In this case regexp should detect 4 sentences)
My approach to get these sentences is this regular pattern: b.{1}s{1}[A-Z]. But I only get the ., whitespace and uppercase letter as you can see on regex101. I haven’t found any working solution in the web for my needs.
Which regular pattern would fit to my needs?