I want to improve the NLTK sentence tokenizer. Unfortunately, it doesn’t work too well when the text doesn’t leave any whitespace between the period and the next sentence.
from nltk.tokenize import sent_tokenize
text = "I love you.i hate you.I understand. i comprehend. i have 3.5 lines.I am bored"
sentences = sent_tokenize(text)
sentences
Output:
['I love you.i hate you.I understand.',
'i comprehend.',
'i have 3.5 lines.I am bored']
So with regex I can split the first line into 3 separate sentences. However, I don’t know how can I get the last sentence too, which doesn’t end in a punctuation sign.
import re
new_sentences = []
for i in sentences:
sents = re.findall(r'w+.*?[.?!$](?!d)', i, flags=re.S)
new_sentences.extend(sents)
new_sentences
Output:
['I love you.',
'i hate you.',
'I understand.',
'i comprehend.',
'i have 3.5 lines.']
I put the $
there indicating end of line, but it doesn’t seem to care.