After extracting lots of text from a PDF using PyPDF2, I had realized that most of the text had spacing issues as PyPDF2 wasn’t able to pick up on a lot of the spaces in the PDF.
This resulted in text being either bunched up together or split apart.
An example of the issue I’m trying to solve is here :
‘thespeedofsoundinsea’
This is one of the strings extracted from a question paper PDF.
After multiple attempts at splitting the text into its constituent words, I was not able to solve the problem due to inconsistencies in the dictionary datasets I was using (namely, english_words, enchant and NLTK wordnet). Additionally, there was an issue mainly in the word ‘a’, because it was difficult to pick up on whether ‘a’ was a standalone word or part of another word.
I haven’t found an answer for the inconsistencies and the ‘a’ issue, but I figured out an answer to the issue where a word would be broken down into other words by using ‘greedy longest match’, basically iterating back to front. Here’s a snippet of the code
from english_words import get_english_words_set
import nltk
from nltk.corpus import wordnet
import enchant
d = enchant.Dict("en_US")
eng_set = set(get_english_words_set(['web2'], lower=True))
def divide_into_largest_words(sequence):
divided_words = []
length = len(sequence)
start = 0
while start < length:
end = length
while not check_dict(sequence[start:end]):
end -= 1
divided_words.append(sequence[start:end])
start = end
return divided_words
def check_dict(word):
return word in eng_set or d.check(word) or wordnet.synsets(word)
sequence = "twostudentsaremeasuringthespeedofsound"
largest_words = divide_into_largest_words(sequence)
print("Divided into largest words:", largest_words)
However, this fell short because of the datasets inconsistencies again, the output being
[‘twos’, ‘tu’, ‘dents’, ‘are’, ‘measuring’, ‘the’, ‘speed’, ‘of’, ‘sound’]
as the datasets consider ‘twos’ and ‘tu’ as words (I’m not sure why’
Can someone help me in solving the issues of the dataset inconsistencies (are there and datasets containing words from the English language without as many inconsistencies?) and the issue of checking whether ‘a’ is meant to be standalone or not in the sentence? Thanks!
NS270 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.