Find exact and similar phrases in a string with millions of characters
I have a list of phrases and a corpus which is a string of text with millions of words. For each phrase in my phrase list, I want to find and record the most similar phrases found in the corpus-string.
For my purposes I need to use SBERT similarity, and found the sentence-transformers lib to the best.
My problem is that while there exists documentation for finding similarities between two lists, I couldn’t find any for finding a list of phrases within a large string. I tried splitting my string into a list of sentence, but compute-time is incredibly long because for each phrase in the phrases list I need to loop thru each sentence (and there are plenty) and then append all matches to a dictionary I am creating.