looking through the documentation about ngrams and the different vectorizors, I came across the Multi-word expression tokenizer (MWETokenizer) which locates phrases in a text and converts them into a single token.
from nltk.tokenize import MWETokenizer
sents = "in a nutshell"
mwe_tok = MWETokenizer()
mwe_tok.add_mwe(sents)
output = mwe_tok.tokenize(sents.split(" "))
result will be a string with 3 words concatenated by a “_”
["in", "a", "nutshell"] -> ["in_a_nutshell"]
from here one can use a vectorizer on the documents to find word frequency. I noticed that this can be performed similarly if one sets the ngram to greater than 1.
from nltk import ngrams
sents = "in a nutshell"
grams = ngrams(sents.split(), 3)
result will be a tuple of 3 words
["in", "a", "nutshell"] -> ("in", "a", "nutshell")
ngram code located in with CountVectorizer
count_vectorizer = CountVectorizer(ngram_range=(1, 3))
What are the benefits to using MWE as opposed to setting the ngram range? does it matter or are these two performing the same function just differently?
this is more of a question about methodology and trying to understand why perform one task over another or if one task is better than the other.
linkey apiacess is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.