Im writing a function to that takes in text and converts the text into ngrams based on the order, n. So for bigrams n=2, fivegrams n=5, and so on. Im trying to add special tokens at the beginning and end. I need to put n-1 special tokens in the beginning, and 1 special token at the end.
This is my function, and it works perfectly fine except I don’t know a good way to add the special tokens that I mentioned above:
def create_ngrams(tokens, n):
n_grams = ngrams(tokens, n)
start = '<s>'
end = '</s>'
## FIGURE OUT HOW TO ADD SPECIAL TOKENS BASED ON THE ORDER, n
return [grams for grams in n_grams]
This is what its currently outputting:
[('1609', 'sonnets'),
('sonnets', 'william'),
('william', 'shakespeare'),
('shakespeare', '1'),
('1', 'fairest'),
('fairest', 'creatures'),
('creatures', 'desire'), ...
but for a bigram, n=2, I want it to look like this:
[('<s>', '1609'),
('1609', 'sonnets'),
('sonnets', 'william'),
('william', 'shakespeare'),
('shakespeare', '1'),
('1', 'fairest'),
('fairest', 'creatures'), ...
of course, it would have one ending token, </s>
, as well but I just copied the beginning.
I need it to work for other ngram orders as well, I just used n=2 as an example.
Kevin Veeder is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.