I am trying to create a new tokenizer by reordering the token ids in my existing tokenizer based on frequency. In theory, the order of token ids has no effect on performance or usability, but it results in not recognizing a few tokens. I am doing this for a variety of reasons and there isn’t another way to accomplish what i need other than through reordering. Why is this happening?
top_amt = 10000
old_tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-125M")
inputs = old_tokenizer("Hello, my dog is cute", return_tensors="pt")
print("input ids: ", inputs['input_ids'])
Outputs:
input ids: tensor([[15496, 11, 616, 3290, 318, 13779]])
# Derive word frequencies
tokens_to_id = dict(list(tokenizer.get_vocab().items()))
# Obtain fequencies
frequencies = np.zeros(len(tokens_to_id))
for i, sample in enumerate(tokenized_val['input_ids']):
for token in sample:
frequencies[token] += 1
# Sort
sorted_indices = np.argsort(frequencies)[::-1]
#toptenk = sorted_indices[:top_amt]
print("Top 10 tokens: ", tokenizer.decode(toptenk[:10]))
# This results in duplicates so it needs to be re-IDed
top_tokens = {tokenizer.decode([sorted_indices[i]]):i for i in range(len(sorted_indices))}
top_keys = list(top_tokens.keys())
top_tokens = {top_keys[i]: i for i in range(len(top_keys))}
print("Occurrences of least used token: ", top_keys[top_amt], frequencies[top_tokens[top_keys[top_amt]]])
Outputs:
Top 10 tokens: <|endoftext|>.
and the, to a was it
Occurrences of least used token: Often 477.0
vocab_file_path = 'reduced_vocab.json'
with open(vocab_file_path, 'w') as f:
f.write(json.dumps(top_tokens))
print(f"Reduced vocabulary file saved to {vocab_file_path}")
new_tokenizer = GPT2Tokenizer(vocab_file=vocab_file_path, merges_file="merges.txt", pad_token='<|endoftext|>')
new_tokenizer.save_pretrained("TinyStoriesTokenizer")
print("Tokenizer saved.")
inputs = new_tokenizer("Hello, my dog is cute", return_tensors="pt")
print("input ids: ", inputs['input_ids'])
Outputs:
Reduced vocabulary file saved to reduced_vocab.json
Tokenizer saved.
input ids: tensor([[564, 5, 0, 0, 0, 0]])