`
def get_vectorize_layer(all_text, vocab_size, max_seq, special_tokens=[“[MASK]”]):
“””
Build Text vectorization layer
Args:
all_text (list): List of string i.e input all_text
vocab_size (int): vocab size
max_seq (int): Maximum sequence length.
special_tokens (list, optional): List of special tokens. Defaults to ['[MASK]'].
Returns:
layers.Layer: Return TextVectorization Keras Layer
"""
vectorize_layer = TextVectorization(
max_tokens=vocab_size,
output_mode="int",
#standardize=custom_standardization,
output_sequence_length=max_seq,
)
vectorize_layer.adapt(all_text)
#Insert mask token in vocabulary
vocab = vectorize_layer.get_vocabulary(include_special_tokens = True)
vocab = vocab[2 : vocab_size - len(special_tokens)] + ["[mask]"]
vectorize_layer.set_vocabulary(vocab)
`
This code snippet was adapted from the tutorial here (https://keras.io/examples/nlp/masked_language_modeling/) in which I am creating a TextVectorization vectorized layer with a [mask] token added to the vocabulary. When I then tokenize a sample text such as “I go swimming in the [mask] with friends”, I would like the token for [mask] to be its index in the vocabulary. What is occurring now is that the vectorized layer is tokenizing [mask] as an [UNK] and thus tokenizing it as 1, instead of [mask]’s vocabulary index.
I am assuming this is because I adapt the layer first and so it learns its token->id “mapping” (I don’t know if this is accurate just a speculation), but when I set the vocabulary, it does not change the original token->id mapping. However, I need to change this mapping just to add in [mask]->vocab.index(‘[mask]’).
Does this seem like a systems/os issue, or coding/understanding? Any help is much appreciated.
At first I just tried assigning vectorized_layer(“[mask]”).numpy()[0][0] = vocab.index(‘[mask]’) but that did not work, as I believe vectorized_layer(“[mask]”).numpy()[0][0] is it’s mapping.
Joe s is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.