I have the following code
import keras
v = {
"deck": ['a','B','C','D','E','F','G','H','I','J','K','L']
}
print(len(v["deck"]))
l = keras.layers.TextVectorization(
max_tokens=len(v["deck"])+2,
vocabulary=v["deck"],
output_mode='count',
name="deck")
print(l.vocabulary_size())
print(l.get_vocabulary())
print(l('a A b B'))
the output is:
12
13
['[UNK]', 'a', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
tf.Tensor([2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], shape=(13,), dtype=float32)
I would expect at least one of the b's
to be counted.
If I use l.adapt(v["deck"])
, things seem to work accordingly, but the vocab is all in lowercase.
Like this:
import keras
v = {
"deck": ['a','B','C','D','E','F','G','H','I','J','K','L']
}
print(len(v["deck"]))
l = keras.layers.TextVectorization(
max_tokens=len(v["deck"])+2,
# vocabulary=v["deck"],
output_mode='count',
name="deck")
l.adapt(v['deck'])
print(l.vocabulary_size())
print(l.get_vocabulary())
print(l('a A b B'))
and output:
12
13
['[UNK]', 'l', 'k', 'j', 'i', 'h', 'g', 'f', 'e', 'd', 'c', 'b', 'a']
tf.Tensor([0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 2.], shape=(13,), dtype=float32)