I am trying to create a program that displays a word and lets the user attempt to pronounce what is on the screen, as an introduction to pronunciation and vocabulary. The language I am working with is Thai, but the examples here will be in English.
I found some basic starting points in this GitHub repo (https://github.com/PyThaiNLP/pythaiasr), which I have been able to use to predict sentences I speak. It is based on existing models like this HuggingFace dataset. I don’t know if it works by using a dictionary of words, or combines phonemes to create words.
Either way, how can I force it to predict (and get confidence levels) from a specific set of words/phrases? There would be 44 in total, one per consonant in the alphabet.
For example, an English program might have the “phrases”:
[b ball, k car, d dog, f fish, j jelly, l lemon, m moon, ...]
And would not recognize these example mix-ups:
[b car, d fish, l dog, m ball, ...]
Assume that the words already exist in the language.
I’m a bit new to AI and neural networks in general, but if someone can point me to a method or framework, I can take it from there. Thanks!