I understand how a transformer model combines input and output to produce a shifted output, but I don’t understand how it hires a loop to generate word by word when the training is done?
I guess it assumes that the output has only one start token at the first and then it feeds the input and assumed output into the model and catches the last index of the prediction and by this edits next index of output and feed them again into the model.
Something like this:
The goal is to convert ‘I am Student’ to ‘Ich bin Student’
English tokens: {0: , 1: , 2: , 3: Am, 4: Student, 5: I}
German tokens: {0: , 1: , 2: , 3: Bin, 4: Ich, 5: Student}
Loop 1.
input: [1, 5, 3, 4, 2, 0]
output: [1, 0, 0, 0, 0]
predict: transformer((input, output)) = [4, 0, 0, 0, 0]
next-token = 4
Loop 2.
input: [1, 5, 3, 4, 2, 0]
output: [1, 4, 0, 0, 0]
predict: transformer((input, output)) = [4, 3, 0, 0, 0]
next-token = 3
Loop 3.
input: [1, 5, 3, 4, 2, 0]
output: [1, 4, 3, 0, 0]
predict: transformer((input, output)) = [4, 3, 5, 0, 0]
next-token = 5
Loop 4.
input: [1, 5, 3, 4, 2, 0]
output: [1, 4, 3, 0, 0]
predict: transformer((input, output)) = [4, 3, 5, 3, 0]
next-token = 3
Loop 5.
next-token is 3: so the loop ends here.
Am I thinking right?