I don’t really understand the difference between using classical MLPs vs self attention transformers for NLP. What do the self attention transformers do that the MLPs can’t? How is it different than just adding more hidden layers? I understand the gist of sending out keys and queries and then creating the attention weights, but to me intuitively (and I know I’m wrong), that seems like an extra abstraction to do what an MLP could do with more hidden layers. What’s fundamentally wrong with the architecture of an MLP for NLP that transformers fix?