What are the best practices for optimizing batch size and learning rate in training Large Language Models (LLMs)?
How should these hyperparameters be adjusted relative to each other for efficient convergence and improved performance?
Additionally, could you provide a concise example illustrating the interplay between batch size and learning rate adjustments in training an LLM on a text generation task?