I am currently coding transformer model from scratch and completed it. I usually code just as fun and don’t overthink on what I do. However, I have never performed optimization of hyperparameters on models. I started learning optuna and has implemented random search in it. Here’s the code for searching space of the hyperparameters:
d_model = trial.suggest_categorical('d_model',[64, 128, 256, 512,1024])
d_v = trial.suggest_categorical('d_v',[32,64,128,256,512])
d_ff = trial.suggest_categorical('d_ff',[128, 256, 512,1024,2048,4096])
N = trial.suggest_categorical('N',[1,2,3,4,5,6,7,8])
h = trial.suggest_categorical('h',[2,4,8])
dropout = trial.suggest_categorical('dropout',[0,0.1,0.2,0.3,0.4,0.5])
optimizer = trial.suggest_categorical('optimizer', ["adam", "adamw"])
lr = trial.suggest_float('lr', 0.0001, 0.01, log=True)
Of course, the paper “Attention is all you need” uses a formula for learning rate, but I am currently using this method for my learning rate.
Now, while studying, I didn’t exactly figured number of trials they performed in the paper. So, I had the question, how many trials do they go for during hyperparameter tuning?
I tried simply with total permutation of trials, but figured out, since learning rate is set as in such manner, the permutation may repeat with different learning_rate on each trials.