I am trying to create a custom script that will be able to train the ELECTRA model from scratch. First I created a generator using code:
class ElectraForGenerating(torch.nn.Module):
def __init__(self, hidden_size, vocab_size):
super(ElectraForGenerating, self).__init__()
config = ElectraConfig.from_json_file("./config.json")
self.ElectraModel = ElectraModel(config)
self.generator_predictions = torch.nn.Linear(hidden_size, vocab_size)
def forward(self, input_ids, token_type_ids=None, position_ids=None):
encoder_output = self.ElectraModel(input_ids)
generator_predictions = self.generator_predictions(encoder_output.last_hidden_state)
return generator_predictions
With the following config (1/4 small model as recommended):
{
"attention_probs_dropout_prob": 0.1,
"hidden_size": 64,
"intermediate_size": 256,
"num_attention_heads": 1,
"num_hidden_layers": 12,
"embedding_size": 128,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"initializer_range": 0.02,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"type_vocab_size": 2,
"vocab_size": 30522,
"num_labels": 1,
"model_type": "electra"
}
For this I created an AdamW optimizer and an LR scheduler:
optimizerGen = torch.optim.AdamW(generator.parameters(), lr=5e-4, betas=(0.9, 0.999), eps=1e-6, weight_decay=0.01)
schedulerGen = WarmupThenLinearDecayScheduler(optimizerGen, warmup_steps, 1000000, 1e-7, 5e-4, 0)
I prepare the data as recommended:
- I will replace 15% of the tokens with [MASK]
- Replace 10% of the [MASK] tokens with a random token
- Replace 10% of the [MASK] tokens with the original token
- Loss is calculated only for MASK tokens
After a bit of training, the model will achieve that it correctly overwrites the tokens that are not replaced by the [MASK] token, but always returns the same value for the whole batch instead of the MASK token. So, for example, all [MASK] tokens should be replaced by token 0, etc., which is obviously wrong.
Is there anything else I need to set up specifically to make the generator work?