I’m trying to confirm that my GPT-2 model is being trained from scratch, rather than using any pre-existing pre-trained weights. Here’s my approach:
- Load the pre-trained GPT-2 XL model: I load a pre-trained GPT-2 XL model using
AutoModelForCausalLM.from_pretrained("gpt2-xl")
and calculate the total L2 norm of the weights for this model. - Initialize a new GPT-2 model from scratch: I then initialize a new GPT-2 model from scratch with a custom configuration using
GPT2Config
. - Compare L2 norms: I calculate the L2 norm of the weights for both the pre-trained model and the freshly initialized model. My assumption is that the L2 norm of the scratch model should be much smaller compared to the pre-trained model if the scratch model is truly initialized from random weights.
Here’s the code snippet:
import torch
from transformers import GPT2LMHeadModel, GPT2Config, AutoModelForCausalLM
# Step 1: Load the pre-trained GPT-2 XL model
pretrained_model = AutoModelForCausalLM.from_pretrained("gpt2-xl")
# Step 2: Calculate the L2 norm of the weights for the pre-trained model
pretrained_weight_norm = 0.0
for param in pretrained_model.parameters():
pretrained_weight_norm += torch.norm(param, p=2).item()
print(f"Total L2 norm of pre-trained model weights: {pretrained_weight_norm:.2f}")
# Step 3: Initialize a new GPT-2 model from scratch with custom configuration
config = GPT2Config(
vocab_size=52000, # Ensure this matches the tokenizer's vocabulary size
n_ctx=1024, # Context window size (number of tokens the model can see at once)
bos_token_id=0, # Begin-of-sequence token
eos_token_id=1, # End-of-sequence token
)
model = GPT2LMHeadModel(config)
# Step 4: Calculate the L2 norm of the weights for the freshly initialized model
scratch_weight_norm = 0.0
for param in model.parameters():
scratch_weight_norm += torch.norm(param, p=2).item()
print(f"Total L2 norm of model initialized from scratch: {scratch_weight_norm:.2f}")
Is this method a valid way to confirm that the model is being trained from scratch? Are there any potential issues or better ways to verify that the model has no pre-existing learned weights?
Looks right
~/beyond-scale-language-data-diversity$ /opt/conda/envs/beyond_scale_div_coeff/bin/python /home/ubuntu/beyond-scale-language-data-diversity/playground/test_gpt2_pt_vs_reinit_scratch.py
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 689/689 [00:00<00:00, 8.05MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████| 6.43G/6.43G [00:29<00:00, 221MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 124/124 [00:00<00:00, 1.03MB/s]
Total L2 norm of pre-trained model weights: 24542.74
Total L2 norm of model initialized from scratch: 1637.31
(beyond_scale_div_coeff)
cross: https://discuss.huggingface.co/t/how-to-reinitialize-from-scratch-gpt-xl-in-hugging-face-hf/101905
ref: https://github.com/alycialee/beyond-scale-language-data-diversity/issues/18