I’m running Llama2 70B from huggingface and I’m using the output_attentions=True argument in Transformers to get the attention weights. It’s supposed to output a tuple of (layers, batch size, attention head, input size, input size) but I’m getting a tuple of size (7, 80, 1, 64, input size, input size). What is the extra dimension of 7 supposed to be? Also, I thought causal models were supposed to have lower triangular attention matrices but the output isn’t a lower triangle. Why is this?
This is the code that I used
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "meta-llama/Llama-2-70b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer([input_prompt], return_tensors="pt")
outputs= model.generate(
input_ids['input_ids'],
do_sample=False,
temperature=None,
top_p=None,
top_k=None,
max_new_tokens=7,
sequence_bias=sequence_bias,
return_dict_in_generate=True,
output_attentions=True
)
attention = outputs.attentions
treett is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.