I am working with a PyTorch model (AutoModelForCausalLM) using the transformers library and encountering a RuntimeError related to tensor types and operator support. Here’s a simplified version of my code:
import torch
import requests
from PIL import Image
from IPython.display import display
from transformers import AutoModelForCausalLM, LlamaTokenizer
# Load tokenizer and model
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
'THUDM/cogvlm-chat-hf',
torch_dtype=torch.float16, # Using torch.float16
low_cpu_mem_usage=True,
trust_remote_code=True
).eval()
def generate(query: str, img_url: str, max_length: int = 2048) -> str:
image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
display(image)
# Generate token inputs
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image], template_version='vqa')
# Convert tensors to appropriate types
input_ids = inputs['input_ids'].unsqueeze(0).to(torch.long)
token_type_ids = inputs['token_type_ids'].unsqueeze(0).to(torch.long)
attention_mask = inputs['attention_mask'].unsqueeze(0).to(torch.float16)
images = [[inputs['images'][0].to(torch.float16)]]
inputs = {
'input_ids': input_ids,
'token_type_ids': token_type_ids,
'attention_mask': attention_mask,
'images': images,
}
gen_kwargs = {"max_length": max_length, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, input_ids.shape[1]:]
return tokenizer.decode(outputs[0])
query = 'Describe this image in detail'
img_url = 'https://i.ibb.co/x1nH9vr/Slide1.jpg'
generate(query, img_url)
Above code throws throws the following error:
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
query : shape=(1, 1226, 16, 112) (torch.float16)
key : shape=(1, 1226, 16, 112) (torch.float16)
value : shape=(1, 1226, 16, 112) (torch.float16)
attn_bias : <class 'NoneType'>
p : 0.0
`ck_decoderF` is not supported because:
device=cpu (supported: {'cuda'})
operator wasn't built - see `python -m xformers.info` for more info
`ckF` is not supported because:
device=cpu (supported: {'cuda'})
operator wasn't built - see `python -m xformers.info` for more info
I’m trying to use torch.float16 tensors with my PyTorch model on CPU (device=cpu). The model is loaded with torch.float16 using AutoModelForCausalLM from the transformers library. However, I encounter the NotImplementedError stating that the memory_efficient_attention_forward operator isn’t supported on CPU with torch.float16.
Is there a way to make memory_efficient_attention_forward work with torch.float16 on CPU? Are there alternative approaches or configurations I should consider to resolve this issue?
I am trying to run this on a MacBook PRO with Intel Core i7 processor.