I am looking to deploy ColBERT reranker for my RAG pipeline, with a T4 GPU (the LLM that I am using is Meta-LLaMa-3-8B-Instruct, which has already been quantized to 4bit):
import torch
from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
llm = HuggingFaceLLM(
model_name="meta-llama/Meta-Llama-3-8B-Instruct",
system_prompt=system_prompt,
query_wrapper_prompt=query_wrapper_prompt,
context_window=8192,
max_new_tokens=512,
model_kwargs={
"token": hf_token,
"torch_dtype": torch.bfloat16,
"quantization_config": quantization_config
},
generate_kwargs={
"do_sample": False,
"temperature": 0.05,
"top_p": 0.3,
},
tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_kwargs={"token": hf_token},
stopping_ids=stopping_ids,
)
My ColBERT reranker using Meta-LLaMa-3-8B-Instruct as model and tokenizer as well, with the quantized version:
from llama_index.postprocessor.colbert_rerank import ColbertRerank
colbert_reranker = ColbertRerank(
top_n=5,
model=llm,
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",
keep_retrieval_score=True,
)
However, when I am running the reranker, I got this error: RecursionError: maximum recursion depth exceeded while getting the repr of an object
. From here, it seems that the problem is with Transformers, however I am not sure about whether it could be a problem to my pipeline.
So, how to resolve this problem to the reranker in this situation?