I’ve been working on developing a Question-Answering Bot utilizing RAG (Retrieval-Augmented Generation) architecture. To achieve this, I’ve integrated LlamaIndex and the LLama2-70B-chat-hf model via TogetherAPI. However, during the evaluation phase using LlamaIndex’s core evaluation methods, particularly when assessing the Faithfulness and Relevancy of the query engine with BatchEvalRunner, I encountered a Rate Limit Error:
Retrying llama_index.llms.openai.base.OpenAI._acomplete in 0.7368610084778021 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Request was rejected due to rate limiting. As a free user, your QPS is 1. If you want more, please upgrade your account.', 'type': 'credit_limit', 'param': None, 'code': None}}.
Although, I’m using Opensource LLM and Embedding didn’t provided OpenAI key nor explicitly utilized it still getting this Error
Here’s a breakdown of my setup:
**LLM and Embedding Model Initialization **
llm_model = "meta-llama/Llama-2-70b-chat-hf"
embed_model = "sentence-transformers/all-MiniLM-L6-v2"
embed_model = HuggingFaceEmbedding(
model_name=embed_model
)
llm = TogetherLLM(
model=llm_model,
api_key=Together_API,
context_window=context_window,
temperature=0.4,
max_tokens=max_tokens,
top_p=0.9,
top_k=20,
is_chat_model = False
)
Settings.llm = llm
Settings.embed_model = embed_model
Vector Store Initialization
db = chromadb.PersistentClient(path="./db")
chroma_collection = db.get_or_create_collection("PPM")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(
vector_store, storage_context=storage_context
)
Query Engine
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=10,
)
query_engine = RetrieverQueryEngine.from_args(
llm=llm,
retriever=retriever,
response_synthesizer= response_synthesizer,
stream=False,
node_postprocessors=[
reranker,
MetadataReplacementPostProcessor(target_metadata_key="window")
]
)
For Evaluation First I generated Q/A Dataset using Llama_index Evaluation method i.e. generate_question_context_pairs
from llama_index.core.evaluation import generate_question_context_pairs
qa_dataset = generate_question_context_pairs(
nodes, # Nodes generated by SentenceWindowNodeParser
llm=llm,
num_questions_per_chunk=1,
)
Evaluation
queries = list(qa_dataset.queries.values())
from llama_index.core.evaluation import (
BatchEvalRunner,
FaithfulnessEvaluator,
RelevancyEvaluator,
)
faithfulness = FaithfulnessEvaluator(llm=llm)
relevancy = RelevancyEvaluator(llm=llm)
runner = BatchEvalRunner(
{"faithfulness": faithfulness, "relevancy": relevancy},
workers=8,
)
eval_results = await runner.aevaluate_queries(
query_engine, queries=queries
)
Warnings
Retrying llama_index.llms.openai.base.OpenAI._acomplete in 0.5485069173398549 seconds as it raised RateLimitError: Error code: 429 - {'error': {'message': 'Request was rejected due to rate limiting. As a free user, your QPS is 1. If you want more, please upgrade your account.', 'type': 'credit_limit', 'param': None, 'code': None}}.
I’m uncertain why I’m encountering this Rate Limit Error since I haven’t utilized any OpenAI key explicitly. Any insights or suggestions would be greatly appreciated.