I’m experiencing significant performance and output quality issues when running the LLaMA 13B model using the llama_cpp library on my laptop. The same setup works efficiently with the LLaMA 7B model. Below are the details:
Environment:
Laptop Specifications:
CPU: Intel Core i7 (8 cores)
RAM: 16 GB (with ~2 GB free during inference)
GPU: None
Software:
Operating System: Windows 10
Python Version: 3.10
llama_cpp Version: 0.1.0
Model Files: llama-2-13b.Q2_K.gguf for 13B and llama-2-7b.Q2_K.gguf for 7B
Issue:
When using the following code to run the LLaMA 13B model:
python
Copy code
import psutil
import os
import warnings
from llama_cpp import Llama
import time
Suppress warnings
warnings.filterwarnings(“ignore”)
Path to the model
model_path = “C:/Llama_project/models/llama-2-13b.Q2_K.gguf”
Load the model with the adjusted n_ctx parameter (token limit)
llm = Llama(model_path=model_path, n_ctx=4096) # Set the context size to 4096 tokens
System message to set the behavior of the assistant
system_message = “You are a helpful assistant.”
Function to ask questions
def ask_question(question):
# Use user input for the question prompt
prompt = f”Answer the following question: {question}”
# Fallback to approximate token count (this can be refined if needed)
prompt_tokens = len(prompt.split()) # Basic word count as a rough token count estimate
print(f"Prompt token count: {prompt_tokens}")
# Calculate the remaining tokens for the output based on the model's 4096 token limit
max_output_tokens = 4096 - prompt_tokens
print(f"Remaining tokens for output: {max_output_tokens}")
# Monitor memory usage and CPU utilization before calling the model
process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / 1024 ** 2 # Memory in MB
cpu_before = psutil.cpu_percent(interval=1) # Get CPU utilization before processing
print(f"Memory before: {mem_before:.2f} MB")
print(f"CPU utilization before: {cpu_before:.2f}%")
# Get the output from the model with the calculated max tokens for output
start_time = time.time() # Track time taken to generate response
try:
output = llm(prompt=prompt, max_tokens=max_output_tokens, temperature=0.7, top_p=1.0)
print("Response generated successfully.")
except Exception as e:
print(f"An error occurred during model inference: {e}")
return ""
end_time = time.time() # End time after processing
# Monitor memory usage and CPU utilization after calling the model
mem_after = process.memory_info().rss / 1024 ** 2 # Memory in MB
cpu_after = psutil.cpu_percent(interval=1) # CPU utilization after processing
print(f"Memory after: {mem_after:.2f} MB")
print(f"CPU utilization after: {cpu_after:.2f}%")
print(f"Time taken for response: {end_time - start_time:.2f} seconds")
# Clean the output and return only the answer text
return output["choices"][0]["text"].strip()
Main loop for user interaction
while True:
# Take user input for the question
user_input = input(“Ask a question (or type ‘exit’ to quit): “)
if user_input.lower() == 'exit':
print("Exiting the program.")
break
# Get the model's response
answer = ask_question(user_input)
# Print only the answer, no extra logs or warnings
print(f"Answer: {answer}")
Observed Behavior:
With LLaMA 7B:
Response Times: Quick.
Answer Quality: Appropriate and relevant.
With LLaMA 13B:
Response Time: Approximately 3 minutes for simple prompts like “hello”.
Extended Prompts: “write a detailed essay on Allama Iqbal” takes about 25 minutes, leading me to terminate the program before completion.
Memory Usage: Before processing – ~8.4 GB; After processing – ~8.4 GB.
CPU Utilization: Before processing – ~5.8%; After processing – ~8.7%.
Output Quality: Responses are slow and often inappropriate or low-quality.
Additional Details:
RAM Usage: I have 16 GB of RAM installed, with about 8 GB free before running the model and approximately 2 GB free during inference. There doesn’t seem to be a RAM bottleneck.
GPU: My laptop does not have a GPU, which I suspect might be contributing to the slow performance with the larger model.
Model Quantization: Using Q2_K quantization for both 7B and 13B models.
Attempts to Resolve:
Reduced desired_word_count: Tried lowering the word count to see if it affects performance.
Optimized Parameters: Adjusted parameters like temperature and top_p for better responses.
Library Updates: Ensured that llama_cpp is up-to-date.
System Monitoring: Checked RAM and CPU usage during model inference.
Tried Different Prompts: Simple prompts are slow but manageable; complex prompts are excessively slow and incomplete.
Questions:
Performance Optimization: What can I do to optimize the performance of the LLaMA 13B model on a CPU-only system with 16 GB RAM?
Inappropriate Responses: Why are the responses with the 13B model inappropriate despite longer processing times?
Quantization Issues: Could the Q2_K quantization be causing compatibility or performance issues with the llama_cpp library?
What I’ve Tried So Far:
Running the same script with the LLaMA 7B model works efficiently.
Lowering the max_output_tokens to reduce processing time.
Ensuring all libraries are updated to their latest versions.
Monitoring system resources to identify potential bottlenecks.
Attempting to use different quantization settings without success.
Additional Information:
When I use prompts like “write a detailed essay on Allama Iqbal”, it takes about 25 minutes to generate a response, leading me to stop the program before it completes because I’m unsure how much longer it will take.