Guys I have finetuned llama3 70b instruct in runpod using lora and merged it with base model and converted to gguf format with 4 bit quantization now I pushed model to ollama when using with ChatOllama or Ollama from langchain I get no response it keep getting processed and sometime I recieved ollama failed error .
Model I wanted to inference : https://ollama.com/utshav/llama3-70b-4bitextraction Is there anything that I am doing wrong please let me know it will be very useful I have wasted lots of time and money in this process.
I got inference with base model like without finetuned version of same gpu but this finetued version takes forever and doesnot converges response.