I am trying to improve the inference speed of my encoder model using Nvidia triton server. But I am not able to get the model-analyzer infer/sec speed when manually testing it out. I am using the best configuration as recommended by model-analyzer. I am currently seeing half the inference rate. Am I doing something wrong here?
I ran the command for model-analyzer
model-analyzer profile --model-repository=models --profile-models=all-MiniLM-L6-v2 --output-model-repository-path=results
got the best configs and the inference speed given my model-analyzer was ~3900 sentences/sec. I am struggling even to reach 1700 inferences/sec now.
Nishanth Satish is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.