I am using this code as a setting
name: "llama2"
backend: "python"
input [
{
name: "prompt"
data_type: TYPE_STRING
dims: [1]
}
]
output [
{
name: "generated_text"
data_type: TYPE_STRING
dims: [1]
}
]
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "/mnt/data/model_repository/llama2/pbtxt_name.tar.gz"}
}
instance_group [
{
count: 1
kind: KIND_GPU
}
]
and this code in the execute method inside “model.py”
def execute(self, requests):
logger = pb_utils.Logger
logger.log("execute-Specific Msg!", logger.INFO)
responses = []
for request in requests:
logger.log_info("request specific ")
print(request)
# Decode the Byte Tensor into Text
inputs = pb_utils.get_input_tensor_by_name(request, "prompt")
logger.log_info("inputs before numpy ")
inputs = inputs.as_numpy()
logger.log_info("inputs after numpy ")
logger.log_info(inputs)
DEFAULT_SYSTEM_PROMPT = """You are a helpful AI assistant. Keep short answers of no more than 2 sentences."""
prompts = [self.get_prompt(i[0].decode(), [], DEFAULT_SYSTEM_PROMPT) for i in inputs]
...........
the triton model is also loading fine
I0624 ************ 1 model_lifecycle.cc:264] ModelStates()
I0624 ************ 1 server.cc:633]
+--------+---------+--------+
| Model | Version | Status |
+--------+---------+--------+
| llama2 | 1 | READY |
+--------+---------+--------+
I0624 ************ 1 metrics.cc:864] Collecting metrics for GPU 0: NVIDIA A10G
I0624 ************ 1 metrics.cc:757] Collecting CPU metrics
but when I am doing simple inference via curl command like this
curl --location --request POST 'http://localhost:8000/v2/models/llama2/infer' --header 'Content-Type: application/json' --data-raw '{
"inputs":[
{
"name": "prompt",
"shape": [1],
"datatype": "BYTES",
"data": ["capital of India"]
}
]
}'
I am not getting any results, here is the summary
-
No log printed with dynamic value like actual value of ‘input’,’request’
-
Getting error on this line “prompts = [self.get_prompt(i[0].decode(), [], DEFAULT_SYSTEM_PROMPT) for i in inputs]”, the error is
tritonclient.utils.InferenceServerException: [400] Failed to process the request(s) for model instance ‘llama2_0’, message: AttributeError: ‘int’ object has no attribute ‘decode’
it seems to be some basic issue but I am not able to find it, it took a lot of time and money on the GPU, can someone please help me
I am following this path : https://blog.marvik.ai/2023/10/16/deploying-llama2-with-nvidia-triton-inference-server/