What is the proper way when you are building an API that will execute a sagemaker call for inference on every request received on FastAPI app?
Should I create a single client for boto3 or one client per invocation? My app will handle around 400 rps, right now I am doing:
import ujson
import boto3
client = boto3.client("sagemaker-runtime", region_name="us-east-1")
def inference(payload):
response = client.invoke_endpoint(
EndpointName=endpoint,
Body=payload,
ContentType='application/json'
)
response_body = response["Body"].read().decode("utf-8")
response_body = ujson.loads(response_body)
return response_body
I am trying to figure out why my sagemaker integration is making my application slow, not sure how it should be used
I have plenty of async code before call inference() and I feel that something with this piece of code…looks like that something is holding my threads
dpcllalala is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.