Bear with me here since I’m new to AWS. I’m trying to process this big database of documents, in particular I am using Mistral-7b-v0.3 to create summaries. I am deploying the model with this machine, using Real Time Inference:
instance_type = "ml.g5.2xlarge"
number_of_gpus = 1
health_check_timeout = 600
initial_instance_count = 5
I am using the endpoint from another big AWS machine (c6a.24xlarge) with a lot of processors (96). The problem is that when I try to use all the processors in parallel, the request just doesn’t get to SageMaker. If I watch the CloudWatch there is no log to check.
Then, I’ve tried decreasing the amount of processors used from 96->15, and there it worked (but it didn’t work with 16 processors).
My next thought was to increase the instance type and count to
instance_type = "ml.g5.4xlarge"
number_of_gpus = 1
health_check_timeout = 600
initial_instance_count = 10
but well I got a ResourceLimitExceeded
error (with that instance type I can only have 2 instances). So I stuck with:
instance_type = "ml.g5.4xlarge"
number_of_gpus = 1
health_check_timeout = 600
initial_instance_count = 10
expecting to be able to use 30 processors instead of 15. But it wasn’t the case, I could only use 15 still.
Anyway, is there anything to do to increase the request to the sagemaker? Any workaround or something possible?