When sending thousands of messages to SNS we had an incident where occasional messages would fail after 3 retry attempts.
The resolved address was an EC2 instance address:
ec2-3-25-138-X.ap-southeast-2.compute.amazonaws.com
The messages were being sent from with a private (external to AWS) network, on a Windows machine, using a custom Java application.
The initial failures started randomly one day, after 12 midnight, and then eventually seemed to clear up after about 12 hours.
All other endpoints were working successfully during this time. And the endpoints were resolved to standard/expected SNS endpoint domains.
When visiting the address in the browser, manually, it reported a SSL_ERROR_BAD_CERT_DOMAIN error. Stating that the TLS certificate was only valid for SNS domain names.
Is this possibly due to an internal DNS cache within our network? Or the application/AWS SDK itself?
Where does the SNS SDK get the endpoint domains from in the first place?
AWS support were not able to help.