I have an NGINX proxy server which sits between user web browsers and a backend container, a frontend container, and an S3 bucket. Hosted on AWS Lightsail. Here’s the basic setup:
# Enforce ipv4 through AWS VPC resolver
## https://docs.aws.amazon.com/vpc/latest/userguide/vpc-dns.html#AmazonDNS
resolver 169.254.169.253 ipv6=off valid=10s;
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
listen 80 default_server;
server_name localhost;
error_log /var/log/nginx/error.log error;
# Giving a bit of wiggle room for /api/programs on 1G containers
# Increase buffer for larger files coming from S3
client_body_buffer_size 100m;
proxy_buffer_size 128k;
proxy_buffers 4 256k;
proxy_busy_buffers_size 256k;
proxy_connect_timeout 120s;
location /api/messaging/ {
proxy_pass https://$WS_FQDN;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_read_timeout 86400;
proxy_ssl_server_name on;
proxy_ssl_name $WS_FQDN;
}
location /common {
proxy_pass https://$S3_IMG_FQDN;
proxy_http_version 1.1;
proxy_pass_request_body off;
proxy_set_header Content-Length "";
proxy_set_header Connection "";
proxy_set_header Authorization "";
proxy_set_header Host $S3_IMG_FQDN;
proxy_set_header User-Agent $S3_IMG_TOKEN;
proxy_pass_request_headers on;
auth_request off;
}
location /api/ {
proxy_pass https://$BACKEND_FQDN;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_body "$request_body";
}
location / {
proxy_pass https://$FRONTEND_FQDN;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_body "$request_body";
}
}
When the NGINX server starts, everything works great. But after a certain amount of time (or maybe a certain amount of requests), one or many of the containers and S3 buckets will start intermittently returning 499s to the Proxy’s request. The browser then gets a 504 error because of it.
For example, when I load the application in a browser, if it sends 5 requests to the backend for something, 1 might fail.
I’ve been looking for patterns but haven’t been able to find any that might help me figure out why this is happening.
- Sometimes it takes a day or two to happen, sometimes it takes upwards of 10 days of uptime before it happens (it seems to be related to load, we have periods of high and low activity. So I think it’s request-based but I am not sure).
- It isn’t any one container that always returns the 499s. Sometimes it’s the frontend, sometimes it’s the backend, sometimes its both, and sometimes it’s the S3 bucket, too.
- It isn’t any one endpoint/file on those servers. I can refresh the browser and the 499 will happen on a different request than the previous error.
- There aren’t any related errors on the containers when the issue starts happening.
I can immediately solve the issues just be restarting the proxy server, but it returns again after some amount of time without fail.
We have tried various DNS settings:
- Using Cloudflare DNS (1.1.1.1)
- Using Google DNS (8.8.8.8)
- Adding/removing cache clear time (10s on now, as above)
- Disabling or enabling IPv6
We have the current AWS DNS solution implemented as per a support ticket with AWS, but are still getting the same issues.
I am not an NGINX expert but I understand some 499s are expected (generally when the user’s browser suddenly or unexpectedly closes the connection). I can see those errors from time to time, and they’re easily identifiable because a whole batch of in flight requests from a single IP get a 499. But these are different, because I can easily pinpoint in the logs when they start happening, and they tend to be single requests out of a bunch to a single user. The rest return 200s.
Jake Robins is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.