In the application, we start seeing a lot of the 504 requests from Nginx, but in most cases, we see only Nginx log about the fact of 504 due to the 60-second timeout but no logs from the PHP/FPM side. We still have application logs for some 504-related endpoints and in most of the cases it’s because of some non-optimized/heavy db query, but most 504 responses are without application logs (only Nginx) at all and I can’t debug/understand what the issue is.
My first thought was log buffer, as we also have a timeout on the fpm side (which is the same 60 seconds) so PHP is not sending buffered logs because of interruption: but we have a stream
logger, and also for some 60s 504 requests we have logs, so that idea was rejected.
The second idea was not enough PHP workers, we have
pm: static
pm_max_children: 30
but we have Prometheus metrics and they show that the maximum amount of active processes is 15, so it’s far away from 30. Also, we have HPA, configured to start new pods when any pod reaches 50% of active workers.
Depending on K8s resources metrics all look good, with no spikes in terms of resources (CPU/memory). What else can be the issue here, where do I need to look?