I’ve done some significant testing of AFD in an attempt to reach the following outcomes:
- Zero downtime deploys (Blue/Green)
- Automatic / immediate failover due to app service being stopped or restarting
I have two Origins. Both Umbraco .NET web applications each in different regions. They are identical, and the configuration in AFD is also identical:
- Priority: 1
- Weight: 50
Health Monitoring is currently OFF as it was generating far too many requests (300/min)
I’ve tried various load balancing settings but have trimmed it back as tightly as it’ll go, so:
- Sample Size: 1
- Successful Samples: 1
- Latency: 0
Regardless of adjusting these, and using Health Monitoring, there is always a number of minutes in a failover event before users are fully routed to the healthy origin.
If I stop one of the web apps, some users continue to get the 403 Stopped status page for up to two minutes.
If I restart one of the web apps, many users will see the 502 status page, and then experience the slow startup of the web app while the other origin is 100% healthy.
Sometimes I refresh and get HTML but no stylesheets. I have no idea why.
What I would expect, is a much faster failover too the healthy origin, especially when one of the servers starts reporting it is down.
Are health probes required in this scenario? I would like to configure these to check every ten seconds but the number of requests coming in from all the POPs is insane. I can’t quite understand why the last response from the origin doesn’t seem to influence whether the origin is healthy or not.
Is this maybe not the right tool for the job? Any advice appreciated.
Based on the outcomes you’re attempting to reach you will want to use the Front Door health probes. As you have seen they generate significant traffic since each Front Door Edge location sends health probes to the origins.
The load balancing settings are correct for what you are trying to achieve – sample size 1, successful samples 1 and latency 0.
There is a tradeoff between achieving “zero downtime” and “lots of health probe traffic to the application.” You could try beefing up the application to be able to handle the health probes, or having a specific low-cost health route the probes can hit in your app. Using the HEAD probe method can be lighter since it doesn’t generate a message body from the request.
Set your health probe interval as short as your application can handle. That’ll determine how close to “zero downtime” you can get. Good luck!
https://learn.microsoft.com/en-us/azure/frontdoor/health-probes