Adobe Primetime Authentication (formerly Adobe Pass) is committed to 99.99% uptime. That’s only around 4 minutes of downtime per month or less. Every month this year Primetime Authentication achieved 99.99% uptime or better, except this month. This month, an external factor put our systems to the test. All our preparations paid off. Primetime Authentication will end October with a 99.95% uptime, which is .04% shy of our goal. This post is about how we keep our uptime commitment.
Architecting robust services
Infrastructure breaks. No matter how much care you take, it breaks. There is no question about it. The only question is when. And the really hard question is actually how prepared you are to deal with it when it breaks.
Any type of infrastructure. Things break in the public cloud and in the private cloud. A wide range of factors such as software bugs, hardware defects, or third-party services failures can put any system to the test. So it takes a robust architecture to keep a service up even when infrastructure breaks.
Here is a recent example. Primetime Authentication is architected as a distributed system running in an active-active configuration between the eastern and western United States. It relies on our DNS provider to geo-balance the traffic. If we encounter problems in the east, the traffic is automatically shifted to the west and vice-versa. This is a major factor in our high availability. Our DNS provider is reliable and not expected to fail. And still…
Last week our DNS provider experienced a major outage that affected multiple states in the eastern United States and impacted several high profile services. That hit us hard. Let’s see how we dealt with it.
Within minutes of the DNS outage, our team was in a virtual incident war-room scrambling to react. It was after midnight in our time zone, but we were online in minutes. This is where all those annoying little things paid off: the fine-tuned uptime monitoring, the automated alerting system, the discipline of 24×7 on-call, and the practiced incident response procedures. They all came together like a well-oiled machine to save valuable time.
Ok, so what’s next? First we thought that something went wrong with our eastern instance. So we brought that down and expected the traffic to shift automatically to our western instance. No time for deep investigation, you need to act quickly to restore the service, that’s the first priority in an incident. But the traffic didn’t shift and critical alerts kept pouring down. By the time we narrowed this to be a DNS issue, we got an email from our provider acknowledging the outage.
A DNS outage is bad. Your service is up, but Internet users cannot reach it because the name cannot be resolved to an actual IP address. Most services are just waiting for the outage to pass and thus restore user access. But we can’t afford that with a 99.99% commitment. So we have a backup DNS system.
The right precaution
Some time ago we migrated from our in-house operated DNS solution to a cloud provider. And we had the precaution of keeping our DNS settings with the in-house system as well in an inactive status. This proved to be the key to our recovery. We re-activated the DNS settings in our in-house system and made that our primary DNS provider. The fact that we kept the “dormant” records there allowed us to enforce aggressive propagation of the new DNS records. The technical details are less trivial than depicted here and involve a pre-defined hierarchy of domains and subdomains that allowed us to control our DNS in this way. The point is that you cannot just adopt a new DNS provider on the spot because this change will take hours to propagate for the first time.
So our precaution of keeping the backup DNS system saved us. We were back online and fully accessible in a short amount of time. It took us more than 4 minutes, for sure. So we failed our 99.99% for the month by a small margin. But we were back before everybody else was and we were one of the very few services to do so.
Continuing to protect Primetime Authentication uptime
In retrospect, this looks like a simple thing. Such things always do, hindsight is 20/20. Of course, another DNS provider, what could be simpler? You would be amazed how few services actually have that at the ready. Most of the affected services just waited in frustration of not being able to react until the DNS provider fixed their outage, 90 minutes later. A DNS outage at this scale doesn’t happen every day, it happens rarely. But when it happens, you can lose invaluable uptime minutes if you are not prepared.
We continue to remain prepared to protect the uptime of Primetime Authentication when the next challenge strikes.