Author Archive: Cristian Radu

Our Commitment to 99.99% Uptime for Adobe Primetime Authentication

Adobe Primetime Authentication (formerly Adobe Pass) is committed to 99.99% uptime. That’s only around 4 minutes of downtime per month or less. Every month this year Primetime Authentication achieved 99.99% uptime or better, except this month. This month, an external factor put our systems to the test. All our preparations paid off. Primetime Authentication will end October with a 99.95% uptime, which is .04% shy of our goal. This post is about how we keep our uptime commitment.

Architecting robust services

Infrastructure breaks. No matter how much care you take, it breaks. There is no question about it. The only question is when. And the really hard question is actually how prepared you are to deal with it when it breaks.
Any type of infrastructure. Things break in the public cloud and in the private cloud. A wide range of factors such as software bugs, hardware defects, or third‐party services failures can put any system to the test. So it takes a robust architecture to keep a service up even when infrastructure breaks.

Here is a recent example. Primetime Authentication is architected as a distributed system running in an active‐active configuration between the eastern and western United States. It relies on our DNS provider to geo‐balance the traffic. If we encounter problems in the east, the traffic is automatically shifted to the west and vice‐versa. This is a major factor in our high availability. Our DNS provider is reliable and not expected to fail. And still...

The outage

Last week our DNS provider experienced a major outage that affected multiple states in the eastern United States and impacted several high profile services. That hit us hard. Let’s see how we dealt with it.

Rapid response

Within minutes of the DNS outage, our team was in a virtual incident war‐room scrambling to react. It was after midnight in our time zone, but we were online in minutes. This is where all those annoying little things paid off: the fine‐tuned uptime monitoring, the automated alerting system, the discipline of 24x7 on‐call, and the practiced incident response procedures. They all came together like a well‐oiled machine to save valuable time.
Ok, so what’s next? First we thought that something went wrong with our eastern instance. So we brought that down and expected the traffic to shift automatically to our western instance. No time for deep investigation, you need to act quickly to restore the service, that’s the first priority in an incident. But the traffic didn’t shift and critical alerts kept pouring down. By the time we narrowed this to be a DNS issue, we got an email from our provider acknowledging the outage.

A DNS outage is bad. Your service is up, but Internet users cannot reach it because the name cannot be resolved to an actual IP address. Most services are just waiting for the outage to pass and thus restore user access. But we can’t afford that with a 99.99% commitment. So we have a backup DNS system.

The right precaution

Some time ago we migrated from our in‐house operated DNS solution to a cloud provider. And we had the precaution of keeping our DNS settings with the in‐house system as well in an inactive status. This proved to be the key to our recovery. We re‐activated the DNS settings in our in‐house system and made that our primary DNS provider. The fact that we kept the “dormant” records there allowed us to enforce aggressive propagation of the new DNS records. The technical details are less trivial than depicted here and involve a pre‐defined hierarchy of domains and subdomains that allowed us to control our DNS in this way. The point is that you cannot just adopt a new DNS provider on the spot because this change will take hours to propagate for the first time.
So our precaution of keeping the backup DNS system saved us. We were back online and fully accessible in a short amount of time. It took us more than 4 minutes, for sure. So we failed our 99.99% for the month by a small margin. But we were back before everybody else was and we were one of the very few services to do so.

Continuing to protect Primetime Authentication uptime

In retrospect, this looks like a simple thing. Such things always do, hindsight is 20/20. Of course, another DNS provider, what could be simpler? You would be amazed how few services actually have that at the ready. Most of the affected services just waited in frustration of not being able to react until the DNS provider fixed their outage, 90 minutes later. A DNS outage at this scale doesn’t happen every day, it happens rarely. But when it happens, you can lose invaluable uptime minutes if you are not prepared.

We continue to remain prepared to protect the uptime of Primetime Authentication when the next challenge strikes.

How We’ve Architected Adobe Primetime for TV Everywhere’s Growth

The first step in the TV Everywhere viewing experience, the authentication step, must be as flawless as possible. Content programmers and pay‐TV providers don’t want authentication issues to stand in the way of subscribers accessing content they paid for. So, we’ve enhanced the service architecture for Adobe Primetime to provide the most robust authentication system possible for TV Everywhere. The enhancements we’ve made are helping delivering smooth, reliable authentication experiences even at peak levels of concurrent video starts.

Our “four nines” service availability

Adobe Primetime authentication (formerly Adobe Pass) handles the vast majority of all TV Everywhere authentications and has an audience coverage of 98% in the US and 95% in Canada. It’s available at the very high “four nines” level of service, which means that it can authenticate viewers with less than 4 minutes of service downtime every month.

Ready for growth

Our enhancements come in preparation for two kinds of TV Everywhere growth. First, we anticipate that overall growth will continue. According to Adobe Digital Index, unique TV Everywhere viewers increased 117% from 2013 to 2014 and authenticated TV Everywhere video starts increased 266%. Second, we expect an increase in the peak levels of concurrent video starts. There are already explosive concurrent video starts around big events like the Olympics, March Madness and World Cup. The peak numbers are only going to get bigger, especially with new tools like our push notifications for mobile applications, which can simultaneously invite millions of subscribers to authenticate and watch the same live stream at the same time.

Less latency, more stability with our robust service architecture

There’s a number of architecture improvements that we’ve made to make Adobe Primetime the most robust authentication system possible.

We’ve graduated Adobe Primetime authentication from using a single, main data center with a failover reserve to a multiple data center footprint that operates from both coasts in the United States. The new footprint is reducing latency by using the data center that’s the closest to each viewer to respond to authentication requests.

We’ve also introduced real time operational analytics of entitlement transactions, which relies on big data storage technologies like Hadoop and HBase. All infrastructure operations were fully automated and product releases happen now without any impact on the end‐user experience. This has helped us upgrade components of our architecture with modern technology that’s more appropriate for the scale of the use case that we’re serving.

Less service interruptions, more reliability with our intelligent service architecture

We’ve also upgraded our services architecture to improve our early warning notifications, outage protection, and live event monitoring.

We’re now anticipating and fixing technical issues before they have any visible effect on our customers or their subscribers. We use an on‐call process that takes advantage of our significant investment in an early warning and monitoring system. The system automatically alerts on‐call engineers about possible issues so they can respond quickly. This improves uptime and service excellence for customers authenticating through their pay‐TV providers.

We can keep authentication and streaming working for TV Everywhere viewers, even when there’s a temporary service disruption with their pay‐TV provider’s online authentication system. Our enterprise‐grade outage protection service can temporarily and seamlessly take over the authentication and authorization of transactions for a pay‐TV provider. This gives the network the time to fix the issue on their side without any disruption to customers. When everything is fixed, we seamlessly return the entitlement back to our customer’s system.

We’re also able to monitor customer applications during their important live events with the same rigor that we monitor our own systems. With live event monitoring, we aim to anticipate and fix any technical issues with a customer’s system that crops up during a major, planned surge of traffic. Live event monitoring can work hand‐in‐hand with outage protection to make the highest profile TV events reach TV Everywhere subscribers without a hitch.

The right partner for TV Everywhere success

We’ve made big investments across our systems and services architecture to provide the most robust authentication system possible for TV Everywhere. We’ve achieved “four nines” on our side of the TV Everywhere equation.

We’re also helping pay‐tv providers with stability and reliability on their side, too. Any content programmer or pay‐TV provider that’s forecasting increased audiences or planning for a big live event should look into our outage protection and live event monitoring. Through partnership, we can provide a great experience for TV Everywhere viewers.

By Cris Radu, Director of Engineering, Adobe Systems