Whether you’re using your own tools or someone else’s, cloud service downtime can be nerve-wracking, especially if the software is mission-critical. Here are some things to keep in mind ahead of time.
We’re using more cloud services than ever, and that means, when they go out, it can have widespread effects.
Certainly, this is nothing new—remember when Twitter used to go down every other day?—but far-reaching downtime is increasingly hitting services that companies rely on to function. Like Slack (which went down last weekend), or Google Docs, or even your web host.
The reasons are, of course, diverse. Last Monday, a bout of early-morning downtime affected thousands of sites that rely on the service Cloudflare, including many popular web services like the popular chat tool Discord. The company explained that the reason was a misconfiguration caused by Verizon misinterpreting a command from a small internet provider.
“This was the equivalent of Waze routing an entire freeway down a neighborhood street—resulting in many websites on Cloudflare, and many other providers, to be unavailable from large parts of the internet,” Cloudflare’s Tom Strickx explained in a blog post.
And at the beginning of June, Google’s cloud service went down, taking out a variety of other services that rely on Google’s software. This was ironic, because Google had no way to access the cloud-based tools it needed to repair its own software. “And so, for an entire afternoon and into the night, the internet was stuck in a crippling ouroboros: Google couldn’t fix its cloud, because Google’s cloud was broken,” Wired’s Brian Barrett wrote.
These incidents are not isolated. In May, the Uptime Institute, which sets standards for data center redundancy, reported that slightly more than a third of respondents to its Global Data Center Survey reported facing business impacts related to downtime from public cloud services.
For organizations, the potential impact is twofold: Your association is likely reliant on cloud infrastructure for both productivity tools you subscribe to (Google Docs, Slack, Office 365) and mission-critical managed tools (your association management system or content management system). If the wrong server goes down, all of a sudden you’re out of commission—internally or externally.
Part of the problem is that many services use the same small handful of cloud providers, so if one service falters, the ripple effect can be felt across an organization. It’s one thing if your chat tool is hosted on Google Cloud, but the plumbing you use on your mobile app? It can really hurt.
You don’t have to just cross your fingers and hope it doesn’t happen to you. Here are a few things you can do to avoid downtime—or at least mitigate its effects:
Look at the numbers. You may not have control over a service that relies on a cloud provider, but you can control your own stack. Recently, a Network World piece compared the “big three” cloud services: Google Cloud, Microsoft Azure, and Amazon Web Services. Author Zeus Kerravala highlighted two significant issues with Microsoft’s platform in particular. One, it doesn’t self-report as much detail on uptime as its competitors; and two, according to self-reported numbers from each service, Microsoft’s had significantly more downtime. Now, other factors may sway your organization to one service or another—Azure has gained popularity thanks to its aggressive offering of credits to enterprises, Kerravala notes—but be sure to look at available data when making a decision.
Consider leaving mission-critical apps off the public cloud. The Uptime Institute noted that roughly a fifth of organizations surveyed have been wary of moving mission-critical operations to public cloud offerings because they’re concerned they wouldn’t get enough visibility into their stack. That’s a fair hesitation, given that half of respondents with mission-critical tools currently on the public cloud express that very concern. This might lead organizations to keep tools they need to stay online on a private cloud, even if risks like power outages threaten uptime.
Look at the equipment and software being used. It’s easy to cut corners with a cloud server in an attempt to save a buck, but if you frequently have downtime on your own stack, the culprit may be one of two issues: underpowered hardware for the task or inefficient software that fails to properly use the hardware.
Come up with contingency plans if key equipment is offline. It’s important to have a plan B, even if you prefer plan A. Your team may prefer talking on Slack, but if Slack is down, you always have email or (surprisingly) face-to-face conversation to fall back on. Consider how you can bake this in to secondary tools you already use, so you can have some sort of functionality online even if your tools aren’t.
Weigh the risks and benefits of a tool, even after a downtime incident. Cloudflare is not so much a cloud hosting provider as a security and caching service. Its value proposition is different (much of the service is free) and the cost savings, even during a modest level of downtime, are often worth it. In fact, not having it there would likely create more downtime than simply keeping it. So be sure you’re analyzing your tools correctly after an incident that might shake your trust of a trusted platform.
Downtime happens—it’s a fact of life. But understanding ways to avoid it, along with having strategies in place for when you can’t control it, might make it a little easier to bear.
Even if it means you can’t post memes on Slack for a while.