How ToNewsfeed

Fault recovery – a good reason to have your servers in the cloud

Cloud Fault
Cloud Fault
Cloud Fault

What would happen if your own server in your company’s server room suddenly developed a crippling hardware fault, the one running a company-critical service? How long would it take you to restore that service?

Cloud computing is all the rage these days. But as the saying goes: “There is no cloud, it’s just somebody else’s computer”. It’s true too – if you think cloud computing means your website or other service run on a ‘cloud’ of CPU’s, that isn’t really how it works.

Rather, your service runs on an ‘instance’, which is a virtual computer onto which you load an OS installation (Windows or Linux most likely), running on a specific (part of a) CPU on a specific server in a large server park. The OS installation is installed on a slice of physical storage you are given and then hooked into your instance so that it looks like a disk drive to your OS.

So in the end, your instance runs on a ‘real’ server, and one only (barring any load balancing setups), in the cloud company’s server park.

But here’s the thing: Even if it’s just somebody else’s computer, that somebody has a lot of computers, and they are yours for the taking (or at least for the leasing). And you can do it in a hurry.

Any cloud provider’s cloud is really the total of the setup and the tools at your disposal. For this purpose, you have access to an admin console with a tool set to manage your instances.

Back to the question: what would happen if the critical server in your company’s server room developed a hardware fault?

I had a server hardware fault happen to me last month, so I know what happened in my company. We aren’t big enough to have fancy server rooms with lots of (or any, really) redundant systems, and we certainly don’t have extra server hardware readily available. The server in question runs our customer support ticket system Helpspot so it had to be up by Monday morning. Not NASA type mission critical, but certainly operationally critical to us.

I got the automated alert about the hardware fault, or ‘degradation’, on Sunday morning. Fortunately, this was not a server in our own server room, but one running in Amazon’s cloud. So I grabbed a cup of coffee and sat down with my laptop.

First thing I tested was if Helpspot was reachable at the URL it lives on. It was not. The fault was real.

I logged into the console. The instance was there and operational on a management level still. I proceeded to do the following, using the tools in the console (picture me calmly doing this with my right hand will sipping coffee with my left):

  • Created a backup of the root (boot) virtual disk the instance booted from, containing the OS and the Helpspot installation. Just in case.
  • Highlighted the degraded instance and selected ‘Launch more like this’. This let me easily ‘spin up’ a new instance with the same setup as the old faulty one, but on a new good piece of hardware.
  • While the new instance was being created, I ‘unplugged’ the root disk from the instance running on faulty hardware and stopped it.
  • With the new instance now created and booted up, I ‘unplugged’ the new virtual disk that it had been given by default, and plugged in my existing virtual disk that I previously had unplugged from the faulty instance, still containing the complete Helpspot installation, and sat it as the new instance’s boot disk.
  • I reallocated the IP address from the old faulty instance to the new one and restarted it.

When it was done booting/starting, Helpspot came back online, as if nothing had happened. It used the same database served from another Amazon type of instance, and it started fetching the backlog of emails that was waiting on the email server we use.

The whole procedure took about 30 minutes, and I had never done it before. Everything I needed to know was in fact explained in the automated email from Amazon.

So what if that hardware fault had occurred on a server in your own server room or office? Would you even have gotten an automated alert? Or would it have been discovered by a staff member Monday morning as they were trying to use whatever service was running on that server? If you had discovered it on a Sunday, would you have had to drive to the server room to fix it? Would you have had spare hardware ready to transfer the service to? If yes, how long would it have taken you to be up and running again?

I don’t know the answers in your case; all I know is that it took me 30 minutes to fix a hardware fault on a Sunday, from home, on my laptop. And nobody except me in the company even knew the server ever was down.

This is one reason more companies should consider moving their servers and services to the cloud. You also get more control and better ability to scale up or down depending on the traffic you see. Cloud presence is not only for the big guys, it is in fact well suited for almost anybody.