When Cloud Computing Goes Wrong

When Cloud Goes Wrong

Technical things go wrong. So what should businesses think about to ensure reliable and consistent operations with an added layer of complexity? The first step is recognising that things will go wrong. Whether operations are in an in-house data centre, an external commercial colocation data centre, or in a hybrid cloud arrangement with workload split between in-house and cloud, the principles are the same.

Cloud Isn’t New

No matter what marketing would have us believe, cloud is not a new concept. It is simply remote hosting of some or all of the workload in a data centre, and is not dissimilar in principle to 1960’s timesharing services. The difference between 1964 and 2014 is the speed and data capacity of fibre optic cables, which open up a whole host of new possibilities to business owners.. But the principle remains the same as do the principles of resilient design. As some or all of the workload can be hosted remotely, the most critical new consideration is the communication between the user and data centres where cloud operations take place.

Securing The Right Data Partner

It is important that businesses choose a high quality data centre, with strong data communications and cloud experience to help minimise risks. Any data centre which says it has never had an outage of any sort is either too new to have a track record or is not training its sales staff to be honest. Even major players, with more money to spend than most businesses can dream of such as Google, Facebook and Amazon have experienced very public data centre outages in the last five years.

Most recently in June this year, Microsoft Office’s 50 million users in the US experienced a nationwide two day outage. Operations managers and architects need to carefully ask the right questions to find out the truth and work through the concepts of automatic fall-overs or manual switching in the event of something going wrong. Ultimately, it comes down to choosing a data centre that you trust.

Moving The Right Workload

Choosing the right workload to move to cloud is also important, especially in the early days when in-house IT staff have less experience of cloud operations. In general, workload which has infrequent, small transactions which are not latency-critical works well in cloud. A CRM system is a good example, where a submission of a visit report or the retrieval of a customer phone number is infrequent, small, and not time critical.

On the other hand, voice telephony, which is a continuous stream of time critical data, is not a good application to move to cloud, except for specialist suppliers who know how to do this and will be located in carrier-rich, carrier-neutral data centres to get the connectivity and diversity they need.

Automatic switching of IP address allocations is a particular problem which needs careful thought. The difficulty of automatically detecting a failure and instantly transferring all the IP addresses to another set of equipment in another location leads many smaller installations to accept a short outage and transfer the addresses manually.

In resilient or safety critical design, every element must be considered and there is one key question which must be asked – “what will happen if this element fails?” The design can then be changed so operations will continue without interruption. If that is not possible, then a plan has to be put in place to deal with the effects of a failure that cannot be mitigated.

Testing Is Key

Continuous testing is essential, as is reconsidering the effects of each potential failure anew each time the system design or architecture is changed. So is rehearsal and practice of both automatic fail-overs and manual procedures to deal with failures. At least once a year, every likely failure should be forced to happen, so that its effect on the overall system operation can be checked. This is one of the main principles of ensuring reliable, continuous operations, and is the same whether a business is operating an in-house data centre or a remotely hosted operation in a data centre in a cloud environment.

Roger Keenan

Roger Keenan joined City Lifeline as managing director in 2005. Prior to City Lifeline, Roger was general manager at Trafficmaster, during which time he progressed to managing director for Germany and then CEO of Trafficmaster in Detroit. Roger belongs to a number of industry and trade associations, including the Chartered Institute of Marketing (MCIM), the Institute of Engineering and Technology (MIET) and is a Chartered Electrical Engineer (CEng). Roger studied at the University of Wales where he was awarded a BSc Hons degree in Electronic Engineering. He then went on to study for an MBA at Cranfield School of Management. Roger is an experienced public speaker and in his spare time has a keen interest in classic cars.