When the UCAS (Universities and Colleges Admissions Service) web site crashed last week after a huge spike in traffic on A-level results day, the organisation faced a stream of criticism from the media – and from frustrated students posting comments on social networks.
Many of the finger-pointers argued that UCAS should have been well aware of the likely number of students that would be logging on to its web site to check university places – allowing it to predict the traffic that needed to be supported.
The negative publicity brought to mind other examples of well known sites and IT systems that have suffered after an outage. Not long ago the London Olympics ticketing system hit problems under the strain of too many people trying to register for ticket allocations.
To the outside observer it may seem ridiculous that so many respected organisations aren’t better prepared. But being prepared is not always straightforward. Scaling up IT resources to cope with occasional very high demand for a web site or application – especially if these same resources are likely to be sat idle for long periods when demand comes nowhere near those peaks – is hard to justify when budgets are tight.
However, as these high profile incidents show, burying your head in the sand or ‘hoping for the best’ can have a lasting detrimental effect on an organisation’s reputation if it all goes wrong.
So as IT professionals it is our role both to speak up when we see the risk of potential performance bottlenecks – and to think of creative solutions to the problem. For example: some larger organisations are starting to tackle the problem of short term demand spikes by temporarily calling on additional resources from other parts of their IT estate.
Virtualisation technology – which enables workload to be shared across hardware at different locations – can prove useful in such situations. Other organisations are starting to consider using cloud technology to ‘rent’ extra processing capacity via the web to help absorb traffic peaks.
But even if you increase capacity, how do you know when you have enough? It’s not always easy to verify this by testing. Real world load testing works by trying to simulating peak usage levels, but generating the workload using traditional methods is expensive and technically difficult.
Testing also needs to be repeated many times, over a long enough period to instill confidence – and extended testing on a live system is difficult as the testing process can, in itself, adversely affect performance. Things are changing, though; modern load testing software can take much of the pain out of simulating high volumes of traffic.
Of course, once you have tested your systems to breaking point you have another challenge: finding out what actually broke – and then fixing it. Any IT infrastructure is only as good as its weakest link.
But finding that broken link – which could be in a back-end application or database rather than your web front end – can be time consuming when faced with your typical organisational mix of platforms and technologies, new and old. All the more reason to do it upfront rather than while the world is watching.
Quality of diagnostic information makes a huge difference here – if you can quickly drill down to the source of a problem, be it a single line of programming code or database call, you will get your systems working again that much faster.
Then, once your systems are live, it’s a case of monitoring them constantly – and making sure you have the tools to warn you when performance starts to slide so you can be proactive and catch problems early. Before your customers do.