Let us start with a little allegorical imagination. Suppose there had been a series of serious accidents with a new model of car that seemed to point to a defective design. In a subsequent interview by Anne, a reporter, with Tom, the managing director of the company responsible, we read the following:
- Tom: Well these things happen sometimes, it’s impossible to avoid
- Anne: Surely it’s possible to design reliable cars?
- Tom: Not really, we always expect some glitches in the field
- Anne: But surely you can test the cars thoroughly before releasing them?
- Tom: Too expensive, you don’t really find the problems until they are out there
- Anne: But doesn’t that mean you can expect to have catastrophic situations
- Tom: Exactly, but that’s industry standard practice
- Anne: But they seem to be able to build reliable airplanes for example?
- Tom: Impractical, and much too expensive for us to use those techniques.
Our reaction would be a mixture of amazement, incredulity, and anger. But if we substitute software systems for automobiles, the above is in fact regrettably close to the real situation we encounter. Whether it is the embarrassing roll out of the web site for Obama’s universal health care, or the disastrous delays in the new Army Recruiting System in the UK, the press and the public seem quite content to accept as normal a situation in which many major software systems are delayed and then malfunction when they are put into effect.
Every month brings new news of this kind. Indeed the stories of such software “glitches” are so common that we hardly regard them as big news. Whether it is the chaos of Heathrow’s terminal 5 opening, or the software problem that caused hundreds of very dangerous criminals to be released in California, or even people being killed by errors in medical instruments delivering excessive radiation, or hundreds of millions of dollars lost when a never-tested trading program goes berserk, we shrug off the malfunctions as being inevitable with large software systems.
In fact respectable professors of Computer Science have told me that it is impossible to build large software systems which do not contain serious errors. There is a cartoon which says “The #1 programmer excuse for legitimately slacking off: My Code’s Compiling”. It seems like we have a real life version of that: “The #1 manager excuse for being responsible for some disastrous situation: It’s a software glitch”.
So, as a member of the public, or perhaps a member of a jury considering consequences of one of these software messes, should we accept this #1 excuse? What about the claim that it is impossible to build large software which does not contain major glitches? Many professionals in the field would say they agree with this, but in my opinion, they are all seriously wrong.
Many years ago, eBay went off line for nearly a week because of a software “glitch”, and the company lost billions of dollars in company stock value. I wrote to the founders telling them that there was no excuse, and that if their software people were telling them that this sort of thing was inevitable, they should all be fired, and hire people who do know how to write reliable software. I never got a reply.
Now given all the news stories you have read, you may be sceptical, but in fact there is a simple demonstration that it is possible to build reliable software. The clue is in the last interchange between Anne and Tom. Aeroplanes are not just mechanical contraptions, they rely heavily on extremely sophisticated and complex software to control every aspect of the aircraft’s flight capabilities.
The Boeing 787 reportedly has 8 million lines of source code. To get some impression of the complexity implied by this, let’s use 10 lines/day as a typical productivity measure, and that translates into at least four thousand person years of programming work.
One serious bug in any of that software could cause major malfunctions leading to a fatal crash. Yet, we have never lost a life on a commercial flight because of a software bug. Avionics software is not perfect and we have had a few, thankfully very few non-critical, malfunctions, but for sure software is not the weak link in the chain when it comes to aeroplane safety.
So, two questions to ask. First can we indeed apply the same approach used on aeroplanes more generally in producing software systems (and if not why not)? And as our technology advances can we do even better than we do now with aeroplanes? My answers to both questions, yes, and yes.
There are three ingredients that go into making safe and reliable avionics software:
- Using the right tools, we need safe programming languages, and safe operating environments.
- Following rigorous standards and procedures in the production of the software.
- Building a programming culture of total commitment to reliability.
All three are important. The right tools certainly help. Many avionics systems for example have been built using the Ada language, which has a strong commitment to safety and reliability. But Ada is not some magic bullet that guarantees success. Standards such as the DO-178B standard widely used in civilian and military avionics systems are also a key to success, but also not a guarantee.
Probably ultimately building the right programming culture is key. If you have programmers who think it’s just fine for software systems to fail all the time, with a “we’ll fix it when we get bug reports” mentality, you will get what they expect. If you have programmers who are committed to producing reliable software and have a “failure is not an option” view where they know they will be held responsible if the software malfunctions, you will be much more likely to achieve this goal.
The usual objections to using these kind of techniques in more general settings are twofold:
- Too expensive
- Too inflexible, we have to get the product out tomorrow, and a new version every year
Well, just what is the cost of disastrous software malfunctions? Imagine a conversation with President Obama that goes like “Mr. President, we could make sure the health care roll out is a success, but it would cost $X and it would be delayed by Y months”, and try to imagine him replying “we can’t afford $X or the delay, and it doesn’t matter if the rollout results in chaos”.
Programming professionals have a responsibility to make such choices clear to management, and management must make responsible choices, without hiding behind “all-software-has-glitches” excuses. When a banking “glitch” at the Royal Bank of Scotland meant millions of customers couldn’t access their accounts, the bank had to pay £125 million in compensation, and that does not count loss of good will, and the cost of bad publicity. Just a fraction of that £125 million could have bought a lot more care in the software production cycle.
Traditionally we have identified “safety-critical” software as deserving this special treatment, where by safety-critical we mean that people can be killed or severely harmed by errors in the software. But in our inter-connected world, nearly all software is at least indirectly safety-critical in this respect.
Take the case of the dangerous criminals released by accident in California. Victims of crimes they subsequently committed can most certainly blame their plight on the software bug that allowed this to happen. We simply cannot countenance serious errors in critical systems of any kind.
Can we do even better? The answer to that is we certainly can. We are now getting to the point where we can use advanced mathematics to actually prove that programs are reliable. Can such techniques be used generally? Potentially absolutely, but for now I would be happy to settle on a first step of using techniques and tools that have been understood and used for decades more generally, and achieving in all critical systems the same kind of reliability that we have come to rely on for aeroplane travel.