Humans are inevitably driven to make errors, it is what makes us human, our human nature. It’s no surprise that software industry is surrounded by this human nature. Errors exist and bugs happen.

“Software is written by humans and therefore has bugs.”
- John Jacobs

Through years, research has been made to try to fight this human nature. We can not let this certainty frustrate us, we can merely follow several good practices that let us prevent more errors, detect them faster and correct them easier.

So, what do programs do when they fail?

When we accept the certainty that bugs will occur, eventually, programs will fail. A program has 3 main behaviours when they fail:

  • They crash. (It is important to note that in a properly computed system, when a program crashes, it doesn’t affect the overall system)
  • They halt. (Do not stop) (e.g. infinite loops)
  • Run to completion but produce a wrong output.

Of this 3 behaviours, the last one is the most harmful and dangerous.

The following article will expose the Therac-25 case study revealing the social implications of not following good practices to prevent this last behaviour.

Therac-25

In 1901 a German physics professor, Wilhelm Conrad Roentgen, was awarded with the first Nobel Prize ever awarded in physics, his contribution was a stepping stone in proving that radiation could be used as treatment to cure cancer.

In the early days, radiologists used the skin of their arms to test the strength of radiation emitted by radiotherapy machines. This radiologists looked for a dose that would produce a pink reaction on their skin that looked like a sunburn (erythema)¹. They call this the “erythema dose”.

erythema in arm example.

It is no surprise that many of this radiologists developed leukaemia from regular exposure to radiation. Advances in radiation physics and computer systems during the last phase of the 20th century made it possible to abandon this primitive and dangerous techniques and rely more on technology allowing for a more precise aim on radiation emissions.

All of this is wonderful until we take in account that those computer systems are made by humans who, by nature, are prone to errors.

The Therac-25 was a computer-controlled radiation therapy machine produced by Atomic Energy of Canada Limited (AECL) in 1982 after the Therac-6 and Therac-20 units. Between June 1985 and January 1987, the Therac-25 machine overdosed with radiation 6 people, accounting with what is described as the worst series of radiation accidents in the 35-year history of medical accelerators².

To give context, a thousand rads (unit of absorbed dose of ionising radiation³) spread over a human body can be fatal. It is estimated by physicists that one of the patients received on average between 15,000 and 20,000 rads on a very small area⁴.

What went wrong?

Being one of the cases with most impact in manifesting the importance of software quality regarding safety-critical systems, there were many groups of researches who analysed the case and pointed out what was the list of events that led to this series of fatal accidents. Here I list those events and try to explain deeper what it could have been done instead.

  • Bad practice: the company responsible for the Therac-25 did not have the software code independently reviewed and chose to rely on a in-house personal review.
    Countermeasure: When we are talking about regular code, it is of good practice to have a phase were code is reviewed, this may be done by the same person that made the code, by a different person but in the same project or company, and lastly, by an independent contractor who review your code as a service. What was mostly criticised of the code of the Therac-25 was that while they told the media that the code was reviewed in-house, the software was designed so that it was realistically impossible to test it in a clean automated way⁵.
  • Bad practice: Machine operators were reassured by the company personnel that overdoses were impossible, leading them to dismiss the Therac-25 as the potential cause of many incidents.
    Countermeasure: Rule #1 of debugging: always assume you were the one making the mistake. It is very common to be protective of our work, nonetheless, we should always be humble enough to accept that we will always make mistakes. Don’t let your arrogance deepen the hole you’re already in.
  • Bad practice: The company had never tested the Therac-25 with the combination of software and hardware until it was assembled at the hospital.
    Countermeasure: Once again, the solution seems obvious. But most times we neglect such things due to not having a methodology or documented plan. We also tend to neglect this because we have such a high steem of ourselves. We have to be humble to recognize our error-driven nature. Another point is that testing enviroments should always aim to replicate as much as possible the production environment.
  • Bad practice: Several error messages merely displayed the word “MALFUNCTION” followed by a number from 1 to 64. The user manual did not explain or even address the error codes, nor give any indication that these errors could pose a threat to patient safety.
    Countermeasure: According to Nielsen Norman Group, who are “world leaders in Research-Based User Experience”, good error message should be explicit, human readable, polite, precise, and have constructive advise⁶. The error message “MALFUNCTION” doesn’t comply with any of this characteristics. A good error message should give the ability to the user to know what course of action he/she should take. We should always take into consideration that when something bad happens, we always have room to make such situation worse. It’s our job as developers to prevent this and guide the user to a correct course of action.

Conclusion:

Years will go by and there will come new methodlogies and methods to prevent bugs, however I personally think we should stop fighting this human nature and embrace it. Some say bugs will be erradicated when humans stop writing programs and AI takes the reins of software development. Once again, it will be a human configuring this AI. It is time that we give the importance of this and understand that good software development is a critical part of infrastructure.

Hope you liked this article, if you have any comment, agree or disagree with something, don’t hesitate on leaving a comment to discuss it further or send me a mail to: mau.tech@protonmail.com

Bibliography:

  1. Evolution of Cancer Treatments: Radiation. (2014, June 12). Retrieved November 23, 2020, from https://www.cancer.org/cancer/cancer-basics/history-of-cancer/cancer-treatment-radiation.html
  2. Rawlinson, J. A. (1987, May). Report on the Therac-25. In OCTRF/OCI Physicists Meeting, Kingston, Ontario.
  3. Rad. (n.d.). Retrieved November 23, 2020, from https://www.britannica.com/science/rad
  4. Rose, B. W. (1994, June). Radiation Deaths linked to AECL Computer Errors. Retrieved November 23, 2020, from http://www.ccnr.org/fatal_dose.html
  5. Leveson, Nancy (1995). “Safeware: System Safety and Computers. Appendix A: Medical Devices: The Therac-25” (PDF). Addison-Wesley.
  6. Nielsen, J. (2001, June 23). Error Message Guidelines. Retrieved November 27, 2020, from https://www.nngroup.com/articles/error-message-guidelines/

Computer scientist, security engineer, web app infrastructure. Kubernetes & microservices. Nodejs & React. Contact: mau.tech@protonmail.com