Vous êtes sur la page 1sur 8

The Institution of Railway Signal Engineers Inc Australasian Section Incorporated

Software Reliability An Oxymoron?

Alena Griffiths MIEAust, CPEng, PhD, BSc(Hons), LLB RGB Assurance Pty Ltd

SUMMARY Rarely a week goes by without a major software failure featuring prominently in the news. Some problems, such as the reported computer glitches with Virgin Blues check-in software in 2010, merely result in financial loss. Others, such as the Queensland Health payroll debacle, in 2011, contribute to the downfall of governments. And of course there have also been cases where software unreliability has contributed to unavailability of critical public infrastructure, and in some cases, loss of life. But how vulnerable is the rail industry to software unreliability, and whats the real likelihood that software problems could actually stop the trains (or even crash the trains)? This paper will provide a brief survey of the extent to which modern railways depend on correct software operation. We will show that this dependency extends from customer facing applications such as web-based journey planners and fare sales and collection systems, through to critical service delivery applications such as routing trains, scheduling essential maintenance, and responding to emergencies. Having elaborated the dependence of modern railways on software technology, we will then proceed to discuss the vulnerabilities this presents. We will describe the main reasons why software engineering is different from other engineering disciplines, and hence why reliability of software must be approached differently to reliability of other engineering products. The explanation will range from the science that underpins software engineering, through to the complexity inherent in modern software systems, and ultimately through to social issues such as regulation of the software engineering profession and the psychology of the software development process. In particular, we will consider traditional approaches to reliability engineering and explain why these approaches in general translate poorly to software. Finally, we will talk about how software reliability is being approached in the Australian rail industry today, and provide some suggestions for improving our handling of, and hence reducing our vulnerability to, software reliability issues.

INTRODUCTION

This paper considers the topic of software reliability, and the relatively poor success record of major software development projects. It considers the extent to which modern railways are vulnerable to software failures, and finds that the current level of vulnerability is high. The paper then considers why software is so hard to get right, having regard to a number of factors ranging from the underpinning mathematics of computer science, through to lack of regulation of the software profession. We also explain why the reliability of software is notoriously difficult to predict with any accuracy. We conclude by relaying the current approaches being taken to this problem, within the rail industry in Australia today.

which replaces the normal Windows display when you do something outside the parameters of what Windows considers expected use. While annoying, and occasionally extremely inconvenient, such software failure will not usually cost you your job, and nor is it lifethreatening. However, the same cannot be said of software failures generally. Indeed, the debacle surrounding the Queensland Health payroll system [2] was blamed in part for the downfall of the Bligh Labor Government in Queensland. That system was originally estimated to cost $6.13M, and by the go-live date of March 2010, had in fact cost $40M. It has since delivered over $75M in overpayments to some staff, left other staff unpaid, and at the time of the last Queensland election had incurred further repair costs of $220M. Estimates by the newly elected government to correct the problems vary, ranging as high as half a billion dollars. Other spectacular failures have resulted in the grounding of a major airline (problems with Virgin
Page 1 of 8

SOFTWARE DISASTERS

Most practicing engineers would make extensive use of simple office-based software such as MS Word, Excel, PowerPoint, etc., and would have seen, at some point in their career, the well-known blue screen of death[1],
IRSE Australasia Technical Meeting: Sydney

12 October 2012

IRSE Australasia

Software Reliability An Oxymoron?

Airlines check-in software in September 2010, cancelling 116 flights and costing millions of dollars [3]), the outage of the entire AT&T public telephone network in the US in 1990 (75 million calls missed, 200 thousand airline reservations lost [4]), the death of persons from excessive dosages of radiation (three people dead in Therac 25 machine in 1985 [5], and another eight people died in 2000 as a result of incorrect software calculation of required radiation dosage levels [6]), and the loss of the Ariane 5 rocket in 1996 [7]. In short, in the relatively short history of software engineering (the term is believed to have first been coined at a NATO conference in 1968), software has played a disproportionately high role in the cause of many engineering systems failures.

3.2

What things can go wrong?

Each reliance of railways upon software technology, in turn gives rise to a vulnerability in the event of a problem with that software. For example, in the previous subsection we mentioned the Octopus and Go Card smart cards. In February 2007 a glitch with Octopus value-adding software was reported in which patrons accounts were debited, and their smart cards not topped up even when the transaction had been cancelled [8]. Subsequent investigations revealed that 3.7 Million Hong Kong dollars had been wrongly deducted, in over 15,270 individual cases. This caused public embarrassment for Octopus Holdings and temporary loss of faith in the main ticketing system used on Hong Kong Railways (it also resulted in the company that ran the electronic payment system permanently losing Octopuss business). In Queensland, a bug in the Go Card logic [9] meant that if a passenger touched on at the front of the bus, and then immediately off at the back of the bus, they would not be charged for the journey. This caused considerable loss of revenue to the bus services, but had less impact on urban rail operations where passengers touch on and off when they enter and leave a station, and not when they enter or leave the train itself. The above incidents had potential implications for customer goodwill in the first case, and for revenue in the second case, but there are also areas where software malfunction could cause more serious incidents. Software now controls many safety-of-life functions performed on a railway. For example, the interlockings that perform vital signalling functions, interlocking the movement of trains with signals, other train movements and with the movement/position of points, is now routinely controlled using program-like code (e.g. Boolean logic or function block diagrams) implemented on top of a computer-based interlocking, rather than a traditional relay circuit. For some time, non-vital (though operationally critical) services such as train radio have operated over open communications systems, however vital communications have been achieved with cabled or wired (i.e. closed) network solutions. This is changing however, with the movement towards centralised authority management systems such as European Train Control System (ETCS) Level 2 and higher and Communications Based Train Control (CBTC), so that vital communications increasingly involve communication over open networks such as radio or wireless networks, and the encoding, transmission and decoding of messages is all performed digitally, via software. In general, such systems are designed so that a failure to deliver a message does not represent a wrong-side failure, however a software error could cause transmission of a correctly encoded message, but with an incorrect payload, and this cannot be fully mitigated. Peak hour passenger trains travelling through tunnels rely on computerised tunnel ventilation systems not just for passenger comfort, but, if the train is delayed for lengthy periods underground, for passenger health.

RELIANCE OF RAILWAYS UPON CORRECT SOFTWARE OPERATION AND CONSEQUENT VULNERABILITIES

So, how vulnerable are modern railways to software failures? In this section we examine how software is used for smooth operation of the railway (i.e. how it is necessary to make things go right), and how this in turn gives insight into the kinds of things that could go wrong.

3.1

What is needed to make things go right?

Most users interface with a passenger railway these days begins with an Internet search to reveal travel times and routes. Whether they travel using a smart card (like the Hong Kong Octopus card or the Queensland Go Card), or whether they buy a single use ticket, such interactions involve computer interaction and in the case of smart card use, involve the users trusting the rail operator to correctly manage e-commerce transactions on their behalf. On commencing their journeys, passengers will typically be admitted to stations via Automatic Revenue systems which will admit them through turnstile gates and register their ticket or smart card. While waiting on the platform, digital passenger information display systems will advise them of their train approaching, and this will often be reinforced by computer activated audible public address systems. Once on-board, the passengers will breathe air which may have been cleaned and cooled via an automatic airconditioning system, which in turn depends upon activation of tunnel ventilation systems for underground travel. The movement of the train will be at least computersupervised, if not directly computer-controlled, on most modern railways, and train drivers will obey movement authorities communicated to them via coloured-light signals or in-cab digital displays. The railway infrastructure will be maintained based on sophisticated maintenance schedules developed using software-supported reliability-centred maintenance tools, and managed using maintenance management systems. Maintenance may be influenced by sophisticated image recognition software which records images of rails or other aspects of infrastructure and schedules corrective or preventative maintenance when defects are recognised.

IRSE Australasia Technical Meeting: Sydney

12 October 2012

Page 2 of 8

IRSE Australasia

Software Reliability An Oxymoron?

Finally, some rail services are replacing drivers partly or completely with computer-based train control systems, removing that final line of defence between computer control and potential disaster. (Of course, there is a school of thought and supporting research that suggests that removing the possibility for human error here will, as a whole, improve the risk profile of the system, but that presupposes a certain level of integrity and trust in the underlying control software).

(although the advent of nano-technologies has seen the complexity of integrated circuits approach that of software, and is leading electronics engineers to apply computer science techniques to manage the inherent complexity). However, there are no such physical limits on the size and hence on the complexity, of software systems. Also, the natural tendency of engineers generally to overcomplicate matters often leads to designs that are more complex than necessary.

3.3

Safety and Reliability

In this paper so far, we have generally referred to software reliability in terms of what could go wrong, and pointed out that this can have an impact either on operational reliability or on railway safety, and occasionally on both. Of course safety and reliability are not the same thing. A railway where the trains never move will have no train crashes. However, to the extent that a system implements a safety function, such as interlocking signals and points, unreliable (i.e. incorrect) implementation of that system can be a safety issue, so software reliability can be a software safety problem.

4.3

Software Engineering

The term software engineering first appeared in the 1968 NATO Software Engineering Conference [10], and was meant to provoke thought regarding the perceived "software crisis" at the time. To quote: The term (software crisis) was used to describe the impact of rapid increases in computer power and the complexity of the problems that could be tackled. In essence, it refers to the difficulty of writing correct, understandable, and verifiable computer programs. The roots of the software crisis are complexity, expectations, and change. Software engineering is the application of disciplined, systematic and scientifically-justified methods to the design, development, maintenance and operation of software. Since the term was first coined in 1968, the discipline of software engineering has evolved enormously, all of which is aimed at the goal of delivering demonstrably correct software, on time and on budget. In general, compliance with good software engineering practice is agreed to be the best defence against problems traceable to unreliable software, for the moment. Good software engineering practices involves many practices that most engineers would consider common sense, such as: 1. 2. Writing down and agreeing before building a product; requirements

WHY SOFTWARE IS DIFFERENT

While people are generally comfortable accepting electrical or mechanical interlocks, many experienced engineers remain uncomfortable about accepting software-based interlocks. In this section we consider why software is different.

4.1

Underpinning Science

Many branches of engineering are based on what Engineers Australia calls an underpinning science, which is inherently continuous in nature. For example, if we test a structure to show that it can support a weight force of 100 tonnes, then it is reasonable to assume that it can support any weight force below that, including no weight force at all. On the other hand, it would be distinctly unwise to assume that it would support a weight in excess of that number, without further analysis and/or test. However, the fact that a computer program correctly computes the multiplicative inverse of the number 100 provides no insight whatsoever into how it will behave if presented with the input zero (0), and indeed no insight into how it will behave with any other input point greater than or less than 100. This is because the underlying mathematics is based upon discrete disciplines such as set theory and logic, and not upon continuous mathematics such as differential and integral calculus.

Documenting design decisions so that someone else can maintain the software when you arent there anymore; Subjecting software to a structured review, by a person other than the one who first produced the software; Requiring software to be independently tested, prior to use; Requiring the record of review, test and certification for service to be documented and available for third party scrutiny; Etc.

3.

4. 5.

6.

4.2

Inherent Complexity

4.4

The Software Culture

The complexity of systems engineered using technologies other than software is usually subject to some natural limits on complexity. For example, the size of a physical structure is limited by the strength of different kinds of building materials, and ability to withstand natural forces such as wind and seismic activity (and of course the cost of building materials!). Traditional electronic circuits are also fundamentally limited by the amount of board real estate available
IRSE Australasia Technical Meeting: Sydney

Computer science as a field of study has a long history of defying convention and authority. Those who have been successful have done things differently. For example, both Apple and Microsoft were founded in garages, by persons who dropped out of university. Stories about the early history of both companies abound with tales of people working through the night to get products to market; so much for fatigue management for safety-critical workers!

12 October 2012

Page 3 of 8

IRSE Australasia

Software Reliability An Oxymoron?

Industry commentator Scott Rosenberg has stated [11] that the history of software development is marked by missed deadlines, blown budgets and broken promises. In his book [12], he discusses the culture of personal heroics that dots the computer science landscape. While this has been responsible for tremendous successes, it also runs directly counter to the ideas of transparency, accountability and group ownership of engineering outcomes, that underpins the discipline of software engineering discussed above. This is such a difficult problem to solve that entire books have been written about how to establish a good software engineering culture [13].

mutual recognition agreement between IEAust and . So far as the author is aware, software engineers are generally not well-represented amongst the ranks of signalling engineers, notwithstanding that modern authority management systems such as the Radio Block Centre at the core of an ETCS Level 2 system or the Zone Controller at the core of a CBTC system are all large software systems. In general, the practice of software engineering in Australia is unregulated. Queensland has a Professional Engineers Act which requires all practising engineers to be registered, and this in turn requires the equivalent of Chartered Engineer status to be demonstrated, but thus far there has been no enforcement of this legislation, so far as the author is aware, in the field of software engineering. Yet the railways in Queensland are as dependent on software development as the railways in the rest of the world!

4.5

Software Engineering as a Profession

Given that the term software engineering was only coined in 1968 [10], it is little surprise that software engineering as a profession is still relatively young. In 2004, the IEEE, in conjunction with ACM, published the Software Engineering Body of Knowledge (SWEBOK) [14]. This has continued to be maintained over the years and is still considered the best place to find an answer to the question What is software engineering?. In the USA, the IEEE certifies software development professionals but only some states and territories recognise software engineering as a route to the award of the post-nominals PEng (professional engineer). The UK was arguably far quicker off the mark here, with the British Computer Society and (what is now) the Institution of Engineering and Technology both being able to award Chartered Engineering status to software engineers since 1984. In Australia, there is almost no registration of the software engineering profession and almost no restrictions on professional practice. This is because in Australia, the main route to Chartered Engineer status is via membership of Engineers Australia, which in turn typically requires a person to have completed an accredited engineering degree. The first Software Engineering degree was not accredited by IEAust until about 2001. Most software engineers who would be eligible to become Chartered Engineers are therefore still relatively inexperienced (less than 11 years in-service. As a consequence, there are very few software engineers in Australia who are Chartered professionals. Indeed, most who would classify themselves as software engineers and who are also chartered, either received their designation overseas, or were originally chartered in another engineering discipline, or have come to Chartered Status via a nontraditional route (e.g. such as people who obtained a computer science degree or similar degree, and later did a competency-based assessment to become members of IEAust). The Australian Computer Society has no power in Australia to charter software engineers. The Institution of Railway Signal Engineers, the most respected and arguably the most relevant professional body in the area of railway engineering, has the power to Charter Engineers in the UK. No such power has been awarded to IRSE in Australia, however, and a Chartered Engineer with the IRSE and the UK Engineering Council may not have their status recognised in Australia due to the absence of any

4.6

Summary of Why Software is Different

In summary then, what is needed to build good software is not that different from what is needed to engineer products using other technologies. You need a solid grasp of the underpinning science, applied in a disciplined manner, subjected to independent review, and you need to produce an audit trail along the way to produce the appropriate assurance, for systems on which the public will ultimately depend. However, the underpinning science is different it depends on computer science which in turn depends heavily on discrete mathematics. Also, the software industry as a culture has been slow to embrace some traditional engineering concepts, and to be fair, the engineering community in Australia including the railway engineering community has been extremely slow to recognise the fact that software is an essential part of engineering modern systems today.

WHY SOFTWARE PREDICTION IS HARD

RELIABILITY

Most simple hardware devices, including some electronic computing components, possess failure modes that depend on the intrinsic physical characteristics of the equipment. The frequencies of these failure modes are distributed randomly about some mean. That is, for a simple electronic component, there are accepted methods for determining the rate at which components of that type will fail. Excluding some initial period of high failure (infant mortality), and a final period of high failure associated with the end of life, the rate of failure between these extremes is typically constant. The mathematics associated with this kind of failure behaviour is well understood, and it is possible to predict with confidence the failure behaviour of a system comprised of such components, given an understanding of the failure behaviour of the individual components. It is also possible to design tests for this kind of equipment that will determine, with a measurable level of confidence, the likelihood that the equipment will fail. Tests to prove extremely low failure rates, with a high degree of confidence, require a large number of test hours to be accumulated, and are therefore expensive to run. However the problem is tractable and it is fairly easy to conduct cost-benefit tradeoffs, comparing the lower cost of a shorter test against the increased risk associated with lower confidence in the predicted failure

IRSE Australasia Technical Meeting: Sydney

12 October 2012

Page 4 of 8

IRSE Australasia

Software Reliability An Oxymoron?

rate. Also, it is possible to introduce equipment into service, with operational restrictions that make allowance for the lower confidence level. Subsequently, based on observation of its performance in the field, it is possible to safely re-adjust the estimates regarding expected failure behaviour, and if appropriate, remove some of the initial limitations on use. Applied to railways, this means that for simple systems comprised of components with random failure behaviour, it might be possible to introduce systems into operational service, with restrictions, but to seek to lift these restrictions following some period of incident-free operation. In contrast, software failures reflect errors in the logical design of the code. A software program, when in a given state (i.e. internal variables possessing a certain value), and when presented with a certain set of inputs, will exercise a particular logical path through the code that will produce a certain set of outputs and transition to another state. Whether this behaviour is considered to be a failure or not depends on the specification. However, whenever the software is in a particular state, and presented with a particular set of inputs, its behaviour will be identical. As such, the reliability at a certain time will depend on the logical path just exercised, and it will either be 1 (correct), or 0 (incorrect). That is, software fails systematically, as a result of defects introduced during its construction. Some of these may be picked up during verification activities, but proving that all such defects have been identified is an elusive goal. (NB: Other technologies are also vulnerable to design errors introduced during construction e.g. relays mistakenly wired in parallel rather than in series but the overall complexity of software makes it more prone to this kind of failure.) As a result, for complex software systems, it is far more difficult to design tests to demonstrate a certain reliability. Just because a program has behaved correctly for a period in the past is no guarantee it will behave correctly in the future, unless one is sure that all logical paths that will be exercised in the future have been exercised in the past. For complex systems, the number of logical paths is huge, and there are several studies in the literature that show that exhaustive testing is intractable (the time required for testing would exceed by a huge factor the expected life of the underlying equipment). Refer references [15] and [16] for a discussion of these problems. On the other hand, there is a body of research that is based on the idea that although software fails systematically, the demands placed upon the software in a certain setting, and hence the logical paths that will be exercised, can be characterised by an operational profile. By designing a test profile that is statistically representative of the operational profile, it is possible to infer, based on the size of the test, the likelihood of system failure in use. This body of research is now wellestablished, refer [17], however constructing a valid operational profile is a difficult problem, and in any case, the test time required to establish reliability even to modest levels is considerable. For example, Reference [16] notes that to establish 99% confidence in a failure rate of 1E-04 failures per hour would require approximately 46,000 hours, or 5 yrs of steady state testing; i.e. testing where the underlying configuration does not change. Thus empirical determination of

software reliability is usually infeasible for most railway systems, where the software integrity requirement is for a wrong side failure no more frequently than about 3.16E-09 failures per hour.

APPROACHES TO MANAGING THE PROBLEM OF SOFTWARE RELIABILITY ON MODERN RAILWAYS Where Reliability Problems Cause Safety Problems

6.1

Where software failure can cause safety problems, there is now well-established good practice for safety-related and safety-critical software development and this is generally applied to all new software development for safety-involved systems for railways. The CENELEC standards EN50126/8/9 ([18], [19] and [20]), which are in turn based on IEC/AS 61508 [21], cover dependability of railway signalling systems, and depending on the safety risk associated with a particular system, mandate certain practices to be applied during the development of software for those systems (EN50128), and mandate the development of a safety case to demonstrate that the risk has been managed to an acceptable level. The process works broadly as follows: 1. The hazards associated with the operation, maintenance, installation or disposal of a proposed system are identified. Such an analysis covers not only hazards arising from physical properties of the systems equipment, such as sharp edges or possibility of electric shock, but also consider the effects of failure of the systems functions (e.g. failure to stop a train prior to it reaching the limit of its movement authority). The risk associated with each hazard is then assessed, having regard to the severity of the worst credible outcome that may result from the hazard, coupled with the anticipated frequency of occurrence of that outcome, which in turn has regard to the likelihood of hazard occurrence and also to the likelihood that any barriers to escalation will fail. Sometimes, the risk is assessed for each hazard individually, but more and more frequently now, an aggregate quantitative risk model is developed using a technique such as fault tree analysis and/or event tree analysis, to quantify the overall risk presented by introduction of a system. A judgement about the tolerability of risk is made. The legal requirement in Australia is to show that risk has been reduced so far as is reasonably practicable (SFAIRP), which generally entails implementing all reasonably practicable options that would deliver tangible risk improvement. In determining what is reasonably practicable, regard may be had to: a. b. What is current good practice in the industry; Whether the proposed system improves or diminishes overall safety (obviously, reduction in safety wont

2.

3.

IRSE Australasia Technical Meeting: Sydney

12 October 2012

Page 5 of 8

IRSE Australasia

Software Reliability An Oxymoron?

be permitted, but if an improvement can be demonstrated, it may well be permitted as an interim measure provided one can show that one is on track to ultimately achieving SFAIRP status); c. The cost of the improvement, compared with the safety benefit to be gained.

accepted approach, and since current good practice is one way to demonstrate SFAIRP, suppliers and procurers are well-advised to adopt this approach.

6.2

Where Software Reliability Impacts Operational Availability, Not Safety

The above process, while now common in Australia for new system developments, is typically only used for safety-related or safety-critical software. However, the author has had personal experience with railway development contracts in Australia and Southeast Asia where software for an operationally-critical system, though not directly safety-critical, was nonetheless required to be developed in accordance with a particular software safety integrity level. In these cases, the contract principal acknowledged using the SIL mechanism as a way to drive a base level of software quality, and hence reliability. Suppliers should also be aware that EN50128 does lay out some basic expectations for SIL 0 software, and some railway authorities are now insisting that all software, whether involved in a safety function or not, nevertheless be shown to at least meet the minimum standards required of SIL 0 (which demands as a minimum specification and design documentation, as well as test plans and documentation of same, along with basic configuration management and requirements management). The author is aware of several cases where such an approach has been used for software on board trains, in control centres, or used to control (say) traction power supply, but not aware of any cases where a SIL approach has been required for strictly front-ofhouse services such as revenue collection, online journey planners, or the like. It is quite standard these days to contractually impose a quantitative availability or reliability requirement for a system, however in demonstrating achievement of such a target, the analysis is usually limited to the failure behaviour of the hardware components. If software is mentioned at all, it is rare that anything more than a qualitative good practice argument needs to be made for it. Where contracts specify an availability run as part of the system acceptance process, the system as a whole, including the software, is operated under conditions approaching or simulating normal operation. In such tests, critical software failures usually have the effect of resetting the clock, requiring the test to be restarted. An availability run in the order of (say) a month, while insufficient to allow any meaningful empirical conclusions about software reliability to be drawn, is nevertheless prudent as a way of determining a base level of software reliability for operationally critical systems.

4.

There are many strategies that can be used to reduce risk, and most safety textbooks [22] will talk about a hierarchy of hazard control beginning with elimination of the hazard altogether, if this is possible, and ending with procedural controls and the use of things like personal protective equipment. However, for modern computer-based systems, at some point the options analysis will usually need to consider the tolerable hazard rate (THR) of the system being introduced. By decreasing the rate at which the system can be expected to fail dangerously, one decreases overall risk. Depending on the THR that is deemed appropriate, a Safety Integrity Level is assigned to the system and for software components of the system, this in turn implies development of the software in accordance with a tailorable set of development processes. The EN50128 standard recognises five SILs, from 0 through to 4, with requirements on the rigour of the software development process increasing with the SIL. The supplier typically then proceeds to implement the system, and at the conclusion of the development process and prior to acceptance into revenue operation, must prepare a Safety Case to show that the system has achieved its safety requirements and therefore is safe to operate, maintain, install, commission, dispose of, etc. Usually, the suppliers safety case will be subject to independent safety assessment (ISA) by a reputable third party who will assess the logical soundness and technical credibility of the claims made in the safety case, and either endorse the findings or express reservations. Sometimes in parallel with, and sometimes as part of, the above processes, most rail authorities also have type approval processes that involve technical reviews of products proposed to be used on a railway, including software-based products.

5.

6.

7.

In noting the above as current good practice, it is important to remember that to date there have been no empirical studies that demonstrate that building software in accordance with the CENELEC processes for (say) SIL 2 actually delivers software whose in-service failure rate is within the range 1E-07 to 1E-06 failures per hour (the tolerable hazard rate range for SIL 2). Indeed, because of the fundamental difficulties in measuring inservice software reliability (as alluded to in the previous section), it may be some time before any such evidence (or counter-evidence) becomes available. In the meantime, however, this approach reflects the currently
IRSE Australasia Technical Meeting: Sydney

CONCLUSION

This paper has examined the issue of software reliability in the rail environment, and concluded that it presents an increasing threat to keeping the trains moving. We considered the record of software performance in society over the years, and cited several examples where software failures have had a massive impact on the public, and occasionally resulted in loss of life. We also considered the extent to which software is used on

12 October 2012

Page 6 of 8

IRSE Australasia

Software Reliability An Oxymoron?

a modern railway, and determined that the exposure of the modern railway to service outages or safety problems arising from software failure is very high. We then considered the question of why software apparently seems so hard to get right, and showed that it is quite different from other engineering disciplines in some fundamental respects. We also noted that, particularly in Australia (as opposed to the UK and Europe), software engineering as a profession is still in its infancy and remains relatively unregulated. We also considered the technical difficulties associated with credibly predicting software reliability, and concluded that thus far, there is no scientifically supportable way to predict with any credibility that software will be sufficiently reliable to support safetycritical (vital) operations. Finally, we considered the current approach, within the Australia rail industry, of tackling the problem of software reliability. We showed that for cases where software reliability impacted safety, the SIL approach as reflected in standards such as CENELEC EN50126/8/9 and AS 61508 is now widely adopted for new developments. This is occasionally applied to software which is operationally critical but not safety critical, however its use in this way is not yet wide-spread.

[10]

Peter, Naur; Brian Randell (711 October 1968). Software Engineering: Report of a conference sponsored by the NATO Science Committee (PDF). Garmisch, Germany: Scientific Affairs Division, NATO. Article by Scott Rosenberg: http://www.cioinsight.com/c/a/ExpertVoices/Scott-Rosenberg-What-MakesSoftware-So-Hard/ Rosenberg, S. Dreaming In Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software (Crown, 2007) Wiegers, K. Creating a Software Engineering Culture, Dorset House, August 1996 http://www.computer.org/portal/web/swebok (accessed 26-Sep-2012) Butler, R.W., and Finelli, G.B., The infeasibility of quantifying the reliability of life-critical realtime software, IEEE Transactions on Software Engineering, Vol 19, No. 1, Jan. 1993, pp. 3 12. Littlewood, B., The problems of assessing software reliability when you really need to rely on it, Centre for Software Reliability, City University, London, UK. 2000 http://www.csr.city.ac.uk/people/bev.littlewood/ bl_public_papers/SCSS_2000/SCSS_2000.pdf Software Reliability Engineering, Musa, J., McGraw-Hill, 1998. CENELEC EN50126 Railway Applications The Specification and Demonstration of Reliability, Availability, Maintainability, and Safety (RAMS), 15 December 1999 CENELEC EN50128 Railway Applications Communications, signalling and processing systems Software for railway control and protection systems, March 2001. CENELEC EN50129 Railway Applications Communication, signalling and processing systems - Safety Related Electronic Systems for Signalling, February 2003. AS 61508 Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems (E/E/PE, or E/E/PES) Leveson, N. Safeware: System Safety and Computers, Addison Wesley, 1995.

[11]

[12]

[13] [14] [15]

[16]

8
[1]

REFERENCES
Article about Bill Gates seeing the blue screen of death during a public demonstration of Windows software: http://www.techrepublic.com/blog/geekend/vide o-bill-gates-meet-the-blue-screen-of-death/623 (accessed 26-Sep-12) Article about the Queensland Health Payroll Disaster: http://delimiter.com.au/2012/06/07/abomination -qld-health-payroll-needs-837m-more/ Article about the Virgin Blue check-in software outage: http://www.computerworld.com.au/article/36217 5/virgin_blue_system_crash_causes_chaos/ Kuhn, R. Sources of Failure in the Public Switched Telephone Network, IEEE Computer, Volume 30, No. 4 (April 1997) Leveson, N. Medical Devices The Therac-25, available from: http://sunnyday.mit.edu/papers/therac.pdf (accessed 26-Sep-2012) Article about problems with Multidata software program that computed radiation dosage levels: http://www.thepanamanews.com/pn/v_10/issue _01/science_01.html (accessed 26-Sep-2012) Ariane 5 article pointing to many articles about Flight 501: http://en.wikipedia.org/wiki/Ariane_5 (accessed 26-Sep-2012) http://en.wikipedia.org/wiki/Octopus_card#EPS _add-value_glitch (accessed 26-Sep-2012) http://www.brisbanetimes.com.au/queensland/g o-card-fare-evasion-loophole-revealed20100301-pdfc.html (accessed 26-Sep-2012) [21]

[17] [18]

[2]

[19]

[3]

[20]

[4]

[5]

[22]

[6]

[7]

[8] [9]

IRSE Australasia Technical Meeting: Sydney

12 October 2012

Page 7 of 8

IRSE Australasia

Software Reliability An Oxymoron?

AUTHOR

Dr Alena Griffiths Dr Alena Griffiths is a Chartered Engineer with over seventeen years experience in the field of systems assurance engineering and management in the Transport, Defence and Research sectors. She is a specialist in the area of system safety engineering

(incorporating reliability, availability, maintainability as well as safety), in verification and validation (V&V), and has also worked extensively on software development programs for SIL 2 and SIL 4 railway applications. She has successfully managed the safety and V&V programs for several large rail control system projects, from inception to delivery of the final system safety report (safety case), and has been involved with many projects of varying value. She has a detailed working knowledge of the CENELEC standards EN50126, EN50128 and EN50129, which are based on IEC 61508, and has also produced various work items to comply with other safety standards and guidelines, including but not limited to Def(Aust) 5679, IEC 61508, the UK Railway's "Yellow Book", and MIL-STD-882C/D/E. She has published widely in the areas of systems assurance and highintegrity software engineering and is considered to be an expert in the field. She has also been retained as Independent Safety Assessor on several occasions, and has acted as an Expert Witness in legal proceedings concerning software reliability.

IRSE Australasia Technical Meeting: Sydney

12 October 2012

Page 8 of 8

Vous aimerez peut-être aussi