Navigation

Home

Musa's Book


Resources


SRE Orientation

Training Services

Consulting Services

Contact

Click Here for A Special Announcement!

More Reliable Software Faster and Cheaper:

An Overview of Software Reliability Engineering



John D. Musa
Software Reliability Engineering and Testing Courses
j.musa@ieee.org



Keywords: software reliability engineering, operational profile, reliability objective, failure intensity, failure intensity objective, time to market, development cost


Abstract: Arguably the most important software development problem is building software to customer demands that it be more reliable, built faster, and built cheaper (in general order of importance). Your success in meeting these demands affects the market share and profitability of a product for your company, and hence your career. These demands conflict, causing risk and overwhelming pressure, and hence strong need for a practice that can help you with them.

Software reliability engineering (SRE) is such a practice, one that is a standard, proven, widespread best practice that is widely applicable. It is low in cost, and its implementation has virtually no schedule impact. We will show what it is, and how it works. We will then outline the SRE process to give you a feel for the practice, using a single consistent example throughout. Finally, we will list some resources that will help you learn more about it.
1. Software Reliability Engineering

SRE differs from other approaches by being primarily quantitative. In applying SRE, you add and integrate it with other good processes and practices; you do not replace them. With SRE you control the development process, it doesn’t control you. The development process is not externally imposed. You use quantitative information to choose the most cost-effective software reliability strategies for your situation.

1.1. What It Is And Why It Works

Let’s look in a little more depth now at just what SRE is. SRE is a practice for quantitatively planning and guiding software development and test, with emphasis on reliability and availability. It is a practice that is backed with science and technology (Musa, Iannino, and Okumoto (1987)). But we will describe how it works in business-oriented terms.

SRE works by quantitatively characterizing and applying two things about the product: the expected relative use of its functions and its required major quality characteristics. The major quality characteristics are reliability, availability, delivery date, and life-cycle cost. In applying SRE, you can vary the relative emphasis you place on these factors.

When you have characterized use, you can substantially increase development efficiency by focusing resources on functions in proportion to use and criticality. You also maximize test effectiveness by making test highly representative of use in the field. Increased efficiency increases the effective resource pool available to add customer value, as shown in Figure 1. For a detailed discussion of ways in which use data can increase development efficiency, see Musa 2004.

Figure 1. Increased resource pool resulting from increased development efficiency.

When you have determined the precise balance of major quality characteristics that meets user needs, you can spend your increased resource pool to carefully match them. You choose software reliability strategies to meet the objectives, based on data collected from previous projects. You also track reliability in system test against its objective to adjust your test process and to determine when test may be terminated. The result is greater efficiency in converting resources to customer value, as shown in Figure 2.

We have set delivery times and budgeted software costs for software-based systems for some time. It is only relatively recently that SRE, the technology for setting and tracking reliability and availability objectives for software, has developed (Musa, Iannino, and Okumoto 1987).

1.2. A Proven, Standard, Widespread Best Practice

Software reliability engineering is a proven, standard, widespread best practice. As one example of the proven benefit of SRE, AT&T applied SRE to two different releases of a switching system, International Definity PBX. Customer-reported problems decreased by a factor of 10, the system test interval decreased by a factor of 2, and total development time decreased 30%. No serious service outages occurred in 2 years of deployment of thousands of systems in the field (Lyu 1996).


Figure 2. Increased customer value resulting from increased resource pool and better match to major quality characteristics needed by users.

SRE has been an AT&T Best Current Practice since May 1991 (Lyu 1996). To become a Best Current Practice, a practice must have substantial application (usually at least 8 to 10 projects) and this application must show a strong, documented benefit-to-cost ratio. For SRE, this ratio was 12 or higher for all projects. The practice undergoes a probing review by two boards, at third and fourth levels of management. More than 70 project managers or their representatives reviewed the SRE proposal. There were more than 100 questions and issues requiring resolution, a process that took several months. In 1991, SRE was one of five practices that were approved, out of 30 that were proposed.

SRE is also a standard practice. McGraw-Hill published an SRE handbook in 1996 (Lyu 1996). SRE has been a standard of the American Institute of Aeronautics and Astronautics since 1993, and IEEE standards are currently under development.

SRE is a widespread practice. There have been almost 70 published articles by users of SRE, and the number continues to grow (Musa 2004, 2002). Since practitioners do not generally publish very frequently, the actual number of users is probably many times the above number.

Users include Alcatel, AT&T, Bellcore, CNES (France), ENEA (Italy), Ericsson Telecom, Hewlett Packard, Hitachi, IBM, NASA’s Jet Propulsion Laboratory, Lockheed-Martin, Lucent Technologies, Microsoft, Mitre, Nortel, Saab Military Aircraft, Tandem Computers, the U.S. Air Force, and the U.S. Marine Corps.

Tierney (1997) reported the results of a late 1997 survey that showed that Microsoft had applied software reliability engineering in 50 percent of its software development groups, including projects such as Windows and Word. The benefits they observed were increased test coverage, improved estimates of amount of test required, useful metrics that helped them establish ship criteria, and improved specification reviews.

SRE is widely applicable. From a technical viewpoint, you can apply SRE to any software-based product, starting at the beginning of any release cycle. From an economic viewpoint, you can apply SRE to any software-based product also, except for very small components, perhaps those involving a total effort of less than 2 staff months. However, if a small component such as this is used for several projects, then it probably will be feasible to use SRE. If not, it still may be worthwhile to implement SRE in abbreviated form.

SRE is independent of development technology and platform. It requires no changes in architecture, design, or code, but it may suggest changes that would be useful. It can be deployed in one step or in stages.

SRE is very customer-oriented: it involves frequent direct close interaction with customers. This enhances a supplier’s image and improves customer satisfaction, greatly reducing the risk of angry customers. Developers who have applied SRE have described it with adjectives such as “unique, powerful, thorough, methodical, and focused.” It is highly correlated with attaining Levels 4 and 5 of the Software Engineering Institute Capability Maturity Model.

Despite the word “software,” software reliability engineering deals with the entire product, although it focuses on the software part. It takes a full-life-cycle, proactive view, as it is dependent on activities throughout the life cycle. It involves system engineers, system architects, developers, users (or their representatives, such as field support engineers and marketing personnel), and managers in a collaborative relationship.

The cost of implementing SRE is small. There is an investment cost of not more than 3 equivalent staff days per person in an organization, which includes a 2-day course for everyone and planning with a much smaller number. The operating cost over the project life cycle typically varies from 0.1 to 3 percent of total project cost, as shown in Table 1. The largest cost component is the cost of developing the operational profile.

Table 1. Operating cost of SRE

The schedule impact of SRE is minimal. Most SRE activities involve only a small effort that can parallel other software development work. The only significant critical path activity is 2 days of training.

2. SRE Process and Fone Follower Example

Let’s now take a look at the SRE process. There are six principal activities, as shown in Figure 3. We show the software development process below and in parallel with the SRE process, so you can relate the activities of one to those of the other. Both processes follow spiral models, but we don’t show the feedback paths for simplicity. In the field, we collect certain data and use it to improve the SRE process for succeeding releases.

The Define the Product, Implement Operational Profiles, Define “Just Right” Reliability, and Prepare for Test activities all start during the Requirements and Architecture phases of the software development process. They all extend to varying degrees into the Design and Implementation phase, as they can be affected by it. The Execute Test and Guide Test activities coincide with the Test phase.

Before we proceed further, let’s define some of the terms we will be using. Reliability is the probability that a system or a capability of a system functions without failure for a specified period in a specified environment. The period may be specified in natural or time units.
The concept of natural units is relatively new to reliability, and it appears to have originated in the software sphere. A natural unit is a unit other than time that is related to the amount of processing performed by a software-based product, such as pages of output, transactions, telephone calls, jobs, semiconductor wafers, queries, or application program interface calls. Availability is the average (over time) probability that a system or a capability of a system is currently functional in a specified environment. If you are given an average down time per failure, availability implies a certain reliability. Failure intensity, used particularly in the field of software reliability engineering, is simply the number of failures per natural or time unit. It is an alternative way of expressing reliability.
Some people speak of software products, but this is really incorrect, because pure software cannot function. You really have “software-based products.” In discussing SRE, we should always be thinking of total systems that also contain hardware and often human components.

Note that we deliberately define software reliability in the same way as hardware reliability. This is so that we can determine system reliability from hardware and software component reliabilities, even though the mechanisms of failure are different (Musa, Iannino, and Okumoto (1987)).

We will illustrate the SRE process with Fone Follower, an example adapted from an actual project at AT&T. We have changed the name and certain details to keep the explanation simple and protect proprietary data. Subscribers to Fone Follower call and enter, as a function of time, the phone numbers to which they want to forward their calls. Fone Follower forwards a subscriber’s incoming calls (voice or fax) from the network according to the program the subscriber entered. Incomplete voice calls go to the subscriber’s pager (if the subscriber has one) and then, if unanswered, to voice mail. If the subscriber does not have a pager, incomplete voice calls go directly to voice mail.

Figure 3. SRE Process

2.1. Define the Product

The first activity is to define the product. You must establish who the supplier is and who the customers and users are, which can be a nontrivial enterprise in these days of outsourcing and complex inter- and intracompany relationships. Then you list all the systems associated with the product that for various reasons must be tested independently. These are generally of two types:

1. base product and variations
2. supersystems

Variations are versions of the base product that you design for different environments. For example, you may design a product for both Windows and Macintosh platforms. Supersystems are combinations of the base product or variations with other systems, where customers view the reliability or availability of the base product or variation as that of the combination.

2.2. Implement Operational Profiles

This section deals with quantifying how software is used. To fully understand it, we need to first consider what operations and operational profiles are.

An operation is a major system logical task, which returns control to the system when complete. Some illustrations from Fone Follower are Phone number entry, Process fax call, and Audit a section of the phone number data base. An operational profile is a complete set of operations with their probabilities of occurrence. Table 2 shows an illustration of an operational profile from Fone Follower.

Table 2. Fone Follower Operational Profile

There are five principal steps in developing an operational profile:

1. Identify the operation initiators

2. List the operations invoked by each initiator

3. Review the operations list to ensure that the operations have certain desirable characteristics and form a set that is complete with high probability

4. Determine the occurrence rates

5. Determine the occurrence probabilities by dividing the occurrence rates by the total occurrence rate

There are three principal kinds of initiators: user types, external systems, and the system itself. You can determine user types by considering customer types. For Fone Follower, one of the user types is subscribers and the principal external system is the telephone network. Among other operations, subscribers initiate Phone number entry and the telephone network initiates Process fax call. Fone Follower itself initiates Audit a section of the phone number data base.

When implementing SRE for the first time, some software practitioners are initially concerned about possible difficulties in determining occurrence rates. Experience indicates that this is usually not a difficult problem. Software practitioners are often not aware of all the use data that exists, as it is typically in the business side of the house. Occurrence rate data is often available or can be derived from a previous release or similar system. New products are not usually approved for development unless a business case study has been made, and this must typically estimate occurrence rates for the use of various functions to demonstrate profitability. One can collect data from the field, and if all else fails, one can usually make reasonable estimates of expected occurrence rates. In any case, even if there are errors in estimating occurrence rates, the advantage of having an operational profile far outweighs not having one at all.

Once you have developed the operational profile, you can employ it, along with criticality information, to:
1. Review the functionality to be implemented for operations that are not likely to be worth their cost and remove them or handle them in other ways (Reduced Operation Software or ROS)

2. Suggest operations where looking for opportunities for reuse will be most cost-effective

3. Plan a more competitive release strategy using operational development. With operational development, development proceeds operation by operation, ordered by the operational profile. This makes it possible to deliver the most used, most critical capabilities to customers earlier than scheduled because the less used, less critical capabilities are delivered later.

4. Allocate development resources among operations for system engineering, architectural design, requirements reviews, and design to cut schedules and costs

5. Allocate development resources among modules for code, code reviews, and unit test to cut schedules and costs
6. Allocate new test cases of a release among the new operations of the base product and its variations

7. Allocate test time

2.3. Define “Just Right” Reliability

To define the “just right” level of reliability for a product, you must first define what “failure” means for the product. We will define a failure as any departure of system behavior in execution from user needs. You have to interpret exactly what this means for your product. The definition must be consistent over the life of the product, and you should clarify it with examples. A failure is not the same thing as a fault; a fault is a defect in system implementation that causes the failure when executed. Beware, as there are many situations where the two have been confused in the literature.

The second step in defining the “just right” level of reliability is to choose a common measure for all failure intensities, either failures per some natural unit or failures per hour.

Then you set the total system failure intensity objective (FIO) for each associated system. To determine an objective, you should analyze the needs and expectations of users.

For each system you are developing, you must compute a developed software FIO. You do this by subtracting the total of the expected failure intensities of all hardware and acquired software components from the system FIOs. You will use the developed software FIOs to track the reliability growth during system test of all the systems you are developing with the failure intensity to failure intensity objective (FI/FIO) ratios.

You will also apply the developed software FIOs in choosing the mix of software reliability strategies that meet these and the schedule and product cost objectives with the lowest development cost. These include strategies that are simply selected or not (requirements reviews, design reviews, and code reviews) and strategies that are selected and controlled (amount of system test, amount of fault tolerance). SRE provides guidelines and some quantitative information for the determination of this mix. However, projects can improve the process by collecting information that is particular to their environment.


2.4. Prepare For Test

The Prepare for Test activity uses the operational profiles you have developed to prepare test cases and test procedures. You allocate test cases in accordance with the operational profile. For example, for the Fone Follower base product there were 500 test cases to allocate. The Process fax call operation received 17 percent of them, or 85.

After you assign test cases to operations, you specify the test cases within the operations by selecting from all the possible intraoperation choices with equal probability. The selections are usually among different sets of values of input variables associated with the operations, sets that cause different processing to occur. These sets are called equivalence classes. For example, one of the input variables for the Process fax call operation was the Forwardee (number to which the call was forwarded) and one of the equivalence classes of this input variable was Local calling area. You then select a specific value within the equivalence class so that you define a specific test case.

The test procedure is the controller that invokes test cases during execution. It uses the operational profile, modified to account for critical operations and for reused operations from previous releases.

2.5. Execute Test

In the Execute Test activity, you will first allocate test time among the associated systems and types of test (feature, load, and regression).

Invoke feature tests first. Feature tests execute all the new test cases of a release independently of each other, with interactions and effects of the field environment minimized (sometimes by reinitializing the system). Follow these by load tests, which execute test cases simultaneously, with full interactions and all the effects of the field environment. Here you invoke the test cases at random times, choosing operations randomly in accord with the operational profile. Invoke a regression test after each build involving significant change. A regression test executes some or all feature tests; it is designed to reveal failures caused by faults introduced by program changes.

Identify failures, along with when they occur. The “when” can be with respect to natural units or time. This information will be used in Guide Test.

2.6. Guide Test

The last activity involves guiding the product’s system test phase and release. For software that you develop, track reliability growth as you attempt to remove faults. Then we certify the supersystems, which simply involves accepting or rejecting the software in question. We also use certification test for any software that we expect customers will acceptance test.

For certification test you first normalize failure data by multiplying by the failure intensity objective. The unit “Mcalls” is millions of calls. Plot each new failure as it occurs on a reliability demonstration chart as shown in Figure 4. Note that the first two failures fall in the Continue region. This means that there is not enough data to reach an accept or reject decision. The third failure falls in the Accept region, which indicates that you can accept the software, subject to the levels of risk associated with the chart you are using. If these levels of risk are unacceptable, you construct another chart with the levels you desire (Musa 2004) and replot the data.

Figure 4. Reliability Demonstration Chart Applied to Fone Follower

To track reliability growth, input failure data that you collect in Execute Test to a reliability estimation program such as CASRE (available through Software Reliability Engineering website, see Internet in Section 4 of this article). Normalize the data by multiplying by the failure intensity objective in the same units. Execute this program periodically and plot the FI/FIO ratio as shown in Figure 5 for Fone Follower. If you observe a significant upward trend in this ratio, you should determine and correct the causes. The most common causes are system evolution, which may indicate poor change control, and changes in test selection probability with time, which may indicate a poor test process.

If you find you are close to your scheduled test completion date but have an FI/FIO ratio substantially greater than 0.5, you have three feasible options: defer some features or operations, rebalance your major quality characteristic objectives, or increase work hours for your organization. When the FI/FIO ratio reaches 0.5, you should consider release as long as essential documentation is complete and you have resolved outstanding high severity failures (you have removed the faults causing them).

Developers sometimes worry that systems with ultrareliable FIOs might require impractically long hours of test to certify the FIOs specified. But there are many ameliorating circumstances that make the problem more tractable than that for ultrareliable hardware (Musa 2004). First, in most cases only a few critical operations, not the entire system, must be ultrareliable. Second, software reliability relates to the execution time of the software, not the clock time for which the system is operating as does hardware. Since the critical operations often occur only rarely, the execution time of the critical operations is frequently a small fraction of the clock time. Thus the FIO for the entire system need not be ultrareliable. Finally, since processing capacity is cheap and rapidly becoming cheaper, it is quite feasible to test at a rate that is hundreds of times real time by using parallel processors. Thus testing of ultrareliable software can be manageable.

2.7. Collect Field Data

The SRE process is not complete when you ship a product. We collect certain field data to use in succeeding releases and in other products. In many cases, we can collect the data easily and inexpensively by building recording and reporting routines into the product. In this situation, we collect data from all field sites. For data that requires manual collection, take a small random sample of field sites.

We collect data on failure intensity and on customer satisfaction with the major quality characteristics and use this information in setting the failure intensity objective for the next release. We also measure operational profiles in the field and use this information to correct the operational profiles we estimated. Finally, we collect information that will let us refine the process of choosing reliability strategies in future projects.

Figure 5. Plot of FI/FIO Ratio for Fone Follower

3. Conclusion

If you apply SRE in all the software-based products you develop, you will be controlling the process rather than it controlling you. You will find that you can be confident of the reliability and availability of the products. At the same time, you will deliver them in minimum time and cost for those levels of reliability and availability. You will have maximized your efficiency in satisfying your customers’ needs. This is a vital skill to possess if you are to be competitive in today’s marketplace.

 

4. To Explore Further

Books

Musa, J. D. 2004. Software Reliability Engineering: More Reliable Software Faster and Cheaper – Second Edition. Detailed, extensive treatment of practice. Browse & order at http://members.aol.com/JohnDMusa/book.htm

Musa, J. D., A. Iannino, and K. Okumoto. 1987. Software Reliability: Measurement, Prediction, Application, ISBN 0-07-044093-X, McGraw-Hill, New York. Very thorough treatment of software reliability theory.


SRE Website: The essential guide to software reliability:

http://members.aol.com/JohnDMusa/

Courses

John D. Musa - conducts onsite and public classroom courses and also distance learning courses for practitioners, see SRE website

University of Maryland - has doctoral program, contact Professor Carol Smidts

Conferences

International Symposium on Software Reliability Engineering (ISSRE)


Professional organization

IEEE Computer Society Technical Committee on Software Reliability Engineering. Publishes newsletter, sponsors ISSRE annual international conference. Join through SRE Website above.


SRE Network

Communicate by email with hundreds of people interested in field. See SRE Website above.


Journals publishing in the field

IEEE Software
IEEE Transactions on Software Engineering
IEEE Transactions on Reliability

 

References

Lyu, M. (Editor). 1996. Handbook of Software Reliability Engineering , ISBN 0-07-039400-8, McGraw-Hill, New York.

Musa, J. D. 2004. Software Reliability Engineering: More Reliable Software Faster and Cheaper, ISBN: 1418493880 (hardcover), 1418493872 (paperback), AuthorHouse. Browse and order from http://members.aol.com/JohnDMusa/book.htm.

Musa, J. D. 2002 (updated regularly). More Reliable Software Faster and Cheaper (Software Reliability Engineering) website: http://members.aol.com/JohnDMusa/

Musa, J.D., A. Iannino, and K. Okumoto. 1987. Software Reliability: Measurement, Prediction, Application, ISBN 0-07-044093-X, McGraw-Hill, New York.

Tierney, J. 1997. SRE at Microsoft. Keynote speech at 8th International Symposium On Software Reliability Engineering, November 1977, Albuquerque, NM.

About the Author: John D. Musa is an independent senior consultant in software reliability engineering. He has more than 35 years experience as software practitioner and manager in a wide variety of development projects. He is one of the creators of the field of software reliability engineering and is widely recognized as the leader in reducing it to practice. He was formerly Technical Manager of Software Reliability Engineering (SRE) at AT&T Bell Laboratories, Murray Hill, NJ.

He has been involved in SRE since 1973. His many contributions include the two most widely-used models (one with K. Okumoto), the concept, practice, and application of the operational profile, and the integration of SRE into all phases of the software development cycle. Musa has published some 100 articles and papers, given more than 200 major presentations, and made a number of videos. He is principal author of the widely-acclaimed pioneering book Software Reliability: Measurement, Prediction, Application and author of the eminently practical books Software Reliability Engineering: More Reliable Software, Faster Development and Testing and Software Reliability Engineering: More Reliable Software Faster and Cheaper.

He organized and led the transfer of SRE into practice within AT&T, spearheading the effort that defined it as a “best current practice.” He was actively involved in research to advance the theory and practice of the field. Musa has been an international leader in its dissemination.

His leadership has been recognized by every edition of Who’s Who in America and American Men and Women of Science since 1990. He is an international leader in software engineering and a Fellow of the IEEE, cited for “contributions to software engineering, particularly software reliability.” He was recognized in 1992 as the individual that year who had contributed the most to testing technology. He was co-founder of the IEEE Committee on SRE. He has very extensive international experience as a lecturer and teacher. In 2004 the IEEE Reliability Society named him “Engineer of the Year.”


JOHN D. MUSA
Software Reliability Engineering and Testing Courses

s