# User:Wyatts/Draft article C

Reliability engineering is the discipline of ensuring that a system will be reliable when operated in a specified manner. Reliability engineering is performed throughout the entire life cycle of a system, including development, test, production and operation. Reliability engineers rely heavily on statistics, probability theory, and reliability theory. Many engineering techniques are used in reliability engineering, such as reliability prediction, Weibull analysis, thermal management, reliability testing and accelerated life testing, to name but a few. Because of the large number of reliability techniques, their expense, and the varying degrees of reliability required for different situations, most projects develop a Reliability Program Plan to specify the reliability tasks that will be performed for that specific system. The function of reliability engineering is to develop the reliability requirements for the system, establish an adequate reliability program, and perform appropriate reliability analyses and tasks to ensure the system will meet its requirements. These tasks are managed by a reliability engineer, who usually holds an accredited engineering degree and has additional reliability-specific education and training. Reliability engineering is closely associated with maintainability engineering and logistics engineering. This article provides an overview of some of the most common reliability engineering tasks. Please see the references for a more comprehensive treatment.

A Reliability Block Diagram

## Reliability

Main articles: reliability, reliability theory, failure rate.

Reliability theory is the foundation of reliability engineering. For engineering purposes, reliability is defined as:

the probability that a system will perform its intended function during a specified period of time under stated conditions.

Mathematically, this may be expressed as,

${\displaystyle R(t)=\int _{t}^{\infty }f(x)\,dx\ \!}$,
where ${\displaystyle f(x)\!}$ is the failure probability density function.

There are four key elements in the definition of reliability. Reliability engineering is concerned with each of these elements of reliability.

• First, reliability is a probability. This means that there is always some chance for failure. Reliability engineering is concerned with meeting the specified probability of success, at a specified statistical confidence level.
• Second, reliability is predicated on performing its "intended function". Generally, this is taken to mean operation without failure. However, even if no individual part of the system fails, yet the system does not do what it was supposed to do, then it is still charged against the system reliability. The system requirements specification is the criterion against which reliability is measured. Reliability engineering ensures adequate reliability system testing and other assessments to ensure compliance to the requirements.
• Third, reliability applies to a specified period of time. In practical terms, this means that a system has a specified chance that it will operate without failure before time ${\displaystyle t\!}$. Reliability engineering ensures that components and materials will meet the reliability requirements during the specified time.
• Fourth, reliability is restricted to operation under stated conditions. This constraint is necessary because it is impossible to design a system for unlimited conditions. A Mars Rover will have much different specified conditions than the family automobile. Reliability engineering ensures that the operating environment is adequately addressed during system design and test.

## Reliability Program Plan

There are myriad tasks, methods, tools, etc., that can be used to help achieve the specified system reliability. Every system requires a different level of reliability. For example, a commercial airliner must operate under a wide range of conditions, the consequences of failure are grave, but has a correspondingly higher budget. A pencil sharpener may be more reliable than an airliner, but has a much different set of operational conditions, mild consequences of failure, and correspondingly lower budget.

A reliability program plan is used to document exactly what tasks, methods, tools, analyses, and tests, are required for a particular system. For complex systems, the reliability program plan is a separate document. For less complex systems, it may be combined with the systems engineering management plan. The reliability program plan is essential for a successful reliability program and is developed early during system development. It specifies not only what the reliability engineer does, but also tasks performed by other organizations and engineers that affect system reliability. The reliability program plan is approved by top program management.

## Reliability Requirements

For any system, one of the first tasks of reliability engineering is to adequately specify the reliability requirements. Reliability requirements address the reliability of the system itself, reliability test and assessment requirements, and associated reliability tasks and documentation for the system. Reliability requirements are included in the appropriate system/subsystem requirements specifications, test plans, and contract statements of work.

### System reliability parameters

System reliability requirements are specified using reliability parameters. The most common reliability parameter is the mean-time-between-failure (MTBF), which can also be specified as the failure rate or the number of failures during a given period. These parameters are very useful for systems that are operated on a regular basis, such as most vehicles, machinery, and electronic equipment. Reliability increases as the MTBF increases. The MTBF is usually specified in hours, but can also be used with any unit of duration such as miles or cycles.

In other cases, reliability is specified as the probability of mission success. For example, reliability of a scheduled aircraft flight can be specified as a mission reliability. Mission reliability is expressed as a dimensionless probability or a percentage.

A special case of mission success is the single-shot device or system. These are devices or systems that remain relatively dormant and only operate once. Examples include automobile airbags, thermal batteries and missiles. Single-shot reliability is specified as a probability of success or subsumed into a related parameter. For example, single-shot missile reliability may be incorporated into a requirement for the probability of hit.

In addition to system level requirements, reliability requirements may be specified by the customer for critical subsystems. In all cases, reliability parameters are specified with appropriate statistical confidence intervals.

### Reliability test requirements

Because reliability is a probability, even highly reliable systems have some chance of failure. However, testing reliability requirements is problematic for several reasons. A single test is insufficient to generate enough statistical data. Multiple tests or long-duration tests are usually very expensive. Some tests are simply impractical. Reliability engineering is used to design a realistic and affordable test program that provides sufficient evidence that the system meets its requirement. Statistical confidence levels are often used to address some of these concerns. This is done by expressing a certain parameter along with a corresponding confidence level. For example, an MTBF of 1000 hours at 90% confidence level. From this specification, the reliability engineer can design a suitable test with explicit criteria for the number of hours and number of failures until the requirement is met or failed. The combination of reliability parameter value and confidence level can greatly affect the development cost and the risk to both the customer and producer. Extreme care is needed to select the right combination of requirements. Reliability testing may be performed at various levels, such as component, subsystem, and system. Also, many factors must be addressed during testing, such as extreme temperature and humidity, shock, vibration, thermal, etc. Reliability engineering is used to determine an effective test strategy to ensure all parts of the system are exercised in relevant environments. For systems that must last many years, reliability engineering may be used to design an accelerated life test. The reliability test strategy must be included in the requirements.

### Requirements for reliability tasks

Reliability engineering must also address requirements for various reliability tasks and documentation during system development, test, production, and operation. These requirements are generally specified in the contract statement of work and depend on how much leeway the customer wishes to provide to the contractor. Reliability tasks include various analyses, planning tasks, failure reporting systems, and similar tasks. Task selection depends on the criticality of the system as well as cost. For example, a critical system may require a formal failure reporting and review process throughout development, whereas a non-critical system may rely on final test reports. The most common reliability program tasks are documented in reliability program standards, such as MIL-STD-785 and IEEE 1332.

## Design for Reliability

It is axiomatic that reliability must be "designed in" to the system. During system design, the top-level reliability requirements are flowed down, or allocated, to subsystems and lower levels. Reliability design tasks are usually performed jointly by the design engineers and reliability engineers.

Reliability design often begins with a system reliability model. Reliability models are usually expressed using reliability block diagrams and fault trees. They provide a graphical means of evaluating the relationships between different parts of the system for reliability purposes. Reliability models often incorporate reliability predictions based on parts count failure rates. These predicted failure rates come from databases of historical failure data. While these predictions are often not accurate in an absolute sense, they are very valuable to assess relative differences in design alternatives. Reliability models and predictions are performed using commercially available software tools and databases.

A Fault Tree Diagram

Several reliability design techniques are employed to meet the specified reliability. One of the most important techniques is redundancy. This means that, if one part of the system fails, there is an alternate success path, such as a backup system. As a simple example, an automobile brake light might use two light bulbs. If one bulb fails, the brake light still operates using the other bulb. Redundancy significantly increases system reliability, and is often the only viable means of doing so. However, redundancy is usually difficult and expensive, and therefore limited to critical parts of the system. Another design technique is physics of failure. Physics of failure relies on understanding the physical processes of stress, strength and failure at a very detailed level. Then, the material or component can be re-designed to reduce the probability of failure. Another common design technique is component derating. This means selecting components whose tolerances significantly exceed the expected stress. A simple example would be using a heavier gauge wire that exceeds the normal specification for the expected amount of electrical current.

Many reliability design tasks, techniques and analyses are specific to particular industries and applications. Common design tasks, techniques and analyses include:

Results of reliability design tasks are presented during the system design reviews and logistics reviews. Reliability is just one requirement among many system requirements. Engineering trade studies are used to determine the optimum balance between reliability and other requirements and constraints.

## Reliability Testing

File:Reliability sequential test plan.png
A Reliability Sequential Test Plan

The purpose of reliability testing is to discover potential problems with the design as early as possible and, ultimately, provide confidence that the system meets its reliability requirements.

Reliability testing may be performed at several levels. Complex systems may be tested at component, circuit board, unit, assembly, subsystem and system levels. (The test level nomenclature varies among applications.) For example, performing environmental stress screening tests at lower levels, such as piece parts or small assemblies, catches problems before they cause failures at higher levels. Testing proceeds during each level of integration through full-up system testing, developmental testing, and operational testing, thereby reducing program risk. System reliability is calculated at each test level. Reliability growth techniques and failure reporting, analysis and corrective active systems (FRACAS) are often employed to improve reliability as testing progresses. The drawbacks to such extensive testing are the schedule and expense. Customers may choose to accept more risk by eliminating some or all lower levels of testing.

It is not always feasible to test all system requirements. Some systems are prohibitively expensive to test; some failure modes may take years to observe; some complex interactions result in a huge number of possible test cases; and some tests require the use of limited test ranges or other resources. In such cases, different approaches to testing can be used, such as accelerated life testing, design of experiments, and simulation techniques.

The desired level of statistical confidence also plays an important role in reliability testing. Statistical confidence is increased by increasing either the test time or the number of items under test. Reliability test plans are designed to achieve the specified reliability at the specified confidence level with the minimum number of test units and test time. Different test plans result in different levels of risk to the producer and consumer. The desired reliability, statistical confidence, and risk levels for each side influence the ultimate test plan. Good reliability test requirements ensure that the customer and developer agree in advance on how reliability requirements will be tested.

A key aspect of reliability testing is to define "failure". Although this may seem obvious, there are many situations where it is not clear whether a failure is really chargeable to the system, or is not the fault of the system. Variations in test conditions, operator differences, weather, and unexpected situations create differences between the customer and the system developer. One strategy to address this issue is to use a scoring conference process. A scoring conference includes representatives from the customer, the developer, the test organization, the reliability organization, and sometimes independent observers. The scoring conference process is defined in the statement of work. Each test case is considered by the group and "scored" as a success or failure. This scoring is the official result used by the reliability engineer.

As part of the requirements phase, the reliability engineer develops a reliability test strategy in coordination with the customer. The test strategy makes trade-offs between the needs of the reliability organization, which wants as much data as possible, and the constraints such as cost, schedule, and other available resources. Test plans and test procedures are developed for each reliability test, and results are documented in official test reports.

## Software Reliability

Software reliability engineering is a special aspect of reliability engineering. System reliability, by definition, includes all parts of the system, including hardware, software, operators and procedures. Traditionally, reliability engineering focuses on critical hardware parts of the system. Since the widespread use of digital integrated circuit technology, software has become an increasingly critical part of most electronics and, hence, nearly all present day systems. There are significant differences, however, in how software and hardware perform. Most hardware unreliability is the result of a component or material failure that results in the system not performing its intended function. Repairing or replacing the hardware component restores the system to its original unfailed state. However, software does not fail in the same sense that hardware fails. Instead, software unreliability is the result of unanticipated results of software operations. Even relatively small software programs can have astronomically large combinations of inputs and states that are infeasible to exhaustively test. Restoring software to its original state only works until the same combination of inputs and states results in the same unintended result. Consequently, software reliability engineering must take into account the differences between hardware and software.

As with hardware, software reliability depends on good requirements, design and implementation. Software reliability engineering relies heavily on a disciplined software engineering process to anticipate and design against unintended consequences. There is more overlap between software quality engineering and software reliability engineering than exists between hardware quality and reliability. A good software development plan is a key aspect of the software reliability program. The software development plan describes the design and coding standards, peer reviews, unit test process, configuration management, software metrics and software models that will be used during software development.

A common reliability metric is the number of software faults, usually expressed as faults per thousand lines of code. This metric, along with software execution time, is key to most software reliability models and estimates. The theory is that the software reliability increases as the number of faults (or fault density) goes down. Establishing a direct connection between fault density and mean-time-between-failure is difficult, however, because of the way software faults are distributed in the code, their severity, and the probability of the combination of inputs necessary to encounter the fault. Nevertheless, fault density serves as a useful indicator for the reliability engineer. Other software metrics, such as complexity, are also frequently used.

Testing is even more important for software than hardware. Even the best software development process results in some software faults that are nearly undetectable until tested. As with hardware, software is tested at several levels, starting with individual software units, through software integration and full-up software system testing. Unlike hardware, it is inadvisable to skip levels of software testing. During all phases of testing, software faults are discovered, corrected, and re-tested. Software reliability estimates are updated based on the fault density and other metrics. At system level, software mean-time-between-failure data is collected and used for reliability estimates. Unlike hardware, performing the exact same test on the exact same software configuration does not provide increased statistical confidence. Instead, software reliability uses different metrics such as test coverage.

Eventually, the software is integrated with the hardware in the top-level system, and software reliability is subsumed by system reliability. The Software Engineering Institute's Capability Maturity Model is a common means of assessing the overall software development process for reliability and quality purposes.

## Reliability Operational Assessment

After a system is produced, reliability engineering is performed during the system operation phase to monitor, assess, and correct reliability deficiencies. Data collection and analysis are the primary tools used during this phase. When possible, system failures and corrective actions are reported to the reliability engineering organization. The data is constantly analyzed using statistical techniques, such as Weibull analysis and linear regression, to ensure the system reliability meets the specification. Reliability data and estimates are also key inputs for system logistics. Data collection is highly dependent on the nature of the system. Most large organizations have maintenance organizations that collect failure data on their vehicles, equipment, machinery, etc. Consumer product failures are often tracked based on the number of returns. For some systems, especially those in dormant storage or standby state, it is necessary to establish a formal reliability surveillance program to inspect and test random samples of systems. Any changes to the system, such as field upgrades or recall repairs, require additional reliability testing to ensure the reliability of the modification.

## Reliability Organizations

Systems of any significant complexity are developed by organizations of people, such as a commercial company or a government agency. The reliability engineering organization must be consistent with the company's organizational structure. For small, non-critical systems, reliability engineering may be performed informally. As system complexity grows, the need arises for a formal reliability function. Because reliability is important to the customer, the customer may even specify certain aspects of the reliability organization.

There are several common types of reliability organizations. The project manager or chief engineer may employ one or more reliability engineers directly. In larger organizations, there is usually a product assurance or specialty engineering organization, which may include reliability, maintainability, quality, safety, human factors, logistics, etc. In such case, the reliability engineer reports to the product assurance manager or specialty engineering manager.

In some cases, a company may wish to establish an independent organization to perform reliability. This is desirable in order to ensure that the system reliability, which is often expensive and time consuming, is not unduly slighted due to budget and schedule pressures. In such cases, the reliability engineer works for the project on a day-to-day basis, but is actually employed and paid by a separate organization within the company.

Because reliability engineering is critical to early system design, it has become common for reliability engineers, however the organization is structured, to work as part of an integrated product team.

## Reliability Engineering Education and Certification

Reliability engineers typically have an engineering degree, which can be in any field of engineering, from an accredited university or college program. Many engineering programs offer reliability courses, and some universities have entire reliability engineering programs. A reliability engineer may be registered as a Professional Engineer by the state, but this is not required by most employers. There are many professional conferences and industry training programs available for reliability engineers. Several professional organizations exist for reliability engineers, including the IEEE Reliability Society, the American Society for Quality (ASQ), and the Society of Reliability Engineers (SRE). The ASQ has a well-known certification course for reliability engineers.

## References

Texts

• Blanchard, Benjamin S. (1992), Logistics Engineering and Management, Fourth Ed., pp 26-32, Prentice-Hall, Inc., Englewood Cliffs, New Jersey.
• Ebeling, Charles E., (1997), An Introduction to Reliability and Maintainability Engineering, pp 23-32, McGraw-Hill Companies, Inc., Boston.
• Kapur, K.C., and Lamberson, L.R., (1977), Reliability in Engineering Design, pp 8-30, John Wiley & Sons, New York.
• MacDiarmid, Preston; Morris, Seymour; et. al., (no date), Reliability Toolkit: Commercial Practices Edition, pp 35-39, Reliability Analysis Center and Rome Laboratory, Rome, New York.
• Neufelder, Ann Marie, (1993), Ensuring Software Reliability, Marcel Dekker, Inc., New York.
• Shooman, Martin, (1987), Software Engineering: Design, Reliability, and Management, McGraw-Hill, New York.

Standards

• MIL-STD-785, Reliability Program for Systems and Equipment Development and Production, U.S. Department of Defense.
• MIL-HDBK-217, Reliability Prediction of Electronic Equipment, U.S. Department of Defense.
• IEEE 1332, IEEE Standard Reliability Program for the Development and Production of Electronic Systems and Equipment, Institute of Electrical and Electronics Engineers.