Is resiliency emerging as a richer version of reliability and a higher systems priority over optimization?
By John Blyler, JB Systems
Resiliency has been described as a richer metric than reliability, as resilient systems have the capacity to survive, adapt and grow in the face of change and uncertainty.
Resiliency has been proposed as yet another needed capability for today’s ever-increasingly complex, “smart” systems. Understandably, system architects and design engineers may be reluctant to add yet another “ilities-like” requirement to an already long list that includes reliability, maintainability, safety, and more.
What is resiliency, especially when applied to the engineering of complex hardware-software embedded systems? What is the difference between resiliency and reliability with feedback and control? Or the difference between resiliency and maintainability in terms of some measure of repairability that would restore partial or full functionality over a specified period of time and in a specified environment? What are some of the quantitative measures of a resilient system? This paper will attempt to answer these questions.
First, to help avoid semantic entanglements, let’s define a few terms. In general, a resilient system is one that can recover from a failure. INCOSE defines resilience as the capability of a system with specific characteristics before, during and after a disruption to absorb the disruption, recover to an acceptable level of performance, and sustain that level for an acceptable period of time. Further, it lists the main attributes of resilience as capacity, flexibility, tolerance and cohesion (1).
The IEEE adds a security element to resilience by defining it as a combination of trustworthiness and tolerance. (2) Wikipedia describes resilient control systems as those that maintain a state of awareness and an accepted level of operational normalcy in response to disturbances, including threats of an unexpected and malicious nature. (3)
It’s noteworthy that resilience is not defined in the usual reliability terms of subsystem or component MTBF and MTTR numbers. As Jim Rodenkirch notes (4), resiliency is the extended part of the reliability problem that deals with what can “go wrong” across the breadth of the system-of-system (SOS) domain and the time required to “undo the wrong” to return the system to an acceptable – albeit different – level of operation.
Resiliency has been described as a richer metric than reliability, as resilient systems have the capacity to survive, adapt and grow in the face of change and uncertainty.(5) In today’s world of complex embedded systems, resiliency might be equated with “smart recovery” systems, those that contained the capacity to evaluate and act on situational inputs via microprocessor hardware, software and connectivity to other systems like the Internet.
Unlike reliability, maintainability and systems safety, resilience is less of a specific topic and more of an over-arching set of considerations and design principals that help a system recover from a disruption. For our purposes, we are considering designed-in resilience, as opposed to intrinsic resilience, where the latter is the focus of material science, psychology and ecology.
A good analogy that ties resiliency, reliability and maintainability together is provided by Ivan Mactaggart, Principal Systems Engineer at Dstl, and President-Elect INCOSE UK – INCOSE: “My car is reliable in that it starts every time and has never broken down. The vehicle is reliable in part due to scheduled maintenance by a trained mechanic, which helps it performs the primary transportation function. However, it is not resilient to a head-on impact with another vehicle, in which case it may no longer perform its primary function. It is not resilient to that shock. I might be able to return the car to a normal (acceptable) level of performance with repair. Or the damage may be too severe to repair.”
Resiliency might have been added to the design of the car by selecting a hybrid architecture – gas and electric. Though the severity of the accident might damage both systems, as well. If one considers the system to extend beyond the car, the resiliency can be added with public transportation – until the car is repaired or replaced. Public transportation is a more limited option in that it can travel compared to a car, but it might be acceptable. At least, it returns some level of transportation function to the overall system.
Kenneth Lloyd, CEO of Systems Science at Watt Systems Technologies, explains that resiliency relates to the continued functional integrity (at some level) despite component failures (and other perturbations) through a range of operating conditions. Reliability relates component failures to MTTR and MTBF independent of functional integrity.
What does this tell us about resilience in the context of the systems engineering “ilities” disciplines such as reliability, maintainability, and safety, etc? Our previous automotive shows that resiliency has strong connections to reliability and safety. This is one reason why many argue that resiliency is not a separate and distinct discipline from the other ‘ilities.’ Rather, resiliency depends upon the other “ilities,” in the same way that safety depends upon reliability, etc.
How does one design for resiliency? This question assumes that resiliency is a measureable quantity. There is some debate on this point. The over-arching nature of resiliency may be one reason why measurements are difficult, e.g., multiple threats, multiple failure modes and multiple recover modes. These issues make it hard to predict the resilience of a system.
According to the Systems Engineering Body of Knowledge (SEBok) (6), a resilient system must possess the four attributes of capacity, flexibility, tolerance and cohesion. Let’s concentrate on the first one as a metric for resilient systems. According to Rodenkirch, the capacity attribute allows the system to withstand a threat. “Resilience allows for the capacity of a system to be exceeded, forcing the system to rely on the remaining attributes to achieve recovery.”
If engineers can quantify the capacity of a system to withstand failures, then that quantity can serve as a measure of resilience. In the case of SOS, resilience can be defined as the level of performance achieved relative to different levels of failure. Capacity is required to withstand these various levels of failure.
In a related study, researchers at Purdue University (7) considered the challenge in measuring resilience. To perform this measurement, they first defined two types of SOS resilience: conditional and total. Conditional resilience is the ratio of the percentage of SOS performance in response to a failure in a particular system or combination of systems. This can be thought of as a particular performance measure that indicates how much performance is maintained for failure in a given set of systems.
Total resilience shows how performance is degraded as the total level of component system failures increases. According to the researchers, resilience patterns for the system are influenced by two factors: architecture type and system-level risk of the SOS. The architecture determines the general shape of the resilience pattern. The goal is to architect a system design that recovers to the highest level of performance possible after the failure.
In contrast to the resilience pattern, the system-level risk determines the scale or magnitude of the pattern, that is, how the system performance degrades as systems fail.
In the Purdue paper, researchers determine the two most critical systems of a multi-component threat detection SOS using the conditional resilience metric. They demonstrated that adding a communications link between these two systems increased the resilience, resulting in higher expected performance and slower expected performance degradation as a result of system failure. The goal now is to develop resilience patterns for more complex interactions.
The Purdue study show that some attributes of a resilient system can be measured. Treating resilience as an evolving, richer function of reliability might help facilitate further interest and study of this upcoming system design consideration. Finally, there is a need to place a greater emphasis on recoverability instead of just optimal states in the engineering of systems, which is another reason to consider augmenting reliability with resilient design.
- Resilient Systems Working Group, https://www.rmspartnership.org/
- Resilience in computer systems and networks
- Understanding Resilience’s Role in Designing Reliable Complex Systems, by Jim Rodenkirch
- “Evaluating System of Systems Resilience using Interdependency Analysis,” Seung Yeob Han, Karen Marais, and Daniel De Laurentis, School of Aeronautics and Astronautics, Purdue University
- Systems Engineering Body of Knowledge (SEBok) – http://sebokwiki.org/wiki/Guide_to_the_Systems_Engineering_Body_of_Knowledge_(SEBoK)
- 06377904.pdf (IEEE) “Evaluating System of Systems Resilience using Interdependency Analysis”