Dependability, Reliability, and Availability - Ignacio Chechile

# Dependability, Reliability, and Availability There is a set of concepts that are related to the probability of technical artifacts operating as intended under certain circumstances. Concepts that tend to be used interchangeably: dependability, reliability, and availability. Do they all mean the same? Let's see. In literature—for example in *Dependable computing and fault-tolerance*[^95]—dependability is defined as an umbrella concept that encompasses reliability, availability, safety, integrity, and maintainability. The IEC international vocabulary defines the dependability (of an item) as the "ability to perform as and when required". Something dependable means that it is reliable AND of high availability. But also, maintainable, and safe. What is reliability, more specifically? Reliability is the ability to perform a required function under given conditions for a given time interval; the continuity of correct service. There are some relevant reliability metrics such as MTBF (Mean Time Between Failures) and MTTR (Mean Time to Repair) which describe the reliability of a system in figures, as well as FIT (Failures in Time) number which is the number of failures that can be expected in one billion ($1 \times 10^{9}$) device hours of operation. The Mean Time to Failure, denoted by MTTF, and the Mean Time Between Failures, MTBF, mean the average time the system operates until a failure occurs, and the average time between two consecutive failures, respectively. The difference between the two is the time needed to repair the system following the first failure: $\large MTBF\ = \ MTTF\ + \ MTTR$ On the other hand, availability is the ability to be in a state to perform a required function, under given conditions, at a given instant of time, or after enough time has elapsed, assuming that the required external resources are provided. In short, availability is readiness for correct service. Another take states that availability is the degree to which a system is in an operable and committable state when it is called for at an unknown—i.e., a random—time. Availability tends to be also called *uptime*, although it is very easy to see uptime mixed up with reliability. The long-term availability can be calculated from MTTF, MTBF, and MTTR as follows: $\large A\ = \ \frac{\text{MTTF}}{\text{MTBF}}\ = \ \frac{\text{MTTF}}{MTTF\ + \ MTTR}\ $ High availability is specified and quantified in an unusual unit called *nines*. Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions 99.999% of the time would have 5 nines availability, or class 5. This unit finds application in enterprise computing and telecommunication equipment, often as part of a service-level agreement. Mind that availability and the number of nines a service complies with is a somewhat slippery metric that can be hard to gauge, depending on the scope of the service and how the downtimes are—sometimes opportunistically—counted. The following table shows some examples of high availability for different numbers of *nines*, with an indication of the "downtime" per relevant unit of time (year, month, week, day) which the number of nines represents. ![Availability and downtime as a function of "class of nines"](image407.png) > [!Note] > With low downtime come great challenges as well. Systems running for long periods may suffer from software bugs that are not tested enough during development. For instance, the FAA has recently issued an airworthiness directive^[https://www.govinfo.gov/content/pkg/FR-2016-12-02/pdf/2016-29064.pdf] (AD) where they recommend power cycling redundant Flight Control Modules in Boeing 787 Dreamliners due to the fact they may simultaneously reset after being powered on (therefore, running) for 22 days. > ![](FAA.png) > In another incident, Software problems with the Patriot Missile System led to a failure at Dhahran, Saudi Arabia. The report found that the failure to track incoming Scud missiles was caused by a precision problem in the software. The computer used to control the Patriot missile was based on a 1970s design and used 24-bit arithmetic. The Patriot system tracked its target by measuring the time it takes for radar pulses to bounce back from them. Time was recorded by the system clock in tenths of a second, but stored as an integer. To enable tracking calculations the time was converted to a 24-bit floating point number. Rounding errors in the time conversions caused shifts in the system's range gate, which was used to track the target. The inaccurate time calculations caused the range gate to shift so much that the system could not track the incoming missile, and the error increased with uptime (see table below). > ![](patriot.png) A fundamental difference between reliability and availability is that reliability refers to failure-free operation during an interval, while availability refers to failure-free operation at a given instant of time, usually the time when a device or system is requested to provide a required function or service. Reliability is a measure that characterizes the failure process of an item and the probability of failure, while availability combines the failure process with the restoration or recovery process and looks at the probability that at a given time instant the item is operational independently of the number of failure/repair cycles already undergone by the item. This is interesting because it means that something can be considered reliable and yet of low availability. Let's see an example using one of the most dreadful machines ever conceived: printers. Are printers available or reliable? You would probably say, neither. But, more accurately, provided we patiently provide them with all the elements they need (toner, paper, electricity, love, and comprehension) and we ensure the right configuration (IP address, drivers, etc.) they may work for a given amount of time, at least as long as all those conditions are met. This is somehow enough to call them "reliable"; remember that reliability is the ability to perform a function **for a given period of time**. But are printers available? Definitely not. Most of the time when you randomly need them, and you happen to be in a terrible urgency to get something printed they'll turn their back on you. Therefore, printers cannot be called dependable devices, for they are not reliable \*and\* available. Designing for dependability requires a set of engineering measures which not only include using the right materials to ensure constructive reliability, but also ensuring that failures, should they happen, will be sorted out in good time to always keep the system in its right state to respond if needed. What are typical examples of dependable systems? A few very quickly pop up: life support equipment, surgery rooms, power grids, aircraft. The implications of these systems being undependable are intolerable. To make them dependable, a set of technologies and architectural decisions must be combined: alternative power sources, hot/cold redundancies, cross-strapping circuitry, self-diagnostics, failure detection and handling, etc. System dependability is largely a design decision, and it requires the right amount of architectural flexibility to accommodate for the errors and faults that will surely take place. With all, the life cycle of a product can be divided into three periods: - An "early life" stage, marked by "infant mortality" failures - A "useful life" stage, with essentially constant failure rates - An "end-of-life" stage, marked by "wear-out" failures. ![Bathtub curve](image408.jpg) > [!Figure] > _Bathtub curve_ During the early-life or "burn-in" stage, the failure rate decreases. The reliability of the product grows over time. At this stage, failures are generally due to process implementation problems and to the environmental stress screening of the design and components. The useful life stage is represented by a constant failure rate. The failure rate is independent of the age of the product (which is why these failures are often described as "random"). This period, often non-existent for mechanical goods, is the reference period for electronics. During the end-of-life or "wear-out" stage, the failure rate increases with the number of hours of operation: the older the product, the more likely it is to fail. This type of behavior is characteristic of items subject to wear or other forms of gradual deterioration. It is reflected in rising failure rates (see figure above). ## Weibull distribution The Weibull distribution is a continuous probability distribution used to model lifespans, reliability, and failure rates of items over time. It was first introduced by Swedish mathematician Waloddi Weibull in 1951 and has since found widespread application in various fields, including reliability analysis. The Weibull distribution allows for a wide range of shapes to describe the failure behavior of items, as it can model scenarios where the failure rate increases, decreases or remains constant over time. The Weibull distribution is defined by two parameters: - Shape Parameter ($k$): Determines the shape of the distribution curve. - Scale Parameter ($λ$): Sets the characteristic or reference lifespan. The probability density function of the Weibull distribution describes the likelihood of a random variable taking on a specific value. The PDF formula for the Weibull distribution is: $\large f(x;k,\lambda) = \frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1} e^{-(x/\lambda)^k}$ where $x$ is the random variable, $k$ is the shape parameter, and $\lambda$ is the scale parameter. The cumulative distribution function gives the probability that the random variable is less than or equal to a certain value. The CDF for the Weibull distribution can be expressed as: $\large F(x;k,\lambda) = 1 - e^{-(x/\lambda)^k}$ The mean $\mu$ and variance $\sigma^2$ of the Weibull distribution can be expressed in terms of the shape and scale parameters: - Mean: $\large \mu = \lambda \cdot \Gamma\left(1 + \frac{1}{k}\right)$, where $\Gamma$ denotes the gamma function. - Variance: $\large \sigma^2 = \lambda^2 \cdot \left[\Gamma\left(1 + \frac{2}{k}\right) - \left(\Gamma\left(1 + \frac{1}{k}\right)\right)^2\right]$ The Weibull is widely used to model the reliability and failure rates of mechanical, electrical, and electronic systems. It is also applied in industries such as manufacturing and finance to analyze the lifespans of products and components. The heavy-tailed nature of the Weibull distribution makes it useful for modeling extreme events, such as extreme weather patterns or rare but catastrophic failures. Below, is some Python code to play with the Weibull distribution: ```Python import numpy as np import scipy.stats as stats import matplotlib.pyplot as plt # Generate random samples from a Weibull distribution shape = 2.0 # Shape parameter scale = 100 # Scale parameter size = 1000 # Number of samples data = np.random.weibull(shape, size) * scale # Fit a Weibull distribution to the data using maximum likelihood estimation shape_fit, loc_fit, scale_fit = stats.weibull_min.fit(data, floc=0) # Plot histogram of the data plt.hist(data, bins=30, density=True, alpha=0.6, color='g', label='Histogram') # Generate x values for plotting the PDF x = np.linspace(0, max(data), 1000) # Plot the PDF of the fitted Weibull distribution plt.plot(x, stats.weibull_min.pdf(x, shape_fit, loc=loc_fit, scale=scale_fit), 'r-', lw=2, label='Weibull PDF') plt.title('Histogram and Weibull PDF') plt.xlabel('Value') plt.ylabel('Probability Density') plt.legend() plt.grid(True) plt.show() ``` Which yields: ![](Weibull.png) The Weibull distribution often arises when dealing with populations that exhibit heterogeneity, meaning that the individuals within the population have different underlying characteristics. This heterogeneity can manifest as variations in material strength, component reliability, or survival times. The Weibull distribution accommodates such heterogeneity by allowing the shape parameter to capture the variability in failure or survival times among individuals within the population. Many physical systems, such as mechanical components or biological organisms, undergo aging and wear processes over time. These processes can change failure rates or survival probabilities as the system ages. The Weibull distribution can capture such non-constant hazard rates, allowing for an increasing, decreasing, or constant failure rate over time, depending on the value of the shape parameter. The Weibull distribution is well-suited for modeling systems that exhibit distinct phases in their lifecycle, such as early life failures, a period of relatively stable operation, and eventual wear-out failures. The shape parameter of the Weibull distribution can control the transition between these different phases, leading to characteristic shapes in the distribution curve. In some cases, the Weibull distribution may arise from extreme value analysis, focusing on modeling rare but catastrophic events or extreme observations. The heavy-tailed nature of the Weibull distribution allows it to capture the likelihood of such extreme events occurring, making it useful for risk assessment and reliability analysis in situations where extreme values are of interest. > [!Note] > What would it take to create a Galton board that shows a Weibull distribution? > > To create a Galton Board that demonstrates a Weibull distribution instead, we would need to modify its design and mechanisms to reflect the characteristics of the Weibull distribution. Some potential modifications that could be made to a Galton Board to produce a Weibull distribution are: > > **Varying Width of Pins or Channels:** In a standard Galton Board, the pegs through which the balls pass are typically evenly spaced and have uniform widths. To simulate a Weibull distribution, you could vary the width of the channels or the spacing between the pegs. This variation could represent the heterogeneity in failure or survival times observed in systems modeled by the Weibull distribution. > > **Non-Uniform Heights of Rows:** In addition to varying the width of the channels, you could also vary the heights of the rows of pegs. For example, you could have rows with increasing or decreasing heights to represent changing failure rates over time. This would allow the balls to experience different levels of resistance or obstacles as they fall through the board, leading to a distribution that reflects the characteristics of the Weibull distribution. > > **Introducing Obstacles or Barriers:** Another approach could be to introduce obstacles or barriers within the Galton Board that the balls encounter as they fall through the board. These obstacles could represent aging mechanisms, wear and tear, or other factors that affect the failure or survival times of systems modeled by the Weibull distribution. By strategically placing obstacles of varying heights and positions, you could create a distribution that exhibits the desired characteristics of the Weibull distribution. [^95]: https://ieeexplore.ieee.org/document/532603