Reliable, according to Google
Consistently good in quality or performance; able to be trusted.
The reliability of Software can be seen in terms of the expectation concerning software’s basic functionality, human error tolerance, performance under load, or security. A system that works correctly, even with faults, is called a Reliable System. (In other words a fault-tolerant or resilient system)
When we talk about Faults, we also hear the word Failure. But what is the difference between a Fault and a Failure?
Fault | Failure |
---|---|
When one component stops working or deviates from its spec | When the whole system stops working |
It is impossible to create a system with 0 faults, so the goal is to design a fault-tolerant system to prevent failures.
Why is it good to deliberately introduce faults?
A system can break for many reasons, like poor error handling. It makes sense to deliberately trigger faults in a system to ensure fault-tolerance machinery is working fine. For example Netflix Chaos Monkey is an open source tool whose job is to randomly kill instances and services within a system so that engineers implement their services to be resilient to instance failures.
Netflix: The best way to avoid failure is to fail constantly.
Note – There are cases when introducing faults is impossible, like Security matters. In those cases, prevention is better than cure because a cure doesn’t exist.
Faults can be categorized broadly in 3 ways Hardware, Software, and Human induced.
Hardware Fault
Possible hardware fault reasons include faulty RAM, power blackouts, hard disk crashes, etc. Hard disks are said to have a mean time to failure (MTTF) of 10-50 years, so in a large data center, we can assume one disk dies per day on average.
To tackle such situations, we can add redundancy by setting up RAID configuration for multiple drives, having a dual power supply, using hot-swappable CPUs, or other power backup techniques.
A Random Array of Independent Disks (RAID) is a strategy used to set up multiple drives for better performance and reliability of data stores.
In Cloud Platforms, we can see virtual machines getting shut down or becoming unavailable in large numbers due to their high commitments. That is why we also need Software fault-tolerance techniques that can tolerate the loss of entire machines.
Software Faults
Software errors are a set of faults that are complex and more correlated compared to Hardware faults. They could be hard to debug as it might take some time to understand the origin of the error if efficient monitoring and testing were not done in the system.
Software faults can happen due to many causes like bugs in code, a process using up shared resources (CPU, memory, network, etc.). Some other reasons can be the slowing down of services on which your system depends or classic Cascading failures where one error can trigger faults in another component of your system.
Thorough testing, monitoring, process isolation, and other measures can be used, in such scenarios which enable developers to think carefully about the fault cause.
Human Errors
Humans are unreliable and can cause systems to fail. Studies have shown that operator configuration errors are the leading source of outages compared to server or network faults.
We can think of well-designed APIs or admin interfaces for handling such conditions. We can also create a separate sandbox for testing purposes where a consumer can play with real data without affecting real users.
Techniques like fast roll-back or gradual releases are also great ways to recover from faults.
Reliability is essential to decide the software quality and the user’s trust. Along with this, the unreliable system can have data loss, are difficult to improve, and system failures can cost a huge amount.
References
Designing Data-Intensive Applications [DDIA]