

Five 9’s reliability has been an important part of “Telecom Grade” network equipment for a long time. Over the years it has been used to justify everything from much higher prices for “Telecom Grade” IP switches to large product development budgets. For a network node to achieve five 9’s reliability is a daunting challenge. There are a huge number of electronic devices in a network node, the failure of any one could result in a node failure. The shear number of devices creates a major problem. In very simplistic terms, if 100 devices are each designed to fail, on average, only once in 100 years, then any system built from them will fail, on average, once a year. Thus, it is not possible for each device to be built sufficiently well to insure that the node will be five 9’s reliable.
To solve this problem, telecom nodes are built with extreme redundancy, eliminating any single points of failure. A node failure can only be caused by multiple device failures. With redundancy, statistics starts to work in our favor. Multiple independent simultaneous failures are extremely rare. Think of rolling dice. The odds of getting a six rolling a single die are 1 in 6. The odds of rolling a pair of 6s with a pair of dice multiplies to 1 in 36, much harder. Eliminating single points of failure makes five 9’s node reliability achievable.

In spite of all of the high reliability components and extreme redundancy, it’s still not enough to guarantee network reliability. Two system level problems arise.
The first problem is that, in any telecom network, there are a large number of nodes. This large number of nodes effects network reliability the same way the large number of devices in a node effect node reliability. The failure of any one of the nodes can take down significant parts of the network. The solution is also similar, eliminate or reduce the impact of a single node failure.
For example, most view SONET rings as a way of protecting against fiber breaks. The nodes on both sides of the break reroute traffic to the standby protection fiber. However, they are also used to minimize the impact of a failed node. The nodes on both sides of the failed node reroute traffic as before. Of course, traffic which was local to that node would be disrupted and if the failed node was a bridge to other rings or networks, major disruptions would occur. As a result, we are seeing more and more movement from rings to meshes to eliminate the single points of failure which remain in SONET rings.
The second problem is that all of the probability and statistics we have used apply to random independent events which doesn’t account for the problems of the real world. The analysis also only covers failures resulting from internal faults. Real world problems include multiple simultaneous node failures caused by external events. How many telecom nodes failed on 9/11 or during Katrina? The recent Taiwanese earthquake is another example. No level of node or link reliability helps protect from these disasters. On a less dramatic level, a single software glitch (malicious or not) can also take down large numbers of nodes. It is prudent for major Internet users to use not only multiple Internet pipes, but pipes from different carriers. Again, the only true solution is to have sufficient resources and capability at the network level to swiftly reroute traffic around problem areas.
This raises some interesting questions: If the only way to ensure a high reliability network is to design a network which continues to operate in the presence of one or more node failures, how important are five 9’s reliable nodes? The issue of five 9’s reliability is no longer one of network needs, but network economics. We know that adding five 9’s reliability to network nodes adds major costs. Is the added node robustness worth the added costs? This is not as heretical as it initially seems. From university researchers to Internet exchange operators, networks are being designed and run where five 9’s reliability applies only at the network level, where it belongs.
Figure 2. AMSIX Redundant Network.
