Crash-proofing Linux: Red Hat on fault-tolerant servers
by Nick Carr
A multitude of factors determine an application’s level of availability. Many are beyond the control of technology suppliers, and rest with the people and processes in place to support the applications and underlying IT infrastructure.
Even so, as Red Hat® Enterprise Linux® elbows its way into mainstream business applications and data centers, its reliability and the adequacy of support will be under close scrutiny. As will the reliability and support of the hardware. An operating system with four-nines (99.99%) of reliability on a server with only three-nines (99.9%) of uptime performance constitutes a solution less reliable than its weakest link. For most business-critical applications, that’s unacceptable. What does “unacceptable” mean in real terms?
A well-known package delivery company estimates the cost of downtime at $25,000 per minute if its principal hub loses its computer-controlled conveying system during a nightly three-hour sorting window. One and a half million dollars an hour.
If the communications system of a New York City securities exchange is down for more than 15 minutes, they have to call the office of the President of the United States.
If a glitch in a computer-aided dispatch system slowed the response time of an EMT to the scene of an accident, it could cost a life.
Not all mission-critical and business-critical applications can have such dire consequences when they go down. But if it’s your customer-facing application that goes offline, your ATM transaction that fails, or your compliance data that gets lost, the resulting pain is very real nevertheless.
Crash-proof Linux
Since August 2006, Red Hat Enterprise Linux 4 AS customers have had a new “availability” option: the ability to deploy their applications – from web servers to databases – on Intel-based, fault-tolerant servers from Stratus Technologies. The Stratus ftServer family has proven itself in the field supporting standard, un-modified applications and enabling them to consistently perform at better than 99.999% uptime reliability for the server and OS together. For customers who prefer, need, demand, or cannot live without continuous application availability, this is a big deal.
In early 2006, Red Hat and Stratus collaborated to add code to the Linux 2.6 kernel enabling the operating system to run in a fault-tolerant architecture. Accepted by the community, this code is now in the standard distribution, and its purpose is described in more detail below. Virtually any application for the Red Hat Enterprise Linux operating environment will run unchanged and unmodified on a supported Stratus server.
A fault tolerant server is a different animal to a traditional server. Most importantly, it is not two or more servers in a high-availability cluster configuration. The fault-tolerant architecture is inherently redundant, and designed to prevent failures from occurring, rather than to recover from failure after the fact (the operational characteristic of high-availability clusters). Think of it as the equivalent to two complete servers in one box. Part of the magic of fault-tolerant technology is that the two physical units run in complete lockstep with each other, doing the exact same thing at the same time. The operating system and application see only one server. Lock-stepping lets the server and the application ride through most transient errors. If something breaks, the server isolates the component from the rest of the system while the companion part continues to run unaffected, as does the application. There is no downtime, no service degradation, and no loss of data.
The technology that enables 99.999+% availability resides in Stratus-designed chipsets and Stratus ftServer System Software (ftSSS) layered between the OS kernel and the application. Among other things, the chipsets are the system traffic cops, monitoring and checking all transaction activity for any anomalies. The ftSSS, together with the OS code modifications and device-driver hardening, enables surprise device removal and insertion, as well as hot addition/removal of RAID partitions, in an operating server. When a customer-replaceable unit is removed and reinserted, a feature called rapid disk resynchronization (RDR) maintains availability by re-mirroring disks up to six times faster than without it. In addition, an RDR utility continuously sweeps the disks for bad blocks, fixes them, and updates the mirror without system interruption.
So Red Hat Enterprise Linux customers today have three primary deployment models for their servers: traditional standalone servers, clustered servers that provide rapid recovery from failure using failover technology, and, now, support for continuous operation using highly redundant fault-tolerant servers. The three models provide increasing levels of application availability.
And while reliability is vital, IT management has other concerns, such as staffing resources, system management complexity, meeting SLA commitments to business clients, and budgets. Red Hat Enterprise Linux on a Stratus ftServer system is no more complex to install, deploy, operate, and manage than a basic vanilla server. When availability is a key user concern, this is a very cost effective solution that can’t be beat for application uptime.
Now with the option of fault-tolerant ftServer system from Stratus, the community has yet another reason to welcome Red Hat Enterprise Linux into mainstream business operations.
More resources
Learn more at stratus.com, including:
- The two lines of ftServer systems that support Red Hat Enterprise Linux–ftServer for the enterprise and ftServer T Series for telecommunications.
- Fault-tolerant technology whitepapers.
- The homepage features an uptime meter. This actual uptime calculation for the installed ftServer systems base tracks the previous 60 days and is updated daily. It typically reads 99.9998% for server and OS combined.







January 14th, 2007 at 11:54 am
I think that I have a few ideas on to how to crash proof Linux on any fault tolerant setup.
I just install any ai32 or ai64 Redhat Enterprise Linux as default and then add in some or all Perl, Python, Java supporting programs, add in all needed drivers for that specific system and a few more works that I have in my recipe’s for Linux manual.
One thing that I’d like to see is if someone were to develop a grammar corrector for Linux. Please, use Python and Java for it!
January 20th, 2007 at 1:19 am
hello sir
im a really fun of linux but the problem is i had no knowledge about linux .can you help me where can i start i want to be a network administrator and a syetem administrator .im a computer technician im only exposed with windows and nothing else so i want to shift to linux because somebody told me that linux is great and wonderful os ?how can i know?please help me how can i start.
thank you very much hoping for your reply
January 23rd, 2007 at 10:48 pm
they must be an old workstation around u r work place, use that resource (box) and install fedora Zod on it. thats a nice starting point. if u do so then welcome to a brave new world of Linux
February 22nd, 2007 at 5:02 am
hi man,
i also got much problems in the early stages when i migrated from Win To Lin but u should not give up as linux needs much of efforts and really linux is not that hard its just put back due to its some difficult and rare available usage
i can help u
u can contact me on ghulamzakria@gmail.com
thanks
1st of all learn linux installation
then networking
then disaster recovery
user management
Directory Management
learn Services of linux
and some of common useful commands and websites
www.tldp.
if u can do try a certification called RHCE