Avoiding Single Points of Failure

A Content Store's NFS server are potential single points of failure: if it goes down and you haven't done anything to prevent it, your web site will go down too. The only way to solve this problem is to duplicate these components: you have the same software installed on two hosts, but only run it on one of them, keeping the other ready as a backup. A heartbeat daemon (see http://haproxy.1wt.eu) is used to monitor the availability of the service and, if it goes down, start the service on the backup host.

This heart beat/fail over solution should also include a virtual IP address for the host running the critical service. All users of the service access it via the virtual IP address. If the service's primary host goes down and the heart beat starts the service on a backup host, the virtual IP address is moved from the primary host to the backup host. This ensures that no configuration changes are needed to any of the components using the service. Any components using the service at the time of failure will lose all current transactions and connections, but operation will resume on the backup host for any subsequent requests/transactions.