Home‎ > ‎Enterprise Architecture‎ > ‎

Failover and fault tolerant

In today’s technology age, many application developers of high technology computation systems have the ability to access more processors than ever before.   A large issue in such technology is the ability of machines to deal with process failures.   Machines that run high on their processors tend to fail every couple minutes.  Failover is the ability to change over to a standby server, system, or network when the server being used gets terminated without a warning.   Fault tolerance helps a system recover immediately and continue operating in any given failure.  Companies with large systems rely on both failover and fault tolerance hugely because their companies must continue to operate or they can face huge deficits.  If a Toyota plant was to shut down due to a warning that wasn’t given, they would lose a lot of money because in a matter of ten minutes, they make enough cars to make hundreds of thousands of dollars.  Failover could help to switch over to another system that would help the machines maintain.  Fault tolerance would automatically help the system recover without changing to another server.
We can use MPI which is a form of shared memory where it can provide two options for handling failures.  The first is the default mode of MPI; it immediately aborts the application software.  The second, hands the control back to the user without ensuring that any further communication will occur.   When considering computers with thousands of processors, the only current available fault tolerance is to restart which tends to have performance and conceptual limits.   MPI is used to handle process fault tolerance and helps to investigate the numerous application scenarios showing the feasibility of high performance computing.  MPI is a direct and effective way to program nodes.  The larger number of individual hardware componets implies that hardware faults are more likely to occur during long running experiments.  Users tend to want their programs to adapt to hardware faults and continue working.  This perfect view is clearly unattainable in general if all nodes fail yet users still can achieve a great degree of fault tolerance.   Fault tolerance relates to linking implementation who can survive a hardware failure, such as a network failure or a computer crash.  When implementing a MPI, a system automatically recovers from some set of faults.  Another process that occurs in a system continuation regardless of failover is that a program is notified of the problem and is prepared to take corrective action.  In each of these two cases, the program arranges for the non failing process for computation to proceed.  We can also abort the program and restart where our processes are saved outside the processes themselves, typically on a disk. 
                When building a good fault tolerance system, one must use a checkpoint recovery for which intermediate system states can be recorded so that the system can resume from one of the states when it fails.  Take for example inside a mobile environment, one major problem is the reliance among communication processes.   Mobile support stations with large storages and high computing powers can take care of recovery quite easily.  Another way to deal with operation contents is called message logging.  Message logging helps to retain a receipt of contents after the process resumes from a check pointed state.   Simply log the messages in and press save, it doesn’t get any easier when insuring your data will be there if a system does tend to fail.  By having a large storage system, information can be retained easily by using failover or fault tolerance, both who can help a company keep up with their machines without losing time and money.