Tutorial: Fault Tolerance in Data Communication

 

In data communication applications, someone (program to program communication is not that much different than person to person communication) sends data and another receives and processes the data. But what happens when the sender sends and the receiver doesn’t get it?  As the quote from Cool Hand Luke says, “What we have here is a failure to communicate.”  When this happens between people, there can be black eyes, bloody noses, and unhappy husbands—things to be avoided.  But the results in a program to program communication failure can be just as devastating and costly.  If a monetary transaction is lost, a customer can be more than a little upset.  If her account is charged twice for a single purchase, she may lose her cool. A manufacturing production line that produced 100 items but only has generated and securely stored information on 99 may cause significant problems in the distribution and sales departments. A System’s Analyst does not want to go to management and say, “What we have here is a failure to communicate.”  In life, there is little counsel that says, “Plan for failure.”

 

 

Data Requirements

 

When planning data communications, the first requirement that needs to be addressed concerns the kind of integrity that will be required.  Related questions to ask are:

The remainder of this tutorial concerns the situation where data delivery is required.

 

Points of Failure

 

When my wife is talking at me and I am absorbed in tracking down a problem on my computer, I may not hear her.  In this case the failure is within the receiver.  If we call our children to dinner but the TV drowns our words, the medium prevents the communication.  If I am talking with my wife over the phone and a warning sounds on my computer requiring immediate attention, I may stop in mid sentence to deal with the emergency.  The sender causes the problem.

In any communication scenario, there are points of failure.  It’s important to identify how to deal with them when they occur. 

When my oldest daughter was three, I remember her sitting upon my lap after I had returned home from work.  I was reading the paper and she was talking with me, though I didn’t really hear her.

“Um hmm,” I said absently.

Grabbing my beard in her hands, she pulled my face to hers and said, “Listen to me, Daddy!” 

She had my attention.  She knew that when she spoke, I heard. She knew it by my response.

How is a program sending information to another program going to know that the receiver is listening—and, if listening, actually processing the data?

 

Guaranteed Delivery

 

When a message must be delivered, it is possible to guarantee delivery.  This is done through a queued front end.  The program that generates data places it in a queue.  A communication program retrieves messages from that queue and sends it to the target process.  The communication program does not dequeue that message until it knows the receiver has successfully processed it. 

This sounds simple enough.  But there is a problem that arises from this solution.  When delivery is guaranteed, duplicate data is possible.  Consider the following scenario.  The communication program sends the message.  Then, before receiving a response, a failure occurs.  The failure can be in the target machine, the sending machine, or the medium.  Wherever it is, no response is received by the sender saying the message was processed.  When communication is restored, the state is as follows:

·        The sender may or may not know that the message was sent.

·        The target process may have received the message or it may not have received it.

·        If it did receive it, it may or may not have successfully processed it.

·        If the message was processed, a response may or may not have been sent.

If the failure occurred outside the sender, a timeout will occur.  In any case, the communication program will try to send the same message again.  If the target process had not processed the message, everything is OK.  However, if that process did receive it, and did successfully process it, it will probably receive a duplicate message.

 

Ack/Nak

 

To guarantee delivery, queue messages and expect the target process to reply.  Usually, a reply is in the form of an acknowledgement (ACK) or a negative acknowledgement (NAK) associated with a reason code identifying the reason for the NAK.  The reason for the NAK may indicate that there is a problem with the message which will require that it be saved somewhere and processed later after some support person has fixed the problem.  The NAK may indicate an environmental error (DB problem for example) in which case the communication program should delay and try again.

 

MESockets

 

            MESockets lets the designer determine how to implement the Application Protocol.  Its intuitive calls simplify the use of sockets.

·        serverSetup  - sets up a program to receive requests.

·        clientSetup  - sets up a program to send requests to a server

·        send  - used to send short records of character data.

·        hsend  - used to send large records or binary data.

·        receive  - used to receive a message.

·        closeConnection  - used to close an established connection.