Configuration and Maintenance
however, we can quell the desire to run for help. A few tests will almost always reveal a problem. Experience allows us to expand our repertoire and recognize clues, but there is no reason why cold logic should not bring us home in every case. Having eliminated the obvious avenues of error, we are led into murkier waters of fault diagnosis. When a situation is confusing, it is of paramount importance to keep a clear head. Writing down a log of what we try and the effect it has on the problem prevents a forgetful mind from losing its way. Drawing a conceptual map of the problem, as a picture, is also a powerful way of persuading the human mind to do its magic. One of the most powerful features of the human mind (the thing which makes it, by far, the most powerful pattern recognition agent in existence) is its ability to associate information input with conceptual models from previous experience. Even the most tenuous of connections can lead us to be amused at a likeness. We recognize human faces in clouds and old cars; we recognize a song from just a few notes. The ability to make connections leads us in circles of thought which sooner or later lead to 'inspiration'. As most professionals know, however, inspiration is seldom worth waiting for. A competent person knows how to work through these mental contortions systematically to come up with the same answer. While this might be a less romantic notion than waiting for inspired enlightenment, it is usually more efficient. 7.6.3 Establishing Cause and Effect
If a problem has arisen, then something in the system is different than it was before the error occurred. Our task then is to determine the source of that change, and identify a chain of events which resulted in the unfortunate effect. The hope is that this will tell us whether we can prevent the problem from recurring, and perhaps also whether we can fix it. It is not merely so that we can fill out a report in triplicate that we need to debug errors. Problem diagnosis is one of the hardest problems in any field, be it system administration, medicine or anything else. Once a cause has been found, a cure can be simple, but finding the problem itself often requires experience, a large knowledge base and an active imagination. There is a three stage process: Gather evidence from users and from other tests. Make an informed guess as to probable cause. Try to reproduce (or perhaps just fix) the error.
There is no particular order in which these pieces of the puzzle must be executed. Normally, they will all be repeated until a satsifactory explanation has been uncovered. It is only when we have shown that a particular change can switch the error on or off that we can say with certainty what the cause of the error was. Sometimes it is not possible to directly identify the causal chain which led to an error with certainty. Trying to reproduce a problem on an unimportant host is one way of verifying a theory, but this will not always work. Computers are complex systems which are affected by the behaviour of users, interactions between subsystems, network traffic, and any combination of these things. Any one of these factors can have changed in the meantime. Sometimes it can be a chance event which creates a unique set of conditions for an error to occur2.
I tend to classify all such inexplicable occurrences under the heading 'cosmic ray'.
Fault Report and Diagnosis
Usually this is not the case, though; most problems are reproducible with sufficient time and imagination. Trying to establish probable cause in such a web of intrigue as a computer system is enough to task the best detectives. Indeed, we shall return to this point in 11, and consider the nature of the problems in more detail. To employ a tried and tested stategy, in the spirit of Sherlock Holmes, we can gradually eliminate possibilities and therefore isolate the problem, little by little. This requires a certain inspiration for hypothesizing causes which can be found from any number of sources: One should pay attention to all the facts available about the problem. If users have reported it, then one should take seriously what they have to say, but always attempt to verify the facts before taking too much on trust. Reading documentation can sometimes reveal simple misunderstandings in configuration which would lead to the problem. Talking to others who might have seen the problem before can provide a short cut to the truth. They might have done the hard work of diagnosis before. Again, their solutions need to be verified before taking them on trust. Reading old bug and problem reports can provide important clues. Examining system log files will sometimes provide answers. Performing simple tests and experiments, based on a best guess scanario, sharpens the perception of the problem, and can even allow cause to be pinpointed. If the system is merely running slower than it should, then some part of it is struggling to allocate resources. Is the disk nearing full, or the memory, or even the process table Entertain the idea that it is choking in garbage. For instance, deleted files take up space on systems like Novell, since the files are stored in such a way that they can be undeleted. One needs to purge the file system every so often to remove these, otherwise the system will spend much longer than it should looking for free blocks. Unix systems thrash when processes build up to unreasonable levels. Garbage collection is a powerful tool in system maintenance. Imagine how human health would suffer if we could never relieve ourselves of dead cells or the biproducts of a healthy consumption. All machines need to do this. Gathering Evidence
From best guess to verification of fault can be a puzzling time in which one grapples with the possible explanations and seeks tests which can confirm or deny their plausibility. One could easily write a whole book exemplifying techniques for troubleshooting, but that would take us beyond the limits set for this book. Let us just provide two examples of real cases which help to illustrate how the process of detection can proceed. Network services become unavailable, a common scenario is the sudden disappearence of a network service, like, say, the WWW. If a network service fails to respond it can only be due to a few possibilties: The service has died on the server host.
