Aug 22
2006

It’s 2:00am. Do You Know Where Your Unattended Systems Are?

Unattended software systems—almost any business with some kind of IT operations has at least one, if not several.

Maybe you have a program that runs every Tuesday at 1:30 am to extract data and upload it to a vendor’s remote FTP server. Or maybe you have a database stored procedure that runs every two hours to keep the database cache for the web server from building up too much data. Or maybe you have a tape backup program that runs every night at 4:00am.

The concept of “unattended” refers to the fact that a human operator is not sitting in front of the program, clicking a button to make it run, and watching it run until it completes. Instead, the system is designed and configured to run on its own, without the need of a human operator.

In some cases, like the backup program that came with your tape drive, these unattended systems are generic enough that a packaged application is all you need (the double edged sword, of course, being that you’re limited to the features offered by the program). But in many cases—if not most—the unattended software systems running quietly in the background of your enterprise are custom-developed.

This is by necessity—the logic that the program must follow, and the data involved, are unique to your business needs. In terms of their technical design, these unattended programs may utilize generic tools and components, such as a scheduling program to launch the software or a DLL for communicating with FTP file servers, but the “glue” between these generic tools—the logic, the real essence of what the program does for your business—was most likely created especially for you by someone in your employ or by a firm that you hired.

What’s more, there’s a good chance that some or all of your unattended systems could be labeled “mission critical.” If that file doesn’t get uploaded to your vendor at 1:30 am next Tuesday, you’ll miss the deadline for the next catalog and lose a whole quarter’s worth of sales. If the backup program fails and the next day something bad happens, you could lose data. If the database cache fills up, the web site will slow down, and customers will go elsewhere.

Questions like the ones below are important, and may have a direct effect on whether you can sleep at night without worrying about your unattended systems:

Is the system running now?
How long has it been running?
How long until it’s completed?
Was the operation successful the last time the program ran?
Do I know exactly how much work it did, if any?
Did the last run produce any errors or warnings?

Just because the system didn’t produce any errors or warnings, can I still be confident that it did everything it was supposed to do?

Are the system’s assumptions and dependencies in place so that we can have some confidence that the next run will be successful? For example, is the hard drive getting full? Has the network administrator changed any related security settings?

I have identified four essential qualities that a system must have to ensure robust and reliable unattended operation: independence, defensiveness, robustness, and transparency. As you might imagine, these qualities do not happen magically, and they do not come for free. They must be designed and built into the system from the start. In the remainder of this article, we will examine each of these qualities. Along the way you can determine whether your unattended systems fit this profile, and perhaps decide to take steps to ensure that future systems do as well.

Independence refers to the ability for the system to run on its own. In one sense, the concept of independence is painfully obvious: by definition an “unattended” program must be able to “run on its own” if there is to be no human operator to launch it. As such, the concept of independence starts with obvious things like using a scheduler to launch the program, and ensuring that the program starts doing its work automatically, without the need for a human to press any buttons or confirm anything.

Moving a little bit beyond the obvious, another key to achieving independence is to minimize the dependence on other systems—especially other systems that are beyond your control, and especially other systems that do not have quality of monitoring ability. The fewer external assumptions and dependencies in your program, the fewer chances there are for things to go awry. Because external dependencies usually cannot be avoided (for instance, you can hardly upload a file to your vendor’s remote FTP server without your program’s interacting with that server), the other qualities we’re discussing are your best tools for protecting yourself from failures related to these external dependencies.

Defensiveness for an unattended program is much like the concept of “defensive driving.” We were taught in driving school to be alert to the other drivers around us, aware of their movements and trajectories, keeping in mind the fact that they are all unknown quantities to us—any one of the drivers around us could make a mistake, lose consciousness, or do something ill advised—and perhaps not even aware that we are driving right next to or behind them. They may be talking on the phone, or trying to calm down a crying baby in the back seat. And while we are being defensive, we are of course still actively focused on our goal of moving forward, getting to where we are going.

An unattended program must also adopt this defensive-yet-assertive posture. This means validating assumptions (and communicating clearly when assumptions fail—see transparency, below). It means building an explicit awareness of external dependencies and their potential failures into the logic of the program. Usually the “hot spots” are easy to identify: databases, file systems, servers, other network resources. Less obvious, though, is an alert posture at the level of data; for example, the content of a data field that was filled in by a human is less predictable than content filled in by a software process.

Robustness refers to the ability of a system to a) prevent a failure in the first place, b) fail gracefully when failure cannot be avoided, and c) recover well the next time the program starts. Defensiveness is a key aspect of preventing failure. For example, if we know that the hard drive might fill up, we can design the program to check how much space is available before copying a file. As in this example, usually the prevention of failure is limited to an area of narrow scope.

However, failing gracefully and recovering well are more global in scope, meaning that the overall structure of the logic of the program must be crafted to ensure that the logic flows along predictable and deliberate paths. The designer of the program must take pains to ensure that the program “cleans up” after a failure, that data is not left in an inconsistent state, that incomplete work is “rolled back” to a clean starting point. Similarly, before the program performs a piece of work, it should check to ensure that a previous run of the program did not leave behind a mess. If possible, the program should pick up where a previous run left off.

Transparency means that humans (or even other programs) can tell what’s going on with the program. At the opposite end of the spectrum, a totally opaque program would not display any output on the screen, print out any reports, write anything to a log, report any errors or problems, send any emails or text messages to administrators, or report on how long it’s been running or how far along it is in the process. All of these are qualities and features that a designer must go to particular effort to build into a program. Without features to enable transparency, an unattended system is both hard to monitor and hard to support.

To monitor the status and health of an unattended system, we need to know when the program started, when it finished, what the status was when it finished, and (especially for a long-running process) at which step, or how far along, in the process it is at any given point. To support the system, technical administrators need access to logs, error messages, and contextual information so that failures, problems, and bugs can be detected and/or fixed after the fact. And if you don’t have (or would prefer not to pay for) technical resources to monitor and support the system, you will have to communicate clearly to the designer of your unattended system that you, as a less technical person, want the system to clearly communicate its status, history, and results.

Notice how defensiveness, robustness, and transparency work together to achieve independence.

If a program is not defensive, it will fail more often because of the unpredictability of external dependencies and data.

If it is not transparent, then when it does fail the human operators, if they know about the failure at all, may have a hard time figuring out what went wrong because the system did not communicate clearly about which of its assumptions failed.

And if it is not robust, it may leave behind corrupted data or half-done work, and may not recover well the next time the program starts.

The independence of your unattended systems achieved through the combination of defensiveness, robustness, and transparency will enable you to sleep well.

Comments

Leave a comment





CAPTCHA Image Validation