Failing Gracefully

Let’s say you have a high-performance system, operating beyond it’s design capacity already, and the workload is increasing. What would you like it to do?

  • Keep running, but slow down. This is the default behaviour for most applications, as in, what you get if you don’t actually plan for this scenario†. If the load increase is temporary, and the responsiveness of the system isn’t critical, this may be acceptable. But there is still a limit. If a system goes into a death spiral, it’s not useful to anyone, it may be impossible to get any useful diagnostics, and then contagion may spread to any connected system. Not good!
  • Shed load, either by refusing new work, or by killing or suspending existing work. But this in turn raises a host of questions. Is there some work which cannot be refused? Can the tasks even be suspended (or rather, can they be resumed again) or must they be killed (and can work done so far be undone)? In what order to kill or suspend? Are there any dependancy relationships between jobs that need to be considered? What happens to these jobs, are they abandoned, resumed, or restarted? Can the rules governing all of this be modified in-flight?
  • Give up, as in, just crash. This might be the safest option, at least the system is now in a known state, rather than responding erratically or non-deterministically, assuming it was written to be fail-safe. This raises similar questions, but for the app as a whole. Is there another program that will restart it (and if so, under what conditions)? Is there any cleaning up that needs to be done? What happens to work that was in progress at the time? And resources locked on external systems, what frees them, or notifies those systems, if they need to be?

Ultimately, this is a question about the consequences of the system failing under load, which is likely to be at the worst possible moment, when it is doing the most work because you are doing the most business! What is the least-worst option? Obviously this varies from app to app. Maybe there is no least-worst option, so you spend the money to cope with any conceivable load. That’s assuming you can, that the hardware even exists, that your code can be written to exploit it, that your algorithms themselves remain stable. Some apps are written to push hardware to the limits under a normal workload, all or nothing.

But we have to be aware of what the options are, how close every system is to experiencing this scenario, and how to make systems that are aware of their own responsiveness, or at least able to report it to another system, which is in turn empowered to act in some way. You may choose to provoke a crash in the event of a memory leak and restart, rather than allowing responsiveness to degrade, and achieve a higher overall throughput, for example‡. Or a hybrid strategy of tolerating some slowdown and shedding load. Not choosing is choosing too!

† So, most applications then!
‡ According to an old Zed Shaw article, some production Ruby apps required 400 restarts/day (!)

About Gaius

Jus' a good ol' boy, never meanin' no harm
This entry was posted in Business, Linux, Oracle. Bookmark the permalink.

Leave a comment