Sunday, 12 August 2012

When Linux load averages lie

To follow on from my last post, having data showing the load on your server throughout the day is great, but what do you do with it and how do you interpret the information?

Looking through the data I could see clear short periods where the load average would sky rocket, into double figures and into the 20's. From every article and post I'd seen it said that if the load divided by the number of cores/threads was more than one you have a problem. No ifs, no buts, you have a problem. I was seeing 20+ on a dual cpu server... oh dear!

The problem was I couldn't see an obvious cause. CPU usage wasn't high, the idle time was good, memory was fine with the swap file not being used, the disk queue wasn't long, and the process list didn't show anything to indicate an issue.

Fortunately I came across a few fantastic explanations of Load Averages that explained where I (and it seems many others) had been going wrong. I've linked to them all below, and I recommend reading them for more info, but the upshot is that it's not as cut and dry as people make out.

To quote from Jon Emmons blog, the load average "is the average sum of the number of processes waiting in the run-queue plus the number currently executing over 1, 5, and 15 minute time periods.".

The load average is far more complex than many people make out, and while it can be a good initial indicator of a problem it must be examined in conjunction with other factors. It doesn't allow for the fact that a process could be waiting for not just the CPU, but also disk or network IO, and doesn't allow for the priority of the running/waiting processes in the queue.

If you have a long running low priority process running for instance, that will always make way for more urgent time critical requests. In the mean time that process will sit in the queue, and will cause the load average to increase. Add some more of these low priority processes, for instance a backup job, and the load average will increase, which indicates a problem. Higher priority processes like email, websites etc will be handled immediately however, causing no delay for users, and as such the reported high load isn't really an issue.

So the key is that it's fine to track the load average to indicate a possible problem, but don't rely on it for proof that you have one. Always remember to check the other figures provided to see IF you have a problem, not necessarily WHAT the problem is.

No comments:

Post a Comment