Tuesday, 7 August 2012

Tracking Linux server load and processes over time

Tracking down the cause for high load on a server can be a challenge, especially when issues are happening when you can't keep an eye on it, but it's a challenge I've been facing recently. While there were no issues with the server's response, I discovered at certain times (often in the evenings) the load average would become high enough to cause Exim to briefly pause processing the mail queue.

After hunting round for a easy way to track the server's state I came across a simple script from Craig Edmonds that did the job. It very simply generates an email containing a variety of status information, including most importantly the process list data from Top, and sends that to you in an email. By scheduling the script to run every minute you get a snapshot of the servers state at regular intervals. Since the subject line includes the load average you can easily look through the messages, spot those times with a high load, and see what the server is doing.

In my case I adjusted the script slightly to include $todaydate in the subject line, since the above issue with Exim meant I couldn't always rely on the message being received in the correct order.

There was one problem I found with this solution. The script runs a single iteration of Top and inserts the output of that into the email, however as you can see from the Man page for Top :

       The  top command calculates Cpu(s) by looking at the change in CPU time
       values between samples. When you first run it, it has no previous  sam-
       ple  to  compare  to, so these initial values are the percentages since
       boot. It means you need at least two loops or you have to  ignore  sum-
       mary output from the first loop.  This is problem for example for batch
       mode. There is a possible workaround if you define the CPULOOP=1  envi-
       ronment variable. The top command will be run one extra hidden loop for
       CPU data before standard output.

each email I received had identical CPU data. While I could tell the server's load was high, I couldn't see what the state of the processor was at that time. I didn't fancy messing around with environment variables, so instead opted for a solution found here, and adjusted the line calling Top as follows :

    $process_list = shell_exec('top -b -n2 | awk "/^top/{i++}i==2"');

So I found a simple and easy way to track what's happening on the server, though of course with an email a minute it's not something I'll be running long term.

I wish I could say that was the end of it, but unfortunately this turned out to be the beginning of my struggle and confusion, brought on in no small part to the number of confused explanations of Load Average operations, but I'll discuss that in my next post.


References :
http://www.unix.com/gentoo/77494-top-batch-mode-cpu-info-wrong.html

No comments:

Post a Comment