A while back I wrote a post Analyzing Linux System Performance and Finding Bottlenecks. I did’t really give a good explanation of determining if you are CPU bound or not so I am writing this post to clear that up.
As I noted previously the sysstat package can provide a wealth of information. One of my favorite utilities is sar. By default sar outputs a CPU utilization report. Lets take a look at some example output.
07:55:01 AM CPU %user %nice %system %iowait %steal %idle 08:05:01 AM all 3.95 0.00 2.22 0.13 0.00 93.70 08:15:01 AM all 4.32 0.00 2.37 0.08 0.00 93.23 08:25:01 AM all 5.70 0.00 2.79 0.06 0.00 91.44 08:35:01 AM all 6.13 0.00 3.85 0.07 0.00 89.94 08:45:01 AM all 10.27 0.00 6.07 0.09 0.00 83.56 08:55:01 AM all 13.50 0.00 8.15 0.21 0.00 78.14 09:05:01 AM all 14.53 0.00 10.39 0.30 0.00 74.78 09:15:01 AM all 12.24 0.00 9.02 0.44 0.00 78.30 09:25:01 AM all 12.66 0.00 8.80 0.50 0.00 78.03 09:35:01 AM all 12.67 0.00 9.43 0.49 0.00 77.40 09:45:01 AM all 13.51 0.00 9.62 0.44 0.00 76.43 09:55:01 AM all 13.67 0.00 10.66 0.59 0.00 75.08 10:05:01 AM all 14.47 0.00 10.99 0.66 0.00 73.88 10:15:01 AM all 12.32 0.00 9.15 0.44 0.00 78.09 10:25:01 AM all 25.71 0.00 14.17 0.51 0.00 59.61 10:35:01 AM all 13.65 0.00 10.04 1.30 0.00 75.01 10:45:01 AM all 12.36 0.00 8.86 0.65 0.00 78.12 10:55:03 AM all 25.34 0.00 19.41 0.56 0.00 54.69 11:05:01 AM all 24.10 0.00 19.04 0.67 0.00 56.19 11:15:01 AM all 16.63 0.00 12.51 0.60 0.00 70.26 11:25:01 AM all 25.83 0.00 22.10 1.73 0.00 50.34 11:35:01 AM all 22.80 0.00 16.90 1.06 0.00 59.24 11:45:01 AM all 31.48 0.00 21.74 1.08 0.00 45.69 11:55:01 AM all 18.10 0.00 13.53 0.82 0.00 67.55 12:05:02 PM all 19.07 0.00 14.74 0.94 0.00 65.26 12:15:01 PM all 20.48 0.00 16.32 1.00 0.00 62.19 12:25:01 PM all 23.83 0.00 20.03 0.80 0.00 55.33 12:35:01 PM all 22.97 0.00 18.57 1.43 0.00 57.03 12:45:02 PM all 25.65 0.00 20.55 0.67 0.00 53.12
The quickest way to see if you need more horsepower is by checking out the idle column. You can see in my example output that as the work day starts the server starts to get a bit stressed, and as the day rolls and we get to midday ive only got about 50% idle.
You might think that is a bit low. Really though most servers are underutilized. I’m sure you’ve heard it before but over 80% of production servers run at less than 20% capacity. That is a good reason to look into server consolidation with a virtualization platform perhaps something like Xen.
Since this is a virtual machine and I have spare cores that could be assigned lets keep investigating if I have a CPU bottleneck.
We have already established that an idle time of 50% isn’t too bad. An idle time habitually less than 20% indicates that the system will not be able to handle a much higher load. If your idle time is regularly less than 5% you might need to add some processing power.
A low idle time in and of itself does not mean you need shiny new cpus (or to add additional virtual cpus in this case). in addition to idle you should look at several other metrics to get a clear picture. First off if your iowait is high you can stop looking for a cpu bottleneck until you have tracked your io bottleneck (faster hard drives, more drives in your raid, move disk heavy service to faster box). iowait higher than about 15% likely indicates an io bottleneck (I have seen some poorly neglected servers with iowaits in the high 60%s. Boy were they pokey, especially the IMAP server that was in that state.)
Moving on past the quick iowait check. Even if your idle time is really low you can check to see if the run queue is heavy or not with sar -q.
07:55:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 08:05:01 AM 1 52 0.08 0.04 0.00 08:15:01 AM 1 53 0.06 0.02 0.00 08:25:01 AM 1 64 0.00 0.00 0.00 08:35:01 AM 1 55 0.09 0.05 0.01 08:45:01 AM 1 52 0.06 0.07 0.01 08:55:01 AM 1 52 0.00 0.00 0.00 09:05:01 AM 1 54 0.01 0.05 0.01 09:15:01 AM 0 53 0.01 0.06 0.01 09:25:01 AM 1 55 0.01 0.03 0.00 09:35:01 AM 1 54 0.04 0.05 0.00 09:45:01 AM 1 53 0.00 0.00 0.00 09:55:01 AM 1 80 0.16 0.08 0.01 10:05:01 AM 1 61 0.07 0.06 0.00 10:15:01 AM 3 62 0.06 0.10 0.05 10:25:01 AM 1 55 0.06 0.13 0.10 10:35:01 AM 1 59 0.17 0.12 0.09 10:45:01 AM 1 55 0.00 0.01 0.04 10:55:01 AM 1 74 0.23 0.17 0.09 11:05:01 AM 1 61 0.08 0.07 0.07 11:15:01 AM 1 56 0.16 0.09 0.07 11:25:01 AM 0 53 0.12 0.12 0.08 11:35:01 AM 1 53 0.01 0.04 0.06 11:45:01 AM 1 59 0.01 0.06 0.07 11:55:01 AM 1 60 0.11 0.08 0.07 12:05:01 PM 1 79 0.30 0.16 0.11 12:15:01 PM 1 63 0.03 0.07 0.08 12:25:01 PM 1 71 0.10 0.06 0.07 12:35:01 PM 1 58 0.01 0.03 0.03 12:45:01 PM 1 66 0.07 0.08 0.03
Here is the run queue for the same time period. We are looking for runq-sz (number of tasks waiting for run time) consistently greater than 2. Well in my case I look ok. Im starting to see an increase in demand for cpu time but its being handeled well. But since I have some extra cores and since I know that this server has not seen the busy part of the year I will probabbly attach an extra vcpu so I have less to worry about in the comming months.
Whew took a bit to spit that out, hope you enjoyed!
Note:
A friend of mine (thanks Scott) commented on something that I neglected to point out. You might want to do a bit of application profiling. Something like a database that has not had optimize tables or vacuum run on it in a long time may present as high iowait. Moving the service might very well alleviate your problem because in the act of moving it you may defrag the database when putting it on the shiny new hardware. So when looking for performance bottle necks metrics cant really tell you if some service or code you have written is just inefficient, either for poor code smells or for lack of maintenance. Just be careful to not sell the farm for a new server when the old one still works just fine.