Meet the stats - today: mpstat

In this installment of the “Meet the Stats” series i want to talk with you about the mpstat. In my opinion, mpstat is one of the most useful tools to find what your processors are really doing.

Using mpstat

Let’s execute mpstat on a system. I’ve used my fileserver for this task on Saturday morning, it’s a system with four cores. So mpstat reports 4 lines to me.

$ mpstat 1
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    4   0   28   679  154  573    3   12    8    0   767    1   2   0  96
  1    4   0   22   504  145  485    2    9    4    0   661    1   2   0  97
  2    4   0   30   579   81  425    3   12    6    0   519    1   2   0  97
  3    5   0   25   505  250  517    3   12    5    0   758    1   3   0  96
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    9   567  182  372    0    2    0    0   338    0   1   0  99
  1    0   0   21   454  174  468    1    1    2    0   468    1   1   0  98
  2    0   0   12   480   15  304    1    4    1    0   249    2   1   0  97
  3   27   0   15   157   68  147    1    2    0    0   422    0   1   0  99
jmoekamp@hivemind:~$ 

Internals

You need some knowledge about the inner workings of an operating system to really understand the output of this command, but a basic understanding is relatively easy to reach.

A recommended reading

In the following description i sacrified complete correctness for understandability, as i simplified some of the dependencies.To understand the full implications of all the numbers presented by all the *stat commands you should start to gather some knowledge about the internals of Solaris. There is an excellent book about it. It’s “Solaris Internals: Solaris 10 and Open Solaris Kernel Architecture
“ written by Richard McDougall and Jim Mauro. The ISBN of this great book is 0-13-148209-2.

DTrace examples

The DTrace examples are from my standard cheat sheet i’m using at customer site. However they aren’t mine, i’ve gathered them from the Prefetch.net’s Dtrace Cookbook.

The numbers and their meaning

sys,usr and idl have obvious meanings, they tell you the percentage of time the system stays in kernelland, in userland or idling. When you really want to know something about the load on your system, look at this values and forget about load average. The other values are a little more difficult to explain. Basically you can divide the columns 2-12 into three groups: The virtual memory part (row 2 and 3), the interrupt part (row 4-7), the scheduling part (8-11) and the locks part (row 11-12)

The virtual memory part

To understand the meaning of the both rows regarding the virtual memory subsystem you should have some knowledge about the concept of virtual memory. But i will try to give you some insight into this part without getting overly complex. At first it’s important to know that memory is organized in pages. Those pages are chunks of memory. The possible sizes of the page are hardware dependent, but let’s assume that we have pages in the size of 8 Kilobytes. As you may know, modern operating system doesn’t allocate real memory when your application is requesting for memory. Instead it allocated virtual memory. When you access a memory page the first time, a page fault occurs. This page fault leads to the mapping of a physical memory page to the virtual memory page. The mapping is done by adding an entry to the Hash Page Table. And here the minor and major page faults differs:

Obviously major faults have a bigger impact to the system performance than minor faults, as the second one doesn’t need to access a rotating rust device aka hard disk. However even minor faults can have a significant impact to performance. But that’s enough stuff for an own article or an evening with the book mentioned above.

The interrupt part

Interrupts are important for the operation of the system, however they interrupt (they are called “interrupts” for a reason) the application running on this processor. Thus a high number of interrupts can lead into a situation, where many interrupts significantly slows down the application. There are some tricks to reduce this interruption. For example you can force the interrupts on a subset of all processors by declaring most of the CPUs as “non-interrupt”.

The scheduling part

The locks part.

Do you want to learn more?

man pages
docs.sun.com: intrstat
docs.sun.com: mpstat Misc
Prefetch.net: Dtrace Cookbook