The deliverer's knowledge: Some comments on performance data for capped CPU's with Solaris 10 zones

(An introduction by Jörg: This is the second blog entry written by Heiko. It’s an interesting article about a weirdness you can stumble upon while working with Solaris 10.) Solaris 10 Update 5 aka Solaris 10 5/08 introduced a feature to limit the absolute CPU usage of zones or projects defined by a percentage. The relevant resource control is zone.cpu-cap. While this is nothing really new, you should aware about the interactions (and possible misunderstandings arising out of them) of this capping feature and values showing the utilization of a system like LoadAvg

A simple example

Here is a simple example. As a test system I used a virtual machine with 4 CPU’s and 2 running zones.

#>zoneadm list -icv
  ID NAME             STATUS     PATH                           BRAND    IP
   0 global           running    /                              native   shared
   9 zone1            running    /zones/zone1                   native   shared
  10 zone2            running    /zones/zone2                   native   shared
   - zone3            installed  /zones/zone3                   native   shared
   - zone4            installed  /zones/zone4                   native   shared

The availability of 4 CPU can be shown by the output of psrinfo

# >psrinfo
0       on-line   since 11/03/2009 08:19:02
1       on-line   since 11/03/2009 08:19:08
2       on-line   since 11/03/2009 08:19:08
3       on-line   since 11/03/2009 08:19:10

We take a short look at the CPU utilization. In essence, the system just runs itself at the moment. A very relaxing situation. :)

# >vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd cd f0 s0   in   sy   cs us sy id
 0 0 0 978940 596200 116 439 95 0  0  0 79 10  2 -0 -0  701 3321 1456  4 14 82
 0 0 0 812240 438140  8  30  0  0  0  0  0  0  0  0  0  430  173  232  0  1 98
 0 0 0 812240 438140  0   6  0  0  0  0  0 16  0  0  0  715  142  350  0  2 98
 0 0 0 812240 438140  0   6  0  0  0  0  0  0  0  0  0  442  157  229  0  1 99
 0 0 0 812240 438140  0   6  0  0  0  0  0  0  0  0  0  433  200  217  0  1 99
 0 0 0 812240 438140  0   6  0  0  0  0  0  0  0  0  0  437  214  233  0  1 99
 0 0 0 812240 438140  0   6  0  0  0  0  0  0  0  0  0  438  162  243  0  1 99
^C

# >prstat -mLZ 1
 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
  2203 root     0.2 3.0 0.0 0.0 0.0 0.0  97 0.0  37   0 525   0 prstat/1
   122 root     0.1 0.1 0.0 0.0 0.0 0.0 100 0.0  36   0 218   0 nscd/3
  2137 noaccess 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0   1   0 java/15
 …
     9 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 svc.configd/1
     7 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 svc.startd/323
     7 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 svc.startd/316
     7 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 svc.startd/66
     7 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 svc.startd/10
ZONEID     NLWP  SWAP   RSS MEMORY      TIME  CPU ZONE
     0      183  140M  212M    21%   0:01:16 0.3% global
     2      123  135M  204M    20%   0:00:52 0.0% zone2
     1      126  138M  208M    20%   0:00:50 0.0% zone1
Total: 102 processes, 432 lwps, load averages: 0.04, 0.39, 0.49

Configuration of the CPU capping

Next, the capping for a local Zone called zone1 is activated and set to 10%. Keep in mind: The activation via prctl is not boot persistent. The calculation for the ratio of capping value can be done like this: 4CPU = 400 → 400 = 100% → 40 = 10 %

# >prctl -t privileged -n zone.cpu-cap -s -v 40 -i zone zone1

Let’s check the configuration:

# >prctl -P -i zone zone1
zone: 1: zone1
zone.max-swap system 18446744073709551615 max deny -
zone.max-locked-memory system 18446744073709551615 max deny -
zone.max-shm-memory system 18446744073709551615 max deny -
zone.max-shm-ids system 16777216 max deny -
zone.max-sem-ids system 16777216 max deny -
zone.max-msg-ids system 16777216 max deny -
zone.max-lwps system 2147483647 max deny -
<b>zone.cpu-cap privileged 40 - deny -</b>
zone.cpu-cap system 4294967295 inf deny -
zone.cpu-shares privileged 1 - none -
zone.cpu-shares system 65535 max none -

BTW, there is also a corresponding kstat module for the CPU Capping. The name of the relevant field is cpucaps_zone_ .

#kstat -m caps -n cpucaps_zone_`zoneadm list -icv | grep zone1 | awk '{print $1}'`
module: caps                            instance: 1
name:   cpucaps_zone_1                  class:    zone_caps
        above_sec                       0
        below_sec                       1335
        crtime                          4801.079480706
        maxusage                        12
        nwait                          	0
        snaptime                        6135.748493858
        usage                           1
        <b>value                           40</b>
        zonename                        zone1

Pedal to the metal - or testing the cap

To test the capping, just use your favorite CPU hog. I wrote a small - albeit not very elegant C program - to fulfill this purpose.

# >vi load.c
#include &lt;stdio.h&gt;
#include &lt;pthread.h&gt;
#include &lt;math.h&gt;
#include &lt;std.h&gt;

int C,I;

void loop1()
{
double x=0, m;
int i=0;
while(i == 0)
        {
        m = acos(x);
        x++;
        }
}

main()
{
pthread_attr_t attr;
pthread_attr_init(&attr);

for (C=0; C<2000; C++)
        {
        pthread_t THREAD;
        pthread_create(&THREAD, NULL,(void *) loop1,NULL);
        }

while(I == 0)
        {
        sleep(1);
        }
}

Okay … as C source doesn’t help us, we have to compile it first. Compile – copy – finish.

#> gcc -lm -o load.bin load.c
#> cp load.bin /zones/zone1/root

Now we can login into the zone and put some load on our system.

# >zlogin zone1
# >./load.bin&;./load.bin&;./load.bin&
^D

Now, wait a bit - until the cake rises. And after a short moment:

#> prstat -mLZ 1
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
  3662 root     0.8 8.7 0.0 0.0 0.0 0.0  90 0.0  16   2 439   0 prstat/1
  3690 root     2.1 0.0 0.0 0.0 0.0 0.0 0.0  98   0   2   0   0 load.bin/166
  ...
  3691 root     1.0 0.0 0.0 0.0 0.0 0.0 0.0  99   0   1   0   0 load.bin/214
  3690 root     1.0 0.0 0.0 0.0 0.0 0.0 0.0  99   0   1   0   0 load.bin/505
ZONEID     NLWP  SWAP   RSS MEMORY      TIME  CPU ZONE
     1     1907  153M  225M    22%   0:01:22 10% zone1
     0      191  147M  219M    21%   0:03:32 2.4% global
     2      122  135M  205M    20%   0:01:21 0.0% zone2
Total: 110 processes, 2220 lwps, load averages: <b>2164.75, 1980.04, 1527.83</b>

As shown by the CPU column the capping works fine. 10% CPU utilization zone1 - works as configured.

Contradictory numbers - and their reason

But wait …. the LoadAvg is extremely high.

# >uptime
  3:58pm  up  2:57,  2 users,  load average: <b>2176.48, 1985.47, 1532.13</b>

Just in the case LoadAvg would be a relevant indicator in this situation, the system would be in real trouble now. But it isn’t … however, applications with dependencies to the LoadAvg are indeed in trouble now. One example is sendmail. There is a mechanism in sendmail to mitigate the risks of a mail storm. When the load average of the system increases above a configurable level, Sendmail stops to accept mail. sendmail would stop to accept mail now, despite there is more than enough capacity available on the system. Okay, let’s look to the system from a different perspective: How is it the state of the system as reported by vmstat?

# >vmstat 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr cd cd f0 s0   in   sy   cs us sy id
 0 0 0 721984 358208  4   6  0  0  0  0  0  0  0  0  0  509  142  312 11  1 88
 0 0 0 721984 358208  4   6  0  0  0  0  0  0  0  0  0  660  174  298 10  2 87
 0 0 0 721984 358208  4   6  0  0  0  0  0  0  0  0  0  503  157  316 10  1 89
 0 0 0 721916 358140  4   6  0  0  0  0  0  0  0  0  0  512  206  321 10  2 88
 0 0 0 721916 358140  4   6  0  0  0  0  0  0  0  0  0  504  143  299 10  1 88
 0 0 0 721916 358140  4   6  0  0  0  0  0  0  0  0  0  498  156  307 11  1 88
 0 0 0 721916 358140  4   6  0  0  0  0  0  0  0  0  0  493  195  309 10  1 88
 0 0 0 721916 358140  4   6  0  0  0  0  0  0  0  0  0  505  140  298 11  1 88
 0 0 0 721916 358140  4   6  0  0  0  0  0  0  0  0  0  516  379  309 11  1 88
 0 0 0 721916 357944  4  54  0  0  0  0  0  0  0  0  0  507  149  305 10  2 88
^C

88% idle. That’s the number you would expect when your just use 10% of your system plus a little bit load by all the processes running on a freshly installed system.
Why is LoadAvg that high? Especially since the 1st column in vmstat shows 0, this number seems counter intuitive . vmstat just shows the number of kthreads in runqueues in this columns. Is it possible that those threas are in the kthread runqueue, but not in the state “on_proc”? To understand this, you have to dig into the mechanism that enables CPU capping. CPU capping is done by leveraging the scheduling subsystem. Threads in the run queue and dispatch queue on a CPU are monitored by the system. When a group of threads is reaching the capping limit, the threads are set to wait, thus they are not run on a CPU. The load generator spawns more and more threads, the number of threads is increasing, but they are set to wait state almost immediately.
This can be stubstantiated by a quick look into the ps man-page:

#> man ps
…
   S (l)               The state of the process:
...
                         W            Waiting: process is waiting
                                      for  CPU  usage  to drop to
                                      the CPU-caps enforced  lim-
                                      its.
...

Let’s use this knowledge on our system:

#> ps -o s=state -o comm=command -aelfL | grep load | more
    W ./load.bin
    W ./load.bin
    W ./load.bin
...
    W ./load.bin
    W ./load.bin
    W ./load.bin
    W ./load.bin

# > ps -o s=state -o comm=command -aelfL | grep load | grep W | wc -l
    2135

The number of wait kthreads can also be controlled via kstat. The value of interest for this situation is "nwait“.

# >kstat -m caps -n cpucaps_zone_`zoneadm list -icv | grep zone1 | awk '{print $1}'`
module: caps                            instance: 1
name:   cpucaps_zone_1                  class:    zone_caps
        above_sec                       2141
        below_sec                       3753
        crtime                          4801.079480706
        maxusage                        135
        <b>nwait                           2156</b>
        snaptime                        10694.627372473
        usage                           40
        value                           40
        zonename                        zone1

What’s the reason for this difference between LoadAvg and the data displayed by other tools? Commands such as prstat / uptime / w etc. use the syscall getloadavg(), which apparently evaluates the number of entries in the runqueues, but isn’t aware of the wait-flag. We can check this by a short dtrace one-liner:

# >dtrace -n 'syscall::: /probefunc=="getloadavg"/ {trace(execname)}' 
dtrace: description 'syscall::: ' matched 466 probes 
CPU     ID                    FUNCTION:NAME 
  2   2626                 getloadavg:entry   uptime                           
  2   2627                getloadavg:return   uptime                           
  2   2626                 getloadavg:entry   prstat                           
  2   2627                getloadavg:return   prstat                           
  2   2626                 getloadavg:entry   prstat                           
  2   2627                getloadavg:return   prstat                           
  2   2626                 getloadavg:entry   prstat                           
  2   2627                getloadavg:return   prstat                           
  2   2626                 getloadavg:entry   prstat                           
  2   2627                getloadavg:return   prstat                           
  2   2626                 getloadavg:entry   w                                
  2   2627                getloadavg:return   w     
^C

Conclusion

There is an interesting interaction between getloadavg() and zone.cpu-cap that leads to misguiding, but perfectly correct numbers. You should keep this in mind, when you try to make sense out of a system with a extremely high load average that’s still responsive.

Do you want to learn more?

Misc
opensolaris.org: The implementation of CPU caps