The deliverer's knowledge: Some comments on performance data for capped CPU's with Solaris 10 zones
(An introduction by Jörg: This is the second blog entry written by Heiko. It’s an interesting article about a weirdness you can stumble upon while working with Solaris 10.)
Solaris 10 Update 5 aka Solaris 10 5/08 introduced a feature to limit the absolute CPU usage of zones or projects defined by a percentage. The relevant resource control is zone.cpu-cap. While this is nothing really new, you should aware about the interactions (and possible misunderstandings arising out of them) of this capping feature and values showing the utilization of a system like LoadAvg
A simple example
Here is a simple example. As a test system I used a virtual machine with 4 CPU’s and 2 running zones.
The availability of 4 CPU can be shown by the output of psrinfo
We take a short look at the CPU utilization. In essence, the system just runs itself at the moment. A very relaxing situation. :)
Configuration of the CPU capping
Next, the capping for a local Zone called zone1 is activated and set to 10%. Keep in mind: The activation via prctl is not boot persistent. The calculation for the ratio of capping value can be done like this: 4CPU = 400 → 400 = 100% → 40 = 10 %
Let’s check the configuration:
BTW, there is also a corresponding kstat module for the CPU Capping. The name of the relevant field is cpucaps_zone_ .
Pedal to the metal - or testing the cap
To test the capping, just use your favorite CPU hog. I wrote a small - albeit not very elegant C program - to fulfill this purpose.
Okay … as C source doesn’t help us, we have to compile it first. Compile – copy – finish.
Now we can login into the zone and put some load on our system.
Now, wait a bit - until the cake rises. And after a short moment:
As shown by the CPU column the capping works fine. 10% CPU utilization zone1 - works as configured.
Contradictory numbers - and their reason
But wait …. the LoadAvg is extremely high.
Just in the case LoadAvg would be a relevant indicator in this situation, the system would be in real trouble now. But it isn’t … however, applications with dependencies to the LoadAvg are indeed in trouble now. One example is sendmail. There is a mechanism in sendmail to mitigate the risks of a mail storm. When the load average of the system increases above a configurable level, Sendmail stops to accept mail. sendmail would stop to accept mail now, despite there is more than enough capacity available on the system. Okay, let’s look to the system from a different perspective: How is it the state of the system as reported by vmstat?
88% idle. That’s the number you would expect when your just use 10% of your system plus a little bit load by all the processes running on a freshly installed system.
Why is LoadAvg that high? Especially since the 1st column in vmstat shows 0, this number seems counter intuitive . vmstat just shows the number of kthreads in runqueues in this columns. Is it possible that those threas are in the kthread runqueue, but not in the state “on_proc”? To understand this, you have to dig into the mechanism that enables CPU capping. CPU capping is done by leveraging the scheduling subsystem. Threads in the run queue and dispatch queue on a CPU are monitored by the system. When a group of threads is reaching the capping limit, the threads are set to wait, thus they are not run on a CPU. The load generator spawns more and more threads, the number of threads is increasing, but they are set to wait state almost immediately.
This can be stubstantiated by a quick look into the ps man-page:
Let’s use this knowledge on our system:
The number of wait kthreads can also be controlled via kstat. The value of interest for this situation is "nwait“.
What’s the reason for this difference between LoadAvg and the data displayed by other tools? Commands such as prstat / uptime / w etc. use the syscall getloadavg(), which apparently evaluates the number of entries in the runqueues, but isn’t aware of the wait-flag. We can check this by a short dtrace one-liner:
Conclusion
There is an interesting interaction between getloadavg() and zone.cpu-cap that leads to misguiding, but perfectly correct numbers. You should keep this in mind, when you try to make sense out of a system with a extremely high load average that’s still responsive.