Measurement artifact

Recently i was working at a customer tuning an application. That was one of the reasons why i was so silent here in the blog because the customer was in Spain and i had not much time to write here.

Thinking about it

The application was throughoutly instrumented by means of something i will call “middleware” and so the customer thought he was able to see how long a job on the system took and how much CPU was used. However the numbers were strange: High throughput, short time to complete the job, however the CPU time used for a job looked strange. Something in there was something wrong with the CPU times. Just a guess. A feeling. Not really knowing. The application used the times() system call in order to deliver the numbers how much CPU time was used by a process so far. This system call was executed quite frequently by the application. The times() system call is returning the amount of SYS CPU and USR CPU time so far. You was able to see the quite significant number of times() sys calls be trussing the application. You can see the documentation of the sys call at times(). So i was sitting in my hotel room and was thinking about it … besides thinking how it’s possible to create such horribly tasting scrambled eggs. The transactions computed by the application obviously ended in the measured time, the number of executed transactions was undeniable correct (counting is easy and two ways to measure it showed the same number at separate points) So the measurement of used CPU time was the only number in question. What if the used CPU time is just a measuring artifact? While thinking, that the coffee was even worse (never had an acceptable hotel coffee) i had an theory what really happened. It’s all about what Solaris is considering as used CPU. By doing some research (and asking some people to validate my theory) i quickly found out, that Solaris is counting this CPU time by summing up the time a process is dispatched on a CPU. However, and that’s important, it doesn’t mean that the process is really executing something at the moment nowadays out of a simple reason: Imagine you have a processing core. This core can execute one stream of commands at a time. Okay there is pipelining so in fact it’s executing several things in parallel but at the end only one result can fall out of CPU. CPU designers found out long ago, that CPU have to wait a lot of time on slower components of a computer system and thus were just waiting on stuff. So it was thought that a CPU should have several streams of commands in order to enable them to choose from other streams of commands, when one stream of command was in pause because they waited on the hard disk or memory. There is a multitude of names for such a concept but the consequence was almost always the same. The single physical existent core was looking to the OS like several CPUs. You have a quite sophisticated and broad implementation of the concept in CMT cpus where a single core is looking like 8 CPUs to the OS or you have implementations where a single x86 core is looking like 2 CPUs, Intel coined the name Hyperthreading for it. However this has one implication: You can schedule more processes to a core than it can execute in parallel. This its almost universally a good thing, because most of the time the execution units are just waiting for data. Don’t believe me … just consult the CPU counters. For the 8 CPU visible to the OS per core, you have 2 pipelines, so you can execute at max two instructions. This can range from executing 1 command from each CPU visible to the OS on two pipelines in 4 cycles to 8 instructions from a CPU visible to the OS However the processor can choose from 8 streams of instructions without context switching to let the processor do something useful instead of just twiddling fingers very fast. After thinking through all this, i had a theory: The times() result could be the result of measuring something different then you initially think. And indeed: Solaris (i didn’t checked it on other operating systems) measures CPU time for the process by measuring the time the process is scheduled on a CPU. But given the the considerations about modern CPU technology this time is not necessarily the amount the clock cycles spend for the process. Other processes scheduled to the other CPUs visible to the OS residing on the same core had their share. This leads to a interesting situation: When a process tells you it got 1 second of CPU time, the 1 second could have been really used by 8 other processes alltogether.

Making it visible

But how to proof that to the customer. Out of this reason i wrote a small perl script.

<b>cpuhog_testing.pl</b>:
#! /usr/bin/perl

use English;

eval {
$SIG{ALRM} = sub {
my ($systemtime, $usertime) = times();
my $cputime = $systemtime + $usertime;
print " $PID - cputime : $cputime seconds\n";
die "endofexecution ";
};
alarm(20);
};

$cpu = $ARGV[0];

print "PID : $PID\n";

system("pbind -b $cpu $PID");
while (1) { my $res = ( 3.3333 / 3.14 ) }

This script is fairly easy: When it starts a time bomb is initialized, after 20 seconds it terminates itself, but prints the used CPU time to the screen before that. It binds itself to a processor at the beginning and then starts to do a division in an endless loop. Until the the time bomb goes up. At first i wrote a small shell script to start 8 instances of the script on one hw thread:

<b>starthogs1.sh</b>:
#!/bin/bash
./cpuhog_testing.pl 248 &
./cpuhog_testing.pl 248 &
./cpuhog_testing.pl 248 &
./cpuhog_testing.pl 248 &
./cpuhog_testing.pl 248 &
./cpuhog_testing.pl 248 &
./cpuhog_testing.pl 248 &
./cpuhog_testing.pl 248 &

Starting it up gave me an expected result. The slightly shortened output:

bash-3.2# ./starthogs1.sh
PID : 21977
PID : 21976
PID : 21980
PID : 21981
PID : 21978
PID : 21979
PID : 21983
PID : 21982
process id 21976: was not bound, now 248
process id 21980: was not bound, now 248
process id 21977: was not bound, now 248
process id 21981: was not bound, now 248
process id 21979: was not bound, now 248
process id 21983: was not bound, now 248
process id 21978: was not bound, now 248
process id 21982: was not bound, now 248
21977 - cputime : 2.603 seconds
21976 - cputime : 2.529 seconds
21978 - cputime : 2.514 seconds
21982 - cputime : 2.516 seconds
21981 - cputime : 2.496 seconds
21979 - cputime : 2.535 seconds
21983 - cputime : 2.52 seconds
21980 - cputime : 2.547 seconds

Each process got roughly 2.5 seconds. 2.5 times 8 is 20 seconds. Something you would expect. Okay … the numbers are not exactly 2.5 seconds, but you have to keep in mind that you start up the interpreter and it takes some time until you start the clock of the time bomb. So i repeated it with starting the process on the 8 hw threads visible as CPUs to the OS:

<b>starthogs2.sh:</b>
#!/bin/bash
./cpuhog_testing.pl 248 &
./cpuhog_testing.pl 249 &
./cpuhog_testing.pl 250 &
./cpuhog_testing.pl 251 &
./cpuhog_testing.pl 252 &
./cpuhog_testing.pl 253 &
./cpuhog_testing.pl 254 &
./cpuhog_testing.pl 255 &

Okay, starting it up again, but now on processors 248 to 255, and just not 248. Keep in mind that this are different sides of the same medal. 248 to 255 are the hw-threads visible to the OS of the eight core of the fourth processor in a T4-4.

bash-3.2# ./starthogs.sh
PID : 16145
PID : 16144
PID : 16146
PID : 16149
PID : 16148
PID : 16150
PID : 16147
PID : 16151
process id 16145: was not bound, now 249
process id 16144: was not bound, now 248
process id 16146: was not bound, now 250
process id 16148: was not bound, now 252
process id 16150: was not bound, now 254
process id 16149: was not bound, now 253
process id 16151: was not bound, now 255
process id 16147: was not bound, now 251
16144 - cputime : 20.01 seconds
16145 - cputime : 20.009 seconds
16146 - cputime : 20.01 seconds
16149 - cputime : 20.007 seconds
16150 - cputime : 20.008 seconds
16148 - cputime : 20.01 seconds
16147 - cputime : 20.009 seconds
16151 - cputime : 20.01 seconds

Whoops … no matter what you have expected. 160 seconds CPU time in a time frame that could give you 40 seconds CPU time at maximum (2 integer pipelines)

Conclusion

You have to take very precicely what Solaris is really measuring when it gives you the CPU times with the times() sys call. It’s not “amount of used CPU cycles for the processes”, it’s “time being scheduled on a CPU”. And this can differ quite significantly, it’s something completely different nowadays. The assumption that the CPU usage() is really telling you something about where the cycles of your CPU are going to is just incorrect, when you have no HW threads on a CPU. And even then there are problems. CPUs are pipelining for some time. That means that a CPU visible from the OS could execute instructions from several processes in parallel, just on different stages of their execution. What thread is using the CPU. I’m a fan of the “the result matters” so i would assign the CPU time to the instruction that completed its way through the pipeline. I didn’t checked it, but it must be even worse with virtualization that works with emulated CPUs (and not like CMT by assigning HW threads to a domain). The operating system can hardly know how much CPU time it got from the real CPUs to execute the stuff scheduled to the virtual CPUs in assigned to a guest OS. In theory this problem should appear on all OS using the times() syscall and having some kind of hardware multithreading . But so far i had no time to check that with my hardware available and some other operating systems. In order to really measure the CPU usage, you need more sophisticated tools like Solaris is offering with cpustat, cputrack or dtrace using the CPU Performance Counters. And at the end there are two numbers you can really use: Elapsed time of the transaction and throughput. Or to say it differently: How long has the user to wait and how many users can execute their requests in a given timeframe.