One flew over the allocator's nest

At some times even a experienced guide in the land of Solaris runs into a trap, even when the guide is aware of this trap. I have a standard tactic when i’m trying to diagnose performance problems: At first … prstat -mL to get an overview what the threads are really doing:

 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 26235 app   4.0 0.1 0.0 0.0 0.0  96 0.2 0.2 643  10 894   0 SomeApp/5
 18107 app   4.0 0.1 0.0 0.0 0.0  95 1.0 0.1  1K   7  1K   0 SomeApp/8
 26235 app   3.7 0.1 0.0 0.0 0.0  96 0.2 0.4 584  36 836   0 SomeApp/4

The first thing that springs into mind is the high percentage of time in the LCK state. So you start the next tool to get some additional information about that high LCK rate.
Obviously the weapon of choice is plockstat -A -p 18107:

<small>
<pre>Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
   78    69427 libc.so.1`libc_malloc_lock   libzip.so`Java_java_util_zip_Deflater_deflateBytes+0x29c
   44    99867 libc.so.1`libc_malloc_lock   libzip.so`Java_java_util_zip_Deflater_deflateBytes+0x36c
   21    35373 libc.so.1`libc_malloc_lock   libzip.so`Java_java_util_zip_Deflater_deflateBytes+0x29c
   11    58851 libc.so.1`libc_malloc_lock   libzip.so`Java_java_util_zip_Deflater_deflateBytes+0x2dc
    9    71014 libc.so.1`libc_malloc_lock   libzip.so`Java_java_util_zip_Deflater_deflateBytes+0x36c
[... ca 25 Zeilen deleted ...]
    1    10483 libc.so.1`libc_malloc_lock   libnsl.so.1`netconfig_free+0x10
    1     4401 libc.so.1`libc_malloc_lock   libnsl.so.1`netconfig_free+0x10
</pre>
</small>

Hey, libc_malloc_lock, my old friend. The wait time in the locks summed up to 15126580ns . Looks like this application is in need of a thread hot memory allocator. You can do this via preloading a different allocator library. As the application was started by SMF, it was easy just to use SMF to enable the preloading:

#svccfg -s application/someapp setenv LD_PRELOAD libumem.so
# svcadm refresh 
# svcadm restart someapp

This procedure had the intended effect. An additional run of plockstat yielded much favorable results:

<small>
<pre>Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
    5    37929 0x3b400                      libumem.so.1`umem_cache_free+0x6c
    2    76420 0x3b400                      libumem.so.1`umem_cache_alloc+0x50
   12     5828 0x3b400                      libumem.so.1`umem_cache_free+0x6c
    2    31008 libumem.so.1`vmem0+0x4a8     libumem.so.1`vmem_alloc+0xf0
    8     6141 0x3b400                      libumem.so.1`umem_cache_alloc+0x50
    1    40411 libumem.so.1`vmem0+0x4a8     libumem.so.1`vmem_xalloc+0x140
    1     5922 0x37080                      libumem.so.1`umem_cache_alloc+0x50</pre>
</small>

All mutex blocks summed up to 569898ns. That’s just roughly 3.768% of the time needed with the libc malloc. There were just two things nagging at me: 15126580ns was high but not that much.But even more puzzling: The LCK time rised to 100% most of the time for many threads with where it was at 90% before. It’s good that i’m preferring trains to get to my customers, otherwise my car insurance would be unpayable by now, because i’m somewhere else with my thoughts. Nevertheless the pencil to connect the dots was rather slow yesterday. My brain experienced several majf yesterday. The first dot was a measurement a day before, where i had gone after the source of the high amount of the syscalls and the reasons for a high number of context switches. In hindsight this was the first dot:

# dtrace -n 'sched:::off-cpu /execname == "SomeApp"/{@[ustack()] = count()}END{trunc(@,5)}'
dtrace: description 'sched:::off-cpu' matched 3 probes
^C
CPU     ID                    FUNCTION:NAME
  1      2                             :END


              libc.so.1`___lwp_cond_wait+0x4
[...]
              528

              libc.so.1`___lwp_cond_wait+0x4
[...]
              553

              libc.so.1`___lwp_cond_wait+0x4
[...]
              570

              libc.so.1`___lwp_cond_wait+0x4
[...]
              638

              libc.so.1`___lwp_cond_wait+0x4
[...]
              860

Hmmm … that was interesting but not unexpected. After that i’ve looked for the numbers of systemcalls. And there a second dot appeared:

 # dtrace -n 'syscall:::entry/execname == "SomeApp"/{@[execname,probefunc] = count()}'
dtrace: description 'syscall:::entry' matched 234 probes
^C
  SomeApp                                    lwp_cond_wait                                                 15683

The third dot was the strange behaviour in regard of the LCK time after changing the memory allocator. It was 17:30 o’clock and i went home. The tip of the pencil moved with whooping 1 attoparsec per week. There was something ringing in my head, but no idea where this ringing come from. Reaching home i’ve cooked a tea … a glacier attempted to overtake the tip of the pencil at the fast lane. Still this ringing. Drove me nuts. Should i call for the mental asylum? Even Gregory House M.D. was there and he slept with a character played by Franka Potente … not that bad ;) Viewed my newly purchased “District 9” DVD. And now to cite my thought at the middle of the movie: “Fuck! That prstat/cond_wait issue”. And suddenly the tip of the pencil to connect the dot moved with with approximately lightspeed (as a pencil has a mass, it can never reach lightspeed). The reason for the counterintuitive behaviour the numbers in the LCK column of prstat is “sleeping on CVs is ‘LCK’ time, should be ‘SLP’”. When a thread is waiting on a conditional variable and sleeps, this time isn’t recorded as SLP time, it’s recorded as LCK time, while SLP would be correct. There is a good explanation of the issue and problem of fixing it in in the bug description. The customers application uses vast amounts of threads (“Oh my god, it’s full of threads”), so conventional interpretation of the LCK column was totally misguiding. The strange increase in LCK time when load decreased was logical, too. As less work reaches the threads the more time they wait in the lwp_cond_wait. And as this number counts to LCK is just natural that your systems seems to burn a lot of cycles in locks.