One flew over the allocator's nest
At some times even a experienced guide in the land of Solaris runs into a trap, even when the guide is aware of this trap.
I have a standard tactic when i’m trying to diagnose performance problems: At first … prstat -mL
to get an overview what the threads are really doing:
The first thing that springs into mind is the high percentage of time in the LCK state. So you start the next tool to get some additional information about that high LCK rate.
Obviously the weapon of choice is plockstat -A -p 18107
:
Hey, libc_malloc_lock
, my old friend. The wait time in the locks summed up to 15126580ns . Looks like this application is in need of a thread hot memory allocator. You can do this via preloading a different allocator library. As the application was started by SMF, it was easy just to use SMF to enable the preloading:
This procedure had the intended effect. An additional run of plockstat yielded much favorable results:
All mutex blocks summed up to 569898ns. That’s just roughly 3.768% of the time needed with the libc malloc. There were just two things nagging at me: 15126580ns was high but not that much.But even more puzzling: The LCK time rised to 100% most of the time for many threads with where it was at 90% before. It’s good that i’m preferring trains to get to my customers, otherwise my car insurance would be unpayable by now, because i’m somewhere else with my thoughts. Nevertheless the pencil to connect the dots was rather slow yesterday. My brain experienced several majf yesterday. The first dot was a measurement a day before, where i had gone after the source of the high amount of the syscalls and the reasons for a high number of context switches. In hindsight this was the first dot:
Hmmm … that was interesting but not unexpected. After that i’ve looked for the numbers of systemcalls. And there a second dot appeared:
The third dot was the strange behaviour in regard of the LCK time after changing the memory allocator. It was 17:30 o’clock and i went home. The tip of the pencil moved with whooping 1 attoparsec per week. There was something ringing in my head, but no idea where this ringing come from. Reaching home i’ve cooked a tea … a glacier attempted to overtake the tip of the pencil at the fast lane. Still this ringing. Drove me nuts. Should i call for the mental asylum? Even Gregory House M.D. was there and he slept with a character played by Franka Potente … not that bad ;) Viewed my newly purchased “District 9” DVD. And now to cite my thought at the middle of the movie: “Fuck! That prstat/cond_wait issue”. And suddenly the tip of the pencil to connect the dot moved with with approximately lightspeed (as a pencil has a mass, it can never reach lightspeed).
The reason for the counterintuitive behaviour the numbers in the LCK column of prstat is “sleeping on CVs is ‘LCK’ time, should be ‘SLP’”. When a thread is waiting on a conditional variable and sleeps, this time isn’t recorded as SLP time, it’s recorded as LCK time, while SLP would be correct. There is a good explanation of the issue and problem of fixing it in in the bug description.
The customers application uses vast amounts of threads (“Oh my god, it’s full of threads”), so conventional interpretation of the LCK column was totally misguiding. The strange increase in LCK time when load decreased was logical, too. As less work reaches the threads the more time they wait in the lwp_cond_wait
. And as this number counts to LCK is just natural that your systems seems to burn a lot of cycles in locks.