Oh my god, it's full of threads ... and out of memory

This is for the people with the really large systems (however thread wise a T4-2 or T4-1 can be really large systems). Imagine you have dozens to hundreds of zones. All with thousands of threads. Or you have an extreme number of ZFS pools … with all their zpool processes and a lot of zones with a lot of processes. You know that you have a lot of processes and just don’t want to think about it so you’ve set the correct parameters at their max. You know … the defaults are set that way you can only have up to 30.000 processes on a solaris system (because you have to reserve resources for each possible process, and it makes no sense to allocate resources for a number of processes most systems will never see). So you’ve set in /etc/system:

set pidmax = 999999

This sets the maximum number of processes ids in the system.

set maxuprc = 999990

With this parameter you set the maximum number of processes, that can be created by any one user on the system

set max_nprocs = 999999

This defines the maximum number of processes on the system. All this parameters have some dependencies on each other and partly depend on other parameters for their defaults. For 11.1 you can look up the dependencies in the “Process-Sizing Parameters” section of the “Oracle Solaris 11.1 Tunable Parameters Reference Manual” Okay, now you start to load your system. More and more processes are starting. You start to reach round about 65000 processes and out of some strange reason everything gets slower, starting a script or even just a shell command startup is really sluggish. One of my colleague was faced with such a problem and he found out whats the problem is. I’m simplifying stuff a little bit now to keep the length of the blog entry shorter. At the start a certain class of memory is allocated by the kernel. It’s called “kernel pageable memory”(however that’s not entirely correct that it’s pageable, it’S a little bit more complex, just think of it as segkp ;) ). This memory is mainly used for the stacks of kernel threads. On a Solaris 10 64-bit system (as it was in this case) the default for this memory is 2 GB, the stack size 24 KByte. However you have to add the redzone to this value. This is quite an elegant way to protect the following stack of any buffer overrun by inserting a specially configured page between the stacks. Whenever you try to write or read from it, you get an page protection fault. So at max you can have 65536 threads (okay, it’s a little bit less) on a system with this default setting, as each thread is taking it’s 32 KByte share out of this cake. The parameter to control this is segkpsize. This parameter controls the amount of kernel pageable memory. You can set it to 24 GB at max in Solaris 10 and the unit is 8-kByte pages. So when you want set it to the max you write into the /etc/system:

set segkpsize=0x300000

0x300000 * 8192 Bytes = 0x600000000 Bytes , which are 25769803776 Bytes in decimal, which are 24 GBytes as the trained eye will surely have notices. However you should find a value better matched to the maximum number of processes you expect. Okay, but why is the system reacting sluggish. When a process is started and you acquire memory from the kp segment, you can do it on two ways: The first way is to tell the memory not to wait. In this case when there is no free memory in that segment, the process creation fails immediately. This is clearly visible. However there is a second way and this way is the default. It wait until there is some free memory, as a different thread was terminated and thus returned his allocation in this memory area. It’s really reasonable to do so, as you don’t create threads just for fun. However now you have the problem that even simple commands may take really significantly longer. Not at execution, but at startup (however it looks like everything takes longer, when you are just starting up some shell scripts as it’s sometimes hard to see if a command is taking longer to start or to run afterwards. This waiting is not visible by error messages. So, how do you detect such a situation. You can use kstat to get the statistics of the segment driver for kernel pageable memory which is the responsible part for this situation.

jmoekamp@hivemind:~# kstat -n segkp
module: vmem                            instance: 35    
name:   segkp                           class:    vmem
        alloc                           11659
        contains                        0
        contains_search                 0
        crtime                          0
        fail                            0
        free                            10896
        lookup                          498
        mem_import                      0
        mem_inuse                       18956288
        mem_total                       2147483648
        populate_fail                   0
        populate_wait                   0
        search                          289723
        snaptime                        16443,626850679
        vmem_source                     0
        wait                            0

The interesting parts are at first the mem_inuse and the mem_total. If the “in use” amount is almost the “total” number, the size of the kernel pageable memory area could surely need an increase. Even more interesting are the fail and wait row. This should be zero or close to zero. So when you see here values different than zero, the increase of this memory area by set segkpsize= is the right thing to do, especially when this number have a trend going upwards. In Solaris 11.1 this value saw an increase. The 2GB default for SPARC was changed with 11.1. As described in the “Oracle Solaris 11.1 Tunable Parameters Reference Manual” it’s now computed with

2 GB x the smaller result of nCPUs / 128 or the amount of physical memory / 256 GB

.Long blog entry, short conclusion: In the case you have many threads, look after segkpsize value.