I really like to read our internal aliases. Because sometimes you are stumbling about a piece of information that reminds you that it could be a article for the blog. In this case it was about the topic of the virtual memory translation. I will simplyfing things a bit, so dear experts, i know that the situation is more complex ;) For people interested in more details the chapters 9-12 and chapter 3 Section 3.11.7 of “Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition): Solaris 10 and Open Solaris Kernel Architecture” is a good start.
Whenever you use some virtual memory, there has to be some mapping from the virtual addresses to the real addresses. However to prevent the CPU to look again and again in - from CPU cycles perpective - distant memory areas, there are little caches in modern processors called translation lookaside buffer or short TLB. This TLBs are rather small, a T4 core has 128 entries. While sounding small, the TLB with such a number of entries has an astonishing high hit rate.
By the way: You can increase the hitrate by using large pages … when you use 8k pages you cover 1024 Kilobyte. If you use 2 Gigabyte pages ( really … a SPARC T4 supports that large pages) you can theoretical cover 256 GByte of memory within the TLB(Theoretical because there are always mappings with smaller page sizes). So you can cover larger areas of memory with entries fitting in the TLB and the rest not fitting into the TLB is much smaller. When you want to know what page sizes an process is using, the command
pmap -s >pid< is your source of insight. When you want to get some insight into the effectiveness of the TLB with your workload you can use the
trapstat -T command on SPARC systems. Many applications are already using huge pages albeit sometimes you have to force the application to use larger pages.
But back to the topic: It’s really helpful to have a TLB to speed up things, a well known proof was the TLB bug in the AMD Opteron. When you switched off the TLB as an reaction to the Erratum 298, you’ve lost 10-20% of the chips performance as on any modern OS the mapping from virtual to physical memory is quite a frequent task.
However as usual speeding things up at one side, opens challenges at on a different side, like with caches and the coherency of caches. That said you have exactly a cache coherency challenge, as the TLB is a cache for a special kind of data: When you unmap some virtual memory you have to clean up after your processes. There may be a mapping in the TLB that isn’t valid any longer. You have to remove it. And on a SMP system this means on possibly all CPUs visible to the OS because it’s possible that a now stale entry is contained in all TLBs on all cores where a process has run on in the past. Otherwise you would have a stale entry in the TLB of other cores and that is something you can’t allow, a mapping could point to other processes data and you would call this simply a corruption.
So it doesn’t suffice to execute this TLB cleanup code just on the core executing the process calling the munmap at that moment, you have to clean up on all cores that possibly contain the stale mapping. In Solaris the execution is done via a cross-call. By the means of a cross call, a processor can ask other processors to interrupt their work in order to execute some code: In this case on doing the removal of TLB entries on the unmap.
So in this case not the creation of something is the expensive task. it’s the removal of it in systems with more than a core.
The simplest thing would be to execute this code on all processors and under certain circumstances this happens (unmapping of kernel memory), however for application processes there is an important optimization: Only those CPUs are crosscalled on munmaps that have really ran the process because otherwise you could be sure that there are no stale TLB entries in the cores. There is a bitmap kept in the “metadata” of a processes tracking which CPUs have seen a process.
So this leads to an important hint: You can reduce the cross calling by binding a process to a cpu or a set of cpus. However - and that’s important - you have to do it at startup of the process. The reasoning is quite simple: Because while the internal mechanism prefer to use a CPU has has been used by a process (processes like a warm cache), it’s not guaranteed that that it will always on the same CPU. Obviously when you allow a process to run on all processors the probability that processes will have ran on all processors since boot-up is 1 with the time and then even with the optimization all CPUs will be crosscalled.
There is another implication: Let’s assume you’ve ran your load on a single proc 1 core machine and put the load on 4 socket 32 core system with 256 virtual CPUs or imagine a T3 with 4 sockets .. 512 virtual CPUs, you can run into the situation that munmaps run much longer as before, as you have to clean up much more TLBs from the stale entry.
That’s a situation where process binding or using cpu pools is a really handy function in Solaris.
So when you migrated to a new system with much more processors visible to the OS and your application runs slower than before, and you see a lot of crosscalls and you see a lot of munmap calls (you can see this with dtrace), it is perhaps a good idea to limit the process to a set of CPUs or a single CPU. There are applications using mmap and munmap that frequently, that you think if they are doing nothing else. They are seldom, but they exist and then you see such slowdowns on multiprocessor systems. Obviously the next thing you should check is why the processes is issuing so many munmap calls. I assume many of those codes where written on or for systems with just one core. No other cores, no need to crosscall. Deleting it from your local TLB and you are done. And then you don’t see the challenges you impose to the system by running munmap. However i assume that is the reason why i see many application saying “I’m doing memory management on my own, just give me a large heap of memory and i’m not giving it back until the admin forces me.”
PS: The article is based on a situation on SPARC and Solaris, however the challenge exists on all multicore systems with TLB to accelerate the translation and while the exact tactics to ease the challenge differ, they boil down to the same … try to get rid of as much TLB cleanups as possible.