This article was originally written by Adam Abrevaya, VP of Engineering at NuoDB
With our latest release cycle, we faced a problem that impacted extremely load-intensive long running (multi-day) tests. The symptom was a slow but steady increase in Resident Set Size (RSS) that impacted both our Transaction Engines (TEs) and Storage Managers (SMs). On large machines unless you're looking at "ps" stats you wouldn't notice it, but on typical cloud hardware the OOM killer would take out the NuoDB processes.
Since our system is written in C++, the obvious thought was that we had a memory leak. That turned out not to be the case after we went through a number of various attempts to find a leak using Valgrind and JEMalloc heap profiler. Then investigation turned to memory fragmentation which potentially was causing more pages to be held in memory with lots of holes that the memory allocator couldn't use. JEMalloc does a great job of minimizing memory fragmentation, and we confirmed that fact. So memory fragmentation wasn't the issue.
Our engineering team discovered (thanks Oleg) was that there were cases where JEMalloc was releasing pages back to the operating system and that didn't seem to have any effect. The problem turned out to be due to a somewhat new feature in the Linux kernel called Transparent Huge Pages (THPs). THPs prevented pages marked with madvise(...,..., MADV_DONTNEED) from being purged from resident memory. The quick description of THPs is that Linux will automatically create a "huge" page when a virtual memory allocation is above a certain size. By doing this there is less bookkeeping for a single huge page resource vs lots of little 4K pages - this bookkeeping impacts the performance of the virtual memory translation lookaside buffer (TLB). Many more details about THPs can be found here.
JEMalloc uses madvise(...,..., MADV_DONTNEED), to discard pages it doesn't need anymore. Since that doesn't work with THPs, our engineering team (thanks Tommy) patched JEMalloc to turn off huge page allocations using madvise(...,..., MADV_NOHUGEPAGE). Doing that fixed the memory consumption issues we were seeing without any noticeable impact on our performance. Tommy is submitting the change back to the JEMalloc community.
Unfortunately, the story doesn't end there. Kernel versions prior to 2.6.38 don't support madvise(...,..., MADV_NOHUGEPAGE). For that case, our TEs and SMs (in the upcoming 2.0.4 release) are producing warnings that say to turn off Transparent Huge Pages. To turn off THPs, as root, you need to do the following:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Note: On some systems (CentOS 6.3), the name of directory will be redhat_transparent_hugepage. Also note that while THPs were introduced in the Linux kernel, they are turned off by default in the Ubuntu kernel (so no worries there!).
If you run NuoDB on a Linux kernel with Transparent Huge Pages enabled, we strongly recommend turning them off. Anyone running things like Centos 6.3, 6.4 and 6.5, which run kernel revisions below 2.6.38, needs to pay attention to this.