When I ask my colleagues why mmap is faster than system calls, the answer is inevitably “system call overhead”: the cost of crossing the boundary between the user space and the kernel. It turns out that this overhead is more nuanced than I used to think, so let’s look under the hood to understand the performance differences.
Background (skip if you are OS expert):
System calls. A system call is a special function that lets you cross protection domains. When a program executes in user mode (an unprivileged protection domain) it is not allowed to do things that are permitted for the code executing in the kernel mode (a privileged protection domain). For example, a program running in user space typically cannot read files without help from the kernel. When a user program asks a service from an operating system, the system protects itself from malicious or buggy programs via system calls. A system call executes a special hardware instruction, often called “trap”, that transfers control into the kernel. Then the kernel can decide whether it will honour the request.
While this protection is super useful, it has a cost. When we cross from user space into the kernel, we have to save the hardware registers, because the kernel might need to use them. Further, since it is unsafe to directly dereference user-level pointers (what if they are null — that’ll crash the kernel!) the data referred to by these pointers must be copied into the kernel.
When we return from the system call, we have to repeat the sequence in the reverse order: copy out any data that the user requested (because we can’t just give user programs pointers into kernel memory), restore the registers and jump to user mode.
Page faults. The operating system and the hardware together translate the addresses that are written down in your program’s executable (these are called virtual addresses) to the addresses in the actual physical memory (physical addresses). It would be pretty inconvenient for the compiler to generate physical addresses directly, because it doesn’t know on what machine you might run your program, how much memory it has and what other programs might be using physical memory at the time your program runs. Hence the need for this virtual-to-physical address translation. The translations, or mappings, are set up in your program’s page table. When your program begins to run, none of these mappings are set up. So when your program tries to access a virtual address, it generates a page fault, which signals the kernel to go set up the mapping. The kernel is notified that it needs to handle a page fault via a trap, so in this way it is a bit similar to a system call. The difference is that the system call is explicit and the page fault is implicit.
Buffer cache. Buffer cache is a part of kernel memory that is used to keep recently accessed chunks of files (these chunks are called blocks or pages). When a user program requests to read a file, the page from the file is (usually) first put into the buffer cache. Then the data is copied from the buffer cache out to the user-supplied buffer during the return from the system call.
Mmap. Mmap stands for memory-mapped files. It is a way to read and write files without invoking system calls. The operating system reserves a chunk of a program virtual addresses to “map” directly to a chunk in a file. So if the program reads the data from that part of the address space, it will obtain the data that resides in the corresponding part of the file. If that part of the file happens to reside in the buffer cache, the virtual addresses of the mapped chunk will simply be mapped to the physical addresses of the corresponding buffer cache pages upon the first access, and no system calls or other traps will be invoked later on. If the file data is not in the buffer cache, accessing the mapped area will generate a page fault, prompting the kernel to go fetch the corresponding data from disk.
Why mmap should be faster
Let us begin by formulating the hypothesis. Why do we expect mmap to be faster? There are two obvious reasons. First, it requires no explicit crossing of protection domains, though there is still implicit crossing when we have page faults. That said, if a given range in the file is accessed more than once, chances are we won’t incur page faults after the first access. That, however, did not occur in my experiments, so I did expect to hit a page fault every time I read a new block of the file.
Second, if the application is written such that it can access the data directly in the mapped region, we don’t have to perform a memory copy. In my experiments, though, I was interested in measuring the scenario where the application has the separate target buffer for the data it reads. So even though the file is mmapped, the application will still copy the data from the mapped area into the target buffer.
Therefore, in my experimental environment, I expected mmap to be slightly faster than system calls, because I thought the code for handling page faults would be a bit more streamlined than that for system calls.
I set up my experiment in the following way. I create a 4GB file and then read it either sequentially or randomly using a block size of 4KB, 8KB or 16KB. I read the file using either a read system call or mmap. In the case of mmap, the data is copied from the mapped area into a separate “destination” buffer. I run these tests using either a cold buffer cache, meaning that the file is not cached there, or a warm buffer cache, meaning that the file is there in kernel memory. The storage medium is an SSD that you might expect to find in a typical server. All reads are performed using a single thread. Source code of a benchmark demonstrating this is found here.
The following charts show the throughput of the read benchmark for the sequential/warm, sequential/cold, random/warm and random/cold runs.
Barring few exceptions, mmap is 2–6 times faster than system calls. Let’s analyze what happens in the warm experiments, since there mmap provides a more consistent improvement.
The following figure shows the CPU profile collected during the sequential/warm syscall experiment with 16KB block size. During this experiment the CPU utilization is 100%, so the CPU profile tells us the whole story.
We see that ~60% of the time is spent in copy_user_enhanced_fast_string — a function that copies data out to user space. About 15% is spent on other work that occurs on crossing the system call boundary (functions do_syscall_64, entry_SYSCALL_64 and syscall_return_via_sysret), and about 6% in functions that find the data in the buffer cache (find_get_entry and generic_file_buffered_read).
Now let’s look at what happens during the mmap test with the same parameters:
This profile is vastly different. About 60% of the time is spent in __memmove_avx_unaligned_erms, and a bunch of time in various functions that set up page mappings.
We will come back to __memmove_avx_unaligned_erms in a moment, but for the time being let’s try to figure out exactly how much time is spent mapping pages. I have a neat trick up my sleeve to do that. On Linux, the mmap system call can accept a MAP_POPULATE flag. What this flag does is it forces mmap to pre-populate all the page mappings during the actual system call, so none of the page-mapping work would be done when my test actually runs. So I changed my test to invoke mmap with MAP_POPULATE and learned that the experiment completes about 36% faster. (I only measure the timing of the main loop, and not that of the mmap system call). Therefore, I assume that in the above profile all those mapping functions take up about 36%.
Let’s summarize what we have so far. Using the CPU profile we were able to explain about 82% of the execution time for the syscall experiment (60% spent on copyout, 15% on other domain crossing operations, and 8% on actually reading the file from the buffer cache), and about 96% for the mmap experiment: 60% was spent in user-level memory copy and about 36% in mapping pages.
I also have to tell you that if I tweak the experiment to run for a very long time, accessing the same file in a loop over and over again, the mmap experiment is completely dominated by the memory copy:
The profile of the syscall experiment, if I run it much longer, stays largely the same:
Here is where things get very interesting. A huge portion of time — at least 60% — is spent in copying data. However, the functions used for syscall and mmap are very different, and not only in the name.
__memmove_avx_unaligned_erms, called in the mmap experiment, is implemented using Advanced Vector Extensions (AVX) (here is the source code of the functions that it relies on). The implementation of copy_user_enhanced_fast_string, on the other hand, is much more modest. That, in my opinion, is the huge reason why mmap is faster. Using wide vector instructions for data copying effectively utilizes the memory bandwidth, and combined with CPU pre-fetching makes mmap really really fast.
Why can’t the kernel implementation use AVX? Well, if it did, then it would have to save and restore those registers on each system call, and that would make domain crossing even more expensive. So this was a conscious decision in the Linux kernel.
In the meantime, converting your application to use mmap rather than system calls could make it run faster. That said, mmap is not always convenient to program with, but that’s a subject for another post…