Make your QEMU 10 times faster with this one weird trick
The work on QEMU and this post was paid for by Determinate Systems, and the post was co-published on the Determinate Systems blog.
NixOS uses virtual machines based on QEMU extensively for running its test suite. In order to avoid generating a disk image for every test, the test driver usually boots using a Plan 9 File Protocol (9p) share (server implemented by QEMU) for the Nix store, which contains all the programs and config necessary for the test.
I was working on a VM test that copied a fairly large amount of data (~278k files totaling ~5.3GiB) out of the 9p-mounted Nix store, and was surprised at how long copying this data took. On NVMe devices, I would expect this to take a matter of seconds or minutes, but the test actually ended up taking over 2 hours, most of which was spent copying files from 9p. Since this is untenably long for incremental work, I decided to dig into it a bit, and was able to reduce the duration of the test to only 7 minutes. In this post, I’ll describe the whole journey.
As a preface: I don’t have much experience in debugging performance issues! Most of what I used was novel to me. The first thing I wanted to do was find out where a lot of time was being spent. My guess was that it was in QEMU and not in the guest, though this guess being correct was a matter of pure luck. This stack overflow question described a problem mostly but not entirely unlike mine. This led me down the wonky path of trying to use the poor man’s profiler, a little hack composed of gdb, some shell, and awk.
Stories of surprises and failure: Poor man’s profiler
I immediately ran into a minor roadblock with this approach. gdb said:
warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable. Connect to gdbserver inside the container.
Nix uses Linux namespaces to provide builds with some isolation from the system running the build in order to reduce the effects that the environment of a particular machine can have on the result of the build (”purity”). This includes PID namespaces, which prevent processes within the namespace from touching any processes outside the namespace. gdb was unhappy with being in a different PID namespace from the process it was targeting! I first attempted to get my gdb inside the sandbox using nsenter
. The first surprise I encountered here was that entering a PID namespace does not cause utilities from procps, such as ps
, pgrep
and top
, to report only on processes inside the new namespace:
[root@oak:~]# pgrep -a qemu
1678991 /nix/store/6shk4z9ip57p6vffm5n9imnkwiks9fsa-qemu-host-cpu-only-for-vm-tests-7.0.0/bin/qemu-kvm [...]
[root@oak:~]# nsenter --target 1678991 --pid
🗣 This spawned a new shell within the build's PID namespace
[root@oak:~]# ps -f 1
UID PID PPID C STIME TTY STAT TIME CMD
root 1 0 0 Sep02 ? Ss 1:24 /run/current-system/systemd/lib/systemd/systemd
What!? That’s not the PID1 I’m expecting! And I certainly haven’t been running this build since the 2nd of September.
I’ll omit the details of the hour of frustration that ensued, but it turns out that the /proc
which ps
and friends were reading from was still that of the root PID namespace — even though they were no longer running in it! This might cause some funny unexpected behaviour when using tools like pkill
… But that’s a problem for another day.
With my newfound knowledge, I was able to work around this issue by also creating a new mount namespace and mounting the desired proc
filesystem on top of the /proc
we inherited from the root namespace.
[root@oak:~]# nsenter -t 1684606 -p -- unshare -m
🗣 Now inside the build's PID namespace (through nsenter) and a new mount namespace (created by unshare)
[root@oak:~]# ps -f 1
UID PID PPID C STIME TTY STAT TIME CMD
root 1 0 0 Sep02 ? Ss 1:24 /run/current-system/systemd/lib/systemd/systemd
[root@oak:~]# mount -t proc proc /proc
[root@oak:~]# ps -f 1
UID PID PPID C STIME TTY STAT TIME CMD
nixbld1 1 0 0 12:27 ? Ss 0:00 bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh
[root@oak:~]# ps -ef
UID PID PPID C STIME TTY TIME CMD
nixbld1 1 0 0 12:27 ? 00:00:00 bash -e /nix/store/9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh
nixbld1 6 1 0 12:27 ? 00:00:00 /nix/store/pn7863n7s2p66b0gazcylm6cccdwpzaf-python3-3.9.13/bin/python3.9 /nix/store/kdi82vgfixayxaql77j3nj7
nixbld1 7 6 99 12:27 ? 00:04:00 /nix/store/6shk4z9ip57p6vffm5n9imnkwiks9fsa-qemu-host-cpu-only-for-vm-tests-7.0.0/bin/qemu-kvm -cpu max -na
root 46 0 0 12:29 pts/5 00:00:00 -bash
root 79 46 0 12:30 pts/5 00:00:00 ps -ef
🗣 That's better!
[root@oak:~]# pid=7 ./pprof
1500 __futex_abstimed_wait_common,__new_sem_wait_slow64.constprop.1,qemu_sem_timedwait,worker_thread,qemu_thread_start,start_thread,clone3
743 __lll_lock_wait,pthread_mutex_lock@@GLIBC_2.2.5,qemu_mutex_lock_impl,qemu_mutex_lock_iothread_impl,flatview_read_continue,flatview_read,address_space_rw,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
100 syscall,qemu_event_wait,call_rcu_thread,qemu_thread_start,start_thread,clone3
53 ioctl,kvm_vcpu_ioctl,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
45 get_fid,v9fs_read,coroutine_trampoline,__correctly_grouped_prefixwc,??
15 get_fid,v9fs_walk,coroutine_trampoline,__correctly_grouped_prefixwc,??
11 alloc_fid,v9fs_walk,coroutine_trampoline,__correctly_grouped_prefixwc,??
8 get_fid,v9fs_getattr,coroutine_trampoline,__correctly_grouped_prefixwc,??
5 clunk_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
5 alloc_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
4 __lll_lock_wait,pthread_mutex_lock@@GLIBC_2.2.5,qemu_mutex_lock_impl,qemu_mutex_lock_iothread_impl,flatview_write_continue,flatview_write,address_space_rw,kvm_cpu_exec,kvm_vcpu_thread_fn,qemu_thread_start,start_thread,clone3
3 get_fid,v9fs_xattrwalk,coroutine_trampoline,__correctly_grouped_prefixwc,??
3 get_fid,v9fs_open,coroutine_trampoline,__correctly_grouped_prefixwc,??
2 clunk_fid,v9fs_clunk,coroutine_trampoline,__correctly_grouped_prefixwc,??
1 get_fid,v9fs_readlink,coroutine_trampoline,__correctly_grouped_prefixwc,??
1 get_fid,v9fs_readdir,coroutine_trampoline,__correctly_grouped_prefixwc,??
1 address_space_translate_internal,flatview_do_translate.isra,address_space_map,virtqueue_map_desc,virtqueue_split_pop,virtio_blk_handle_vq,virtio_queue_notify_vq.part,aio_dispatch_handler,aio_dispatch,aio_ctx_dispatch,g_main_context_dispatch,main_loop_wait,qemu_main_loop,main
1
The most common points in the code were:
- Locking resources, at 1500 and 743 thread samples. Since the top result was on worker threads, I guessed that that wasn’t interesting and they were just waiting for work to become available.
- Waiting for things to happen inside the VM, at 100 and 53 thread samples. That didn’t seem like a relevant issue though — I’d expect a VM monitor to be waiting for its guest to need something most of the time.
The rest of the numbers were small enough that I (erroneously!) considered them uninteresting too. At this point I became frustrated enough with this branch of my investigation that I gave up on it.
I moved on to quickstack, packaged it, and observed that it couldn’t get me any information on threads, before going all the way back to the Stack Overflow question to follow the other link provided in the answer.
Flame graphs with perf
This was what really got me somewhere. After recording performance data using perf record -F max -a -g -- sleep 20
while the build was running, I was able to generate a flame graph which made the source of the performance problems quite apparent. The following command:
# taken from the flamegraph page linked above
perf script | stackcollapse-perf.pl |
# We're only interested in the qemu process
grep ^qemu |
# Make the graph a lot less tall by collapsing multiple consecutive unknown stack frames together
awk '/unknown/ { gsub("(\\[unknown];){1,}", "[unknown...];", $0) } { print }' |
flamegraph.pl > flamegraph.svg # and generate a flamegraph
produced this nifty interactive SVG graph:
The prevalence of “fid”-related functions is quite apparent in this graph. So I jumped into the 9p docs and QEMU source code to find out that fids are numbers which refer to open files in a 9p connection, similar to file descriptors in the POSIX file API — so there is one fid for every file that the guest has open.
Let’s look at the previous implementation of get_fid
, which QEMU’s 9p server uses in implementations of stat
(getting file metadata), open
(opening a file), and read
(reading data from an open file), amongst others:
static V9fsFidState *coroutine_fn get_fid(V9fsPDU *pdu, int32_t fid)
{
int err;
*f;
V9fsFidState *s = pdu->s;
V9fsState
// Blog note: I've omitted some parts that are irrelevant to performance here.
(f, &s->fid_list, next) {
QSIMPLEQ_FOREACHif (f->fid == fid) {
return f;
}
}
return NULL;
}
QEMU iterates over fid_list
to find the desired fid’s data. Finding an entry in a list by iterating over has a time complexity of O(n) where n is the size of the list — in this case, the list of all the files the guest has open — so this is expensive! Moreover, some inspection of QSIMPLEQ (QEMU simple queue) reveals that it’s implemented as a linked list, a data structure which tends to exhibit poor cache locality.
Since my test copies many files from the 9p filesystem, as well as being booted from it, this lookup happens very frequently:
- One
stat
for getting a file’s metadata (permissions in particular) - One
open
to get a handle on a file - One
read
for small files, or many for larger files, to get the contents of the file
That makes at least 3 operations which perform the lookup, for each of the 278000 files, which bring the inefficient lookup into a hot code path. This was the reason for the slowness.
What we really want is a structure where we can look up entries by fid more cheaply. We can’t just use an array-based vector, which would consistently give us O(1) lookup, because fids are chosen by the client: we can’t rely on every newly allocated fid just being the smallest unoccupied one, and need to support arbitrary 32-bit integers. I opted for a hash table, as conveniently implemented by glib, which QEMU already depends on. That provides us with O(1) best-case complexity, while keeping the worst case at O(n). The exact real-world performance characteristics are significantly more complex than with the linked list, and there may be marginally more suitable data structures (or hash table implementations) out there, but we’re looking for a big easy win and not micro-optimisation here.
Rewriting the relevant code was surprisingly simple and uneventful, to the point that once I had it compiling, it just worked (an experience I’m not sure I’ve ever had before with C!). The results were spectacular: my previously >2h test finished in 7 minutes. It also reduces the build time of NixOS’s ZFS AWS image from 19 minutes to 1. It was pretty clear to me that this needed to go upstream.
QEMU uses an email-based workflow, where patches are sent as emails to a mailing list which maintainers are subscribed to. I hadn’t done this before, and it went somewhat chaotically, as you can see by looking at the threads
(v1
v3
v1a
v5
v6
) in the list archives. There’s a lot of space for mistakes in email-based patch submission, and I made several:
- Forgetting to reply-all when answering review comments, so that the reply is visible to everyone interested and not just the reviewer
- Missing the version tag on a resubmission (in my case, this was because of misleading docs)
- Sending a resubmission as a reply to the previous thread (that was just me failing to read the docs)
- Losing a bunch of patch-documenting work while using git-publish because
git send-email
failed (nixpkgs does not enable Git’s email functionality in the default package, and git-publish unconditionally deletes the working files once the send command has run).
But the reviewers, especially Christian Schoenebeck (thanks!) were helpful and patient, and in the end the patch made it (though an apparently unrelated bug was discovered) and will soon be finding its way to a QEMU version near you!
If you can’t wait and you use Nix, you can pull the patch in with this overlay:
final: prev: {
qemu = prev.qemu.overrideAttrs (o: {
patches = o.patches ++ [ (prev.fetchpatch {
name = "qemu-9p-performance-fix.patch";
url = "https://gitlab.com/lheckemann/qemu/-/commit/8ab70b8958a8f9cb9bd316eecd3ccbcf05c06614.patch";
sha256 = "sha256-PSOv0dhiEq9g6B1uIbs6vbhGr7BQWCtAoLHnk4vnvVg=";
}) ];
});
}
While the core of what I changed wasn’t hugely complex, this adventure had many collateral (positive!) outcomes for me:
- I wrote Nix packages for quickstack (including an upstream PR that makes the build more generic) and git-publish.
- I used
perf
to work on performance issues for the first time, learning how useful it is and basics of how to use it. - I learnt that different methods of profiling can yield wildly different results — the results from perf painted a very different picture from the poor man’s profiler; I don’t understand why yet, but I’ll spend some more time learning about them soon.
- I submitted my first patch to QEMU, getting to know the code a little.
- I submitted my second patch to QEMU (trying to fix the patch submission docs).
- I learnt how to submit patches by email, and which mistakes to avoid.
- I learnt about the 9p protocol.
- My tests now run much faster!
- And I got the warm fuzzy feeling of having made a valuable contribution.
My takeaway from it? Digging into a frustrating problem — even if I’m totally unfamiliar with the code involved and the technology I’m using — can be hugely rewarding!
Not only have I made my own work on the tests a lot more pleasant by reducing turnaround time, but this will go out to all QEMU users, and significantly reduce the load generated by installer tests on NixOS’s build farm.
This is what I love about working on open source software: if something’s wrong, I have the means to go in there and fix it, and can pass the benefits on to everyone else who uses it.
It doesn’t always go as well as this, but this kind of experience can make a great many less successful adventures worth it.