Notes on Virtualization Stack

QEMU, KVM, IOMMU, and more

Yizhou Shan

ys@ucsd.edu

Created: Jan 25, 2020

Last Updated: Jun 30, 2021

Table of Content

​1.​ Introduction

​2.​ List of Open Source Projects

​3.​ QEMU Source Code Study

​3.1.​ References

​3.2.​ Code Layout

​3.3.​ High-Level Summary

​4.​ I/O Device Virtualization

​4.1.​ Models

​4.2.​ QEMU Device Emulation Design

​4.3.​ QEMU Device Emulation Implementation (Code)

​4.4.​ QEMU + KVM Implementation Code Flow

​4.5.​ virtio/vhost (Paravirtualization for devices drivers):

​4.6.​ IOMMU

​4.6.1.​ dma_map

​4.6.2.​ QEMU vIOMMU Emulation

​4.6.3.​ Nested Translation in IOMMU

​4.6.4.​ Device-TLB

​4.6.5.​ References:

​4.7.​ VFIO

​4.8.​ Samecore v.s. Sidecore Emulation

​5.​ KVM

​6.​ libvirt and virsh

​7.​ Misc Knowledge

​7.1.​ Timer Interrupt and IPI delivery to VMs


​1.​ Introduction

This is a scratchy and raw note about QEMU and KVM. It has some links to various posts. It has some sentences trying to explain how QEMU interacts with KVM, and some code snippets from both QEMU and Linux kernel KVM modules.

Honestly, I didn’t fully understand the whole QEMU/KVM thing the whole time, until I decided to take a deep tour recently. End of the day, I’m satisfied, mostly. I now know how QEMU invokes KVM, how KVM launches guest, how KVM handles vmexit, and vmexit translates to userspace-visible `KVM_EXIT_REASON` and so on, all in concrete code pieces. More importantly, I dive deep into how the device is emulated. More specifically, IO and MMIO emulation. It's super interesting but really fundamental, I strongly recommend the QEMU blog explaining the device emulation.

Device emulation is costly, that’s why Amazon use Nitro to offload now (not too much tech details available online)! VM can use IOMMU to have exclusive access to a device, the IOMMU ensures memory safety, and optimizes the interrupt-to-cpu delivery. SR-IOV improves scalability, it makes a physical device appear as multiple virtual devices. Combing SR-IOV and IOMMU, each running VM can have exclusive access to a virtual function, no VMM involved, even if there is just one physical PCIe device. Note that, a) IOMMU can be used without SR-IOV, that means a physical device can be used by one VM only, b) in theory, SR-IOV-capable device can be used without IOMMU, as long as the guest VM can see host physical address. However, the usual practice is always use SR-IOV with IOMMU, thus it can translate guest physical address to host physical address.

​2.​ List of Open Source Projects

(If you want to develop a hypervisor, or a VMM, most likely you can find codes and references in the following projects. )

​3.​ QEMU Source Code Study

​3.1.​ References

​3.2.​ Code Layout

​3.3.​ High-Level Summary

  1. QEMU has many pieces, including dynamic binary translation (or tiny code generation, or TCG), KVM acceleration, device emulation, and more helpers. From my understanding, TCG and KVM are two exclusive modes to run guest code, it’s either TCG or KVM.
  2. QEMU runtime is mostly event-driven. QEMU uses one “main” thread to run a repeated main loop. Inside the loop, QEMU will run the guest code, either via TCG or via KVM. A single “main” thread can run one vCPU or multiple ones, it depends on if CONFIG_IOTHREAD is enabled. Normal practice is one “main” thread for one vCPU.
  3. Conceptually, the “main” thread loop has two user-mode contexts. One is the QEMU context, another is the guest context, and they are exclusive. The goal is to run in guest mode as much as possible, thus dedicating the whole pCPU to vCPU.
  1. **QEMU->Guest context transition**: When the “main” thread starts (`vl.c`), we run in QEMU context, where we first allocate/prepare misc stuff. Next call KVM to allocate a VM. Next, the “main” thread ask KVM to start running guest via an ioctl (i.e., `ioctl(vcpufd, KVM_RUN, NULL)`). *During this particular ioctl call, we transition from user-mode QEMU context to kernel mode, then the kernel mode will transition to user-mode guest context*.
  2. **Guest->QEMU context transition**: Whenever guest context incurs a vmexit (e.g., MMIO read/write), the CPU will exit to kernel mode KVM handlers first. Then, the KVM module will determine if this particular vmexit should be handled by userspace (note that kernel KVM module will handle some particular vmexit itself rather than exposing to userspace). If so, the `ioctl()` syscall that caused the QEMU->Guest transition will return to userspace, and then we will be back at QEMU context!. And repeat.
  3. The whole thing is demonstrated use real code.
  1. Other than those “main” threads, QEMU has other worker threads. The motivation is simple, many device operations are asynchronous to guest, i.e., interrupt-based. The “main” threads will offload some tasks to those worker threads, mostly some asynchronous tasks. For instance, the “main” thread may offload VNC computation, disk operation to worker threads. Upon completion, the worker threads can either send signals or use file descriptors to notify the main thread.
  2. The main loop is waiting for events, some from guest, some from worker threads, some from timer expires. Overall, QEMU is like the Linux kernel, it needs opportunities to gain control thus run code. And “main” thread cannot always run in guest mode, what if guest is running spinning code, right? So QEMU has either timer or signals to let QEMU context has a chance to run.
  1. Actually, I’m not sure how this happens. The user mode QEMU cannot just regain control. It has to be the kernel side KVM to help, right?
  1. To me, with the help of KVM, the main thing left for QEMU is to emulate all the devices. Like this blog said, QEMU will catch all the IO and MMIO accesses and emulate the effect as they will do in the bare-metal machine.
  1. “With QEMU, one thing to remember is that we are trying to emulate what an Operating System (OS) would see on bare-metal hardware”
  2.  And at the end of the day, all virtualization really means is running a particular set of assembly instructions (the guest OS) to manipulate locations within a giant memory map for causing a particular set of side effects, where QEMU is just a user-space application providing a memory map and mimicking the same side effects you would get when executing those guest instructions on the appropriate bare metal hardware.

​4.​ I/O Device Virtualization

This section walk through various bits on I/O virtualization.

​4.1.​ Models

  1. Traditional Device Emulation. In this case, the guest device drivers are not aware of the virtualization environment. During runtime, the VMM (QEMU/KVM) will trap all the IO and MMIO access and emulate the device behavior. The VMM emulates the I/O device to ensure compatibility and then processes I/O operations before passing them on to the physical device (which may be different)  The downside is obvious, there will be A LOT vmexit!.
  2. Paravirtualized Device Emulation, or virtio. In this case, the guest device drivers are aware of the virtualization environment. This approach uses a front-end driver in the guest that works in concert with a back-end driver in the VMM. These drivers are optimized for sharing and have the benefit of not needing to emulate an entire device. The back-end driver communicates with the physical device  A lot of vmexits can be coalesced thus perf can be improved.
  1. Direct Assignment. Let a VM directly talk to device. Thus the guest device drivers can directly access the device configuration space to, e.g., launch a DMA operation. The device can DMA to physical memory in a safe manner, via IOMMU. This is enabled by Intel vt-d. Drawback: One concern with direct assignment is that it has limited scalability; a physical device can only be assigned to one VM.
  2. SR-IOV and Direct Assignment. Incremental to above one. With SR-IOV, each physical device can appear as multiple virtual ones. Each virtual one can be directly assigned to one VM, and this direct assignment is using the vt-d/IOMMU feature.
  3. AWS Nitro. Details?
  1. Yes, you have to! Read the virto section. Reasons: 1) there is no complete virtio backend drivers in the kernel space. Even the vhost only has data path, control path is still handled by userspace
  2. That being said, if you want to implement a new VMM to run Linux kernel, you will have to implement either raw emulation or virtio backends.
  3. Also I think that’s why every new VMM needs to deal with virtio and vhost. If you check QEMU, cloud-hypervisor, ACRN etc, they all handle these.

Further Reading

​4.2.​ QEMU Device Emulation Design

  1. QEMU declares a memory region
  2. Guest’s first access to MMIO addr will cause an EPT violation vmexit
  3. KVM constructs EPT pgtable and marks the PTE with special mark
  4. Later the guest access these MMIO, it will be processed by EPT misconfig VM-exit handler

​4.3.​ QEMU Device Emulation Implementation (Code)

In this section, I want to take a tour on how QEMU works without accelerators like KVM. I think at this point we already know the basic designs. Hence this section will focus on code details. (Added on Jun 19, 2021.)

References

  1. QEMU's instance_init() vs. realize() (redhat.com) 

​4.4.​ QEMU + KVM Implementation Code Flow

For QEMU, KVM is a kind of accelerator. By default, QEMU uses binary translation. With KVM, QEMU is able to run native instructions. To do so, QEMU must interact with the linux kvm via ioctl. Also, with KVM, the device emulation flow is slightly different as it would trap to host kernel then bounce back to usespace QEMU.

Also note there are many examples demonstrating the KVM flow. QEMU-KVM is one, the rust-virt vmm is also one. Generally this is where you should start from if you want to write you own virtualization.

qemu: accel/kvm/kvm-all.c

        switch (run->exit_reason) {

        case KVM_EXIT_IO:

            DPRINTF("handle_io\n");

            /* Called outside BQL */

            kvm_handle_io(run->io.port, attrs,

                          (uint8_t *)run + run->io.data_offset,

                          run->io.direction,

                          run->io.size,

                          run->io.count);

            ret = 0;

            break;

        case KVM_EXIT_MMIO:

            DPRINTF("handle_mmio\n");

            /* Called outside BQL */

            address_space_rw(&address_space_memory,

                             run->mmio.phys_addr, attrs,

                             run->mmio.data,

                             run->mmio.len,

                             run->mmio.is_write);

            ret = 0;

            break;

kvm_set_user_memory_region

                /* KVM_EXIT_MMIO */

                struct {

                        __u64 phys_addr;

                        __u8  data[8];

                        __u32 len;

                        __u8  is_write;

                } mmio;

If exit_reason is KVM_EXIT_MMIO, then the vcpu has

executed a memory-mapped I/O instruction which could not be satisfied

by kvm.  The 'data' member contains the written data if 'is_write' is

true, and should be filled by application code otherwise.

(Linux)

x86_emulate_instruction()

  -> x86_emulate_insn

       -> exec() -> IO/MMIO -> fill in the  KVM_EXIT_IO info etc

(Linux)

kvm_mmu_page_fault (this is called from the VM EXIT handler array)

  -> x86_emulate_instruction

(QEMU -> Linux -> QEMU)

QEMU kvm_vcpu_ioctl(cpu, KVM_RUN, 0)

 -> Linux kvm_arch_vcpu_ioctl_run

     -> vcpu_run

          -> vcpu_enter_guest

                 -> kvm_x86_ops->run(vcpu); (run the guest!)

                 -> handle_exit_irqoff()

                 -> handle_exit() which is vmx_handle_exit

                     -> handle all the vmexit, fill in the KVM_EXIT reasons.

                         (kvm_vmx_exit_handlers[exit_reason](vcpu))

                         -> handle_ept_misconfig (just one of many handlers!)

                             -> kvm_mmu_page_fault

                                 -> x86_emulate_instruction

                                     -> x86_emulate_insn

                                          -> exec() -> IO/MMIO -> fill KVM_EXIT_IO

 

(if vcpu_enter_guest returns 1, the whole thing will break the loop and return back to userspace, where the above qemu code can inspect the KVM_EXIT reasons.)

​4.5.​ virtio/vhost (Paravirtualization for devices drivers):

Notes and references

I think when you start QEMU, you can choose if you want to use `virio` or `vhost` as the network/block/etc devices. QEMU will use `vhost` accordingly, i.e., use the host’s `/dev/vhost-xxx` interface.

Source code:

​4.6.​ IOMMU

IOMMU can be used for virtualization or non-virtualization cases. If a device is directly assigned to a guest, IOMMU must be used, because we need to prevent the guest drivers from corrupting arbitrary hypervisor memory.

The BIOS presents IOMMU related information via the ACPI DMAR tables. The Linux IOMMU driver will parse the table and build necessary stuff. The whole linux intel-iommu.c follows the Intel VT-d specification, i.e., setup Root Table, Context Entry, and talk with IOMMU via MMIO register access. The IOMMU is involved in every DMA related operation, because it needs to prepare the page table entries.

IOMMU can also be emulated. QEMU can emulate an IOMMU for the guest, so the guest linux can also run its intel-iommu code. The emulation is done similar to other MMIO emulation techniques. (See the following QEMU vIOMMU link for why we need this)

Intel IOMMU is very flexible. Each device can have more than one associated address space, presumably each client VM can install their address space onto a device. In addition, the IOMMU can translate from not only gPA, but also gVA (think about RNIC in virtualized usage!), and some others. This is described int Intel IO-d spec:

​4.6.1.​ dma_map

Drivers will call dma_map and dma_unmap before and after each DMA operation

Remember dma_map? I have implemented in LegoOS, mainly for IB’s needs. At the time of implementation, I’ve only implemented a pci-nommu version, that means the dma_map is barely just doing a “kernel virtual address to physical address” translation, that’s all, no additional setup. The code is here: https://github.com/WukLab/LegoOS/blob/master/arch/x86/kernel/pci-nommu.c#L50

Linux has the real IOMMU-based dma_map and dma_unmap. Ultimately, it uses the dma_map_ops, from linux/intel-iommu.c at master · torvalds/linux. I took a brief read of the code, it’s my understanding that the code is following the Intel VT-d specification. More specific, this intel-iommu.c source file will:

  1. Allocate the Root Table and write into the IOMMU registers
  2. Allocate context pages
  3. Setup the IOMMU page tables during dma_map

For Linux, there are multiple dma_ops: nommu, iommu, and swiotlb. If Intel IOMMU is not present in the ACPI table, usually the swiotlb is used by default. SWIOTLB is like a bounce buffer, I don’t really get why it is the default. LegoOS has the nommu version, which really does nothing.

So when you enable pass-through and expose a device to guest this is what happens:

One thing I still don’t understand: (Answer below, in VFIO section)

​4.6.2.​ QEMU vIOMMU Emulation

​4.6.3.​ Nested Translation in IOMMU

IOMMU can also do GVA -> HPA translation using its two-level (nested) translation. The device needs to work with physical IOMMU. The device can send requests to IOMMU to request address translation.

To utilize nested translation in IOMMU, there must be a companion vIOMMU from QEMU exposed to guest VM. And QEMU needs to intercept and record whatever changes the guest is trying to make to vIOMMU, the guest is trying to do gVA->gPA mapping. Then QEMU will install that onto the physical IOMMU.

[RFC Design Doc v3] Enable Shared Virtual Memory feature in pass-through scenarios 

Shared Virtual Memory in KVM 

virtualization. It is to let application programs(running in guest)share their

virtual address with assigned device(e.g. graphics processors or accelerators)”

QUESTION (see paragraph above): It seems RDMA/GPU cards can both use this feature, right? But I’ve never seen anyone mention this before. My impression is always that RDMA/GPU cards will do their own GVA to GPA translation and not care whether there is iommu present.

Now the question is to find out who is actually using this nested IOMMU translation.

​4.6.4.​ Device-TLB

The Intel vt-d spec talks about Device-TLB. As its name suggests, devices can cache some entries in their chip! There is a protocol between the device and IOMMU.

Do note, this is totally different from RDMA/GPU’s own VA->PA translation facility.

This one is IOMMU specific and is generic to all devices who want to use it. And it seems it is used for the nested translation above (gVA->hPA).

​4.6.5.​ References:

​4.7.​ VFIO

Funny that I missed vfio in the first place, such a critical piece. vfio exposes device’s configuration spaces to the user space (via mmap of course), so that people could run user-level device drivers!

To understand that, you first need to understand how the driver talks with the device: PIO, MMIO, interrupt, and DMA. The most important thing of course is MMIO (assume PCIe devices): Interrupt/DMA will happen because the driver touched the configuration space in the first place. Vfio exposes the PCIe configuration space to userspace, so that people can write a device driver just like they have done in kernel space.

But, a DMA-capable device is able to write to anywhere. That’s why we need IOMMU, that’s why within the kernel, virtio subsystem is closely bound to the IOMMU part.


The use cases of vfio are straightforward. Both of them need to directly access device:

QEMU uses VFIO to directly assign physical devices to guest OS, and the command line option is `-device vfio-pci,host:00:05.0`. In fact, virsh PCIe device passthrough configuration eventually translates to the above QEMU option. (Check /var/log/libvirt/qemu/$vmname)

Note that QEMU itself has valid VA-PA mapping to access the physical device configuration spaces, which was established via mmap and ioctl. When QEMU launches the guest, it will expose this part of VA to the guest OS, thus when guest OS tries to access the physical device’s configuration space, it will just through without any EPT VMEXIT. (The mechanism should be: qemu will register the PCI device’s memory region via its memory_region_init_rw() API, which will finally ask KVM to setup the ETP pgtables)

As for the IOMMU mapping, QEMU will use ioctl to ask vfio driver to install the mappings. I think QEMU will simply map all guest’s memory (gPA -> hPA), so whatever gPA is used by guest device driver, IOMMU has a valid PTE. Wow, this actually solved my concern!

(the assumption is QEMU allocates the guest memory during start, i.e., just eager-allocate all memories. Is this true?)

The usage of IOMMU is really smart here. Even though the user app can use ioctl to ask vfio kernel driver to setup some mappings, the user app is only allowed setup pages belong to itself! Thus IOMMU is safe, thus DMA is safe, thus the whole “userspace device driver” is space.

As for other interrupts, it says it was implemented as a file descriptor+eventfd. Sure, shouldn’t be too hard. :p

References:

​4.8.​ Samecore v.s. Sidecore Emulation

Well, most QEMU/KVM device emulation is samecore, right? For instance, the serial device is definitely samecore, as you can see from the above QEMU/KVM code flow.

We also know that QEMU has some worker threads. The “main” thread will offload jobs to them, and the jobs include I/O related stuff, right? So maybe the state-of-the-art QEMU is already doing both samecore and sidecore emulation?

Well, it depends on how you define “sidecore emulation”, I guess. If it means “avoid vmexit” at all, then it’s slightly different.

​4.9.​ AWS Nitro/Microsoft

The latest in this design space is to use dedicated hardware for IO virtualization, like AWS Nitro and Microsoft SmartNIC.

The device exposes a unique partition/MMIO to each guest. Guest sees a passthrough raw device. Within the device itself, there will be stack handling necessary virtualization tasks. These tasks are similar to what QEMU was doing.

​5.​ KVM


Source code:

References:

How to use KVM APIs:

​6.​ libvirt and virsh

​7.​ Misc Knowledge

​7.1.​ Timer Interrupt and IPI delivery to VMs

So today (Arp 2, 2020) I came across this VEE’20 Directvisor paper, which is trying to deploy a very thin hypervisor to bare-metal cloud to regain some manageability. The general approach is to reduce VM exit (by disabling the vt-d feature) and try to deliver interrupts to VM directly without VM exit. It talks about the flow of the current timer interrupt and IPI delivery mechanism, I found it useful thus post it here: