An overview of NVMe and its support on Maestro

Lobsters Hottest Tools

Summary

A technical overview of implementing an NVMe driver for the Maestro operating system, covering PCIe interface, queues, and memory-mapped I/O details.

<p><a href="https://lobste.rs/s/u16jsm/overview_nvme_its_support_on_maestro">Comments</a></p>
Original Article
View Cached Full Text

Cached at: 04/23/26, 11:25 AM

# Non-Volatile Memory Express (NVMe) Source: [https://blog.lenot.re/a/nvme](https://blog.lenot.re/a/nvme) As stated in my[previous blog article](https://blog.lenot.re/a/2025-retrospective), I started an implementation of an**NVMe driver**\. It is now functional, and this article is an overview of it\! NVMe \(Non\-Volatile Memory Express\) is the modern standard for SSD drives\. Its specification is defined by the[NVM Express Consortium](https://nvmexpress.org/)\. Now you may ask: Maestro supports[PATA](https://en.wikipedia.org/wiki/Parallel_ATA)and NVMe, why not implement SATA first? The answer is pretty straightforward:**NVMe support is easy to implement\!** Now you may have noticed that it took a few months before this article came out\. While the implementation of the driver was easy, my kernel needed a bit of redesign in some places to fix design flaws that prevented the driver from working correctly\. We will talk about this later\. ## The NVMe interface > For the implementation of this driver, I base myself on the[version 1\.3d](https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3d-2019.03.20-Ratified.pdf)of the specification \(mainly because it is easy to read\) From the point of view of a kernel, the**NVMe controller**is exposed as a device on the**PCIe**bus\. The PCIe provides**BAR**s \(Base Address Register\) that are addresses in memory where the NVMe controller’s registers are located\. Reading or writing to memory at the address of a BAR performs I/O with the NVMe controller\. Note that depending on cases, it may be necessary to**disable the cache**on BAR memory or enable**Write\-Through**or**Write\-Combining**\. > The**Write\-Through**flag on a virtual memory map tells the CPU writing to memory should be directly forwarded to the underlying device \(RAM, or other\)\. **Write\-Combining**works the same, except the CPU waits for the whole cache line to be written before forwarding \(which usually improves performance relative to Write\-Through\)\. The NVMe exposes a set of registers on BAR memory \(non\-exhaustive\): OffsetNameDescription0x00\-0x07CAPController capabilities0x08\-0x0BVSVersion0x0C\-0x0FINTMSInterrupt mask set0x10\-0x13INTMCInterrupt mask clear0x14\-0x17CCController configuration0x1C\-0x1FCSTSController status0x24\-0x27AQAAdmin queue attributes0x28\-0x2FASQAdmin submission queue0x30\-0x37ACQAdmin completion queue0x1000\+\(2X\)\*YSQxTDBLSubmission queue X tail doorbell0x1000\+\(2X\+1\)\*YCQxHDBLCompletion queue X head doorbell> This table has been shamelessly stolen from[osdev\.org](https://wiki.osdev.org/NVMe) `Y`is determined by reading the`CAP`register\. In the NVMe nomenclature, a disk is called a**namespace**\. Namespaces are attached to a controller, with which the OS communicates\. > Under Linux, in the`/dev`directory, you may see devices under the form`nvmeX`,`nvmeXnY`or`nvmeXnYpZ`\. `X`represents the ID of the controller,`Y`is the ID of the namespace on the controller, and`Z`is the ID of the partition on the namespace\. The specific details of the initialisation procedure of the NVMe controller are not very interesting and won’t be covered here\. This is an overview rather than a tutorial\. ### Submission and completion queues NVMe works with asynchronous I/O, through**submission**and**completion**queues\. The OS writes commands in the submission queue, and the NVMe controller writes command results back into the completion queue\. > If you are familiar with Linux’s low\-level interfaces, this should remind you about[io\_uring](https://en.wikipedia.org/wiki/Io_uring)\. One of the main advantages of this approach is that the NVMe controller can choose the order in which it processes commands, allowing optimisations\. A submission queue contains structures with the following layout: BytesFieldDescription3:0CDW0Opcode \[7:0\], fused operation \[9:8\], PSDT \[15:14\], command ID \[31:16\]7:4NSIDNamespace identifier15:8\-Reserved23:16MPTRMetadata pointer31:24PRP1Physical Region Page entry 1 \(source/destination in physical memory\)39:32PRP2Physical Region Page entry 2, or pointer to a PRP list43:40CDW10Command\-specific47:44CDW11Command\-specific51:48CDW12Command\-specific55:52CDW13Command\-specific59:56CDW14Command\-specific63:60CDW15Command\-specificWhen a command is completed, the NVMe controller writes a structure in the completion queue with the following layout: BytesFieldDescription3:0DW0Command\-specific result7:4DW1Reserved9:8SQHDSubmission queue head pointer \(updated by controller after processing\)11:10SQIDSubmission queue identifier13:12CIDCommand identifier \(matches the submission entry’s CDW0 \[31:16\]\)15:14\-Phase tag \[0\], status field \[14:1\]After writing this structure, the controller issues an interrupt to signal the OS there is something to read\. ### Command submission and completion When submitting a command, the OS: - Locks the queue’s semaphore \(see below\) - Writes the submission entry to the submission queue - Writes a pointer to the current process in a side table to link the submission to its corresponding process - Updates the submission queue’s doorbell register to signal the NVMe controller a new entry is available to read - Makes the current process sleep > **Note**: Maestro uses semaphores with a number of permits that correspond to the number of entries in the submission queue\. Before submitting a command, we take a permit and only release it when completion arrives\. This ensures we don’t overflow the queue\. Once the NVMe controller completed the command, the following happens: - The NVMe controller writes an entry in the completion queue - The NVMe controller sends an interrupt \(see the**Message Signaled Interrupts**chapter\) - The interrupt is handled by the OS, which reads the new completion queue entry - The OS uses the ID in the completion entry to find the matching submission entry, and then uses the side table to find the matching process - The OS copies the completion entry to the side table so that it may be retrieved by the process that submitted the command - The OS wakes the process up - The OS updates the completion queue’s doorbell register to signal the NVMe the completion queue entry has been processed - Upon resuming, the process retrieves the completion queue entry from the side table to know if the command succeeded ### Admin queues & Identification At the start, the NVMe only has one submission/completion queue pair\. Those queues are the**admin queues**, used to**identify**the controller and namespaces, and to set up**I/O queues**\. The**Identify**command returns a big structure that contains a lot of information about the controller, the list of namespaces, or a namespace\. We first use it to retrieve the information about the controller, then retrieve the list of attached namespaces, then retrieve information about each namespace\. With this information, we are able to use the**Create\_IO\_Completion\_Queue**and**Create\_IO\_Submission\_Queue**commands to create a pair of I/O queues\. ### Read and write Read and write operations on disk each have their associated command\. To read or write data on disk, the OS simply specifies the**size**\(in blocks\), the**LBA**\(Logical Block Address\) and a**PRP**\(Physical Region Page\) or a PRP list\. A PRP is an address in physical memory, where the data is read from or written to \(depending on the I/O direction\)\. Passing a PRP list allows implementing scatter\-gather I/O \(similar to the`readv\(2\)`/`writev\(2\)`system calls\)\. ## Message Signaled Interrupts Message Signaled Interrupts are supported by PCI/PCIe\. They allow passing interrupts to the CPU without using dedicated PINs\. Instead, a message is sent on the bus from the device \(in our case, the NVMe controller\) to the CPU\. In the case of PCIe, this message is materialised by a write operation of a DWORD \(4 bytes\) at a specific memory address\. This feature is called**MSI\-X**\(or simply**MSI**for the legacy version, but we will not describe it here, as NVMe requires MSI\-X anyway\)\. It turns out, x86 CPUs have a dedicated memory address that, when writing a DWORD to it, will trigger an interrupt\. Upon scanning the PCIe for device discovery, the OS can retrieve a BAR pointing to a table used to map the device’s interrupts to the CPU’s interrupt table\. It has the following layout: Bits 127\-96Bits 95\-64Bits 63\-32Bits 31\-0Vector Control \(0\)Message Data \(0\)Message Address High \(0\)Message Address Low \(0\)Vector Control \(1\)Message Data \(1\)Message Address High \(1\)Message Address Low \(1\)…………Vector Control \(N \- 1\)Message Data \(N \- 1\)Message Address High \(N \- 1\)Message Address Low \(N \- 1\)> This table has been shamelessly stolen from[osdev\.org](https://wiki.osdev.org/PCI) The**Message Address**is the address where the message is written in memory\. The**Message Data**is the value of the DWORD written there\.**Vector Control**contains flags\. Each entry \(line in the table\) corresponds to an interrupt ID on the device’s side\. The format of Message Data is specific to the CPU architecture, and it contains the interrupt ID on the CPU side\. ## Maestro’s design flaws A few design issues in the kernel made implementing the NVMe a bit harder than it should have been\. This chapter is an overview of those\. ### Sleeping in a page fault handler triggered in`execve` When using`execve\(2\)`to execute a programme, the kernel creates a new virtual memory space to build the new programme image\. To populate this new virtual memory space, the kernel temporarily switches to it\. While doing so, it also entered a**critical section**because we could not allow switching context to another process, while being in a memory context that is different from the one bound to the current process\. Otherwise, the process would resume with its original memory context instead of the temporary one, causing memory corruption\. > For more information about critical sections, see the blog[article about SMP](https://blog.lenot.re/a/smp)\. To populate the memory space, the kernel memory\-maps \(`mmap\(2\)`\) the ELF file\. Memory pages are lazy\-populated, so they are not loaded from disk immediately\. Then, the kernel sometimes needs to zero the end of some ELF segments\. Doing so triggers a page fault \(since the page is not present yet\), which in turn reads the file from the disk to populate memory\. The NVMe driver sends the read command and then puts the current process to sleep, which, in turn, triggers a context switch to immediately continue executing any other process waiting for CPU time\. **However**, we said earlier that we entered a critical section, right? The precise goal of a critical section is to disallow context switching\. This design was valid before NVMe because my Parallel ATA implementation is only polling \(looping until data is ready, instead of sleeping until an interrupt comes\)\. To fix this issue, I modified the`Process`structure to contain one more field \(`active\_mem\_space`\) that points to the currently bound memory space \(that may differ from the process’s own memory space\)\. The temporary switch changes the value in`active\_mem\_space`\. Then, when the process resumes, it uses that memory space instead of the process’s memory space\. So the critical section is not needed any more\. ### CPU tasks rebalance during an NVMe sleep To start sleeping, the NVMe driver has the following code: ``` // Put the process into sleeping state process::set_state(State::Sleeping); // Reschedule schedule(); ``` There is a small gap between the two function calls, during which another CPU core might run the task that rebalances tasks across CPU cores\. Between the moment we set the process into`Sleeping`state and the moment we reschedule \(effectively saving the process’s state\), another CPU core could attempt to resume the task if it received the completion interrupt before`schedule`was called\. This made the process run on two CPU cores at once, with an invalid register state on one core\. To fix this issue, the process state now has a flag that locks it until the context has effectively been switched, preventing other CPUs from picking the task until`schedule`is called\. ### Disabling Write\-Protect outside a critical section Under x86, the`CR0`register has a**Write\-Protect**flag that, when disabled, allows the kernel to write on read\-only pages\. The state of this register is**NOT**saved by Maestro on context switch\. In`execve\(2\)`\(again\), the kernel was temporarily disabling this flag to write in the ELF’s read\-only segments\. We cannot wrap this in a critical section either since, again, writing in memory might read from disk, which will put the process to sleep, which in turn will trigger a context switch\. When putting the process to sleep, the scheduler sometimes migrated the process to another CPU \(to rebalance the load between cores\)\. Since the process is resumed on a different core, with the Write\-Protect flag enabled \(since`CR0`is not saved on context switch\), the process resumes trying to write on a read\-only page, which triggers a kernel panic\. As a fix, I stopped disabling Write\-Protect here, and instead I implemented the`mprotect\(2\)`system call to use its logic inside of`execve\(2\)`and enable read\-only once initialisation is over\. ## Future optimisations The current implementation only has one I/O queue pair\. On systems that have a lot of CPU cores, this might cause[contention](https://en.wikipedia.org/wiki/Resource_contention)\. One way to reduce this is to create as many I/O queue pairs as available \(but no more than the number of CPU cores\) and assign them evenly across cores\. As such, if the system has`N`CPUs and`M`I/O queue pairs available, each I/O queues pair gets`N / M`cores assigned\. Some NVMe commands are available to make some operations faster \(non\-exhaustive\): - **Write Zeros**: marks blocks as full of zeros \(also making the NVMe last longer since it avoids write cycles\) - **Dataset Management**: gives hints to the controller about the frequency of access to certain blocks - **Copy**: copies data from/to the same namespace or different namespaces without involving the CPU ## What’s next? Now that the NVMe side\-quest is over, I will get back to implementing support for a desktop environment\. This is ongoing and progressing well\!

Similar Articles

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

Reddit r/LocalLLaMA

A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.