In June we saw an update to the NVMe standard. The update defines a software interface to assist in actually reading and writing to the drives in a way to which SSDs and NAND flash actually works.

Instead of emulating the traditional block device model that SSDs inherited from hard drives and earlier storage technologies, the new NVMe Zoned Namespaces optional feature allows SSDs to implement a different storage abstraction over flash memory. This is quite similar to the extensions SAS and SATA have added to accommodate Shingled Magnetic Recording (SMR) hard drives, with a few extras for SSDs. ‘Zoned’ SSDs with this new feature can offer better performance than regular SSDs, with less overprovisioning and less DRAM. The downside is that applications and operating systems have to be updated to support zoned storage, but that work is well underway.

The NVMe Zoned Namespaces (ZNS) specification has been ratified and published as a Technical Proposal. It builds on top of the current NVMe 1.4a specification, in preparation for NVMe 2.0. The upcoming NVMe 2.0 specification will incorporate all the approved Technical Proposals, but also reorganize that same functionality into multiple smaller component documents: a base specification, plus one for each command set (block, zoned, key-value, and potentially more in the future), and separate specifications for each transport protocol (PCIe, RDMA, TCP). The standardization of Zoned Namespaces clears the way for broader commercialization and adoption of this technology, which so far has been held back by vendor-specific zoned storage interfaces and very limited hardware choices.

Zoned Storage: An Overview

The fundamental challenge of using flash memory for a solid state drive is all of our computers are built around the concept of how hard drives work, and flash memory doesn't behave like a hard drive. Flash is organized very differently from a hard drive, and so optimizing our computers for the enhanced performance characteristics of flash memory will make it worth the trouble.

Magnetic platters are a fairly analog storage medium, with no inherent structure to dictate features like sector sizes. The long-lived standard of 512-byte sectors was chosen merely for convenience, and enterprise drives now support 4K byte sectors as we reach drive capacities in the multi-TB range. By contrast, a flash memory chip has several levels of structure baked into the design. The most important numbers are the page size and erase block size. Data can be read with page size granularity (typically on the order of several kB) and an empty page can be written to with a program operation, but erase operations clear an entire multi-MB block. The substantial size mismatch between read/program operations and erase operations is a complication that ordinary mechanical hard drives don't have to deal with. The limited program/erase cycle endurance of flash memory also adds to challenge, as writing fewer times increases the lifespan.

Almost all SSDs today are presented to software as an abstraction of a simple HDD-like block storage device with 512-byte or 4kB sectors. This hides all the complexities of SSDs that we’ve gone into detail over the years, such as page and erase block sizes, wear leveling and garbage collection. This abstraction is also part of why SSD controllers and firmware are so much bigger and more complicated (and more bug-prone) than hard drive controllers. For most purposes, the block device abstraction is still the right compromise, because it allows unmodified software to enjoy most of the performance benefits of flash memory, and the downsides like write amplification are manageable.

For years, the storage industry has been exploring alternatives to the block storage abstraction. There have been several proposals for Open Channel SSDs, which expose many of the gory details of flash memory directly to the host system, moving many of the responsibilities of SSD firmware over to software running on the host CPU. The various open channel SSD standards that have been promoted have struck different balances along the spectrum, between a typical SSD with a fully drive-managed flash translation layer (FTL) to a fully software-managed solution. The industry consensus was that some of the earliest standards, like the LightNVM 1.x specification, exposed too many details, requiring software to handle some differences between different vendors' flash memory, or between SLC, MLC, TLC, etc. Newer standards have sought to find a better balance and a level of abstraction that will allow for easier mass adoption while still allowing software to bypass the inefficiencies of a typical SSD.

Tackling the problem from the other direction, the NVMe standard has been gaining features that allow drives to share more information with the host about optimal patterns for data access and layout. For the most part, these are hints and optional features that software can take advantage of. This works because software that isn't aware of these features will still function as usual. Directives and Streams, NVM Sets, Predictable Latency Mode, and various alignment and granularity hints have all been added over the past few revisions of the NVMe specification to make it possible for software and SSDs to better cooperate.

Lately, a third approach has been gaining momentum, influenced by the hard drive market. Shingled Magnetic Recording (SMR) is a technique for increasing storage density by partially overlapping tracks on hard drive platters. The downside of this approach is that it's no longer possible to directly modify arbitrary bytes of data without corrupting adjacent overlapping tracks, so SMR hard drives group tracks into zones and only allow sequential writes within a zone. This has severe performance implications for workloads that include random writes, which is part of why drive-managed SMR hard drives have seen a mixed reception at best in the marketplace. However, in the server storage market, host-managed SMR is also a viable option: it requires the OS, filesystem and potentially the application software to be directly aware of zones, but making the necessary software changes is not an insurmountable challenge when working with a controlled environment.

The zoned storage model used for SMR hard drives turns out to also be a good fit for use with flash, and is a precursor to NVMe Zoned Namespaces. The zone-like structure of SMR hard drives mirrors the page and erase block structure of an SSD. The restrictions on writes aren't an exact match, but it comes close enough.

In this article, we’ll cover what NVMe Zoned Namespaces are, and why this is an important thing.

How to Enable NVMe Zoned Namespaces
Comments Locked


View All Comments

  • FreckledTrout - Thursday, August 6, 2020 - link

    Like most things its the cost. I bet the testing alone is prohibitive to back port this into older SSD drives.
  • xenol - Thursday, August 6, 2020 - link

    Bingo. Testing and support costs something. Though I suppose they could release it for older drives under a no-support provision.

    Except depending on who tries this, I'm sure it's inevitable someone will break something and complain that they're not getting support.
  • DigitalFreak - Thursday, August 6, 2020 - link

    Why spend the money to make a retroactive firmware, when you can just sell the user a new drive with the updated spec? If someone cares enough about this, they'll shell out the $$$ for a new drive.
  • IT Mamba - Monday, December 14, 2020 - link

    Easier said then done.
  • Grizzlebee11 - Thursday, August 6, 2020 - link

    I wonder how this will affect Optane performance.
  • Billy Tallis - Thursday, August 6, 2020 - link

    Optane has no reason to adopt a zoned model, because the underlying 3D XPoint memory supports in-place modification of data.
  • name99 - Saturday, August 8, 2020 - link

    Does it really? I know Intel made a big deal about this, but isn't the reality (not that it changes your point, but getting the technical details right)
    - the minimum Optane granularity unit is a 64B line (which, admittedly, is the effective same as DRAM, but DRAM could be smaller if necessary, Optane???)

    - the PRACTICAL Optane granularity unit (which is what I am getting at in terms of "in-place"), giving 4x the bandwidth, is 256B.

    Yeah, I'm right. Looking around I found this
    which says "the 3D-XPoint physical media access granularity is 256 bytes" with everything that flows from that: need for write combining buffers, RMW if you can't write-combine, write amplification power/lifetime concerns, etc etc.

    So, sure, you don't have BIG zones/pages like flash -- but it's also incorrect (both technically, and for optimal use of the technology) to suggest that it's "true" random access, as much so as DRAM.

    It remains unclear to me how much of the current disappointment around Optane DIMM performance, eg
    derives from this. Certainly the Optane-targeted algorithms and new file systems I was reading say 5 years ago, when Intel was promising essentially "flash density, RAM performance" seemed very much optimized for "true" random access with no attempts at clustering larger than a cache line.
    Wouldn't be the first time Intel advertising department's lies landed up tanking a technology because of the ultimate gap between what was promised (and designed for) vs what was delivered...
  • MFinn3333 - Sunday, August 9, 2020 - link

    Um... Optane DIMM's have not disappointed anybody in their performance. Shows just how
  • brucethemoose - Thursday, August 6, 2020 - link

    Optane is byte addressable like DRAM and fairly durable, isn't it? I don't think this "multi kilobyte zoned storage" approach would be any more appropriate than the spinning rust block/sector model.

    Then again, running Optane over PCIe/NVMe always seemed like a waste to me.
  • FunBunny2 - Friday, August 7, 2020 - link

    "Optane is byte addressable like DRAM and fairly durable, isn't it?"

    yes, and my first notion was that Optane would *replace* DRAM/HDD/SSD in a 'true' 64 bit address single level storage space. although slower than DRAM, such an architecture would write program variables as they change direct to 'storage' without all that data migration. completely forgot that current cpu use many levels of buffers between registers and durable storage. iow, there's really no byte addressed update in today's machines.

    back in the 70s and early 80s, TI (and some others, I think) built machines that had no data registers in/on the cpu, all instructions happened in main memory and all data was written directly in memory and then to disc. the morphing to load/store architectures with scads of buffering means that optimum use of an Optane store with such an architecture looks to be a waste of time until/if cpu architecture writes data based on transaction scope of applications, not buffer fill.

Log in

Don't have an account? Sign up now