1===============================
2LIBNVDIMM: Non-Volatile Devices
3===============================
4
5libnvdimm - kernel / libndctl - userspace helper library
6
7nvdimm@lists.linux.dev
8
9Version 13
10
11.. contents:
12
13	Glossary
14	Overview
15	    Supporting Documents
16	    Git Trees
17	LIBNVDIMM PMEM
18	    PMEM-REGIONs, Atomic Sectors, and DAX
19	Example NVDIMM Platform
20	LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
21	    LIBNDCTL: Context
22	        libndctl: instantiate a new library context example
23	    LIBNVDIMM/LIBNDCTL: Bus
24	        libnvdimm: control class device in /sys/class
25	        libnvdimm: bus
26	        libndctl: bus enumeration example
27	    LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
28	        libnvdimm: DIMM (NMEM)
29	        libndctl: DIMM enumeration example
30	    LIBNVDIMM/LIBNDCTL: Region
31	        libnvdimm: region
32	        libndctl: region enumeration example
33	        Why Not Encode the Region Type into the Region Name?
34	        How Do I Determine the Major Type of a Region?
35	    LIBNVDIMM/LIBNDCTL: Namespace
36	        libnvdimm: namespace
37	        libndctl: namespace enumeration example
38	        libndctl: namespace creation example
39	        Why the Term "namespace"?
40	    LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
41	        libnvdimm: btt layout
42	        libndctl: btt creation example
43	Summary LIBNDCTL Diagram
44
45
46Glossary
47========
48
49PMEM:
50  A system-physical-address range where writes are persistent.  A
51  block device composed of PMEM is capable of DAX.  A PMEM address range
52  may span an interleave of several DIMMs.
53
54DPA:
55  DIMM Physical Address, is a DIMM-relative offset.  With one DIMM in
56  the system there would be a 1:1 system-physical-address:DPA association.
57  Once more DIMMs are added a memory controller interleave must be
58  decoded to determine the DPA associated with a given
59  system-physical-address.
60
61DAX:
62  File system extensions to bypass the page cache and block layer to
63  mmap persistent memory, from a PMEM block device, directly into a
64  process address space.
65
66DSM:
67  Device Specific Method: ACPI method to control specific
68  device - in this case the firmware.
69
70DCR:
71  NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5.
72  It defines a vendor-id, device-id, and interface format for a given DIMM.
73
74BTT:
75  Block Translation Table: Persistent memory is byte addressable.
76  Existing software may have an expectation that the power-fail-atomicity
77  of writes is at least one sector, 512 bytes.  The BTT is an indirection
78  table with atomic update semantics to front a PMEM block device
79  driver and present arbitrary atomic sector sizes.
80
81LABEL:
82  Metadata stored on a DIMM device that partitions and identifies
83  (persistently names) capacity allocated to different PMEM namespaces. It
84  also indicates whether an address abstraction like a BTT is applied to
85  the namespace.  Note that traditional partition tables, GPT/MBR, are
86  layered on top of a PMEM namespace, or an address abstraction like BTT
87  if present, but partition support is deprecated going forward.
88
89
90Overview
91========
92
93The LIBNVDIMM subsystem provides support for PMEM described by platform
94firmware or a device driver. On ACPI based systems the platform firmware
95conveys persistent memory resource via the ACPI NFIT "NVDIMM Firmware
96Interface Table" in ACPI 6. While the LIBNVDIMM subsystem implementation
97is generic and supports pre-NFIT platforms, it was guided by the
98superset of capabilities need to support this ACPI 6 definition for
99NVDIMM resources. The original implementation supported the
100block-window-aperture capability described in the NFIT, but that support
101has since been abandoned and never shipped in a product.
102
103Supporting Documents
104--------------------
105
106ACPI 6:
107	https://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
108NVDIMM Namespace:
109	https://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
110DSM Interface Example:
111	https://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
112Driver Writer's Guide:
113	https://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
114
115Git Trees
116---------
117
118LIBNVDIMM:
119	https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git
120LIBNDCTL:
121	https://github.com/pmem/ndctl.git
122
123
124LIBNVDIMM PMEM
125==============
126
127Prior to the arrival of the NFIT, non-volatile memory was described to a
128system in various ad-hoc ways.  Usually only the bare minimum was
129provided, namely, a single system-physical-address range where writes
130are expected to be durable after a system power loss.  Now, the NFIT
131specification standardizes not only the description of PMEM, but also
132platform message-passing entry points for control and configuration.
133
134PMEM (nd_pmem.ko): Drives a system-physical-address range.  This range is
135contiguous in system memory and may be interleaved (hardware memory controller
136striped) across multiple DIMMs.  When interleaved the platform may optionally
137provide details of which DIMMs are participating in the interleave.
138
139It is worth noting that when the labeling capability is detected (a EFI
140namespace label index block is found), then no block device is created
141by default as userspace needs to do at least one allocation of DPA to
142the PMEM range.  In contrast ND_NAMESPACE_IO ranges, once registered,
143can be immediately attached to nd_pmem. This latter mode is called
144label-less or "legacy".
145
146PMEM-REGIONs, Atomic Sectors, and DAX
147-------------------------------------
148
149For the cases where an application or filesystem still needs atomic sector
150update guarantees it can register a BTT on a PMEM device or partition.  See
151LIBNVDIMM/NDCTL: Block Translation Table "btt"
152
153
154Example NVDIMM Platform
155=======================
156
157For the remainder of this document the following diagram will be
158referenced for any example sysfs layouts::
159
160
161                               (a)               (b)           DIMM
162            +-------------------+--------+--------+--------+
163  +------+  |       pm0.0       |  free  | pm1.0  |  free  |    0
164  | imc0 +--+- - - region0- - - +--------+        +--------+
165  +--+---+  |       pm0.0       |  free  | pm1.0  |  free  |    1
166     |      +-------------------+--------v        v--------+
167  +--+---+                               |                 |
168  | cpu0 |                                     region1
169  +--+---+                               |                 |
170     |      +----------------------------^        ^--------+
171  +--+---+  |           free             | pm1.0  |  free  |    2
172  | imc1 +--+----------------------------|        +--------+
173  +------+  |           free             | pm1.0  |  free  |    3
174            +----------------------------+--------+--------+
175
176In this platform we have four DIMMs and two memory controllers in one
177socket.  Each PMEM interleave set is identified by a region device with
178a dynamically assigned id.
179
180    1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
181       single PMEM namespace is created in the REGION0-SPA-range that spans most
182       of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
183       interleaved system-physical-address range is left free for
184       another PMEM namespace to be defined.
185
186    2. In the last portion of DIMM0 and DIMM1 we have an interleaved
187       system-physical-address range, REGION1, that spans those two DIMMs as
188       well as DIMM2 and DIMM3.  Some of REGION1 is allocated to a PMEM namespace
189       named "pm1.0".
190
191    This bus is provided by the kernel under the device
192    /sys/devices/platform/nfit_test.0 when the nfit_test.ko module from
193    tools/testing/nvdimm is loaded. This module is a unit test for
194    LIBNVDIMM and the  acpi_nfit.ko driver.
195
196
197LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
198========================================================
199
200What follows is a description of the LIBNVDIMM sysfs layout and a
201corresponding object hierarchy diagram as viewed through the LIBNDCTL
202API.  The example sysfs paths and diagrams are relative to the Example
203NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
204test.
205
206LIBNDCTL: Context
207-----------------
208
209Every API call in the LIBNDCTL library requires a context that holds the
210logging parameters and other library instance state.  The library is
211based on the libabc template:
212
213	https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git
214
215LIBNDCTL: instantiate a new library context example
216^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
217
218::
219
220	struct ndctl_ctx *ctx;
221
222	if (ndctl_new(&ctx) == 0)
223		return ctx;
224	else
225		return NULL;
226
227LIBNVDIMM/LIBNDCTL: Bus
228-----------------------
229
230A bus has a 1:1 relationship with an NFIT.  The current expectation for
231ACPI based systems is that there is only ever one platform-global NFIT.
232That said, it is trivial to register multiple NFITs, the specification
233does not preclude it.  The infrastructure supports multiple busses and
234we use this capability to test multiple NFIT configurations in the unit
235test.
236
237LIBNVDIMM: control class device in /sys/class
238---------------------------------------------
239
240This character device accepts DSM messages to be passed to DIMM
241identified by its NFIT handle::
242
243	/sys/class/nd/ndctl0
244	|-- dev
245	|-- device -> ../../../ndbus0
246	|-- subsystem -> ../../../../../../../class/nd
247
248
249
250LIBNVDIMM: bus
251--------------
252
253::
254
255	struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
256	       struct nvdimm_bus_descriptor *nfit_desc);
257
258::
259
260	/sys/devices/platform/nfit_test.0/ndbus0
261	|-- commands
262	|-- nd
263	|-- nfit
264	|-- nmem0
265	|-- nmem1
266	|-- nmem2
267	|-- nmem3
268	|-- power
269	|-- provider
270	|-- region0
271	|-- region1
272	|-- region2
273	|-- region3
274	|-- region4
275	|-- region5
276	|-- uevent
277	`-- wait_probe
278
279LIBNDCTL: bus enumeration example
280^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
281
282Find the bus handle that describes the bus from Example NVDIMM Platform::
283
284	static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
285			const char *provider)
286	{
287		struct ndctl_bus *bus;
288
289		ndctl_bus_foreach(ctx, bus)
290			if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0)
291				return bus;
292
293		return NULL;
294	}
295
296	bus = get_bus_by_provider(ctx, "nfit_test.0");
297
298
299LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
300-------------------------------
301
302The DIMM device provides a character device for sending commands to
303hardware, and it is a container for LABELs.  If the DIMM is defined by
304NFIT then an optional 'nfit' attribute sub-directory is available to add
305NFIT-specifics.
306
307Note that the kernel device name for "DIMMs" is "nmemX".  The NFIT
308describes these devices via "Memory Device to System Physical Address
309Range Mapping Structure", and there is no requirement that they actually
310be physical DIMMs, so we use a more generic name.
311
312LIBNVDIMM: DIMM (NMEM)
313^^^^^^^^^^^^^^^^^^^^^^
314
315::
316
317	struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
318			const struct attribute_group **groups, unsigned long flags,
319			unsigned long *dsm_mask);
320
321::
322
323	/sys/devices/platform/nfit_test.0/ndbus0
324	|-- nmem0
325	|   |-- available_slots
326	|   |-- commands
327	|   |-- dev
328	|   |-- devtype
329	|   |-- driver -> ../../../../../bus/nd/drivers/nvdimm
330	|   |-- modalias
331	|   |-- nfit
332	|   |   |-- device
333	|   |   |-- format
334	|   |   |-- handle
335	|   |   |-- phys_id
336	|   |   |-- rev_id
337	|   |   |-- serial
338	|   |   `-- vendor
339	|   |-- state
340	|   |-- subsystem -> ../../../../../bus/nd
341	|   `-- uevent
342	|-- nmem1
343	[..]
344
345
346LIBNDCTL: DIMM enumeration example
347^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
348
349Note, in this example we are assuming NFIT-defined DIMMs which are
350identified by an "nfit_handle" a 32-bit value where:
351
352   - Bit 3:0 DIMM number within the memory channel
353   - Bit 7:4 memory channel number
354   - Bit 11:8 memory controller ID
355   - Bit 15:12 socket ID (within scope of a Node controller if node
356     controller is present)
357   - Bit 27:16 Node Controller ID
358   - Bit 31:28 Reserved
359
360::
361
362	static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
363	       unsigned int handle)
364	{
365		struct ndctl_dimm *dimm;
366
367		ndctl_dimm_foreach(bus, dimm)
368			if (ndctl_dimm_get_handle(dimm) == handle)
369				return dimm;
370
371		return NULL;
372	}
373
374	#define DIMM_HANDLE(n, s, i, c, d) \
375		(((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \
376		 | ((c & 0xf) << 4) | (d & 0xf))
377
378	dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
379
380LIBNVDIMM/LIBNDCTL: Region
381--------------------------
382
383A generic REGION device is registered for each PMEM interleave-set /
384range. Per the example there are 2 PMEM regions on the "nfit_test.0"
385bus. The primary role of regions are to be a container of "mappings".  A
386mapping is a tuple of <DIMM, DPA-start-offset, length>.
387
388LIBNVDIMM provides a built-in driver for REGION devices.  This driver
389is responsible for all parsing LABELs, if present, and then emitting NAMESPACE
390devices for the nd_pmem driver to consume.
391
392In addition to the generic attributes of "mapping"s, "interleave_ways"
393and "size" the REGION device also exports some convenience attributes.
394"nstype" indicates the integer type of namespace-device this region
395emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
396'add' event, "modalias" duplicates the MODALIAS variable stored by udev
397at the 'add' event, and finally, the optional "spa_index" is provided in
398the case where the region is defined by a SPA.
399
400LIBNVDIMM: region::
401
402	struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
403			struct nd_region_desc *ndr_desc);
404
405::
406
407	/sys/devices/platform/nfit_test.0/ndbus0
408	|-- region0
409	|   |-- available_size
410	|   |-- btt0
411	|   |-- btt_seed
412	|   |-- devtype
413	|   |-- driver -> ../../../../../bus/nd/drivers/nd_region
414	|   |-- init_namespaces
415	|   |-- mapping0
416	|   |-- mapping1
417	|   |-- mappings
418	|   |-- modalias
419	|   |-- namespace0.0
420	|   |-- namespace_seed
421	|   |-- numa_node
422	|   |-- nfit
423	|   |   `-- spa_index
424	|   |-- nstype
425	|   |-- set_cookie
426	|   |-- size
427	|   |-- subsystem -> ../../../../../bus/nd
428	|   `-- uevent
429	|-- region1
430	[..]
431
432LIBNDCTL: region enumeration example
433^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
434
435Sample region retrieval routines based on NFIT-unique data like
436"spa_index" (interleave set id).
437
438::
439
440	static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
441			unsigned int spa_index)
442	{
443		struct ndctl_region *region;
444
445		ndctl_region_foreach(bus, region) {
446			if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM)
447				continue;
448			if (ndctl_region_get_spa_index(region) == spa_index)
449				return region;
450		}
451		return NULL;
452	}
453
454
455LIBNVDIMM/LIBNDCTL: Namespace
456-----------------------------
457
458A REGION, after resolving DPA aliasing and LABEL specified boundaries, surfaces
459one or more "namespace" devices.  The arrival of a "namespace" device currently
460triggers the nd_pmem driver to load and register a disk/block device.
461
462LIBNVDIMM: namespace
463^^^^^^^^^^^^^^^^^^^^
464
465Here is a sample layout from the 2 major types of NAMESPACE where namespace0.0
466represents DIMM-info-backed PMEM (note that it has a 'uuid' attribute), and
467namespace1.0 represents an anonymous PMEM namespace (note that has no 'uuid'
468attribute due to not support a LABEL)
469
470::
471
472	/sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
473	|-- alt_name
474	|-- devtype
475	|-- dpa_extents
476	|-- force_raw
477	|-- modalias
478	|-- numa_node
479	|-- resource
480	|-- size
481	|-- subsystem -> ../../../../../../bus/nd
482	|-- type
483	|-- uevent
484	`-- uuid
485	/sys/devices/platform/nfit_test.1/ndbus1/region1/namespace1.0
486	|-- block
487	|   `-- pmem0
488	|-- devtype
489	|-- driver -> ../../../../../../bus/nd/drivers/pmem
490	|-- force_raw
491	|-- modalias
492	|-- numa_node
493	|-- resource
494	|-- size
495	|-- subsystem -> ../../../../../../bus/nd
496	|-- type
497	`-- uevent
498
499LIBNDCTL: namespace enumeration example
500^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
501Namespaces are indexed relative to their parent region, example below.
502These indexes are mostly static from boot to boot, but subsystem makes
503no guarantees in this regard.  For a static namespace identifier use its
504'uuid' attribute.
505
506::
507
508  static struct ndctl_namespace
509  *get_namespace_by_id(struct ndctl_region *region, unsigned int id)
510  {
511          struct ndctl_namespace *ndns;
512
513          ndctl_namespace_foreach(region, ndns)
514                  if (ndctl_namespace_get_id(ndns) == id)
515                          return ndns;
516
517          return NULL;
518  }
519
520LIBNDCTL: namespace creation example
521^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
522
523Idle namespaces are automatically created by the kernel if a given
524region has enough available capacity to create a new namespace.
525Namespace instantiation involves finding an idle namespace and
526configuring it.  For the most part the setting of namespace attributes
527can occur in any order, the only constraint is that 'uuid' must be set
528before 'size'.  This enables the kernel to track DPA allocations
529internally with a static identifier::
530
531  static int configure_namespace(struct ndctl_region *region,
532                  struct ndctl_namespace *ndns,
533                  struct namespace_parameters *parameters)
534  {
535          char devname[50];
536
537          snprintf(devname, sizeof(devname), "namespace%d.%d",
538                          ndctl_region_get_id(region), paramaters->id);
539
540          ndctl_namespace_set_alt_name(ndns, devname);
541          /* 'uuid' must be set prior to setting size! */
542          ndctl_namespace_set_uuid(ndns, paramaters->uuid);
543          ndctl_namespace_set_size(ndns, paramaters->size);
544          /* unlike pmem namespaces, blk namespaces have a sector size */
545          if (parameters->lbasize)
546                  ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
547          ndctl_namespace_enable(ndns);
548  }
549
550
551Why the Term "namespace"?
552^^^^^^^^^^^^^^^^^^^^^^^^^
553
554    1. Why not "volume" for instance?  "volume" ran the risk of confusing
555       ND (libnvdimm subsystem) to a volume manager like device-mapper.
556
557    2. The term originated to describe the sub-devices that can be created
558       within a NVME controller (see the nvme specification:
559       https://www.nvmexpress.org/specifications/), and NFIT namespaces are
560       meant to parallel the capabilities and configurability of
561       NVME-namespaces.
562
563
564LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
565-------------------------------------------------
566
567A BTT (design document: https://pmem.io/2014/09/23/btt.html) is a
568personality driver for a namespace that fronts entire namespace as an
569'address abstraction'.
570
571LIBNVDIMM: btt layout
572^^^^^^^^^^^^^^^^^^^^^
573
574Every region will start out with at least one BTT device which is the
575seed device.  To activate it set the "namespace", "uuid", and
576"sector_size" attributes and then bind the device to the nd_pmem or
577nd_blk driver depending on the region type::
578
579	/sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/
580	|-- namespace
581	|-- delete
582	|-- devtype
583	|-- modalias
584	|-- numa_node
585	|-- sector_size
586	|-- subsystem -> ../../../../../bus/nd
587	|-- uevent
588	`-- uuid
589
590LIBNDCTL: btt creation example
591^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
592
593Similar to namespaces an idle BTT device is automatically created per
594region.  Each time this "seed" btt device is configured and enabled a new
595seed is created.  Creating a BTT configuration involves two steps of
596finding and idle BTT and assigning it to consume a namespace.
597
598::
599
600	static struct ndctl_btt *get_idle_btt(struct ndctl_region *region)
601	{
602		struct ndctl_btt *btt;
603
604		ndctl_btt_foreach(region, btt)
605			if (!ndctl_btt_is_enabled(btt)
606					&& !ndctl_btt_is_configured(btt))
607				return btt;
608
609		return NULL;
610	}
611
612	static int configure_btt(struct ndctl_region *region,
613			struct btt_parameters *parameters)
614	{
615		btt = get_idle_btt(region);
616
617		ndctl_btt_set_uuid(btt, parameters->uuid);
618		ndctl_btt_set_sector_size(btt, parameters->sector_size);
619		ndctl_btt_set_namespace(btt, parameters->ndns);
620		/* turn off raw mode device */
621		ndctl_namespace_disable(parameters->ndns);
622		/* turn on btt access */
623		ndctl_btt_enable(btt);
624	}
625
626Once instantiated a new inactive btt seed device will appear underneath
627the region.
628
629Once a "namespace" is removed from a BTT that instance of the BTT device
630will be deleted or otherwise reset to default values.  This deletion is
631only at the device model level.  In order to destroy a BTT the "info
632block" needs to be destroyed.  Note, that to destroy a BTT the media
633needs to be written in raw mode.  By default, the kernel will autodetect
634the presence of a BTT and disable raw mode.  This autodetect behavior
635can be suppressed by enabling raw mode for the namespace via the
636ndctl_namespace_set_raw_mode() API.
637
638
639Summary LIBNDCTL Diagram
640------------------------
641
642For the given example above, here is the view of the objects as seen by the
643LIBNDCTL API::
644
645              +---+
646              |CTX|
647              +-+-+
648                |
649  +-------+     |
650  | DIMM0 <-+   |      +---------+   +--------------+  +---------------+
651  +-------+ |   |    +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
652  | DIMM1 <-+ +-v--+ | +---------+   +--------------+  +---------------+
653  +-------+ +-+BUS0+-| +---------+   +--------------+  +----------------------+
654  | DIMM2 <-+ +----+ +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | BTT1 |
655  +-------+ |        | +---------+   +--------------+  +---------------+------+
656  | DIMM3 <-+
657  +-------+
658