1========
2dm-zoned
3========
4
5The dm-zoned device mapper target exposes a zoned block device (ZBC and
6ZAC compliant devices) as a regular block device without any write
7pattern constraints. In effect, it implements a drive-managed zoned
8block device which hides from the user (a file system or an application
9doing raw block device accesses) the sequential write constraints of
10host-managed zoned block devices and can mitigate the potential
11device-side performance degradation due to excessive random writes on
12host-aware zoned block devices.
13
14For a more detailed description of the zoned block device models and
15their constraints see (for SCSI devices):
16
17http://www.t10.org/drafts.htm#ZBC_Family
18
19and (for ATA devices):
20
21http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
22
23The dm-zoned implementation is simple and minimizes system overhead (CPU
24and memory usage as well as storage capacity loss). For a 10TB
25host-managed disk with 256 MB zones, dm-zoned memory usage per disk
26instance is at most 4.5 MB and as little as 5 zones will be used
27internally for storing metadata and performaing reclaim operations.
28
29dm-zoned target devices are formatted and checked using the dmzadm
30utility available at:
31
32https://github.com/hgst/dm-zoned-tools
33
34Algorithm
35=========
36
37dm-zoned implements an on-disk buffering scheme to handle non-sequential
38write accesses to the sequential zones of a zoned block device.
39Conventional zones are used for caching as well as for storing internal
40metadata.
41
42The zones of the device are separated into 2 types:
43
441) Metadata zones: these are conventional zones used to store metadata.
45Metadata zones are not reported as useable capacity to the user.
46
472) Data zones: all remaining zones, the vast majority of which will be
48sequential zones used exclusively to store user data. The conventional
49zones of the device may be used also for buffering user random writes.
50Data in these zones may be directly mapped to the conventional zone, but
51later moved to a sequential zone so that the conventional zone can be
52reused for buffering incoming random writes.
53
54dm-zoned exposes a logical device with a sector size of 4096 bytes,
55irrespective of the physical sector size of the backend zoned block
56device being used. This allows reducing the amount of metadata needed to
57manage valid blocks (blocks written).
58
59The on-disk metadata format is as follows:
60
611) The first block of the first conventional zone found contains the
62super block which describes the on disk amount and position of metadata
63blocks.
64
652) Following the super block, a set of blocks is used to describe the
66mapping of the logical device blocks. The mapping is done per chunk of
67blocks, with the chunk size equal to the zoned block device size. The
68mapping table is indexed by chunk number and each mapping entry
69indicates the zone number of the device storing the chunk of data. Each
70mapping entry may also indicate if the zone number of a conventional
71zone used to buffer random modification to the data zone.
72
733) A set of blocks used to store bitmaps indicating the validity of
74blocks in the data zones follows the mapping table. A valid block is
75defined as a block that was written and not discarded. For a buffered
76data chunk, a block is always valid only in the data zone mapping the
77chunk or in the buffer zone of the chunk.
78
79For a logical chunk mapped to a conventional zone, all write operations
80are processed by directly writing to the zone. If the mapping zone is a
81sequential zone, the write operation is processed directly only if the
82write offset within the logical chunk is equal to the write pointer
83offset within of the sequential data zone (i.e. the write operation is
84aligned on the zone write pointer). Otherwise, write operations are
85processed indirectly using a buffer zone. In that case, an unused
86conventional zone is allocated and assigned to the chunk being
87accessed. Writing a block to the buffer zone of a chunk will
88automatically invalidate the same block in the sequential zone mapping
89the chunk. If all blocks of the sequential zone become invalid, the zone
90is freed and the chunk buffer zone becomes the primary zone mapping the
91chunk, resulting in native random write performance similar to a regular
92block device.
93
94Read operations are processed according to the block validity
95information provided by the bitmaps. Valid blocks are read either from
96the sequential zone mapping a chunk, or if the chunk is buffered, from
97the buffer zone assigned. If the accessed chunk has no mapping, or the
98accessed blocks are invalid, the read buffer is zeroed and the read
99operation terminated.
100
101After some time, the limited number of convnetional zones available may
102be exhausted (all used to map chunks or buffer sequential zones) and
103unaligned writes to unbuffered chunks become impossible. To avoid this
104situation, a reclaim process regularly scans used conventional zones and
105tries to reclaim the least recently used zones by copying the valid
106blocks of the buffer zone to a free sequential zone. Once the copy
107completes, the chunk mapping is updated to point to the sequential zone
108and the buffer zone freed for reuse.
109
110Metadata Protection
111===================
112
113To protect metadata against corruption in case of sudden power loss or
114system crash, 2 sets of metadata zones are used. One set, the primary
115set, is used as the main metadata region, while the secondary set is
116used as a staging area. Modified metadata is first written to the
117secondary set and validated by updating the super block in the secondary
118set, a generation counter is used to indicate that this set contains the
119newest metadata. Once this operation completes, in place of metadata
120block updates can be done in the primary metadata set. This ensures that
121one of the set is always consistent (all modifications committed or none
122at all). Flush operations are used as a commit point. Upon reception of
123a flush request, metadata modification activity is temporarily blocked
124(for both incoming BIO processing and reclaim process) and all dirty
125metadata blocks are staged and updated. Normal operation is then
126resumed. Flushing metadata thus only temporarily delays write and
127discard requests. Read requests can be processed concurrently while
128metadata flush is being executed.
129
130Usage
131=====
132
133A zoned block device must first be formatted using the dmzadm tool. This
134will analyze the device zone configuration, determine where to place the
135metadata sets on the device and initialize the metadata sets.
136
137Ex::
138
139	dmzadm --format /dev/sdxx
140
141For a formatted device, the target can be created normally with the
142dmsetup utility. The only parameter that dm-zoned requires is the
143underlying zoned block device name. Ex::
144
145	echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
146	dmsetup create dmz-`basename ${dev}`
147