1*7431b783SNikos Tsironis.. SPDX-License-Identifier: GPL-2.0-only 2*7431b783SNikos Tsironis 3*7431b783SNikos Tsironis======== 4*7431b783SNikos Tsironisdm-clone 5*7431b783SNikos Tsironis======== 6*7431b783SNikos Tsironis 7*7431b783SNikos TsironisIntroduction 8*7431b783SNikos Tsironis============ 9*7431b783SNikos Tsironis 10*7431b783SNikos Tsironisdm-clone is a device mapper target which produces a one-to-one copy of an 11*7431b783SNikos Tsironisexisting, read-only source device into a writable destination device: It 12*7431b783SNikos Tsironispresents a virtual block device which makes all data appear immediately, and 13*7431b783SNikos Tsironisredirects reads and writes accordingly. 14*7431b783SNikos Tsironis 15*7431b783SNikos TsironisThe main use case of dm-clone is to clone a potentially remote, high-latency, 16*7431b783SNikos Tsironisread-only, archival-type block device into a writable, fast, primary-type device 17*7431b783SNikos Tsironisfor fast, low-latency I/O. The cloned device is visible/mountable immediately 18*7431b783SNikos Tsironisand the copy of the source device to the destination device happens in the 19*7431b783SNikos Tsironisbackground, in parallel with user I/O. 20*7431b783SNikos Tsironis 21*7431b783SNikos TsironisFor example, one could restore an application backup from a read-only copy, 22*7431b783SNikos Tsironisaccessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE, 23*7431b783SNikos Tsironisetc.), into a local SSD or NVMe device, and start using the device immediately, 24*7431b783SNikos Tsironiswithout waiting for the restore to complete. 25*7431b783SNikos Tsironis 26*7431b783SNikos TsironisWhen the cloning completes, the dm-clone table can be removed altogether and be 27*7431b783SNikos Tsironisreplaced, e.g., by a linear table, mapping directly to the destination device. 28*7431b783SNikos Tsironis 29*7431b783SNikos TsironisThe dm-clone target reuses the metadata library used by the thin-provisioning 30*7431b783SNikos Tsironistarget. 31*7431b783SNikos Tsironis 32*7431b783SNikos TsironisGlossary 33*7431b783SNikos Tsironis======== 34*7431b783SNikos Tsironis 35*7431b783SNikos Tsironis Hydration 36*7431b783SNikos Tsironis The process of filling a region of the destination device with data from 37*7431b783SNikos Tsironis the same region of the source device, i.e., copying the region from the 38*7431b783SNikos Tsironis source to the destination device. 39*7431b783SNikos Tsironis 40*7431b783SNikos TsironisOnce a region gets hydrated we redirect all I/O regarding it to the destination 41*7431b783SNikos Tsironisdevice. 42*7431b783SNikos Tsironis 43*7431b783SNikos TsironisDesign 44*7431b783SNikos Tsironis====== 45*7431b783SNikos Tsironis 46*7431b783SNikos TsironisSub-devices 47*7431b783SNikos Tsironis----------- 48*7431b783SNikos Tsironis 49*7431b783SNikos TsironisThe target is constructed by passing three devices to it (along with other 50*7431b783SNikos Tsironisparameters detailed later): 51*7431b783SNikos Tsironis 52*7431b783SNikos Tsironis1. A source device - the read-only device that gets cloned and source of the 53*7431b783SNikos Tsironis hydration. 54*7431b783SNikos Tsironis 55*7431b783SNikos Tsironis2. A destination device - the destination of the hydration, which will become a 56*7431b783SNikos Tsironis clone of the source device. 57*7431b783SNikos Tsironis 58*7431b783SNikos Tsironis3. A small metadata device - it records which regions are already valid in the 59*7431b783SNikos Tsironis destination device, i.e., which regions have already been hydrated, or have 60*7431b783SNikos Tsironis been written to directly, via user I/O. 61*7431b783SNikos Tsironis 62*7431b783SNikos TsironisThe size of the destination device must be at least equal to the size of the 63*7431b783SNikos Tsironissource device. 64*7431b783SNikos Tsironis 65*7431b783SNikos TsironisRegions 66*7431b783SNikos Tsironis------- 67*7431b783SNikos Tsironis 68*7431b783SNikos Tsironisdm-clone divides the source and destination devices in fixed sized regions. 69*7431b783SNikos TsironisRegions are the unit of hydration, i.e., the minimum amount of data copied from 70*7431b783SNikos Tsironisthe source to the destination device. 71*7431b783SNikos Tsironis 72*7431b783SNikos TsironisThe region size is configurable when you first create the dm-clone device. The 73*7431b783SNikos Tsironisrecommended region size is the same as the file system block size, which usually 74*7431b783SNikos Tsironisis 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors 75*7431b783SNikos Tsironis(1GB) and a power of two. 76*7431b783SNikos Tsironis 77*7431b783SNikos TsironisReads and writes from/to hydrated regions are serviced from the destination 78*7431b783SNikos Tsironisdevice. 79*7431b783SNikos Tsironis 80*7431b783SNikos TsironisA read to a not yet hydrated region is serviced directly from the source device. 81*7431b783SNikos Tsironis 82*7431b783SNikos TsironisA write to a not yet hydrated region will be delayed until the corresponding 83*7431b783SNikos Tsironisregion has been hydrated and the hydration of the region starts immediately. 84*7431b783SNikos Tsironis 85*7431b783SNikos TsironisNote that a write request with size equal to region size will skip copying of 86*7431b783SNikos Tsironisthe corresponding region from the source device and overwrite the region of the 87*7431b783SNikos Tsironisdestination device directly. 88*7431b783SNikos Tsironis 89*7431b783SNikos TsironisDiscards 90*7431b783SNikos Tsironis-------- 91*7431b783SNikos Tsironis 92*7431b783SNikos Tsironisdm-clone interprets a discard request to a range that hasn't been hydrated yet 93*7431b783SNikos Tsironisas a hint to skip hydration of the regions covered by the request, i.e., it 94*7431b783SNikos Tsironisskips copying the region's data from the source to the destination device, and 95*7431b783SNikos Tsironisonly updates its metadata. 96*7431b783SNikos Tsironis 97*7431b783SNikos TsironisIf the destination device supports discards, then by default dm-clone will pass 98*7431b783SNikos Tsironisdown discard requests to it. 99*7431b783SNikos Tsironis 100*7431b783SNikos TsironisBackground Hydration 101*7431b783SNikos Tsironis-------------------- 102*7431b783SNikos Tsironis 103*7431b783SNikos Tsironisdm-clone copies continuously from the source to the destination device, until 104*7431b783SNikos Tsironisall of the device has been copied. 105*7431b783SNikos Tsironis 106*7431b783SNikos TsironisCopying data from the source to the destination device uses bandwidth. The user 107*7431b783SNikos Tsironiscan set a throttle to prevent more than a certain amount of copying occurring at 108*7431b783SNikos Tsironisany one time. Moreover, dm-clone takes into account user I/O traffic going to 109*7431b783SNikos Tsironisthe devices and pauses the background hydration when there is I/O in-flight. 110*7431b783SNikos Tsironis 111*7431b783SNikos TsironisA message `hydration_threshold <#regions>` can be used to set the maximum number 112*7431b783SNikos Tsironisof regions being copied, the default being 1 region. 113*7431b783SNikos Tsironis 114*7431b783SNikos Tsironisdm-clone employs dm-kcopyd for copying portions of the source device to the 115*7431b783SNikos Tsironisdestination device. By default, we issue copy requests of size equal to the 116*7431b783SNikos Tsironisregion size. A message `hydration_batch_size <#regions>` can be used to tune the 117*7431b783SNikos Tsironissize of these copy requests. Increasing the hydration batch size results in 118*7431b783SNikos Tsironisdm-clone trying to batch together contiguous regions, so we copy the data in 119*7431b783SNikos Tsironisbatches of this many regions. 120*7431b783SNikos Tsironis 121*7431b783SNikos TsironisWhen the hydration of the destination device finishes, a dm event will be sent 122*7431b783SNikos Tsironisto user space. 123*7431b783SNikos Tsironis 124*7431b783SNikos TsironisUpdating on-disk metadata 125*7431b783SNikos Tsironis------------------------- 126*7431b783SNikos Tsironis 127*7431b783SNikos TsironisOn-disk metadata is committed every time a FLUSH or FUA bio is written. If no 128*7431b783SNikos Tsironissuch requests are made then commits will occur every second. This means the 129*7431b783SNikos Tsironisdm-clone device behaves like a physical disk that has a volatile write cache. If 130*7431b783SNikos Tsironispower is lost you may lose some recent writes. The metadata should always be 131*7431b783SNikos Tsironisconsistent in spite of any crash. 132*7431b783SNikos Tsironis 133*7431b783SNikos TsironisTarget Interface 134*7431b783SNikos Tsironis================ 135*7431b783SNikos Tsironis 136*7431b783SNikos TsironisConstructor 137*7431b783SNikos Tsironis----------- 138*7431b783SNikos Tsironis 139*7431b783SNikos Tsironis :: 140*7431b783SNikos Tsironis 141*7431b783SNikos Tsironis clone <metadata dev> <destination dev> <source dev> <region size> 142*7431b783SNikos Tsironis [<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]] 143*7431b783SNikos Tsironis 144*7431b783SNikos Tsironis ================ ============================================================== 145*7431b783SNikos Tsironis metadata dev Fast device holding the persistent metadata 146*7431b783SNikos Tsironis destination dev The destination device, where the source will be cloned 147*7431b783SNikos Tsironis source dev Read only device containing the data that gets cloned 148*7431b783SNikos Tsironis region size The size of a region in sectors 149*7431b783SNikos Tsironis 150*7431b783SNikos Tsironis #feature args Number of feature arguments passed 151*7431b783SNikos Tsironis feature args no_hydration or no_discard_passdown 152*7431b783SNikos Tsironis 153*7431b783SNikos Tsironis #core args An even number of arguments corresponding to key/value pairs 154*7431b783SNikos Tsironis passed to dm-clone 155*7431b783SNikos Tsironis core args Key/value pairs passed to dm-clone, e.g. `hydration_threshold 156*7431b783SNikos Tsironis 256` 157*7431b783SNikos Tsironis ================ ============================================================== 158*7431b783SNikos Tsironis 159*7431b783SNikos TsironisOptional feature arguments are: 160*7431b783SNikos Tsironis 161*7431b783SNikos Tsironis ==================== ========================================================= 162*7431b783SNikos Tsironis no_hydration Create a dm-clone instance with background hydration 163*7431b783SNikos Tsironis disabled 164*7431b783SNikos Tsironis no_discard_passdown Disable passing down discards to the destination device 165*7431b783SNikos Tsironis ==================== ========================================================= 166*7431b783SNikos Tsironis 167*7431b783SNikos TsironisOptional core arguments are: 168*7431b783SNikos Tsironis 169*7431b783SNikos Tsironis ================================ ============================================== 170*7431b783SNikos Tsironis hydration_threshold <#regions> Maximum number of regions being copied from 171*7431b783SNikos Tsironis the source to the destination device at any 172*7431b783SNikos Tsironis one time, during background hydration. 173*7431b783SNikos Tsironis hydration_batch_size <#regions> During background hydration, try to batch 174*7431b783SNikos Tsironis together contiguous regions, so we copy data 175*7431b783SNikos Tsironis from the source to the destination device in 176*7431b783SNikos Tsironis batches of this many regions. 177*7431b783SNikos Tsironis ================================ ============================================== 178*7431b783SNikos Tsironis 179*7431b783SNikos TsironisStatus 180*7431b783SNikos Tsironis------ 181*7431b783SNikos Tsironis 182*7431b783SNikos Tsironis :: 183*7431b783SNikos Tsironis 184*7431b783SNikos Tsironis <metadata block size> <#used metadata blocks>/<#total metadata blocks> 185*7431b783SNikos Tsironis <region size> <#hydrated regions>/<#total regions> <#hydrating regions> 186*7431b783SNikos Tsironis <#feature args> <feature args>* <#core args> <core args>* 187*7431b783SNikos Tsironis <clone metadata mode> 188*7431b783SNikos Tsironis 189*7431b783SNikos Tsironis ======================= ======================================================= 190*7431b783SNikos Tsironis metadata block size Fixed block size for each metadata block in sectors 191*7431b783SNikos Tsironis #used metadata blocks Number of metadata blocks used 192*7431b783SNikos Tsironis #total metadata blocks Total number of metadata blocks 193*7431b783SNikos Tsironis region size Configurable region size for the device in sectors 194*7431b783SNikos Tsironis #hydrated regions Number of regions that have finished hydrating 195*7431b783SNikos Tsironis #total regions Total number of regions to hydrate 196*7431b783SNikos Tsironis #hydrating regions Number of regions currently hydrating 197*7431b783SNikos Tsironis #feature args Number of feature arguments to follow 198*7431b783SNikos Tsironis feature args Feature arguments, e.g. `no_hydration` 199*7431b783SNikos Tsironis #core args Even number of core arguments to follow 200*7431b783SNikos Tsironis core args Key/value pairs for tuning the core, e.g. 201*7431b783SNikos Tsironis `hydration_threshold 256` 202*7431b783SNikos Tsironis clone metadata mode ro if read-only, rw if read-write 203*7431b783SNikos Tsironis 204*7431b783SNikos Tsironis In serious cases where even a read-only mode is deemed 205*7431b783SNikos Tsironis unsafe no further I/O will be permitted and the status 206*7431b783SNikos Tsironis will just contain the string 'Fail'. If the metadata 207*7431b783SNikos Tsironis mode changes, a dm event will be sent to user space. 208*7431b783SNikos Tsironis ======================= ======================================================= 209*7431b783SNikos Tsironis 210*7431b783SNikos TsironisMessages 211*7431b783SNikos Tsironis-------- 212*7431b783SNikos Tsironis 213*7431b783SNikos Tsironis `disable_hydration` 214*7431b783SNikos Tsironis Disable the background hydration of the destination device. 215*7431b783SNikos Tsironis 216*7431b783SNikos Tsironis `enable_hydration` 217*7431b783SNikos Tsironis Enable the background hydration of the destination device. 218*7431b783SNikos Tsironis 219*7431b783SNikos Tsironis `hydration_threshold <#regions>` 220*7431b783SNikos Tsironis Set background hydration threshold. 221*7431b783SNikos Tsironis 222*7431b783SNikos Tsironis `hydration_batch_size <#regions>` 223*7431b783SNikos Tsironis Set background hydration batch size. 224*7431b783SNikos Tsironis 225*7431b783SNikos TsironisExamples 226*7431b783SNikos Tsironis======== 227*7431b783SNikos Tsironis 228*7431b783SNikos TsironisClone a device containing a file system 229*7431b783SNikos Tsironis--------------------------------------- 230*7431b783SNikos Tsironis 231*7431b783SNikos Tsironis1. Create the dm-clone device. 232*7431b783SNikos Tsironis 233*7431b783SNikos Tsironis :: 234*7431b783SNikos Tsironis 235*7431b783SNikos Tsironis dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \ 236*7431b783SNikos Tsironis $source_dev 8 1 no_hydration" 237*7431b783SNikos Tsironis 238*7431b783SNikos Tsironis2. Mount the device and trim the file system. dm-clone interprets the discards 239*7431b783SNikos Tsironis sent by the file system and it will not hydrate the unused space. 240*7431b783SNikos Tsironis 241*7431b783SNikos Tsironis :: 242*7431b783SNikos Tsironis 243*7431b783SNikos Tsironis mount /dev/mapper/clone /mnt/cloned-fs 244*7431b783SNikos Tsironis fstrim /mnt/cloned-fs 245*7431b783SNikos Tsironis 246*7431b783SNikos Tsironis3. Enable background hydration of the destination device. 247*7431b783SNikos Tsironis 248*7431b783SNikos Tsironis :: 249*7431b783SNikos Tsironis 250*7431b783SNikos Tsironis dmsetup message clone 0 enable_hydration 251*7431b783SNikos Tsironis 252*7431b783SNikos Tsironis4. When the hydration finishes, we can replace the dm-clone table with a linear 253*7431b783SNikos Tsironis table. 254*7431b783SNikos Tsironis 255*7431b783SNikos Tsironis :: 256*7431b783SNikos Tsironis 257*7431b783SNikos Tsironis dmsetup suspend clone 258*7431b783SNikos Tsironis dmsetup load clone --table "0 1048576000 linear $dest_dev 0" 259*7431b783SNikos Tsironis dmsetup resume clone 260*7431b783SNikos Tsironis 261*7431b783SNikos Tsironis The metadata device is no longer needed and can be safely discarded or reused 262*7431b783SNikos Tsironis for other purposes. 263*7431b783SNikos Tsironis 264*7431b783SNikos TsironisKnown issues 265*7431b783SNikos Tsironis============ 266*7431b783SNikos Tsironis 267*7431b783SNikos Tsironis1. We redirect reads, to not-yet-hydrated regions, to the source device. If 268*7431b783SNikos Tsironis reading the source device has high latency and the user repeatedly reads from 269*7431b783SNikos Tsironis the same regions, this behaviour could degrade performance. We should use 270*7431b783SNikos Tsironis these reads as hints to hydrate the relevant regions sooner. Currently, we 271*7431b783SNikos Tsironis rely on the page cache to cache these regions, so we hopefully don't end up 272*7431b783SNikos Tsironis reading them multiple times from the source device. 273*7431b783SNikos Tsironis 274*7431b783SNikos Tsironis2. Release in-core resources, i.e., the bitmaps tracking which regions are 275*7431b783SNikos Tsironis hydrated, after the hydration has finished. 276*7431b783SNikos Tsironis 277*7431b783SNikos Tsironis3. During background hydration, if we fail to read the source or write to the 278*7431b783SNikos Tsironis destination device, we print an error message, but the hydration process 279*7431b783SNikos Tsironis continues indefinitely, until it succeeds. We should stop the background 280*7431b783SNikos Tsironis hydration after a number of failures and emit a dm event for user space to 281*7431b783SNikos Tsironis notice. 282*7431b783SNikos Tsironis 283*7431b783SNikos TsironisWhy not...? 284*7431b783SNikos Tsironis=========== 285*7431b783SNikos Tsironis 286*7431b783SNikos TsironisWe explored the following alternatives before implementing dm-clone: 287*7431b783SNikos Tsironis 288*7431b783SNikos Tsironis1. Use dm-cache with cache size equal to the source device and implement a new 289*7431b783SNikos Tsironis cloning policy: 290*7431b783SNikos Tsironis 291*7431b783SNikos Tsironis * The resulting cache device is not a one-to-one mirror of the source device 292*7431b783SNikos Tsironis and thus we cannot remove the cache device once cloning completes. 293*7431b783SNikos Tsironis 294*7431b783SNikos Tsironis * dm-cache writes to the source device, which violates our requirement that 295*7431b783SNikos Tsironis the source device must be treated as read-only. 296*7431b783SNikos Tsironis 297*7431b783SNikos Tsironis * Caching is semantically different from cloning. 298*7431b783SNikos Tsironis 299*7431b783SNikos Tsironis2. Use dm-snapshot with a COW device equal to the source device: 300*7431b783SNikos Tsironis 301*7431b783SNikos Tsironis * dm-snapshot stores its metadata in the COW device, so the resulting device 302*7431b783SNikos Tsironis is not a one-to-one mirror of the source device. 303*7431b783SNikos Tsironis 304*7431b783SNikos Tsironis * No background copying mechanism. 305*7431b783SNikos Tsironis 306*7431b783SNikos Tsironis * dm-snapshot needs to commit its metadata whenever a pending exception 307*7431b783SNikos Tsironis completes, to ensure snapshot consistency. In the case of cloning, we don't 308*7431b783SNikos Tsironis need to be so strict and can rely on committing metadata every time a FLUSH 309*7431b783SNikos Tsironis or FUA bio is written, or periodically, like dm-thin and dm-cache do. This 310*7431b783SNikos Tsironis improves the performance significantly. 311*7431b783SNikos Tsironis 312*7431b783SNikos Tsironis3. Use dm-mirror: The mirror target has a background copying/mirroring 313*7431b783SNikos Tsironis mechanism, but it writes to all mirrors, thus violating our requirement that 314*7431b783SNikos Tsironis the source device must be treated as read-only. 315*7431b783SNikos Tsironis 316*7431b783SNikos Tsironis4. Use dm-thin's external snapshot functionality. This approach is the most 317*7431b783SNikos Tsironis promising among all alternatives, as the thinly-provisioned volume is a 318*7431b783SNikos Tsironis one-to-one mirror of the source device and handles reads and writes to 319*7431b783SNikos Tsironis un-provisioned/not-yet-cloned areas the same way as dm-clone does. 320*7431b783SNikos Tsironis 321*7431b783SNikos Tsironis Still: 322*7431b783SNikos Tsironis 323*7431b783SNikos Tsironis * There is no background copying mechanism, though one could be implemented. 324*7431b783SNikos Tsironis 325*7431b783SNikos Tsironis * Most importantly, we want to support arbitrary block devices as the 326*7431b783SNikos Tsironis destination of the cloning process and not restrict ourselves to 327*7431b783SNikos Tsironis thinly-provisioned volumes. Thin-provisioning has an inherent metadata 328*7431b783SNikos Tsironis overhead, for maintaining the thin volume mappings, which significantly 329*7431b783SNikos Tsironis degrades performance. 330*7431b783SNikos Tsironis 331*7431b783SNikos Tsironis Moreover, cloning a device shouldn't force the use of thin-provisioning. On 332*7431b783SNikos Tsironis the other hand, if we wish to use thin provisioning, we can just use a thin 333*7431b783SNikos Tsironis LV as dm-clone's destination device. 334