xref: /openbmc/linux/Documentation/admin-guide/device-mapper/dm-clone.rst (revision 976e3645923bdd2fe7893aae33fd7a21098bfb28)
1*7431b783SNikos Tsironis.. SPDX-License-Identifier: GPL-2.0-only
2*7431b783SNikos Tsironis
3*7431b783SNikos Tsironis========
4*7431b783SNikos Tsironisdm-clone
5*7431b783SNikos Tsironis========
6*7431b783SNikos Tsironis
7*7431b783SNikos TsironisIntroduction
8*7431b783SNikos Tsironis============
9*7431b783SNikos Tsironis
10*7431b783SNikos Tsironisdm-clone is a device mapper target which produces a one-to-one copy of an
11*7431b783SNikos Tsironisexisting, read-only source device into a writable destination device: It
12*7431b783SNikos Tsironispresents a virtual block device which makes all data appear immediately, and
13*7431b783SNikos Tsironisredirects reads and writes accordingly.
14*7431b783SNikos Tsironis
15*7431b783SNikos TsironisThe main use case of dm-clone is to clone a potentially remote, high-latency,
16*7431b783SNikos Tsironisread-only, archival-type block device into a writable, fast, primary-type device
17*7431b783SNikos Tsironisfor fast, low-latency I/O. The cloned device is visible/mountable immediately
18*7431b783SNikos Tsironisand the copy of the source device to the destination device happens in the
19*7431b783SNikos Tsironisbackground, in parallel with user I/O.
20*7431b783SNikos Tsironis
21*7431b783SNikos TsironisFor example, one could restore an application backup from a read-only copy,
22*7431b783SNikos Tsironisaccessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE,
23*7431b783SNikos Tsironisetc.), into a local SSD or NVMe device, and start using the device immediately,
24*7431b783SNikos Tsironiswithout waiting for the restore to complete.
25*7431b783SNikos Tsironis
26*7431b783SNikos TsironisWhen the cloning completes, the dm-clone table can be removed altogether and be
27*7431b783SNikos Tsironisreplaced, e.g., by a linear table, mapping directly to the destination device.
28*7431b783SNikos Tsironis
29*7431b783SNikos TsironisThe dm-clone target reuses the metadata library used by the thin-provisioning
30*7431b783SNikos Tsironistarget.
31*7431b783SNikos Tsironis
32*7431b783SNikos TsironisGlossary
33*7431b783SNikos Tsironis========
34*7431b783SNikos Tsironis
35*7431b783SNikos Tsironis   Hydration
36*7431b783SNikos Tsironis     The process of filling a region of the destination device with data from
37*7431b783SNikos Tsironis     the same region of the source device, i.e., copying the region from the
38*7431b783SNikos Tsironis     source to the destination device.
39*7431b783SNikos Tsironis
40*7431b783SNikos TsironisOnce a region gets hydrated we redirect all I/O regarding it to the destination
41*7431b783SNikos Tsironisdevice.
42*7431b783SNikos Tsironis
43*7431b783SNikos TsironisDesign
44*7431b783SNikos Tsironis======
45*7431b783SNikos Tsironis
46*7431b783SNikos TsironisSub-devices
47*7431b783SNikos Tsironis-----------
48*7431b783SNikos Tsironis
49*7431b783SNikos TsironisThe target is constructed by passing three devices to it (along with other
50*7431b783SNikos Tsironisparameters detailed later):
51*7431b783SNikos Tsironis
52*7431b783SNikos Tsironis1. A source device - the read-only device that gets cloned and source of the
53*7431b783SNikos Tsironis   hydration.
54*7431b783SNikos Tsironis
55*7431b783SNikos Tsironis2. A destination device - the destination of the hydration, which will become a
56*7431b783SNikos Tsironis   clone of the source device.
57*7431b783SNikos Tsironis
58*7431b783SNikos Tsironis3. A small metadata device - it records which regions are already valid in the
59*7431b783SNikos Tsironis   destination device, i.e., which regions have already been hydrated, or have
60*7431b783SNikos Tsironis   been written to directly, via user I/O.
61*7431b783SNikos Tsironis
62*7431b783SNikos TsironisThe size of the destination device must be at least equal to the size of the
63*7431b783SNikos Tsironissource device.
64*7431b783SNikos Tsironis
65*7431b783SNikos TsironisRegions
66*7431b783SNikos Tsironis-------
67*7431b783SNikos Tsironis
68*7431b783SNikos Tsironisdm-clone divides the source and destination devices in fixed sized regions.
69*7431b783SNikos TsironisRegions are the unit of hydration, i.e., the minimum amount of data copied from
70*7431b783SNikos Tsironisthe source to the destination device.
71*7431b783SNikos Tsironis
72*7431b783SNikos TsironisThe region size is configurable when you first create the dm-clone device. The
73*7431b783SNikos Tsironisrecommended region size is the same as the file system block size, which usually
74*7431b783SNikos Tsironisis 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors
75*7431b783SNikos Tsironis(1GB) and a power of two.
76*7431b783SNikos Tsironis
77*7431b783SNikos TsironisReads and writes from/to hydrated regions are serviced from the destination
78*7431b783SNikos Tsironisdevice.
79*7431b783SNikos Tsironis
80*7431b783SNikos TsironisA read to a not yet hydrated region is serviced directly from the source device.
81*7431b783SNikos Tsironis
82*7431b783SNikos TsironisA write to a not yet hydrated region will be delayed until the corresponding
83*7431b783SNikos Tsironisregion has been hydrated and the hydration of the region starts immediately.
84*7431b783SNikos Tsironis
85*7431b783SNikos TsironisNote that a write request with size equal to region size will skip copying of
86*7431b783SNikos Tsironisthe corresponding region from the source device and overwrite the region of the
87*7431b783SNikos Tsironisdestination device directly.
88*7431b783SNikos Tsironis
89*7431b783SNikos TsironisDiscards
90*7431b783SNikos Tsironis--------
91*7431b783SNikos Tsironis
92*7431b783SNikos Tsironisdm-clone interprets a discard request to a range that hasn't been hydrated yet
93*7431b783SNikos Tsironisas a hint to skip hydration of the regions covered by the request, i.e., it
94*7431b783SNikos Tsironisskips copying the region's data from the source to the destination device, and
95*7431b783SNikos Tsironisonly updates its metadata.
96*7431b783SNikos Tsironis
97*7431b783SNikos TsironisIf the destination device supports discards, then by default dm-clone will pass
98*7431b783SNikos Tsironisdown discard requests to it.
99*7431b783SNikos Tsironis
100*7431b783SNikos TsironisBackground Hydration
101*7431b783SNikos Tsironis--------------------
102*7431b783SNikos Tsironis
103*7431b783SNikos Tsironisdm-clone copies continuously from the source to the destination device, until
104*7431b783SNikos Tsironisall of the device has been copied.
105*7431b783SNikos Tsironis
106*7431b783SNikos TsironisCopying data from the source to the destination device uses bandwidth. The user
107*7431b783SNikos Tsironiscan set a throttle to prevent more than a certain amount of copying occurring at
108*7431b783SNikos Tsironisany one time. Moreover, dm-clone takes into account user I/O traffic going to
109*7431b783SNikos Tsironisthe devices and pauses the background hydration when there is I/O in-flight.
110*7431b783SNikos Tsironis
111*7431b783SNikos TsironisA message `hydration_threshold <#regions>` can be used to set the maximum number
112*7431b783SNikos Tsironisof regions being copied, the default being 1 region.
113*7431b783SNikos Tsironis
114*7431b783SNikos Tsironisdm-clone employs dm-kcopyd for copying portions of the source device to the
115*7431b783SNikos Tsironisdestination device. By default, we issue copy requests of size equal to the
116*7431b783SNikos Tsironisregion size. A message `hydration_batch_size <#regions>` can be used to tune the
117*7431b783SNikos Tsironissize of these copy requests. Increasing the hydration batch size results in
118*7431b783SNikos Tsironisdm-clone trying to batch together contiguous regions, so we copy the data in
119*7431b783SNikos Tsironisbatches of this many regions.
120*7431b783SNikos Tsironis
121*7431b783SNikos TsironisWhen the hydration of the destination device finishes, a dm event will be sent
122*7431b783SNikos Tsironisto user space.
123*7431b783SNikos Tsironis
124*7431b783SNikos TsironisUpdating on-disk metadata
125*7431b783SNikos Tsironis-------------------------
126*7431b783SNikos Tsironis
127*7431b783SNikos TsironisOn-disk metadata is committed every time a FLUSH or FUA bio is written. If no
128*7431b783SNikos Tsironissuch requests are made then commits will occur every second. This means the
129*7431b783SNikos Tsironisdm-clone device behaves like a physical disk that has a volatile write cache. If
130*7431b783SNikos Tsironispower is lost you may lose some recent writes. The metadata should always be
131*7431b783SNikos Tsironisconsistent in spite of any crash.
132*7431b783SNikos Tsironis
133*7431b783SNikos TsironisTarget Interface
134*7431b783SNikos Tsironis================
135*7431b783SNikos Tsironis
136*7431b783SNikos TsironisConstructor
137*7431b783SNikos Tsironis-----------
138*7431b783SNikos Tsironis
139*7431b783SNikos Tsironis  ::
140*7431b783SNikos Tsironis
141*7431b783SNikos Tsironis   clone <metadata dev> <destination dev> <source dev> <region size>
142*7431b783SNikos Tsironis         [<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]]
143*7431b783SNikos Tsironis
144*7431b783SNikos Tsironis ================ ==============================================================
145*7431b783SNikos Tsironis metadata dev     Fast device holding the persistent metadata
146*7431b783SNikos Tsironis destination dev  The destination device, where the source will be cloned
147*7431b783SNikos Tsironis source dev       Read only device containing the data that gets cloned
148*7431b783SNikos Tsironis region size      The size of a region in sectors
149*7431b783SNikos Tsironis
150*7431b783SNikos Tsironis #feature args    Number of feature arguments passed
151*7431b783SNikos Tsironis feature args     no_hydration or no_discard_passdown
152*7431b783SNikos Tsironis
153*7431b783SNikos Tsironis #core args       An even number of arguments corresponding to key/value pairs
154*7431b783SNikos Tsironis                  passed to dm-clone
155*7431b783SNikos Tsironis core args        Key/value pairs passed to dm-clone, e.g. `hydration_threshold
156*7431b783SNikos Tsironis                  256`
157*7431b783SNikos Tsironis ================ ==============================================================
158*7431b783SNikos Tsironis
159*7431b783SNikos TsironisOptional feature arguments are:
160*7431b783SNikos Tsironis
161*7431b783SNikos Tsironis ==================== =========================================================
162*7431b783SNikos Tsironis no_hydration         Create a dm-clone instance with background hydration
163*7431b783SNikos Tsironis                      disabled
164*7431b783SNikos Tsironis no_discard_passdown  Disable passing down discards to the destination device
165*7431b783SNikos Tsironis ==================== =========================================================
166*7431b783SNikos Tsironis
167*7431b783SNikos TsironisOptional core arguments are:
168*7431b783SNikos Tsironis
169*7431b783SNikos Tsironis ================================ ==============================================
170*7431b783SNikos Tsironis hydration_threshold <#regions>   Maximum number of regions being copied from
171*7431b783SNikos Tsironis                                  the source to the destination device at any
172*7431b783SNikos Tsironis                                  one time, during background hydration.
173*7431b783SNikos Tsironis hydration_batch_size <#regions>  During background hydration, try to batch
174*7431b783SNikos Tsironis                                  together contiguous regions, so we copy data
175*7431b783SNikos Tsironis                                  from the source to the destination device in
176*7431b783SNikos Tsironis                                  batches of this many regions.
177*7431b783SNikos Tsironis ================================ ==============================================
178*7431b783SNikos Tsironis
179*7431b783SNikos TsironisStatus
180*7431b783SNikos Tsironis------
181*7431b783SNikos Tsironis
182*7431b783SNikos Tsironis  ::
183*7431b783SNikos Tsironis
184*7431b783SNikos Tsironis   <metadata block size> <#used metadata blocks>/<#total metadata blocks>
185*7431b783SNikos Tsironis   <region size> <#hydrated regions>/<#total regions> <#hydrating regions>
186*7431b783SNikos Tsironis   <#feature args> <feature args>* <#core args> <core args>*
187*7431b783SNikos Tsironis   <clone metadata mode>
188*7431b783SNikos Tsironis
189*7431b783SNikos Tsironis ======================= =======================================================
190*7431b783SNikos Tsironis metadata block size     Fixed block size for each metadata block in sectors
191*7431b783SNikos Tsironis #used metadata blocks   Number of metadata blocks used
192*7431b783SNikos Tsironis #total metadata blocks  Total number of metadata blocks
193*7431b783SNikos Tsironis region size             Configurable region size for the device in sectors
194*7431b783SNikos Tsironis #hydrated regions       Number of regions that have finished hydrating
195*7431b783SNikos Tsironis #total regions          Total number of regions to hydrate
196*7431b783SNikos Tsironis #hydrating regions      Number of regions currently hydrating
197*7431b783SNikos Tsironis #feature args           Number of feature arguments to follow
198*7431b783SNikos Tsironis feature args            Feature arguments, e.g. `no_hydration`
199*7431b783SNikos Tsironis #core args              Even number of core arguments to follow
200*7431b783SNikos Tsironis core args               Key/value pairs for tuning the core, e.g.
201*7431b783SNikos Tsironis                         `hydration_threshold 256`
202*7431b783SNikos Tsironis clone metadata mode     ro if read-only, rw if read-write
203*7431b783SNikos Tsironis
204*7431b783SNikos Tsironis                         In serious cases where even a read-only mode is deemed
205*7431b783SNikos Tsironis                         unsafe no further I/O will be permitted and the status
206*7431b783SNikos Tsironis                         will just contain the string 'Fail'. If the metadata
207*7431b783SNikos Tsironis                         mode changes, a dm event will be sent to user space.
208*7431b783SNikos Tsironis ======================= =======================================================
209*7431b783SNikos Tsironis
210*7431b783SNikos TsironisMessages
211*7431b783SNikos Tsironis--------
212*7431b783SNikos Tsironis
213*7431b783SNikos Tsironis  `disable_hydration`
214*7431b783SNikos Tsironis      Disable the background hydration of the destination device.
215*7431b783SNikos Tsironis
216*7431b783SNikos Tsironis  `enable_hydration`
217*7431b783SNikos Tsironis      Enable the background hydration of the destination device.
218*7431b783SNikos Tsironis
219*7431b783SNikos Tsironis  `hydration_threshold <#regions>`
220*7431b783SNikos Tsironis      Set background hydration threshold.
221*7431b783SNikos Tsironis
222*7431b783SNikos Tsironis  `hydration_batch_size <#regions>`
223*7431b783SNikos Tsironis      Set background hydration batch size.
224*7431b783SNikos Tsironis
225*7431b783SNikos TsironisExamples
226*7431b783SNikos Tsironis========
227*7431b783SNikos Tsironis
228*7431b783SNikos TsironisClone a device containing a file system
229*7431b783SNikos Tsironis---------------------------------------
230*7431b783SNikos Tsironis
231*7431b783SNikos Tsironis1. Create the dm-clone device.
232*7431b783SNikos Tsironis
233*7431b783SNikos Tsironis   ::
234*7431b783SNikos Tsironis
235*7431b783SNikos Tsironis    dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \
236*7431b783SNikos Tsironis      $source_dev 8 1 no_hydration"
237*7431b783SNikos Tsironis
238*7431b783SNikos Tsironis2. Mount the device and trim the file system. dm-clone interprets the discards
239*7431b783SNikos Tsironis   sent by the file system and it will not hydrate the unused space.
240*7431b783SNikos Tsironis
241*7431b783SNikos Tsironis   ::
242*7431b783SNikos Tsironis
243*7431b783SNikos Tsironis    mount /dev/mapper/clone /mnt/cloned-fs
244*7431b783SNikos Tsironis    fstrim /mnt/cloned-fs
245*7431b783SNikos Tsironis
246*7431b783SNikos Tsironis3. Enable background hydration of the destination device.
247*7431b783SNikos Tsironis
248*7431b783SNikos Tsironis   ::
249*7431b783SNikos Tsironis
250*7431b783SNikos Tsironis    dmsetup message clone 0 enable_hydration
251*7431b783SNikos Tsironis
252*7431b783SNikos Tsironis4. When the hydration finishes, we can replace the dm-clone table with a linear
253*7431b783SNikos Tsironis   table.
254*7431b783SNikos Tsironis
255*7431b783SNikos Tsironis   ::
256*7431b783SNikos Tsironis
257*7431b783SNikos Tsironis    dmsetup suspend clone
258*7431b783SNikos Tsironis    dmsetup load clone --table "0 1048576000 linear $dest_dev 0"
259*7431b783SNikos Tsironis    dmsetup resume clone
260*7431b783SNikos Tsironis
261*7431b783SNikos Tsironis   The metadata device is no longer needed and can be safely discarded or reused
262*7431b783SNikos Tsironis   for other purposes.
263*7431b783SNikos Tsironis
264*7431b783SNikos TsironisKnown issues
265*7431b783SNikos Tsironis============
266*7431b783SNikos Tsironis
267*7431b783SNikos Tsironis1. We redirect reads, to not-yet-hydrated regions, to the source device. If
268*7431b783SNikos Tsironis   reading the source device has high latency and the user repeatedly reads from
269*7431b783SNikos Tsironis   the same regions, this behaviour could degrade performance. We should use
270*7431b783SNikos Tsironis   these reads as hints to hydrate the relevant regions sooner. Currently, we
271*7431b783SNikos Tsironis   rely on the page cache to cache these regions, so we hopefully don't end up
272*7431b783SNikos Tsironis   reading them multiple times from the source device.
273*7431b783SNikos Tsironis
274*7431b783SNikos Tsironis2. Release in-core resources, i.e., the bitmaps tracking which regions are
275*7431b783SNikos Tsironis   hydrated, after the hydration has finished.
276*7431b783SNikos Tsironis
277*7431b783SNikos Tsironis3. During background hydration, if we fail to read the source or write to the
278*7431b783SNikos Tsironis   destination device, we print an error message, but the hydration process
279*7431b783SNikos Tsironis   continues indefinitely, until it succeeds. We should stop the background
280*7431b783SNikos Tsironis   hydration after a number of failures and emit a dm event for user space to
281*7431b783SNikos Tsironis   notice.
282*7431b783SNikos Tsironis
283*7431b783SNikos TsironisWhy not...?
284*7431b783SNikos Tsironis===========
285*7431b783SNikos Tsironis
286*7431b783SNikos TsironisWe explored the following alternatives before implementing dm-clone:
287*7431b783SNikos Tsironis
288*7431b783SNikos Tsironis1. Use dm-cache with cache size equal to the source device and implement a new
289*7431b783SNikos Tsironis   cloning policy:
290*7431b783SNikos Tsironis
291*7431b783SNikos Tsironis   * The resulting cache device is not a one-to-one mirror of the source device
292*7431b783SNikos Tsironis     and thus we cannot remove the cache device once cloning completes.
293*7431b783SNikos Tsironis
294*7431b783SNikos Tsironis   * dm-cache writes to the source device, which violates our requirement that
295*7431b783SNikos Tsironis     the source device must be treated as read-only.
296*7431b783SNikos Tsironis
297*7431b783SNikos Tsironis   * Caching is semantically different from cloning.
298*7431b783SNikos Tsironis
299*7431b783SNikos Tsironis2. Use dm-snapshot with a COW device equal to the source device:
300*7431b783SNikos Tsironis
301*7431b783SNikos Tsironis   * dm-snapshot stores its metadata in the COW device, so the resulting device
302*7431b783SNikos Tsironis     is not a one-to-one mirror of the source device.
303*7431b783SNikos Tsironis
304*7431b783SNikos Tsironis   * No background copying mechanism.
305*7431b783SNikos Tsironis
306*7431b783SNikos Tsironis   * dm-snapshot needs to commit its metadata whenever a pending exception
307*7431b783SNikos Tsironis     completes, to ensure snapshot consistency. In the case of cloning, we don't
308*7431b783SNikos Tsironis     need to be so strict and can rely on committing metadata every time a FLUSH
309*7431b783SNikos Tsironis     or FUA bio is written, or periodically, like dm-thin and dm-cache do. This
310*7431b783SNikos Tsironis     improves the performance significantly.
311*7431b783SNikos Tsironis
312*7431b783SNikos Tsironis3. Use dm-mirror: The mirror target has a background copying/mirroring
313*7431b783SNikos Tsironis   mechanism, but it writes to all mirrors, thus violating our requirement that
314*7431b783SNikos Tsironis   the source device must be treated as read-only.
315*7431b783SNikos Tsironis
316*7431b783SNikos Tsironis4. Use dm-thin's external snapshot functionality. This approach is the most
317*7431b783SNikos Tsironis   promising among all alternatives, as the thinly-provisioned volume is a
318*7431b783SNikos Tsironis   one-to-one mirror of the source device and handles reads and writes to
319*7431b783SNikos Tsironis   un-provisioned/not-yet-cloned areas the same way as dm-clone does.
320*7431b783SNikos Tsironis
321*7431b783SNikos Tsironis   Still:
322*7431b783SNikos Tsironis
323*7431b783SNikos Tsironis   * There is no background copying mechanism, though one could be implemented.
324*7431b783SNikos Tsironis
325*7431b783SNikos Tsironis   * Most importantly, we want to support arbitrary block devices as the
326*7431b783SNikos Tsironis     destination of the cloning process and not restrict ourselves to
327*7431b783SNikos Tsironis     thinly-provisioned volumes. Thin-provisioning has an inherent metadata
328*7431b783SNikos Tsironis     overhead, for maintaining the thin volume mappings, which significantly
329*7431b783SNikos Tsironis     degrades performance.
330*7431b783SNikos Tsironis
331*7431b783SNikos Tsironis   Moreover, cloning a device shouldn't force the use of thin-provisioning. On
332*7431b783SNikos Tsironis   the other hand, if we wish to use thin provisioning, we can just use a thin
333*7431b783SNikos Tsironis   LV as dm-clone's destination device.
334