xref: /openbmc/linux/Documentation/admin-guide/device-mapper/thin-provisioning.rst (revision 0898782247ae533d1f4e47a06bc5d4870931b284)
1*6cf2a73cSMauro Carvalho Chehab=================
2*6cf2a73cSMauro Carvalho ChehabThin provisioning
3*6cf2a73cSMauro Carvalho Chehab=================
4*6cf2a73cSMauro Carvalho Chehab
5*6cf2a73cSMauro Carvalho ChehabIntroduction
6*6cf2a73cSMauro Carvalho Chehab============
7*6cf2a73cSMauro Carvalho Chehab
8*6cf2a73cSMauro Carvalho ChehabThis document describes a collection of device-mapper targets that
9*6cf2a73cSMauro Carvalho Chehabbetween them implement thin-provisioning and snapshots.
10*6cf2a73cSMauro Carvalho Chehab
11*6cf2a73cSMauro Carvalho ChehabThe main highlight of this implementation, compared to the previous
12*6cf2a73cSMauro Carvalho Chehabimplementation of snapshots, is that it allows many virtual devices to
13*6cf2a73cSMauro Carvalho Chehabbe stored on the same data volume.  This simplifies administration and
14*6cf2a73cSMauro Carvalho Chehaballows the sharing of data between volumes, thus reducing disk usage.
15*6cf2a73cSMauro Carvalho Chehab
16*6cf2a73cSMauro Carvalho ChehabAnother significant feature is support for an arbitrary depth of
17*6cf2a73cSMauro Carvalho Chehabrecursive snapshots (snapshots of snapshots of snapshots ...).  The
18*6cf2a73cSMauro Carvalho Chehabprevious implementation of snapshots did this by chaining together
19*6cf2a73cSMauro Carvalho Chehablookup tables, and so performance was O(depth).  This new
20*6cf2a73cSMauro Carvalho Chehabimplementation uses a single data structure to avoid this degradation
21*6cf2a73cSMauro Carvalho Chehabwith depth.  Fragmentation may still be an issue, however, in some
22*6cf2a73cSMauro Carvalho Chehabscenarios.
23*6cf2a73cSMauro Carvalho Chehab
24*6cf2a73cSMauro Carvalho ChehabMetadata is stored on a separate device from data, giving the
25*6cf2a73cSMauro Carvalho Chehabadministrator some freedom, for example to:
26*6cf2a73cSMauro Carvalho Chehab
27*6cf2a73cSMauro Carvalho Chehab- Improve metadata resilience by storing metadata on a mirrored volume
28*6cf2a73cSMauro Carvalho Chehab  but data on a non-mirrored one.
29*6cf2a73cSMauro Carvalho Chehab
30*6cf2a73cSMauro Carvalho Chehab- Improve performance by storing the metadata on SSD.
31*6cf2a73cSMauro Carvalho Chehab
32*6cf2a73cSMauro Carvalho ChehabStatus
33*6cf2a73cSMauro Carvalho Chehab======
34*6cf2a73cSMauro Carvalho Chehab
35*6cf2a73cSMauro Carvalho ChehabThese targets are considered safe for production use.  But different use
36*6cf2a73cSMauro Carvalho Chehabcases will have different performance characteristics, for example due
37*6cf2a73cSMauro Carvalho Chehabto fragmentation of the data volume.
38*6cf2a73cSMauro Carvalho Chehab
39*6cf2a73cSMauro Carvalho ChehabIf you find this software is not performing as expected please mail
40*6cf2a73cSMauro Carvalho Chehabdm-devel@redhat.com with details and we'll try our best to improve
41*6cf2a73cSMauro Carvalho Chehabthings for you.
42*6cf2a73cSMauro Carvalho Chehab
43*6cf2a73cSMauro Carvalho ChehabUserspace tools for checking and repairing the metadata have been fully
44*6cf2a73cSMauro Carvalho Chehabdeveloped and are available as 'thin_check' and 'thin_repair'.  The name
45*6cf2a73cSMauro Carvalho Chehabof the package that provides these utilities varies by distribution (on
46*6cf2a73cSMauro Carvalho Chehaba Red Hat distribution it is named 'device-mapper-persistent-data').
47*6cf2a73cSMauro Carvalho Chehab
48*6cf2a73cSMauro Carvalho ChehabCookbook
49*6cf2a73cSMauro Carvalho Chehab========
50*6cf2a73cSMauro Carvalho Chehab
51*6cf2a73cSMauro Carvalho ChehabThis section describes some quick recipes for using thin provisioning.
52*6cf2a73cSMauro Carvalho ChehabThey use the dmsetup program to control the device-mapper driver
53*6cf2a73cSMauro Carvalho Chehabdirectly.  End users will be advised to use a higher-level volume
54*6cf2a73cSMauro Carvalho Chehabmanager such as LVM2 once support has been added.
55*6cf2a73cSMauro Carvalho Chehab
56*6cf2a73cSMauro Carvalho ChehabPool device
57*6cf2a73cSMauro Carvalho Chehab-----------
58*6cf2a73cSMauro Carvalho Chehab
59*6cf2a73cSMauro Carvalho ChehabThe pool device ties together the metadata volume and the data volume.
60*6cf2a73cSMauro Carvalho ChehabIt maps I/O linearly to the data volume and updates the metadata via
61*6cf2a73cSMauro Carvalho Chehabtwo mechanisms:
62*6cf2a73cSMauro Carvalho Chehab
63*6cf2a73cSMauro Carvalho Chehab- Function calls from the thin targets
64*6cf2a73cSMauro Carvalho Chehab
65*6cf2a73cSMauro Carvalho Chehab- Device-mapper 'messages' from userspace which control the creation of new
66*6cf2a73cSMauro Carvalho Chehab  virtual devices amongst other things.
67*6cf2a73cSMauro Carvalho Chehab
68*6cf2a73cSMauro Carvalho ChehabSetting up a fresh pool device
69*6cf2a73cSMauro Carvalho Chehab------------------------------
70*6cf2a73cSMauro Carvalho Chehab
71*6cf2a73cSMauro Carvalho ChehabSetting up a pool device requires a valid metadata device, and a
72*6cf2a73cSMauro Carvalho Chehabdata device.  If you do not have an existing metadata device you can
73*6cf2a73cSMauro Carvalho Chehabmake one by zeroing the first 4k to indicate empty metadata.
74*6cf2a73cSMauro Carvalho Chehab
75*6cf2a73cSMauro Carvalho Chehab    dd if=/dev/zero of=$metadata_dev bs=4096 count=1
76*6cf2a73cSMauro Carvalho Chehab
77*6cf2a73cSMauro Carvalho ChehabThe amount of metadata you need will vary according to how many blocks
78*6cf2a73cSMauro Carvalho Chehabare shared between thin devices (i.e. through snapshots).  If you have
79*6cf2a73cSMauro Carvalho Chehabless sharing than average you'll need a larger-than-average metadata device.
80*6cf2a73cSMauro Carvalho Chehab
81*6cf2a73cSMauro Carvalho ChehabAs a guide, we suggest you calculate the number of bytes to use in the
82*6cf2a73cSMauro Carvalho Chehabmetadata device as 48 * $data_dev_size / $data_block_size but round it up
83*6cf2a73cSMauro Carvalho Chehabto 2MB if the answer is smaller.  If you're creating large numbers of
84*6cf2a73cSMauro Carvalho Chehabsnapshots which are recording large amounts of change, you may find you
85*6cf2a73cSMauro Carvalho Chehabneed to increase this.
86*6cf2a73cSMauro Carvalho Chehab
87*6cf2a73cSMauro Carvalho ChehabThe largest size supported is 16GB: If the device is larger,
88*6cf2a73cSMauro Carvalho Chehaba warning will be issued and the excess space will not be used.
89*6cf2a73cSMauro Carvalho Chehab
90*6cf2a73cSMauro Carvalho ChehabReloading a pool table
91*6cf2a73cSMauro Carvalho Chehab----------------------
92*6cf2a73cSMauro Carvalho Chehab
93*6cf2a73cSMauro Carvalho ChehabYou may reload a pool's table, indeed this is how the pool is resized
94*6cf2a73cSMauro Carvalho Chehabif it runs out of space.  (N.B. While specifying a different metadata
95*6cf2a73cSMauro Carvalho Chehabdevice when reloading is not forbidden at the moment, things will go
96*6cf2a73cSMauro Carvalho Chehabwrong if it does not route I/O to exactly the same on-disk location as
97*6cf2a73cSMauro Carvalho Chehabpreviously.)
98*6cf2a73cSMauro Carvalho Chehab
99*6cf2a73cSMauro Carvalho ChehabUsing an existing pool device
100*6cf2a73cSMauro Carvalho Chehab-----------------------------
101*6cf2a73cSMauro Carvalho Chehab
102*6cf2a73cSMauro Carvalho Chehab::
103*6cf2a73cSMauro Carvalho Chehab
104*6cf2a73cSMauro Carvalho Chehab    dmsetup create pool \
105*6cf2a73cSMauro Carvalho Chehab	--table "0 20971520 thin-pool $metadata_dev $data_dev \
106*6cf2a73cSMauro Carvalho Chehab		 $data_block_size $low_water_mark"
107*6cf2a73cSMauro Carvalho Chehab
108*6cf2a73cSMauro Carvalho Chehab$data_block_size gives the smallest unit of disk space that can be
109*6cf2a73cSMauro Carvalho Chehaballocated at a time expressed in units of 512-byte sectors.
110*6cf2a73cSMauro Carvalho Chehab$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
111*6cf2a73cSMauro Carvalho Chehabmultiple of 128 (64KB).  $data_block_size cannot be changed after the
112*6cf2a73cSMauro Carvalho Chehabthin-pool is created.  People primarily interested in thin provisioning
113*6cf2a73cSMauro Carvalho Chehabmay want to use a value such as 1024 (512KB).  People doing lots of
114*6cf2a73cSMauro Carvalho Chehabsnapshotting may want a smaller value such as 128 (64KB).  If you are
115*6cf2a73cSMauro Carvalho Chehabnot zeroing newly-allocated data, a larger $data_block_size in the
116*6cf2a73cSMauro Carvalho Chehabregion of 256000 (128MB) is suggested.
117*6cf2a73cSMauro Carvalho Chehab
118*6cf2a73cSMauro Carvalho Chehab$low_water_mark is expressed in blocks of size $data_block_size.  If
119*6cf2a73cSMauro Carvalho Chehabfree space on the data device drops below this level then a dm event
120*6cf2a73cSMauro Carvalho Chehabwill be triggered which a userspace daemon should catch allowing it to
121*6cf2a73cSMauro Carvalho Chehabextend the pool device.  Only one such event will be sent.
122*6cf2a73cSMauro Carvalho Chehab
123*6cf2a73cSMauro Carvalho ChehabNo special event is triggered if a just resumed device's free space is below
124*6cf2a73cSMauro Carvalho Chehabthe low water mark. However, resuming a device always triggers an
125*6cf2a73cSMauro Carvalho Chehabevent; a userspace daemon should verify that free space exceeds the low
126*6cf2a73cSMauro Carvalho Chehabwater mark when handling this event.
127*6cf2a73cSMauro Carvalho Chehab
128*6cf2a73cSMauro Carvalho ChehabA low water mark for the metadata device is maintained in the kernel and
129*6cf2a73cSMauro Carvalho Chehabwill trigger a dm event if free space on the metadata device drops below
130*6cf2a73cSMauro Carvalho Chehabit.
131*6cf2a73cSMauro Carvalho Chehab
132*6cf2a73cSMauro Carvalho ChehabUpdating on-disk metadata
133*6cf2a73cSMauro Carvalho Chehab-------------------------
134*6cf2a73cSMauro Carvalho Chehab
135*6cf2a73cSMauro Carvalho ChehabOn-disk metadata is committed every time a FLUSH or FUA bio is written.
136*6cf2a73cSMauro Carvalho ChehabIf no such requests are made then commits will occur every second.  This
137*6cf2a73cSMauro Carvalho Chehabmeans the thin-provisioning target behaves like a physical disk that has
138*6cf2a73cSMauro Carvalho Chehaba volatile write cache.  If power is lost you may lose some recent
139*6cf2a73cSMauro Carvalho Chehabwrites.  The metadata should always be consistent in spite of any crash.
140*6cf2a73cSMauro Carvalho Chehab
141*6cf2a73cSMauro Carvalho ChehabIf data space is exhausted the pool will either error or queue IO
142*6cf2a73cSMauro Carvalho Chehabaccording to the configuration (see: error_if_no_space).  If metadata
143*6cf2a73cSMauro Carvalho Chehabspace is exhausted or a metadata operation fails: the pool will error IO
144*6cf2a73cSMauro Carvalho Chehabuntil the pool is taken offline and repair is performed to 1) fix any
145*6cf2a73cSMauro Carvalho Chehabpotential inconsistencies and 2) clear the flag that imposes repair.
146*6cf2a73cSMauro Carvalho ChehabOnce the pool's metadata device is repaired it may be resized, which
147*6cf2a73cSMauro Carvalho Chehabwill allow the pool to return to normal operation.  Note that if a pool
148*6cf2a73cSMauro Carvalho Chehabis flagged as needing repair, the pool's data and metadata devices
149*6cf2a73cSMauro Carvalho Chehabcannot be resized until repair is performed.  It should also be noted
150*6cf2a73cSMauro Carvalho Chehabthat when the pool's metadata space is exhausted the current metadata
151*6cf2a73cSMauro Carvalho Chehabtransaction is aborted.  Given that the pool will cache IO whose
152*6cf2a73cSMauro Carvalho Chehabcompletion may have already been acknowledged to upper IO layers
153*6cf2a73cSMauro Carvalho Chehab(e.g. filesystem) it is strongly suggested that consistency checks
154*6cf2a73cSMauro Carvalho Chehab(e.g. fsck) be performed on those layers when repair of the pool is
155*6cf2a73cSMauro Carvalho Chehabrequired.
156*6cf2a73cSMauro Carvalho Chehab
157*6cf2a73cSMauro Carvalho ChehabThin provisioning
158*6cf2a73cSMauro Carvalho Chehab-----------------
159*6cf2a73cSMauro Carvalho Chehab
160*6cf2a73cSMauro Carvalho Chehabi) Creating a new thinly-provisioned volume.
161*6cf2a73cSMauro Carvalho Chehab
162*6cf2a73cSMauro Carvalho Chehab  To create a new thinly- provisioned volume you must send a message to an
163*6cf2a73cSMauro Carvalho Chehab  active pool device, /dev/mapper/pool in this example::
164*6cf2a73cSMauro Carvalho Chehab
165*6cf2a73cSMauro Carvalho Chehab    dmsetup message /dev/mapper/pool 0 "create_thin 0"
166*6cf2a73cSMauro Carvalho Chehab
167*6cf2a73cSMauro Carvalho Chehab  Here '0' is an identifier for the volume, a 24-bit number.  It's up
168*6cf2a73cSMauro Carvalho Chehab  to the caller to allocate and manage these identifiers.  If the
169*6cf2a73cSMauro Carvalho Chehab  identifier is already in use, the message will fail with -EEXIST.
170*6cf2a73cSMauro Carvalho Chehab
171*6cf2a73cSMauro Carvalho Chehabii) Using a thinly-provisioned volume.
172*6cf2a73cSMauro Carvalho Chehab
173*6cf2a73cSMauro Carvalho Chehab  Thinly-provisioned volumes are activated using the 'thin' target::
174*6cf2a73cSMauro Carvalho Chehab
175*6cf2a73cSMauro Carvalho Chehab    dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
176*6cf2a73cSMauro Carvalho Chehab
177*6cf2a73cSMauro Carvalho Chehab  The last parameter is the identifier for the thinp device.
178*6cf2a73cSMauro Carvalho Chehab
179*6cf2a73cSMauro Carvalho ChehabInternal snapshots
180*6cf2a73cSMauro Carvalho Chehab------------------
181*6cf2a73cSMauro Carvalho Chehab
182*6cf2a73cSMauro Carvalho Chehabi) Creating an internal snapshot.
183*6cf2a73cSMauro Carvalho Chehab
184*6cf2a73cSMauro Carvalho Chehab  Snapshots are created with another message to the pool.
185*6cf2a73cSMauro Carvalho Chehab
186*6cf2a73cSMauro Carvalho Chehab  N.B.  If the origin device that you wish to snapshot is active, you
187*6cf2a73cSMauro Carvalho Chehab  must suspend it before creating the snapshot to avoid corruption.
188*6cf2a73cSMauro Carvalho Chehab  This is NOT enforced at the moment, so please be careful!
189*6cf2a73cSMauro Carvalho Chehab
190*6cf2a73cSMauro Carvalho Chehab  ::
191*6cf2a73cSMauro Carvalho Chehab
192*6cf2a73cSMauro Carvalho Chehab    dmsetup suspend /dev/mapper/thin
193*6cf2a73cSMauro Carvalho Chehab    dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
194*6cf2a73cSMauro Carvalho Chehab    dmsetup resume /dev/mapper/thin
195*6cf2a73cSMauro Carvalho Chehab
196*6cf2a73cSMauro Carvalho Chehab  Here '1' is the identifier for the volume, a 24-bit number.  '0' is the
197*6cf2a73cSMauro Carvalho Chehab  identifier for the origin device.
198*6cf2a73cSMauro Carvalho Chehab
199*6cf2a73cSMauro Carvalho Chehabii) Using an internal snapshot.
200*6cf2a73cSMauro Carvalho Chehab
201*6cf2a73cSMauro Carvalho Chehab  Once created, the user doesn't have to worry about any connection
202*6cf2a73cSMauro Carvalho Chehab  between the origin and the snapshot.  Indeed the snapshot is no
203*6cf2a73cSMauro Carvalho Chehab  different from any other thinly-provisioned device and can be
204*6cf2a73cSMauro Carvalho Chehab  snapshotted itself via the same method.  It's perfectly legal to
205*6cf2a73cSMauro Carvalho Chehab  have only one of them active, and there's no ordering requirement on
206*6cf2a73cSMauro Carvalho Chehab  activating or removing them both.  (This differs from conventional
207*6cf2a73cSMauro Carvalho Chehab  device-mapper snapshots.)
208*6cf2a73cSMauro Carvalho Chehab
209*6cf2a73cSMauro Carvalho Chehab  Activate it exactly the same way as any other thinly-provisioned volume::
210*6cf2a73cSMauro Carvalho Chehab
211*6cf2a73cSMauro Carvalho Chehab    dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
212*6cf2a73cSMauro Carvalho Chehab
213*6cf2a73cSMauro Carvalho ChehabExternal snapshots
214*6cf2a73cSMauro Carvalho Chehab------------------
215*6cf2a73cSMauro Carvalho Chehab
216*6cf2a73cSMauro Carvalho ChehabYou can use an external **read only** device as an origin for a
217*6cf2a73cSMauro Carvalho Chehabthinly-provisioned volume.  Any read to an unprovisioned area of the
218*6cf2a73cSMauro Carvalho Chehabthin device will be passed through to the origin.  Writes trigger
219*6cf2a73cSMauro Carvalho Chehabthe allocation of new blocks as usual.
220*6cf2a73cSMauro Carvalho Chehab
221*6cf2a73cSMauro Carvalho ChehabOne use case for this is VM hosts that want to run guests on
222*6cf2a73cSMauro Carvalho Chehabthinly-provisioned volumes but have the base image on another device
223*6cf2a73cSMauro Carvalho Chehab(possibly shared between many VMs).
224*6cf2a73cSMauro Carvalho Chehab
225*6cf2a73cSMauro Carvalho ChehabYou must not write to the origin device if you use this technique!
226*6cf2a73cSMauro Carvalho ChehabOf course, you may write to the thin device and take internal snapshots
227*6cf2a73cSMauro Carvalho Chehabof the thin volume.
228*6cf2a73cSMauro Carvalho Chehab
229*6cf2a73cSMauro Carvalho Chehabi) Creating a snapshot of an external device
230*6cf2a73cSMauro Carvalho Chehab
231*6cf2a73cSMauro Carvalho Chehab  This is the same as creating a thin device.
232*6cf2a73cSMauro Carvalho Chehab  You don't mention the origin at this stage.
233*6cf2a73cSMauro Carvalho Chehab
234*6cf2a73cSMauro Carvalho Chehab  ::
235*6cf2a73cSMauro Carvalho Chehab
236*6cf2a73cSMauro Carvalho Chehab    dmsetup message /dev/mapper/pool 0 "create_thin 0"
237*6cf2a73cSMauro Carvalho Chehab
238*6cf2a73cSMauro Carvalho Chehabii) Using a snapshot of an external device.
239*6cf2a73cSMauro Carvalho Chehab
240*6cf2a73cSMauro Carvalho Chehab  Append an extra parameter to the thin target specifying the origin::
241*6cf2a73cSMauro Carvalho Chehab
242*6cf2a73cSMauro Carvalho Chehab    dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
243*6cf2a73cSMauro Carvalho Chehab
244*6cf2a73cSMauro Carvalho Chehab  N.B. All descendants (internal snapshots) of this snapshot require the
245*6cf2a73cSMauro Carvalho Chehab  same extra origin parameter.
246*6cf2a73cSMauro Carvalho Chehab
247*6cf2a73cSMauro Carvalho ChehabDeactivation
248*6cf2a73cSMauro Carvalho Chehab------------
249*6cf2a73cSMauro Carvalho Chehab
250*6cf2a73cSMauro Carvalho ChehabAll devices using a pool must be deactivated before the pool itself
251*6cf2a73cSMauro Carvalho Chehabcan be.
252*6cf2a73cSMauro Carvalho Chehab
253*6cf2a73cSMauro Carvalho Chehab::
254*6cf2a73cSMauro Carvalho Chehab
255*6cf2a73cSMauro Carvalho Chehab    dmsetup remove thin
256*6cf2a73cSMauro Carvalho Chehab    dmsetup remove snap
257*6cf2a73cSMauro Carvalho Chehab    dmsetup remove pool
258*6cf2a73cSMauro Carvalho Chehab
259*6cf2a73cSMauro Carvalho ChehabReference
260*6cf2a73cSMauro Carvalho Chehab=========
261*6cf2a73cSMauro Carvalho Chehab
262*6cf2a73cSMauro Carvalho Chehab'thin-pool' target
263*6cf2a73cSMauro Carvalho Chehab------------------
264*6cf2a73cSMauro Carvalho Chehab
265*6cf2a73cSMauro Carvalho Chehabi) Constructor
266*6cf2a73cSMauro Carvalho Chehab
267*6cf2a73cSMauro Carvalho Chehab    ::
268*6cf2a73cSMauro Carvalho Chehab
269*6cf2a73cSMauro Carvalho Chehab      thin-pool <metadata dev> <data dev> <data block size (sectors)> \
270*6cf2a73cSMauro Carvalho Chehab	        <low water mark (blocks)> [<number of feature args> [<arg>]*]
271*6cf2a73cSMauro Carvalho Chehab
272*6cf2a73cSMauro Carvalho Chehab    Optional feature arguments:
273*6cf2a73cSMauro Carvalho Chehab
274*6cf2a73cSMauro Carvalho Chehab      skip_block_zeroing:
275*6cf2a73cSMauro Carvalho Chehab	Skip the zeroing of newly-provisioned blocks.
276*6cf2a73cSMauro Carvalho Chehab
277*6cf2a73cSMauro Carvalho Chehab      ignore_discard:
278*6cf2a73cSMauro Carvalho Chehab	Disable discard support.
279*6cf2a73cSMauro Carvalho Chehab
280*6cf2a73cSMauro Carvalho Chehab      no_discard_passdown:
281*6cf2a73cSMauro Carvalho Chehab	Don't pass discards down to the underlying
282*6cf2a73cSMauro Carvalho Chehab	data device, but just remove the mapping.
283*6cf2a73cSMauro Carvalho Chehab
284*6cf2a73cSMauro Carvalho Chehab      read_only:
285*6cf2a73cSMauro Carvalho Chehab		 Don't allow any changes to be made to the pool
286*6cf2a73cSMauro Carvalho Chehab		 metadata.  This mode is only available after the
287*6cf2a73cSMauro Carvalho Chehab		 thin-pool has been created and first used in full
288*6cf2a73cSMauro Carvalho Chehab		 read/write mode.  It cannot be specified on initial
289*6cf2a73cSMauro Carvalho Chehab		 thin-pool creation.
290*6cf2a73cSMauro Carvalho Chehab
291*6cf2a73cSMauro Carvalho Chehab      error_if_no_space:
292*6cf2a73cSMauro Carvalho Chehab	Error IOs, instead of queueing, if no space.
293*6cf2a73cSMauro Carvalho Chehab
294*6cf2a73cSMauro Carvalho Chehab    Data block size must be between 64KB (128 sectors) and 1GB
295*6cf2a73cSMauro Carvalho Chehab    (2097152 sectors) inclusive.
296*6cf2a73cSMauro Carvalho Chehab
297*6cf2a73cSMauro Carvalho Chehab
298*6cf2a73cSMauro Carvalho Chehabii) Status
299*6cf2a73cSMauro Carvalho Chehab
300*6cf2a73cSMauro Carvalho Chehab    ::
301*6cf2a73cSMauro Carvalho Chehab
302*6cf2a73cSMauro Carvalho Chehab      <transaction id> <used metadata blocks>/<total metadata blocks>
303*6cf2a73cSMauro Carvalho Chehab      <used data blocks>/<total data blocks> <held metadata root>
304*6cf2a73cSMauro Carvalho Chehab      ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
305*6cf2a73cSMauro Carvalho Chehab      needs_check|- metadata_low_watermark
306*6cf2a73cSMauro Carvalho Chehab
307*6cf2a73cSMauro Carvalho Chehab    transaction id:
308*6cf2a73cSMauro Carvalho Chehab	A 64-bit number used by userspace to help synchronise with metadata
309*6cf2a73cSMauro Carvalho Chehab	from volume managers.
310*6cf2a73cSMauro Carvalho Chehab
311*6cf2a73cSMauro Carvalho Chehab    used data blocks / total data blocks
312*6cf2a73cSMauro Carvalho Chehab	If the number of free blocks drops below the pool's low water mark a
313*6cf2a73cSMauro Carvalho Chehab	dm event will be sent to userspace.  This event is edge-triggered and
314*6cf2a73cSMauro Carvalho Chehab	it will occur only once after each resume so volume manager writers
315*6cf2a73cSMauro Carvalho Chehab	should register for the event and then check the target's status.
316*6cf2a73cSMauro Carvalho Chehab
317*6cf2a73cSMauro Carvalho Chehab    held metadata root:
318*6cf2a73cSMauro Carvalho Chehab	The location, in blocks, of the metadata root that has been
319*6cf2a73cSMauro Carvalho Chehab	'held' for userspace read access.  '-' indicates there is no
320*6cf2a73cSMauro Carvalho Chehab	held root.
321*6cf2a73cSMauro Carvalho Chehab
322*6cf2a73cSMauro Carvalho Chehab    discard_passdown|no_discard_passdown
323*6cf2a73cSMauro Carvalho Chehab	Whether or not discards are actually being passed down to the
324*6cf2a73cSMauro Carvalho Chehab	underlying device.  When this is enabled when loading the table,
325*6cf2a73cSMauro Carvalho Chehab	it can get disabled if the underlying device doesn't support it.
326*6cf2a73cSMauro Carvalho Chehab
327*6cf2a73cSMauro Carvalho Chehab    ro|rw|out_of_data_space
328*6cf2a73cSMauro Carvalho Chehab	If the pool encounters certain types of device failures it will
329*6cf2a73cSMauro Carvalho Chehab	drop into a read-only metadata mode in which no changes to
330*6cf2a73cSMauro Carvalho Chehab	the pool metadata (like allocating new blocks) are permitted.
331*6cf2a73cSMauro Carvalho Chehab
332*6cf2a73cSMauro Carvalho Chehab	In serious cases where even a read-only mode is deemed unsafe
333*6cf2a73cSMauro Carvalho Chehab	no further I/O will be permitted and the status will just
334*6cf2a73cSMauro Carvalho Chehab	contain the string 'Fail'.  The userspace recovery tools
335*6cf2a73cSMauro Carvalho Chehab	should then be used.
336*6cf2a73cSMauro Carvalho Chehab
337*6cf2a73cSMauro Carvalho Chehab    error_if_no_space|queue_if_no_space
338*6cf2a73cSMauro Carvalho Chehab	If the pool runs out of data or metadata space, the pool will
339*6cf2a73cSMauro Carvalho Chehab	either queue or error the IO destined to the data device.  The
340*6cf2a73cSMauro Carvalho Chehab	default is to queue the IO until more space is added or the
341*6cf2a73cSMauro Carvalho Chehab	'no_space_timeout' expires.  The 'no_space_timeout' dm-thin-pool
342*6cf2a73cSMauro Carvalho Chehab	module parameter can be used to change this timeout -- it
343*6cf2a73cSMauro Carvalho Chehab	defaults to 60 seconds but may be disabled using a value of 0.
344*6cf2a73cSMauro Carvalho Chehab
345*6cf2a73cSMauro Carvalho Chehab    needs_check
346*6cf2a73cSMauro Carvalho Chehab	A metadata operation has failed, resulting in the needs_check
347*6cf2a73cSMauro Carvalho Chehab	flag being set in the metadata's superblock.  The metadata
348*6cf2a73cSMauro Carvalho Chehab	device must be deactivated and checked/repaired before the
349*6cf2a73cSMauro Carvalho Chehab	thin-pool can be made fully operational again.  '-' indicates
350*6cf2a73cSMauro Carvalho Chehab	needs_check is not set.
351*6cf2a73cSMauro Carvalho Chehab
352*6cf2a73cSMauro Carvalho Chehab    metadata_low_watermark:
353*6cf2a73cSMauro Carvalho Chehab	Value of metadata low watermark in blocks.  The kernel sets this
354*6cf2a73cSMauro Carvalho Chehab	value internally but userspace needs to know this value to
355*6cf2a73cSMauro Carvalho Chehab	determine if an event was caused by crossing this threshold.
356*6cf2a73cSMauro Carvalho Chehab
357*6cf2a73cSMauro Carvalho Chehabiii) Messages
358*6cf2a73cSMauro Carvalho Chehab
359*6cf2a73cSMauro Carvalho Chehab    create_thin <dev id>
360*6cf2a73cSMauro Carvalho Chehab	Create a new thinly-provisioned device.
361*6cf2a73cSMauro Carvalho Chehab	<dev id> is an arbitrary unique 24-bit identifier chosen by
362*6cf2a73cSMauro Carvalho Chehab	the caller.
363*6cf2a73cSMauro Carvalho Chehab
364*6cf2a73cSMauro Carvalho Chehab    create_snap <dev id> <origin id>
365*6cf2a73cSMauro Carvalho Chehab	Create a new snapshot of another thinly-provisioned device.
366*6cf2a73cSMauro Carvalho Chehab	<dev id> is an arbitrary unique 24-bit identifier chosen by
367*6cf2a73cSMauro Carvalho Chehab	the caller.
368*6cf2a73cSMauro Carvalho Chehab	<origin id> is the identifier of the thinly-provisioned device
369*6cf2a73cSMauro Carvalho Chehab	of which the new device will be a snapshot.
370*6cf2a73cSMauro Carvalho Chehab
371*6cf2a73cSMauro Carvalho Chehab    delete <dev id>
372*6cf2a73cSMauro Carvalho Chehab	Deletes a thin device.  Irreversible.
373*6cf2a73cSMauro Carvalho Chehab
374*6cf2a73cSMauro Carvalho Chehab    set_transaction_id <current id> <new id>
375*6cf2a73cSMauro Carvalho Chehab	Userland volume managers, such as LVM, need a way to
376*6cf2a73cSMauro Carvalho Chehab	synchronise their external metadata with the internal metadata of the
377*6cf2a73cSMauro Carvalho Chehab	pool target.  The thin-pool target offers to store an
378*6cf2a73cSMauro Carvalho Chehab	arbitrary 64-bit transaction id and return it on the target's
379*6cf2a73cSMauro Carvalho Chehab	status line.  To avoid races you must provide what you think
380*6cf2a73cSMauro Carvalho Chehab	the current transaction id is when you change it with this
381*6cf2a73cSMauro Carvalho Chehab	compare-and-swap message.
382*6cf2a73cSMauro Carvalho Chehab
383*6cf2a73cSMauro Carvalho Chehab    reserve_metadata_snap
384*6cf2a73cSMauro Carvalho Chehab        Reserve a copy of the data mapping btree for use by userland.
385*6cf2a73cSMauro Carvalho Chehab        This allows userland to inspect the mappings as they were when
386*6cf2a73cSMauro Carvalho Chehab        this message was executed.  Use the pool's status command to
387*6cf2a73cSMauro Carvalho Chehab        get the root block associated with the metadata snapshot.
388*6cf2a73cSMauro Carvalho Chehab
389*6cf2a73cSMauro Carvalho Chehab    release_metadata_snap
390*6cf2a73cSMauro Carvalho Chehab        Release a previously reserved copy of the data mapping btree.
391*6cf2a73cSMauro Carvalho Chehab
392*6cf2a73cSMauro Carvalho Chehab'thin' target
393*6cf2a73cSMauro Carvalho Chehab-------------
394*6cf2a73cSMauro Carvalho Chehab
395*6cf2a73cSMauro Carvalho Chehabi) Constructor
396*6cf2a73cSMauro Carvalho Chehab
397*6cf2a73cSMauro Carvalho Chehab    ::
398*6cf2a73cSMauro Carvalho Chehab
399*6cf2a73cSMauro Carvalho Chehab        thin <pool dev> <dev id> [<external origin dev>]
400*6cf2a73cSMauro Carvalho Chehab
401*6cf2a73cSMauro Carvalho Chehab    pool dev:
402*6cf2a73cSMauro Carvalho Chehab	the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
403*6cf2a73cSMauro Carvalho Chehab
404*6cf2a73cSMauro Carvalho Chehab    dev id:
405*6cf2a73cSMauro Carvalho Chehab	the internal device identifier of the device to be
406*6cf2a73cSMauro Carvalho Chehab	activated.
407*6cf2a73cSMauro Carvalho Chehab
408*6cf2a73cSMauro Carvalho Chehab    external origin dev:
409*6cf2a73cSMauro Carvalho Chehab	an optional block device outside the pool to be treated as a
410*6cf2a73cSMauro Carvalho Chehab	read-only snapshot origin: reads to unprovisioned areas of the
411*6cf2a73cSMauro Carvalho Chehab	thin target will be mapped to this device.
412*6cf2a73cSMauro Carvalho Chehab
413*6cf2a73cSMauro Carvalho ChehabThe pool doesn't store any size against the thin devices.  If you
414*6cf2a73cSMauro Carvalho Chehabload a thin target that is smaller than you've been using previously,
415*6cf2a73cSMauro Carvalho Chehabthen you'll have no access to blocks mapped beyond the end.  If you
416*6cf2a73cSMauro Carvalho Chehabload a target that is bigger than before, then extra blocks will be
417*6cf2a73cSMauro Carvalho Chehabprovisioned as and when needed.
418*6cf2a73cSMauro Carvalho Chehab
419*6cf2a73cSMauro Carvalho Chehabii) Status
420*6cf2a73cSMauro Carvalho Chehab
421*6cf2a73cSMauro Carvalho Chehab    <nr mapped sectors> <highest mapped sector>
422*6cf2a73cSMauro Carvalho Chehab	If the pool has encountered device errors and failed, the status
423*6cf2a73cSMauro Carvalho Chehab	will just contain the string 'Fail'.  The userspace recovery
424*6cf2a73cSMauro Carvalho Chehab	tools should then be used.
425*6cf2a73cSMauro Carvalho Chehab
426*6cf2a73cSMauro Carvalho Chehab    In the case where <nr mapped sectors> is 0, there is no highest
427*6cf2a73cSMauro Carvalho Chehab    mapped sector and the value of <highest mapped sector> is unspecified.
428