driver-api/md/md-cluster.rst

5 The cluster MD is a shared-device RAID for a cluster, it supports
9 1. On-disk format
12 Separate write-intent-bitmaps are used for each cluster node.
14 and may not yet have finished. The on-disk layout is::
17   -------------------------------------------------------------------
26  - set the appropriate bit (if not already set)
27  - commit the write to all mirrors
28  - schedule the bit to be cleared after a timeout.
35 2. DLM Locks for management
38 There are three groups of locks for managing the device:
41 -------------------------------------
44  the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a
52  The LVB of the bitmap lock for a particular node records the range
53  of sectors that are being re-synced by that node.  No other
58 -------------------------
61  resync, and for metadata superblock updates.  This communication is
62  managed through three locks: "token", "message", and "ack", together
65 2.3 new-device management
66 -------------------------
68  A single lock: "no-new-dev" is used to coordinate the addition of
69  new devices - this must be synchronized across the array.
70  Normally all nodes hold a concurrent-read lock on this device.
75  Messages can be broadcast to all nodes, and the sender waits for all
80 -----------------
88    been updated, and the node must re-read the md superblock. This is
99    time per-node.
105    the array. Message contains an identifier for that device.  See
106    below for further details.
112    array. The slot-number of the device is included in the message.
116    A failed device is being re-activated - the assumption
126 ---------------------------
129  are three resources used for the purpose:
141 3.2.3 ack
151  1. receive status - all nodes have concurrent-reader lock on "ack"::
154 	"ack":CR                       "ack":CR                 "ack":CR
160 	"token":EX                    "ack":CR                 "ack":CR
162 	"ack":CR
165     received or other events that happened while waiting for the
170     sender down-convert "message" from EX to CW
172     sender try to get EX of "ack"
176       [ wait until all receivers have *processed* the "message" ]
178                                        [ triggered by bast of "ack" ]
182                                        [ wait finish ]
183                                        receiver releases "ack"
189      "ack":EX
191  4. triggered by grant of EX on "ack" (indicating all receivers
194     sender down-converts "ack" from EX to CR
203                                  receiver get CR of "ack"
207      "ack":CR                    "ack":CR                   "ack":CR
214 ----------------
220 	- acquires the bitmap<number> lock of the failed node
221 	- opens the bitmap
222 	- reads the bitmap of the failed node
223 	- copies the set bitmap to local node
224 	- cleans the bitmap of the failed node
225 	- releases bitmap<number> lock of the failed node
226 	- initiates resync of the bitmap on the current node
228 	  then md_check_recovery -> metadata_update_start/finish,
244  A helper function, ->area_resyncing() can be used to check if a
256 ----------------------
258  For adding a new device, it is necessary that all nodes "see" the new
259  device to be added. For this, the following algorithm is used:
261    1.  Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
266    4.  In userspace, the node searches for the disk, perhaps
267        using blkid -t SUB_UUID=""
273    6.  Other nodes drop lock on "no-new-devs" (CR) if device is found
274    7.  Node 1 attempts EX lock on "no-new-dev"
277    9.  If not (get "no-new-dev" lock), it fails the operation and sends
285  There are 17 call-backs which the md core can make to the cluster
290 ---------------------------
298 -----------------
301  Range is from 0 to nodes-1.
304 ------------------------
308  end point is always the end of the array.
312 -----------------------------------
323 -------------------------------------------------------------------------------
332 --------------------
338  then the caller will avoid writing or read-balancing in that
342  all areas are resyncing for READ requests.  This avoids races
343  between the cluster-filesystem and the cluster-RAID handling
347 ---------------------------------------------------------------
349  These are used to manage the new-disk protocol described above.
359 -----------------
365 --------------------
369  bitmap is then used to recovery the re-added device.
372 ------------------------------------------------
385 - change array_sectors.