xref: /openbmc/linux/Documentation/admin-guide/device-mapper/switch.rst (revision 0898782247ae533d1f4e47a06bc5d4870931b284)
1*6cf2a73cSMauro Carvalho Chehab=========
2*6cf2a73cSMauro Carvalho Chehabdm-switch
3*6cf2a73cSMauro Carvalho Chehab=========
4*6cf2a73cSMauro Carvalho Chehab
5*6cf2a73cSMauro Carvalho ChehabThe device-mapper switch target creates a device that supports an
6*6cf2a73cSMauro Carvalho Chehabarbitrary mapping of fixed-size regions of I/O across a fixed set of
7*6cf2a73cSMauro Carvalho Chehabpaths.  The path used for any specific region can be switched
8*6cf2a73cSMauro Carvalho Chehabdynamically by sending the target a message.
9*6cf2a73cSMauro Carvalho Chehab
10*6cf2a73cSMauro Carvalho ChehabIt maps I/O to underlying block devices efficiently when there is a large
11*6cf2a73cSMauro Carvalho Chehabnumber of fixed-sized address regions but there is no simple pattern
12*6cf2a73cSMauro Carvalho Chehabthat would allow for a compact representation of the mapping such as
13*6cf2a73cSMauro Carvalho Chehabdm-stripe.
14*6cf2a73cSMauro Carvalho Chehab
15*6cf2a73cSMauro Carvalho ChehabBackground
16*6cf2a73cSMauro Carvalho Chehab----------
17*6cf2a73cSMauro Carvalho Chehab
18*6cf2a73cSMauro Carvalho ChehabDell EqualLogic and some other iSCSI storage arrays use a distributed
19*6cf2a73cSMauro Carvalho Chehabframeless architecture.  In this architecture, the storage group
20*6cf2a73cSMauro Carvalho Chehabconsists of a number of distinct storage arrays ("members") each having
21*6cf2a73cSMauro Carvalho Chehabindependent controllers, disk storage and network adapters.  When a LUN
22*6cf2a73cSMauro Carvalho Chehabis created it is spread across multiple members.  The details of the
23*6cf2a73cSMauro Carvalho Chehabspreading are hidden from initiators connected to this storage system.
24*6cf2a73cSMauro Carvalho ChehabThe storage group exposes a single target discovery portal, no matter
25*6cf2a73cSMauro Carvalho Chehabhow many members are being used.  When iSCSI sessions are created, each
26*6cf2a73cSMauro Carvalho Chehabsession is connected to an eth port on a single member.  Data to a LUN
27*6cf2a73cSMauro Carvalho Chehabcan be sent on any iSCSI session, and if the blocks being accessed are
28*6cf2a73cSMauro Carvalho Chehabstored on another member the I/O will be forwarded as required.  This
29*6cf2a73cSMauro Carvalho Chehabforwarding is invisible to the initiator.  The storage layout is also
30*6cf2a73cSMauro Carvalho Chehabdynamic, and the blocks stored on disk may be moved from member to
31*6cf2a73cSMauro Carvalho Chehabmember as needed to balance the load.
32*6cf2a73cSMauro Carvalho Chehab
33*6cf2a73cSMauro Carvalho ChehabThis architecture simplifies the management and configuration of both
34*6cf2a73cSMauro Carvalho Chehabthe storage group and initiators.  In a multipathing configuration, it
35*6cf2a73cSMauro Carvalho Chehabis possible to set up multiple iSCSI sessions to use multiple network
36*6cf2a73cSMauro Carvalho Chehabinterfaces on both the host and target to take advantage of the
37*6cf2a73cSMauro Carvalho Chehabincreased network bandwidth.  An initiator could use a simple round
38*6cf2a73cSMauro Carvalho Chehabrobin algorithm to send I/O across all paths and let the storage array
39*6cf2a73cSMauro Carvalho Chehabmembers forward it as necessary, but there is a performance advantage to
40*6cf2a73cSMauro Carvalho Chehabsending data directly to the correct member.
41*6cf2a73cSMauro Carvalho Chehab
42*6cf2a73cSMauro Carvalho ChehabA device-mapper table already lets you map different regions of a
43*6cf2a73cSMauro Carvalho Chehabdevice onto different targets.  However in this architecture the LUN is
44*6cf2a73cSMauro Carvalho Chehabspread with an address region size on the order of 10s of MBs, which
45*6cf2a73cSMauro Carvalho Chehabmeans the resulting table could have more than a million entries and
46*6cf2a73cSMauro Carvalho Chehabconsume far too much memory.
47*6cf2a73cSMauro Carvalho Chehab
48*6cf2a73cSMauro Carvalho ChehabUsing this device-mapper switch target we can now build a two-layer
49*6cf2a73cSMauro Carvalho Chehabdevice hierarchy:
50*6cf2a73cSMauro Carvalho Chehab
51*6cf2a73cSMauro Carvalho Chehab    Upper Tier - Determine which array member the I/O should be sent to.
52*6cf2a73cSMauro Carvalho Chehab    Lower Tier - Load balance amongst paths to a particular member.
53*6cf2a73cSMauro Carvalho Chehab
54*6cf2a73cSMauro Carvalho ChehabThe lower tier consists of a single dm multipath device for each member.
55*6cf2a73cSMauro Carvalho ChehabEach of these multipath devices contains the set of paths directly to
56*6cf2a73cSMauro Carvalho Chehabthe array member in one priority group, and leverages existing path
57*6cf2a73cSMauro Carvalho Chehabselectors to load balance amongst these paths.  We also build a
58*6cf2a73cSMauro Carvalho Chehabnon-preferred priority group containing paths to other array members for
59*6cf2a73cSMauro Carvalho Chehabfailover reasons.
60*6cf2a73cSMauro Carvalho Chehab
61*6cf2a73cSMauro Carvalho ChehabThe upper tier consists of a single dm-switch device.  This device uses
62*6cf2a73cSMauro Carvalho Chehaba bitmap to look up the location of the I/O and choose the appropriate
63*6cf2a73cSMauro Carvalho Chehablower tier device to route the I/O.  By using a bitmap we are able to
64*6cf2a73cSMauro Carvalho Chehabuse 4 bits for each address range in a 16 member group (which is very
65*6cf2a73cSMauro Carvalho Chehablarge for us).  This is a much denser representation than the dm table
66*6cf2a73cSMauro Carvalho Chehabb-tree can achieve.
67*6cf2a73cSMauro Carvalho Chehab
68*6cf2a73cSMauro Carvalho ChehabConstruction Parameters
69*6cf2a73cSMauro Carvalho Chehab=======================
70*6cf2a73cSMauro Carvalho Chehab
71*6cf2a73cSMauro Carvalho Chehab    <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+
72*6cf2a73cSMauro Carvalho Chehab	<num_paths>
73*6cf2a73cSMauro Carvalho Chehab	    The number of paths across which to distribute the I/O.
74*6cf2a73cSMauro Carvalho Chehab
75*6cf2a73cSMauro Carvalho Chehab	<region_size>
76*6cf2a73cSMauro Carvalho Chehab	    The number of 512-byte sectors in a region. Each region can be redirected
77*6cf2a73cSMauro Carvalho Chehab	    to any of the available paths.
78*6cf2a73cSMauro Carvalho Chehab
79*6cf2a73cSMauro Carvalho Chehab	<num_optional_args>
80*6cf2a73cSMauro Carvalho Chehab	    The number of optional arguments. Currently, no optional arguments
81*6cf2a73cSMauro Carvalho Chehab	    are supported and so this must be zero.
82*6cf2a73cSMauro Carvalho Chehab
83*6cf2a73cSMauro Carvalho Chehab	<dev_path>
84*6cf2a73cSMauro Carvalho Chehab	    The block device that represents a specific path to the device.
85*6cf2a73cSMauro Carvalho Chehab
86*6cf2a73cSMauro Carvalho Chehab	<offset>
87*6cf2a73cSMauro Carvalho Chehab	    The offset of the start of data on the specific <dev_path> (in units
88*6cf2a73cSMauro Carvalho Chehab	    of 512-byte sectors). This number is added to the sector number when
89*6cf2a73cSMauro Carvalho Chehab	    forwarding the request to the specific path. Typically it is zero.
90*6cf2a73cSMauro Carvalho Chehab
91*6cf2a73cSMauro Carvalho ChehabMessages
92*6cf2a73cSMauro Carvalho Chehab========
93*6cf2a73cSMauro Carvalho Chehab
94*6cf2a73cSMauro Carvalho Chehabset_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
95*6cf2a73cSMauro Carvalho Chehab
96*6cf2a73cSMauro Carvalho ChehabModify the region table by specifying which regions are redirected to
97*6cf2a73cSMauro Carvalho Chehabwhich paths.
98*6cf2a73cSMauro Carvalho Chehab
99*6cf2a73cSMauro Carvalho Chehab<index>
100*6cf2a73cSMauro Carvalho Chehab    The region number (region size was specified in constructor parameters).
101*6cf2a73cSMauro Carvalho Chehab    If index is omitted, the next region (previous index + 1) is used.
102*6cf2a73cSMauro Carvalho Chehab    Expressed in hexadecimal (WITHOUT any prefix like 0x).
103*6cf2a73cSMauro Carvalho Chehab
104*6cf2a73cSMauro Carvalho Chehab<path_nr>
105*6cf2a73cSMauro Carvalho Chehab    The path number in the range 0 ... (<num_paths> - 1).
106*6cf2a73cSMauro Carvalho Chehab    Expressed in hexadecimal (WITHOUT any prefix like 0x).
107*6cf2a73cSMauro Carvalho Chehab
108*6cf2a73cSMauro Carvalho ChehabR<n>,<m>
109*6cf2a73cSMauro Carvalho Chehab    This parameter allows repetitive patterns to be loaded quickly. <n> and <m>
110*6cf2a73cSMauro Carvalho Chehab    are hexadecimal numbers. The last <n> mappings are repeated in the next <m>
111*6cf2a73cSMauro Carvalho Chehab    slots.
112*6cf2a73cSMauro Carvalho Chehab
113*6cf2a73cSMauro Carvalho ChehabStatus
114*6cf2a73cSMauro Carvalho Chehab======
115*6cf2a73cSMauro Carvalho Chehab
116*6cf2a73cSMauro Carvalho ChehabNo status line is reported.
117*6cf2a73cSMauro Carvalho Chehab
118*6cf2a73cSMauro Carvalho ChehabExample
119*6cf2a73cSMauro Carvalho Chehab=======
120*6cf2a73cSMauro Carvalho Chehab
121*6cf2a73cSMauro Carvalho ChehabAssume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
122*6cf2a73cSMauro Carvalho Chehabthe same size.
123*6cf2a73cSMauro Carvalho Chehab
124*6cf2a73cSMauro Carvalho ChehabCreate a switch device with 64kB region size::
125*6cf2a73cSMauro Carvalho Chehab
126*6cf2a73cSMauro Carvalho Chehab    dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0`
127*6cf2a73cSMauro Carvalho Chehab	switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
128*6cf2a73cSMauro Carvalho Chehab
129*6cf2a73cSMauro Carvalho ChehabSet mappings for the first 7 entries to point to devices switch0, switch1,
130*6cf2a73cSMauro Carvalho Chehabswitch2, switch0, switch1, switch2, switch1::
131*6cf2a73cSMauro Carvalho Chehab
132*6cf2a73cSMauro Carvalho Chehab    dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
133*6cf2a73cSMauro Carvalho Chehab
134*6cf2a73cSMauro Carvalho ChehabSet repetitive mapping. This command::
135*6cf2a73cSMauro Carvalho Chehab
136*6cf2a73cSMauro Carvalho Chehab    dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10
137*6cf2a73cSMauro Carvalho Chehab
138*6cf2a73cSMauro Carvalho Chehabis equivalent to::
139*6cf2a73cSMauro Carvalho Chehab
140*6cf2a73cSMauro Carvalho Chehab    dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \
141*6cf2a73cSMauro Carvalho Chehab	:1 :2 :1 :2 :1 :2 :1 :2 :1 :2
142