1*6cf2a73cSMauro Carvalho Chehab========= 2*6cf2a73cSMauro Carvalho Chehabdm-switch 3*6cf2a73cSMauro Carvalho Chehab========= 4*6cf2a73cSMauro Carvalho Chehab 5*6cf2a73cSMauro Carvalho ChehabThe device-mapper switch target creates a device that supports an 6*6cf2a73cSMauro Carvalho Chehabarbitrary mapping of fixed-size regions of I/O across a fixed set of 7*6cf2a73cSMauro Carvalho Chehabpaths. The path used for any specific region can be switched 8*6cf2a73cSMauro Carvalho Chehabdynamically by sending the target a message. 9*6cf2a73cSMauro Carvalho Chehab 10*6cf2a73cSMauro Carvalho ChehabIt maps I/O to underlying block devices efficiently when there is a large 11*6cf2a73cSMauro Carvalho Chehabnumber of fixed-sized address regions but there is no simple pattern 12*6cf2a73cSMauro Carvalho Chehabthat would allow for a compact representation of the mapping such as 13*6cf2a73cSMauro Carvalho Chehabdm-stripe. 14*6cf2a73cSMauro Carvalho Chehab 15*6cf2a73cSMauro Carvalho ChehabBackground 16*6cf2a73cSMauro Carvalho Chehab---------- 17*6cf2a73cSMauro Carvalho Chehab 18*6cf2a73cSMauro Carvalho ChehabDell EqualLogic and some other iSCSI storage arrays use a distributed 19*6cf2a73cSMauro Carvalho Chehabframeless architecture. In this architecture, the storage group 20*6cf2a73cSMauro Carvalho Chehabconsists of a number of distinct storage arrays ("members") each having 21*6cf2a73cSMauro Carvalho Chehabindependent controllers, disk storage and network adapters. When a LUN 22*6cf2a73cSMauro Carvalho Chehabis created it is spread across multiple members. The details of the 23*6cf2a73cSMauro Carvalho Chehabspreading are hidden from initiators connected to this storage system. 24*6cf2a73cSMauro Carvalho ChehabThe storage group exposes a single target discovery portal, no matter 25*6cf2a73cSMauro Carvalho Chehabhow many members are being used. When iSCSI sessions are created, each 26*6cf2a73cSMauro Carvalho Chehabsession is connected to an eth port on a single member. Data to a LUN 27*6cf2a73cSMauro Carvalho Chehabcan be sent on any iSCSI session, and if the blocks being accessed are 28*6cf2a73cSMauro Carvalho Chehabstored on another member the I/O will be forwarded as required. This 29*6cf2a73cSMauro Carvalho Chehabforwarding is invisible to the initiator. The storage layout is also 30*6cf2a73cSMauro Carvalho Chehabdynamic, and the blocks stored on disk may be moved from member to 31*6cf2a73cSMauro Carvalho Chehabmember as needed to balance the load. 32*6cf2a73cSMauro Carvalho Chehab 33*6cf2a73cSMauro Carvalho ChehabThis architecture simplifies the management and configuration of both 34*6cf2a73cSMauro Carvalho Chehabthe storage group and initiators. In a multipathing configuration, it 35*6cf2a73cSMauro Carvalho Chehabis possible to set up multiple iSCSI sessions to use multiple network 36*6cf2a73cSMauro Carvalho Chehabinterfaces on both the host and target to take advantage of the 37*6cf2a73cSMauro Carvalho Chehabincreased network bandwidth. An initiator could use a simple round 38*6cf2a73cSMauro Carvalho Chehabrobin algorithm to send I/O across all paths and let the storage array 39*6cf2a73cSMauro Carvalho Chehabmembers forward it as necessary, but there is a performance advantage to 40*6cf2a73cSMauro Carvalho Chehabsending data directly to the correct member. 41*6cf2a73cSMauro Carvalho Chehab 42*6cf2a73cSMauro Carvalho ChehabA device-mapper table already lets you map different regions of a 43*6cf2a73cSMauro Carvalho Chehabdevice onto different targets. However in this architecture the LUN is 44*6cf2a73cSMauro Carvalho Chehabspread with an address region size on the order of 10s of MBs, which 45*6cf2a73cSMauro Carvalho Chehabmeans the resulting table could have more than a million entries and 46*6cf2a73cSMauro Carvalho Chehabconsume far too much memory. 47*6cf2a73cSMauro Carvalho Chehab 48*6cf2a73cSMauro Carvalho ChehabUsing this device-mapper switch target we can now build a two-layer 49*6cf2a73cSMauro Carvalho Chehabdevice hierarchy: 50*6cf2a73cSMauro Carvalho Chehab 51*6cf2a73cSMauro Carvalho Chehab Upper Tier - Determine which array member the I/O should be sent to. 52*6cf2a73cSMauro Carvalho Chehab Lower Tier - Load balance amongst paths to a particular member. 53*6cf2a73cSMauro Carvalho Chehab 54*6cf2a73cSMauro Carvalho ChehabThe lower tier consists of a single dm multipath device for each member. 55*6cf2a73cSMauro Carvalho ChehabEach of these multipath devices contains the set of paths directly to 56*6cf2a73cSMauro Carvalho Chehabthe array member in one priority group, and leverages existing path 57*6cf2a73cSMauro Carvalho Chehabselectors to load balance amongst these paths. We also build a 58*6cf2a73cSMauro Carvalho Chehabnon-preferred priority group containing paths to other array members for 59*6cf2a73cSMauro Carvalho Chehabfailover reasons. 60*6cf2a73cSMauro Carvalho Chehab 61*6cf2a73cSMauro Carvalho ChehabThe upper tier consists of a single dm-switch device. This device uses 62*6cf2a73cSMauro Carvalho Chehaba bitmap to look up the location of the I/O and choose the appropriate 63*6cf2a73cSMauro Carvalho Chehablower tier device to route the I/O. By using a bitmap we are able to 64*6cf2a73cSMauro Carvalho Chehabuse 4 bits for each address range in a 16 member group (which is very 65*6cf2a73cSMauro Carvalho Chehablarge for us). This is a much denser representation than the dm table 66*6cf2a73cSMauro Carvalho Chehabb-tree can achieve. 67*6cf2a73cSMauro Carvalho Chehab 68*6cf2a73cSMauro Carvalho ChehabConstruction Parameters 69*6cf2a73cSMauro Carvalho Chehab======================= 70*6cf2a73cSMauro Carvalho Chehab 71*6cf2a73cSMauro Carvalho Chehab <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+ 72*6cf2a73cSMauro Carvalho Chehab <num_paths> 73*6cf2a73cSMauro Carvalho Chehab The number of paths across which to distribute the I/O. 74*6cf2a73cSMauro Carvalho Chehab 75*6cf2a73cSMauro Carvalho Chehab <region_size> 76*6cf2a73cSMauro Carvalho Chehab The number of 512-byte sectors in a region. Each region can be redirected 77*6cf2a73cSMauro Carvalho Chehab to any of the available paths. 78*6cf2a73cSMauro Carvalho Chehab 79*6cf2a73cSMauro Carvalho Chehab <num_optional_args> 80*6cf2a73cSMauro Carvalho Chehab The number of optional arguments. Currently, no optional arguments 81*6cf2a73cSMauro Carvalho Chehab are supported and so this must be zero. 82*6cf2a73cSMauro Carvalho Chehab 83*6cf2a73cSMauro Carvalho Chehab <dev_path> 84*6cf2a73cSMauro Carvalho Chehab The block device that represents a specific path to the device. 85*6cf2a73cSMauro Carvalho Chehab 86*6cf2a73cSMauro Carvalho Chehab <offset> 87*6cf2a73cSMauro Carvalho Chehab The offset of the start of data on the specific <dev_path> (in units 88*6cf2a73cSMauro Carvalho Chehab of 512-byte sectors). This number is added to the sector number when 89*6cf2a73cSMauro Carvalho Chehab forwarding the request to the specific path. Typically it is zero. 90*6cf2a73cSMauro Carvalho Chehab 91*6cf2a73cSMauro Carvalho ChehabMessages 92*6cf2a73cSMauro Carvalho Chehab======== 93*6cf2a73cSMauro Carvalho Chehab 94*6cf2a73cSMauro Carvalho Chehabset_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... 95*6cf2a73cSMauro Carvalho Chehab 96*6cf2a73cSMauro Carvalho ChehabModify the region table by specifying which regions are redirected to 97*6cf2a73cSMauro Carvalho Chehabwhich paths. 98*6cf2a73cSMauro Carvalho Chehab 99*6cf2a73cSMauro Carvalho Chehab<index> 100*6cf2a73cSMauro Carvalho Chehab The region number (region size was specified in constructor parameters). 101*6cf2a73cSMauro Carvalho Chehab If index is omitted, the next region (previous index + 1) is used. 102*6cf2a73cSMauro Carvalho Chehab Expressed in hexadecimal (WITHOUT any prefix like 0x). 103*6cf2a73cSMauro Carvalho Chehab 104*6cf2a73cSMauro Carvalho Chehab<path_nr> 105*6cf2a73cSMauro Carvalho Chehab The path number in the range 0 ... (<num_paths> - 1). 106*6cf2a73cSMauro Carvalho Chehab Expressed in hexadecimal (WITHOUT any prefix like 0x). 107*6cf2a73cSMauro Carvalho Chehab 108*6cf2a73cSMauro Carvalho ChehabR<n>,<m> 109*6cf2a73cSMauro Carvalho Chehab This parameter allows repetitive patterns to be loaded quickly. <n> and <m> 110*6cf2a73cSMauro Carvalho Chehab are hexadecimal numbers. The last <n> mappings are repeated in the next <m> 111*6cf2a73cSMauro Carvalho Chehab slots. 112*6cf2a73cSMauro Carvalho Chehab 113*6cf2a73cSMauro Carvalho ChehabStatus 114*6cf2a73cSMauro Carvalho Chehab====== 115*6cf2a73cSMauro Carvalho Chehab 116*6cf2a73cSMauro Carvalho ChehabNo status line is reported. 117*6cf2a73cSMauro Carvalho Chehab 118*6cf2a73cSMauro Carvalho ChehabExample 119*6cf2a73cSMauro Carvalho Chehab======= 120*6cf2a73cSMauro Carvalho Chehab 121*6cf2a73cSMauro Carvalho ChehabAssume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with 122*6cf2a73cSMauro Carvalho Chehabthe same size. 123*6cf2a73cSMauro Carvalho Chehab 124*6cf2a73cSMauro Carvalho ChehabCreate a switch device with 64kB region size:: 125*6cf2a73cSMauro Carvalho Chehab 126*6cf2a73cSMauro Carvalho Chehab dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` 127*6cf2a73cSMauro Carvalho Chehab switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" 128*6cf2a73cSMauro Carvalho Chehab 129*6cf2a73cSMauro Carvalho ChehabSet mappings for the first 7 entries to point to devices switch0, switch1, 130*6cf2a73cSMauro Carvalho Chehabswitch2, switch0, switch1, switch2, switch1:: 131*6cf2a73cSMauro Carvalho Chehab 132*6cf2a73cSMauro Carvalho Chehab dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 133*6cf2a73cSMauro Carvalho Chehab 134*6cf2a73cSMauro Carvalho ChehabSet repetitive mapping. This command:: 135*6cf2a73cSMauro Carvalho Chehab 136*6cf2a73cSMauro Carvalho Chehab dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 137*6cf2a73cSMauro Carvalho Chehab 138*6cf2a73cSMauro Carvalho Chehabis equivalent to:: 139*6cf2a73cSMauro Carvalho Chehab 140*6cf2a73cSMauro Carvalho Chehab dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ 141*6cf2a73cSMauro Carvalho Chehab :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 142