1**************************** 2RDMA Transport (RTRS) 3**************************** 4 5RTRS (RDMA Transport) is a reliable high speed transport library 6which provides support to establish optimal number of connections 7between client and server machines using RDMA (InfiniBand, RoCE, iWarp) 8transport. It is optimized to transfer (read/write) IO blocks. 9 10In its core interface it follows the BIO semantics of providing the 11possibility to either write data from an sg list to the remote side 12or to request ("read") data transfer from the remote side into a given 13sg list. 14 15RTRS provides I/O fail-over and load-balancing capabilities by using 16multipath I/O (see "add_path" and "mp_policy" configuration entries in 17Documentation/ABI/testing/sysfs-class-rtrs-client). 18 19RTRS is used by the RNBD (RDMA Network Block Device) modules. 20 21================== 22Transport protocol 23================== 24 25Overview 26-------- 27An established connection between a client and a server is called rtrs 28session. A session is associated with a set of memory chunks reserved on the 29server side for a given client for rdma transfer. A session 30consists of multiple paths, each representing a separate physical link 31between client and server. Those are used for load balancing and failover. 32Each path consists of as many connections (QPs) as there are cpus on 33the client. 34 35When processing an incoming write or read request, rtrs client uses memory 36chunks reserved for him on the server side. Their number, size and addresses 37need to be exchanged between client and server during the connection 38establishment phase. Apart from the memory related information client needs to 39inform the server about the session name and identify each path and connection 40individually. 41 42On an established session client sends to server write or read messages. 43Server uses immediate field to tell the client which request is being 44acknowledged and for errno. Client uses immediate field to tell the server 45which of the memory chunks has been accessed and at which offset the message 46can be found. 47 48Module parameter always_invalidate is introduced for the security problem 49discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we 50invalidate each rdma buffer before we hand it over to RNBD server and 51then pass it to the block layer. A new rkey is generated and registered for the 52buffer after it returns back from the block layer and RNBD server. 53The new rkey is sent back to the client along with the IO result. 54The procedure is the default behaviour of the driver. This invalidation and 55registration on each IO causes performance drop of up to 20%. A user of the 56driver may choose to load the modules with this mechanism switched off 57(always_invalidate=N), if he understands and can take the risk of a malicious 58client being able to corrupt memory of a server it is connected to. This might 59be a reasonable option in a scenario where all the clients and all the servers 60are located within a secure datacenter. 61 62 63Connection establishment 64------------------------ 65 661. Client starts establishing connections belonging to a path of a session one 67by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests. 68Those include uuid of the session and uuid of the path to be 69established. They are used by the server to find a persisting session/path or 70to create a new one when necessary. The message also contains the protocol 71version and magic for compatibility, total number of connections per session 72(as many as cpus on the client), the id of the current connection and 73the reconnect counter, which is used to resolve the situations where 74client is trying to reconnect a path, while server is still destroying the old 75one. 76 772. Server accepts the connection requests one by one and attaches 78RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and 79protocol version, the messages include error code, queue depth supported by 80the server (number of memory chunks which are going to be allocated for that 81session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set 82when always_invalidate=Y. 83 843. After all connections of a path are established client sends to server the 85RTRS_MSG_INFO_REQ message, containing the name of the session. This message 86requests the address information from the server. 87 884. Server replies to the session info request message with RTRS_MSG_INFO_RSP, 89which contains the addresses and keys of the RDMA buffers allocated for that 90session. 91 925. Session becomes connected after all paths to be established are connected 93(i.e. steps 1-4 finished for all paths requested for a session) 94 956. Server and client exchange periodically heartbeat messages (empty rdma 96messages with an immediate field) which are used to detect a crash on remote 97side or network outage in an absence of IO. 98 997. On any RDMA related error or in the case of a heartbeat timeout, the 100corresponding path is disconnected, all the inflight IO are failed over to a 101healthy path, if any, and the reconnect mechanism is triggered. 102 103CLT SRV 104*for each connection belonging to a path and for each path: 105RTRS_MSG_CON_REQ -------------------> 106 <------------------- RTRS_MSG_CON_RSP 107... 108*after all connections are established: 109RTRS_MSG_INFO_REQ -------------------> 110 <------------------- RTRS_MSG_INFO_RSP 111*heartbeat is started from both sides: 112 -------------------> [RTRS_HB_MSG_IMM] 113[RTRS_HB_MSG_ACK] <------------------- 114[RTRS_HB_MSG_IMM] <------------------- 115 -------------------> [RTRS_HB_MSG_ACK] 116 117IO path 118------- 119 120* Write (always_invalidate=N) * 121 1221. When processing a write request client selects one of the memory chunks 123on the server side and rdma writes there the user data, user header and the 124RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 125contains size of the user header. The client tells the server which chunk has 126been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 127using the IMM field. 128 1292. When confirming a write request server sends an "empty" rdma message with 130an immediate field. The 32 bit field is used to specify the outstanding 131inflight IO and for the error code. 132 133CLT SRV 134usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 135[RTRS_IO_RSP_IMM] <----------------- (id + errno) 136 137* Write (always_invalidate=Y) * 138 1391. When processing a write request client selects one of the memory chunks 140on the server side and rdma writes there the user data, user header and the 141RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 142contains size of the user header. The client tells the server which chunk has 143been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 144using the IMM field, Server invalidate rkey associated to the memory chunks 145first, when it finishes, pass the IO to RNBD server module. 146 1472. When confirming a write request server sends an "empty" rdma message with 148an immediate field. The 32 bit field is used to specify the outstanding 149inflight IO and for the error code. The new rkey is sent back using 150SEND_WITH_IMM WR, client When it recived new rkey message, it validates 151the message and finished IO after update rkey for the rbuffer, then post 152back the recv buffer for later use. 153 154CLT SRV 155usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 156[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 157[RTRS_IO_RSP_IMM] <----------------- (id + errno) 158 159 160* Read (always_invalidate=N)* 161 1621. When processing a read request client selects one of the memory chunks 163on the server side and rdma writes there the user header and the 164RTRS_MSG_RDMA_READ message. This message contains the type (read), size of 165the user header, flags (specifying if memory invalidation is necessary) and the 166list of addresses along with keys for the data to be read into. 167 1682. When confirming a read request server transfers the requested data first, 169attaches an invalidation message if requested and finally an "empty" rdma 170message with an immediate field. The 32 bit field is used to specify the 171outstanding inflight IO and the error code. 172 173CLT SRV 174usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 175[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 176or in case client requested invalidation: 177[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 178 179* Read (always_invalidate=Y)* 180 1811. When processing a read request client selects one of the memory chunks 182on the server side and rdma writes there the user header and the 183RTRS_MSG_RDMA_READ message. This message contains the type (read), size of 184the user header, flags (specifying if memory invalidation is necessary) and the 185list of addresses along with keys for the data to be read into. 186Server invalidate rkey associated to the memory chunks first, when it finishes, 187passes the IO to RNBD server module. 188 1892. When confirming a read request server transfers the requested data first, 190attaches an invalidation message if requested and finally an "empty" rdma 191message with an immediate field. The 32 bit field is used to specify the 192outstanding inflight IO and the error code. The new rkey is sent back using 193SEND_WITH_IMM WR, client When it recived new rkey message, it validates 194the message and finished IO after update rkey for the rbuffer, then post 195back the recv buffer for later use. 196 197CLT SRV 198usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 199[RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 200[RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 201or in case client requested invalidation: 202[RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 203========================================= 204Contributors List(in alphabetical order) 205========================================= 206Danil Kipnis <danil.kipnis@profitbricks.com> 207Fabian Holler <mail@fholler.de> 208Guoqing Jiang <guoqing.jiang@cloud.ionos.com> 209Jack Wang <jinpu.wang@profitbricks.com> 210Kleber Souza <kleber.souza@profitbricks.com> 211Lutz Pogrell <lutz.pogrell@cloud.ionos.com> 212Milind Dumbare <Milind.dumbare@gmail.com> 213Roman Penyaev <roman.penyaev@profitbricks.com> 214