1.. SPDX-License-Identifier: GPL-2.0 2 3============================= 4Kernel Connection Multiplexor 5============================= 6 7Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based 8interface over TCP for generic application protocols. With KCM an application 9can efficiently send and receive application protocol messages over TCP using 10datagram sockets. 11 12KCM implements an NxM multiplexor in the kernel as diagrammed below:: 13 14 +------------+ +------------+ +------------+ +------------+ 15 | KCM socket | | KCM socket | | KCM socket | | KCM socket | 16 +------------+ +------------+ +------------+ +------------+ 17 | | | | 18 +-----------+ | | +----------+ 19 | | | | 20 +----------------------------------+ 21 | Multiplexor | 22 +----------------------------------+ 23 | | | | | 24 +---------+ | | | ------------+ 25 | | | | | 26 +----------+ +----------+ +----------+ +----------+ +----------+ 27 | Psock | | Psock | | Psock | | Psock | | Psock | 28 +----------+ +----------+ +----------+ +----------+ +----------+ 29 | | | | | 30 +----------+ +----------+ +----------+ +----------+ +----------+ 31 | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | 32 +----------+ +----------+ +----------+ +----------+ +----------+ 33 34KCM sockets 35=========== 36 37The KCM sockets provide the user interface to the multiplexor. All the KCM sockets 38bound to a multiplexor are considered to have equivalent function, and I/O 39operations in different sockets may be done in parallel without the need for 40synchronization between threads in userspace. 41 42Multiplexor 43=========== 44 45The multiplexor provides the message steering. In the transmit path, messages 46written on a KCM socket are sent atomically on an appropriate TCP socket. 47Similarly, in the receive path, messages are constructed on each TCP socket 48(Psock) and complete messages are steered to a KCM socket. 49 50TCP sockets & Psocks 51==================== 52 53TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated 54for each bound TCP socket, this structure holds the state for constructing 55messages on receive as well as other connection specific information for KCM. 56 57Connected mode semantics 58======================== 59 60Each multiplexor assumes that all attached TCP connections are to the same 61destination and can use the different connections for load balancing when 62transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) 63can be used to send and receive messages from the KCM socket. 64 65Socket types 66============ 67 68KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. 69 70Message delineation 71------------------- 72 73Messages are sent over a TCP stream with some application protocol message 74format that typically includes a header which frames the messages. The length 75of a received message can be deduced from the application protocol header 76(often just a simple length field). 77 78A TCP stream must be parsed to determine message boundaries. Berkeley Packet 79Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a 80BPF program must be specified. The program is called at the start of receiving 81a new message and is given an skbuff that contains the bytes received so far. 82It parses the message header and returns the length of the message. Given this 83information, KCM will construct the message of the stated length and deliver it 84to a KCM socket. 85 86TCP socket management 87--------------------- 88 89When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and 90write space available (POLLOUT) events are handled by the multiplexor. If there 91is a state change (disconnection) or other error on a TCP socket, an error is 92posted on the TCP socket so that a POLLERR event happens and KCM discontinues 93using the socket. When the application gets the error notification for a 94TCP socket, it should unattach the socket from KCM and then handle the error 95condition (the typical response is to close the socket and create a new 96connection if necessary). 97 98KCM limits the maximum receive message size to be the size of the receive 99socket buffer on the attached TCP socket (the socket buffer size can be set by 100SO_RCVBUF). If the length of a new message reported by the BPF program is 101greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP 102socket. The BPF program may also enforce a maximum messages size and report an 103error when it is exceeded. 104 105A timeout may be set for assembling messages on a receive socket. The timeout 106value is taken from the receive timeout of the attached TCP socket (this is set 107by SO_RCVTIMEO). If the timer expires before assembly is complete an error 108(ETIMEDOUT) is posted on the socket. 109 110User interface 111============== 112 113Creating a multiplexor 114---------------------- 115 116A new multiplexor and initial KCM socket is created by a socket call:: 117 118 socket(AF_KCM, type, protocol) 119 120- type is either SOCK_DGRAM or SOCK_SEQPACKET 121- protocol is KCMPROTO_CONNECTED 122 123Cloning KCM sockets 124------------------- 125 126After the first KCM socket is created using the socket call as described 127above, additional sockets for the multiplexor can be created by cloning 128a KCM socket. This is accomplished by an ioctl on a KCM socket:: 129 130 /* From linux/kcm.h */ 131 struct kcm_clone { 132 int fd; 133 }; 134 135 struct kcm_clone info; 136 137 memset(&info, 0, sizeof(info)); 138 139 err = ioctl(kcmfd, SIOCKCMCLONE, &info); 140 141 if (!err) 142 newkcmfd = info.fd; 143 144Attach transport sockets 145------------------------ 146 147Attaching of transport sockets to a multiplexor is performed by calling an 148ioctl on a KCM socket for the multiplexor. e.g.:: 149 150 /* From linux/kcm.h */ 151 struct kcm_attach { 152 int fd; 153 int bpf_fd; 154 }; 155 156 struct kcm_attach info; 157 158 memset(&info, 0, sizeof(info)); 159 160 info.fd = tcpfd; 161 info.bpf_fd = bpf_prog_fd; 162 163 ioctl(kcmfd, SIOCKCMATTACH, &info); 164 165The kcm_attach structure contains: 166 167 - fd: file descriptor for TCP socket being attached 168 - bpf_prog_fd: file descriptor for compiled BPF program downloaded 169 170Unattach transport sockets 171-------------------------- 172 173Unattaching a transport socket from a multiplexor is straightforward. An 174"unattach" ioctl is done with the kcm_unattach structure as the argument:: 175 176 /* From linux/kcm.h */ 177 struct kcm_unattach { 178 int fd; 179 }; 180 181 struct kcm_unattach info; 182 183 memset(&info, 0, sizeof(info)); 184 185 info.fd = cfd; 186 187 ioctl(fd, SIOCKCMUNATTACH, &info); 188 189Disabling receive on KCM socket 190------------------------------- 191 192A setsockopt is used to disable or enable receiving on a KCM socket. 193When receive is disabled, any pending messages in the socket's 194receive buffer are moved to other sockets. This feature is useful 195if an application thread knows that it will be doing a lot of 196work on a request and won't be able to service new messages for a 197while. Example use:: 198 199 int val = 1; 200 201 setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) 202 203BFP programs for message delineation 204------------------------------------ 205 206BPF programs can be compiled using the BPF LLVM backend. For example, 207the BPF program for parsing Thrift is:: 208 209 #include "bpf.h" /* for __sk_buff */ 210 #include "bpf_helpers.h" /* for load_word intrinsic */ 211 212 SEC("socket_kcm") 213 int bpf_prog1(struct __sk_buff *skb) 214 { 215 return load_word(skb, 0) + 4; 216 } 217 218 char _license[] SEC("license") = "GPL"; 219 220Use in applications 221=================== 222 223KCM accelerates application layer protocols. Specifically, it allows 224applications to use a message based interface for sending and receiving 225messages. The kernel provides necessary assurances that messages are sent 226and received atomically. This relieves much of the burden applications have 227in mapping a message based protocol onto the TCP stream. KCM also make 228application layer messages a unit of work in the kernel for the purposes of 229steering and scheduling, which in turn allows a simpler networking model in 230multithreaded applications. 231 232Configurations 233-------------- 234 235In an Nx1 configuration, KCM logically provides multiple socket handles 236to the same TCP connection. This allows parallelism between in I/O 237operations on the TCP socket (for instance copyin and copyout of data is 238parallelized). In an application, a KCM socket can be opened for each 239processing thread and inserted into the epoll (similar to how SO_REUSEPORT 240is used to allow multiple listener sockets on the same port). 241 242In a MxN configuration, multiple connections are established to the 243same destination. These are used for simple load balancing. 244 245Message batching 246---------------- 247 248The primary purpose of KCM is load balancing between KCM sockets and hence 249threads in a nominal use case. Perfect load balancing, that is steering 250each received message to a different KCM socket or steering each sent 251message to a different TCP socket, can negatively impact performance 252since this doesn't allow for affinities to be established. Balancing 253based on groups, or batches of messages, can be beneficial for performance. 254 255On transmit, there are three ways an application can batch (pipeline) 256messages on a KCM socket. 257 258 1) Send multiple messages in a single sendmmsg. 259 2) Send a group of messages each with a sendmsg call, where all messages 260 except the last have MSG_BATCH in the flags of sendmsg call. 261 3) Create "super message" composed of multiple messages and send this 262 with a single sendmsg. 263 264On receive, the KCM module attempts to queue messages received on the 265same KCM socket during each TCP ready callback. The targeted KCM socket 266changes at each receive ready callback on the KCM socket. The application 267does not need to configure this. 268 269Error handling 270-------------- 271 272An application should include a thread to monitor errors raised on 273the TCP connection. Normally, this will be done by placing each 274TCP socket attached to a KCM multiplexor in epoll set for POLLERR 275event. If an error occurs on an attached TCP socket, KCM sets an EPIPE 276on the socket thus waking up the application thread. When the application 277sees the error (which may just be a disconnect) it should unattach the 278socket from KCM and then close it. It is assumed that once an error is 279posted on the TCP socket the data stream is unrecoverable (i.e. an error 280may have occurred in the middle of receiving a message). 281 282TCP connection monitoring 283------------------------- 284 285In KCM there is no means to correlate a message to the TCP socket that 286was used to send or receive the message (except in the case there is 287only one attached TCP socket). However, the application does retain 288an open file descriptor to the socket so it will be able to get statistics 289from the socket which can be used in detecting issues (such as high 290retransmissions on the socket). 291