1 2============ 3MSG_ZEROCOPY 4============ 5 6Intro 7===== 8 9The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. 10The feature is currently implemented for TCP sockets. 11 12 13Opportunity and Caveats 14----------------------- 15 16Copying large buffers between user process and kernel can be 17expensive. Linux supports various interfaces that eschew copying, 18such as sendpage and splice. The MSG_ZEROCOPY flag extends the 19underlying copy avoidance mechanism to common socket send calls. 20 21Copy avoidance is not a free lunch. As implemented, with page pinning, 22it replaces per byte copy cost with page accounting and completion 23notification overhead. As a result, MSG_ZEROCOPY is generally only 24effective at writes over around 10 KB. 25 26Page pinning also changes system call semantics. It temporarily shares 27the buffer between process and network stack. Unlike with copying, the 28process cannot immediately overwrite the buffer after system call 29return without possibly modifying the data in flight. Kernel integrity 30is not affected, but a buggy program can possibly corrupt its own data 31stream. 32 33The kernel returns a notification when it is safe to modify data. 34Converting an existing application to MSG_ZEROCOPY is not always as 35trivial as just passing the flag, then. 36 37 38More Info 39--------- 40 41Much of this document was derived from a longer paper presented at 42netdev 2.1. For more in-depth information see that paper and talk, 43the excellent reporting over at LWN.net or read the original code. 44 45 paper, slides, video 46 https://netdevconf.org/2.1/session.html?debruijn 47 48 LWN article 49 https://lwn.net/Articles/726917/ 50 51 patchset 52 [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY 53 http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com 54 55 56Interface 57========= 58 59Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy 60avoidance, but not the only one. 61 62Socket Setup 63------------ 64 65The kernel is permissive when applications pass undefined flags to the 66send system call. By default it simply ignores these. To avoid enabling 67copy avoidance mode for legacy processes that accidentally already pass 68this flag, a process must first signal intent by setting a socket option: 69 70:: 71 72 if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) 73 error(1, errno, "setsockopt zerocopy"); 74 75Setting the socket option only works when the socket is in its initial 76(TCP_CLOSED) state. Trying to set the option for a socket returned by accept(), 77for example, will lead to an EBUSY error. In this case, the option should be set 78to the listening socket and it will be inherited by the accepted sockets. 79 80Transmission 81------------ 82 83The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. 84Pass the new flag. 85 86:: 87 88 ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); 89 90A zerocopy failure will return -1 with errno ENOBUFS. This happens if 91the socket option was not set, the socket exceeds its optmem limit or 92the user exceeds its ulimit on locked pages. 93 94 95Mixing copy avoidance and copying 96~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 97 98Many workloads have a mixture of large and small buffers. Because copy 99avoidance is more expensive than copying for small packets, the 100feature is implemented as a flag. It is safe to mix calls with the flag 101with those without. 102 103 104Notifications 105------------- 106 107The kernel has to notify the process when it is safe to reuse a 108previously passed buffer. It queues completion notifications on the 109socket error queue, akin to the transmit timestamping interface. 110 111The notification itself is a simple scalar value. Each socket 112maintains an internal unsigned 32-bit counter. Each send call with 113MSG_ZEROCOPY that successfully sends data increments the counter. The 114counter is not incremented on failure or if called with length zero. 115The counter counts system call invocations, not bytes. It wraps after 116UINT_MAX calls. 117 118 119Notification Reception 120~~~~~~~~~~~~~~~~~~~~~~ 121 122The below snippet demonstrates the API. In the simplest case, each 123send syscall is followed by a poll and recvmsg on the error queue. 124 125Reading from the error queue is always a non-blocking operation. The 126poll call is there to block until an error is outstanding. It will set 127POLLERR in its output flags. That flag does not have to be set in the 128events field. Errors are signaled unconditionally. 129 130:: 131 132 pfd.fd = fd; 133 pfd.events = 0; 134 if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) 135 error(1, errno, "poll"); 136 137 ret = recvmsg(fd, &msg, MSG_ERRQUEUE); 138 if (ret == -1) 139 error(1, errno, "recvmsg"); 140 141 read_notification(msg); 142 143The example is for demonstration purpose only. In practice, it is more 144efficient to not wait for notifications, but read without blocking 145every couple of send calls. 146 147Notifications can be processed out of order with other operations on 148the socket. A socket that has an error queued would normally block 149other operations until the error is read. Zerocopy notifications have 150a zero error code, however, to not block send and recv calls. 151 152 153Notification Batching 154~~~~~~~~~~~~~~~~~~~~~ 155 156Multiple outstanding packets can be read at once using the recvmmsg 157call. This is often not needed. In each message the kernel returns not 158a single value, but a range. It coalesces consecutive notifications 159while one is outstanding for reception on the error queue. 160 161When a new notification is about to be queued, it checks whether the 162new value extends the range of the notification at the tail of the 163queue. If so, it drops the new notification packet and instead increases 164the range upper value of the outstanding notification. 165 166For protocols that acknowledge data in-order, like TCP, each 167notification can be squashed into the previous one, so that no more 168than one notification is outstanding at any one point. 169 170Ordered delivery is the common case, but not guaranteed. Notifications 171may arrive out of order on retransmission and socket teardown. 172 173 174Notification Parsing 175~~~~~~~~~~~~~~~~~~~~ 176 177The below snippet demonstrates how to parse the control message: the 178read_notification() call in the previous snippet. A notification 179is encoded in the standard error format, sock_extended_err. 180 181The level and type fields in the control data are protocol family 182specific, IP_RECVERR or IPV6_RECVERR. 183 184Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, 185as explained before, to avoid blocking read and write system calls on 186the socket. 187 188The 32-bit notification range is encoded as [ee_info, ee_data]. This 189range is inclusive. Other fields in the struct must be treated as 190undefined, bar for ee_code, as discussed below. 191 192:: 193 194 struct sock_extended_err *serr; 195 struct cmsghdr *cm; 196 197 cm = CMSG_FIRSTHDR(msg); 198 if (cm->cmsg_level != SOL_IP && 199 cm->cmsg_type != IP_RECVERR) 200 error(1, 0, "cmsg"); 201 202 serr = (void *) CMSG_DATA(cm); 203 if (serr->ee_errno != 0 || 204 serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) 205 error(1, 0, "serr"); 206 207 printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); 208 209 210Deferred copies 211~~~~~~~~~~~~~~~ 212 213Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy 214avoidance, and a contract that the kernel will queue a completion 215notification. It is not a guarantee that the copy is elided. 216 217Copy avoidance is not always feasible. Devices that do not support 218scatter-gather I/O cannot send packets made up of kernel generated 219protocol headers plus zerocopy user data. A packet may need to be 220converted to a private copy of data deep in the stack, say to compute 221a checksum. 222 223In all these cases, the kernel returns a completion notification when 224it releases its hold on the shared pages. That notification may arrive 225before the (copied) data is fully transmitted. A zerocopy completion 226notification is not a transmit completion notification, therefore. 227 228Deferred copies can be more expensive than a copy immediately in the 229system call, if the data is no longer warm in the cache. The process 230also incurs notification processing cost for no benefit. For this 231reason, the kernel signals if data was completed with a copy, by 232setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. 233A process may use this signal to stop passing flag MSG_ZEROCOPY on 234subsequent requests on the same socket. 235 236 237Implementation 238============== 239 240Loopback 241-------- 242 243Data sent to local sockets can be queued indefinitely if the receive 244process does not read its socket. Unbound notification latency is not 245acceptable. For this reason all packets generated with MSG_ZEROCOPY 246that are looped to a local socket will incur a deferred copy. This 247includes looping onto packet sockets (e.g., tcpdump) and tun devices. 248 249 250Testing 251======= 252 253More realistic example code can be found in the kernel source under 254tools/testing/selftests/net/msg_zerocopy.c. 255 256Be cognizant of the loopback constraint. The test can be run between 257a pair of hosts. But if run between a local pair of processes, for 258instance when run with msg_zerocopy.sh between a veth pair across 259namespaces, the test will not show any improvement. For testing, the 260loopback restriction can be temporarily relaxed by making 261skb_orphan_frags_rx identical to skb_orphan_frags. 262