1.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) 2 3================== 4Kernel TLS offload 5================== 6 7Kernel TLS operation 8==================== 9 10Linux kernel provides TLS connection offload infrastructure. Once a TCP 11connection is in ``ESTABLISHED`` state user space can enable the TLS Upper 12Layer Protocol (ULP) and install the cryptographic connection state. 13For details regarding the user-facing interface refer to the TLS 14documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`. 15 16``ktls`` can operate in three modes: 17 18 * Software crypto mode (``TLS_SW``) - CPU handles the cryptography. 19 In most basic cases only crypto operations synchronous with the CPU 20 can be used, but depending on calling context CPU may utilize 21 asynchronous crypto accelerators. The use of accelerators introduces extra 22 latency on socket reads (decryption only starts when a read syscall 23 is made) and additional I/O load on the system. 24 * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto 25 on a packet by packet basis, provided the packets arrive in order. 26 This mode integrates best with the kernel stack and is described in detail 27 in the remaining part of this document 28 (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``). 29 * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where 30 NIC driver and firmware replace the kernel networking stack 31 with its own TCP handling, it is not usable in production environments 32 making use of the Linux networking stack for example any firewalling 33 abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``). 34 35The operation mode is selected automatically based on device configuration, 36offload opt-in or opt-out on per-connection basis is not currently supported. 37 38TX 39-- 40 41At a high level user write requests are turned into a scatter list, the TLS ULP 42intercepts them, inserts record framing, performs encryption (in ``TLS_SW`` 43mode) and then hands the modified scatter list to the TCP layer. From this 44point on the TCP stack proceeds as normal. 45 46In ``TLS_HW`` mode the encryption is not performed in the TLS ULP. 47Instead packets reach a device driver, the driver will mark the packets 48for crypto offload based on the socket the packet is attached to, 49and send them to the device for encryption and transmission. 50 51RX 52-- 53 54On the receive side if the device handled decryption and authentication 55successfully, the driver will set the decrypted bit in the associated 56:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and 57are handled normally. ``ktls`` is informed when data is queued to the socket 58and the ``strparser`` mechanism is used to delineate the records. Upon read 59request, records are retrieved from the socket and passed to decryption routine. 60If device decrypted all the segments of the record the decryption is skipped, 61otherwise software path handles decryption. 62 63.. kernel-figure:: tls-offload-layers.svg 64 :alt: TLS offload layers 65 :align: center 66 :figwidth: 28em 67 68 Layers of Kernel TLS stack 69 70Device configuration 71==================== 72 73During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and 74``NETIF_F_HW_TLS_TX`` features and installs its 75:c:type:`struct tlsdev_ops <tlsdev_ops>` 76pointer in the :c:member:`tlsdev_ops` member of the 77:c:type:`struct net_device <net_device>`. 78 79When TLS cryptographic connection state is installed on a ``ktls`` socket 80(note that it is done twice, once for RX and once for TX direction, 81and the two are completely independent), the kernel checks if the underlying 82network device is offload-capable and attempts the offload. In case offload 83fails the connection is handled entirely in software using the same mechanism 84as if the offload was never tried. 85 86Offload request is performed via the :c:member:`tls_dev_add` callback of 87:c:type:`struct tlsdev_ops <tlsdev_ops>`: 88 89.. code-block:: c 90 91 int (*tls_dev_add)(struct net_device *netdev, struct sock *sk, 92 enum tls_offload_ctx_dir direction, 93 struct tls_crypto_info *crypto_info, 94 u32 start_offload_tcp_sn); 95 96``direction`` indicates whether the cryptographic information is for 97the received or transmitted packets. Driver uses the ``sk`` parameter 98to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6). 99Cryptographic information in ``crypto_info`` includes the key, iv, salt 100as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates 101which TCP sequence number corresponds to the beginning of the record with 102sequence number from ``crypto_info``. The driver can add its state 103at the end of kernel structures (see :c:member:`driver_state` members 104in ``include/net/tls.h``) to avoid additional allocations and pointer 105dereferences. 106 107TX 108-- 109 110After TX state is installed, the stack guarantees that the first segment 111of the stream will start exactly at the ``start_offload_tcp_sn`` sequence 112number, simplifying TCP sequence number matching. 113 114TX offload being fully initialized does not imply that all segments passing 115through the driver and which belong to the offloaded socket will be after 116the expected sequence number and will have kernel record information. 117In particular, already encrypted data may have been queued to the socket 118before installing the connection state in the kernel. 119 120RX 121-- 122 123In RX direction local networking stack has little control over the segmentation, 124so the initial records' TCP sequence number may be anywhere inside the segment. 125 126Normal operation 127================ 128 129At the minimum the device maintains the following state for each connection, in 130each direction: 131 132 * crypto secrets (key, iv, salt) 133 * crypto processing state (partial blocks, partial authentication tag, etc.) 134 * record metadata (sequence number, processing offset and length) 135 * expected TCP sequence number 136 137There are no guarantees on record length or record segmentation. In particular 138segments may start at any point of a record and contain any number of records. 139Assuming segments are received in order, the device should be able to perform 140crypto operations and authentication regardless of segmentation. For this 141to be possible device has to keep small amount of segment-to-segment state. 142This includes at least: 143 144 * partial headers (if a segment carried only a part of the TLS header) 145 * partial data block 146 * partial authentication tag (all data had been seen but part of the 147 authentication tag has to be written or read from the subsequent segment) 148 149Record reassembly is not necessary for TLS offload. If the packets arrive 150in order the device should be able to handle them separately and make 151forward progress. 152 153TX 154-- 155 156The kernel stack performs record framing reserving space for the authentication 157tag and populating all other TLS header and tailer fields. 158 159Both the device and the driver maintain expected TCP sequence numbers 160due to the possibility of retransmissions and the lack of software fallback 161once the packet reaches the device. 162For segments passed in order, the driver marks the packets with 163a connection identifier (note that a 5-tuple lookup is insufficient to identify 164packets requiring HW offload, see the :ref:`5tuple_problems` section) 165and hands them to the device. The device identifies the packet as requiring 166TLS handling and confirms the sequence number matches its expectation. 167The device performs encryption and authentication of the record data. 168It replaces the authentication tag and TCP checksum with correct values. 169 170RX 171-- 172 173Before a packet is DMAed to the host (but after NIC's embedded switching 174and packet transformation functions) the device validates the Layer 4 175checksum and performs a 5-tuple lookup to find any TLS connection the packet 176may belong to (technically a 4-tuple 177lookup is sufficient - IP addresses and TCP port numbers, as the protocol 178is always TCP). If connection is matched device confirms if the TCP sequence 179number is the expected one and proceeds to TLS handling (record delineation, 180decryption, authentication for each record in the packet). The device leaves 181the record framing unmodified, the stack takes care of record decapsulation. 182Device indicates successful handling of TLS offload in the per-packet context 183(descriptor) passed to the host. 184 185Upon reception of a TLS offloaded packet, the driver sets 186the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>` 187corresponding to the segment. Networking stack makes sure decrypted 188and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer) 189and takes care of partial decryption. 190 191Resync handling 192=============== 193 194In presence of packet drops or network packet reordering, the device may lose 195synchronization with the TLS stream, and require a resync with the kernel's 196TCP stack. 197 198Note that resync is only attempted for connections which were successfully 199added to the device table and are in TLS_HW mode. For example, 200if the table was full when cryptographic state was installed in the kernel, 201such connection will never get offloaded. Therefore the resync request 202does not carry any cryptographic connection state. 203 204TX 205-- 206 207Segments transmitted from an offloaded socket can get out of sync 208in similar ways to the receive side-retransmissions - local drops 209are possible, though network reorders are not. 210 211Whenever an out of order segment is transmitted the driver provides 212the device with enough information to perform cryptographic operations. 213This means most likely that the part of the record preceding the current 214segment has to be passed to the device as part of the packet context, 215together with its TCP sequence number and TLS record number. The device 216can then initialize its crypto state, process and discard the preceding 217data (to be able to insert the authentication tag) and move onto handling 218the actual packet. 219 220In this mode depending on the implementation the driver can either ask 221for a continuation with the crypto state and the new sequence number 222(next expected segment is the one after the out of order one), or continue 223with the previous stream state - assuming that the out of order segment 224was just a retransmission. The former is simpler, and does not require 225retransmission detection therefore it is the recommended method until 226such time it is proven inefficient. 227 228RX 229-- 230 231A small amount of RX reorder events may not require a full resynchronization. 232In particular the device should not lose synchronization 233when record boundary can be recovered: 234 235.. kernel-figure:: tls-offload-reorder-good.svg 236 :alt: reorder of non-header segment 237 :align: center 238 239 Reorder of non-header segment 240 241Green segments are successfully decrypted, blue ones are passed 242as received on wire, red stripes mark start of new records. 243 244In above case segment 1 is received and decrypted successfully. 245Segment 2 was dropped so 3 arrives out of order. The device knows 246the next record starts inside 3, based on record length in segment 1. 247Segment 3 is passed untouched, because due to lack of data from segment 2 248the remainder of the previous record inside segment 3 cannot be handled. 249The device can, however, collect the authentication algorithm's state 250and partial block from the new record in segment 3 and when 4 and 5 251arrive continue decryption. Finally when 2 arrives it's completely outside 252of expected window of the device so it's passed as is without special 253handling. ``ktls`` software fallback handles the decryption of record 254spanning segments 1, 2 and 3. The device did not get out of sync, 255even though two segments did not get decrypted. 256 257Kernel synchronization may be necessary if the lost segment contained 258a record header and arrived after the next record header has already passed: 259 260.. kernel-figure:: tls-offload-reorder-bad.svg 261 :alt: reorder of header segment 262 :align: center 263 264 Reorder of segment with a TLS header 265 266In this example segment 2 gets dropped, and it contains a record header. 267Device can only detect that segment 4 also contains a TLS header 268if it knows the length of the previous record from segment 2. In this case 269the device will lose synchronization with the stream. 270 271When the device gets out of sync and the stream reaches TCP sequence 272numbers more than a max size record past the expected TCP sequence number, 273the device starts scanning for a known header pattern. For example 274for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur 275in the SSL/TLS version field of the header. Once pattern is matched 276the device continues attempting parsing headers at expected locations 277(based on the length fields at guessed locations). 278Whenever the expected location does not contain a valid header the scan 279is restarted. 280 281When the header is matched the device sends a confirmation request 282to the kernel, asking if the guessed location is correct (if a TLS record 283really starts there), and which record sequence number the given header had. 284The kernel confirms the guessed location was correct and tells the device 285the record sequence number. Meanwhile, the device had been parsing 286and counting all records since the just-confirmed one, it adds the number 287of records it had seen to the record number provided by the kernel. 288At this point the device is in sync and can resume decryption at next 289segment boundary. 290 291In a pathological case the device may latch onto a sequence of matching 292headers and never hear back from the kernel (there is no negative 293confirmation from the kernel). The implementation may choose to periodically 294restart scan. Given how unlikely falsely-matching stream is, however, 295periodic restart is not deemed necessary. 296 297Special care has to be taken if the confirmation request is passed 298asynchronously to the packet stream and record may get processed 299by the kernel before the confirmation request. 300 301Error handling 302============== 303 304TX 305-- 306 307Packets may be redirected or rerouted by the stack to a different 308device than the selected TLS offload device. The stack will handle 309such condition using the :c:func:`sk_validate_xmit_skb` helper 310(TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook). 311Offload maintains information about all records until the data is 312fully acknowledged, so if skbs reach the wrong device they can be handled 313by software fallback. 314 315Any device TLS offload handling error on the transmission side must result 316in the packet being dropped. For example if a packet got out of order 317due to a bug in the stack or the device, reached the device and can't 318be encrypted such packet must be dropped. 319 320RX 321-- 322 323If the device encounters any problems with TLS offload on the receive 324side it should pass the packet to the host's networking stack as it was 325received on the wire. 326 327For example authentication failure for any record in the segment should 328result in passing the unmodified packet to the software fallback. This means 329packets should not be modified "in place". Splitting segments to handle partial 330decryption is not advised. In other words either all records in the packet 331had been handled successfully and authenticated or the packet has to be passed 332to the host's stack as it was on the wire (recovering original packet in the 333driver if device provides precise error is sufficient). 334 335The Linux networking stack does not provide a way of reporting per-packet 336decryption and authentication errors, packets with errors must simply not 337have the :c:member:`decrypted` mark set. 338 339A packet should also not be handled by the TLS offload if it contains 340incorrect checksums. 341 342Performance metrics 343=================== 344 345TLS offload can be characterized by the following basic metrics: 346 347 * max connection count 348 * connection installation rate 349 * connection installation latency 350 * total cryptographic performance 351 352Note that each TCP connection requires a TLS session in both directions, 353the performance may be reported treating each direction separately. 354 355Max connection count 356-------------------- 357 358The number of connections device can support can be exposed via 359``devlink resource`` API. 360 361Total cryptographic performance 362------------------------------- 363 364Offload performance may depend on segment and record size. 365 366Overload of the cryptographic subsystem of the device should not have 367significant performance impact on non-offloaded streams. 368 369Statistics 370========== 371 372Following minimum set of TLS-related statistics should be reported 373by the driver: 374 375 * ``rx_tls_decrypted`` - number of successfully decrypted TLS segments 376 * ``tx_tls_encrypted`` - number of in-order TLS segments passed to device 377 for encryption 378 * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream 379 but did not arrive in the expected order 380 * ``tx_tls_drop_no_sync_data`` - number of TX packets dropped because 381 they arrived out of order and associated record could not be found 382 (see also :ref:`pre_tls_data`) 383 384Notable corner cases, exceptions and additional requirements 385============================================================ 386 387.. _5tuple_problems: 388 3895-tuple matching limitations 390---------------------------- 391 392The device can only recognize received packets based on the 5-tuple 393of the socket. Current ``ktls`` implementation will not offload sockets 394routed through software interfaces such as those used for tunneling 395or virtual networking. However, many packet transformations performed 396by the networking stack (most notably any BPF logic) do not require 397any intermediate software device, therefore a 5-tuple match may 398consistently miss at the device level. In such cases the device 399should still be able to perform TX offload (encryption) and should 400fallback cleanly to software decryption (RX). 401 402Out of order 403------------ 404 405Introducing extra processing in NICs should not cause packets to be 406transmitted or received out of order, for example pure ACK packets 407should not be reordered with respect to data segments. 408 409Ingress reorder 410--------------- 411 412A device is permitted to perform packet reordering for consecutive 413TCP segments (i.e. placing packets in the correct order) but any form 414of additional buffering is disallowed. 415 416Coexistence with standard networking offload features 417----------------------------------------------------- 418 419Offloaded ``ktls`` sockets should support standard TCP stack features 420transparently. Enabling device TLS offload should not cause any difference 421in packets as seen on the wire. 422 423Transport layer transparency 424---------------------------- 425 426The device should not modify any packet headers for the purpose 427of the simplifying TLS offload. 428 429The device should not depend on any packet headers beyond what is strictly 430necessary for TLS offload. 431 432Segment drops 433------------- 434 435Dropping packets is acceptable only in the event of catastrophic 436system errors and should never be used as an error handling mechanism 437in cases arising from normal operation. In other words, reliance 438on TCP retransmissions to handle corner cases is not acceptable. 439 440TLS device features 441------------------- 442 443Drivers should ignore the changes to TLS the device feature flags. 444These flags will be acted upon accordingly by the core ``ktls`` code. 445TLS device feature flags only control adding of new TLS connection 446offloads, old connections will remain active after flags are cleared. 447 448Known bugs 449========== 450 451skb_orphan() leaks clear text 452----------------------------- 453 454Currently drivers depend on the :c:member:`sk` member of 455:c:type:`struct sk_buff <sk_buff>` to identify segments requiring 456encryption. Any operation which removes or does not preserve the socket 457association such as :c:func:`skb_orphan` or :c:func:`skb_clone` 458will cause the driver to miss the packets and lead to clear text leaks. 459 460Redirects leak clear text 461------------------------- 462 463In the RX direction, if segment has already been decrypted by the device 464and it gets redirected or mirrored - clear text will be transmitted out. 465 466.. _pre_tls_data: 467 468Transmission of pre-TLS data 469---------------------------- 470 471User can enqueue some already encrypted and framed records before enabling 472``ktls`` on the socket. Those records have to get sent as they are. This is 473perfectly easy to handle in the software case - such data will be waiting 474in the TCP layer, TLS ULP won't see it. In the offloaded case when pre-queued 475segment reaches transmission point it appears to be out of order (before the 476expected TCP sequence number) and the stack does not have a record information 477associated. 478 479All segments without record information cannot, however, be assumed to be 480pre-queued data, because a race condition exists between TCP stack queuing 481a retransmission, the driver seeing the retransmission and TCP ACK arriving 482for the retransmitted data. 483