xref: /openbmc/docs/designs/mctp/mctp-kernel.md (revision d045c8aa)
1# OpenBMC in-kernel MCTP
2
3Author: Jeremy Kerr `<jk@codeconstruct.com.au>`
4
5Please refer to the [MCTP Overview](mctp.md) document for general MCTP design
6description, background and requirements.
7
8This document describes a kernel-based implementation of MCTP infrastructure,
9providing a sockets-based API for MCTP communication within an OpenBMC-based
10platform.
11
12# Requirements for a kernel implementation
13
14- The MCTP messaging API should be an obvious application of the existing POSIX
15  socket interface
16
17- Configuration should be simple for a straightforward MCTP endpoint: a single
18  network with a single local endpoint id (EID).
19
20- Infrastructure should be flexible enough to allow for more complex MCTP
21  networks, allowing:
22
23  - each MCTP network (as defined by section 3.2.31 of DSP0236) may consist of
24    multiple local physical interfaces, and/or multiple EIDs;
25
26  - multiple distinct (ie., non-bridged) networks, possibly containing
27    duplicated EIDs between networks;
28
29  - multiple local EIDs on a single interface, and
30
31  - customisable routing/bridging configurations within a network.
32
33# Proposed Design
34
35The design contains several components:
36
37- An interface for userspace applications to send and receive MCTP messages: A
38  mapping of the sockets API to MCTP usage
39
40- Infrastructure for control and configuration of the MCTP network(s),
41  consisting of a configuration utility, and a kernel messaging facility for
42  this utility to use.
43
44- Kernel drivers for physical interface bindings.
45
46In general, the kernel components cover the transport functionality of MCTP,
47such as message assembly/disassembly, packet forwarding, and physical interface
48implementations.
49
50Higher-level protocols (such as PLDM) are implemented in userspace, through the
51introduced socket API. This also includes the majority of the MCTP Control
52Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically
53have a specific process to request and respond to control protocol messages.
54However, the kernel will include a small subset of control protocol code to
55allow very simple endpoints, with static EID allocations, to run without this
56process. MCTP endpoints that require more than just single-endpoint
57functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would
58include the control message protocol process.
59
60A new driver is introduced to handle each physical interface binding. These
61drivers expose the appropriate `struct net_device` to handle transmission and
62reception of MCTP packets on their associated hardware channels. Under Linux,
63the namespace for these interfaces is separate from other network interfaces -
64such as those for ethernet.
65
66## Structure: interfaces & networks
67
68The kernel models the local MCTP topology through two items: interfaces and
69networks.
70
71An interface (or "link") is an instance of an MCTP physical transport binding
72(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware
73device. This is represented as a `struct netdevice`, and has a user-visible name
74and index (`ifindex`). Non-hardware-attached interfaces are permitted, to allow
75local loopback and/or virtual interfaces.
76
77A network defines a unique address space for MCTP endpoints by endpoint-ID
78(described by DSP0236, section 3.2.31). A network has a user-visible identifier
79to allow references from userspace. Route definitions are specific to one
80network.
81
82Interfaces are associated with one network. A network may be associated with one
83or more interfaces.
84
85If multiple networks are present, each may contain EIDs that are also present on
86other networks.
87
88## Sockets API
89
90### Protocol definitions
91
92We define a new address family (and corresponding protocol family) for MCTP:
93
94```c
95    #define AF_MCTP /* TBD */
96    #define PF_MCTP AF_MCTP
97```
98
99MCTP sockets are created with the `socket()` syscall, specifying `AF_MCTP` as
100the domain. Currently, only a `SOCK_DGRAM` socket type is defined.
101
102```c
103    int sd = socket(AF_MCTP, SOCK_DGRAM, 0);
104```
105
106The only (current) value for the `protocol` argument is 0. Future protocol
107implementations may be added later.
108
109MCTP Sockets opened with a protocol value of 0 will communicate directly at the
110transport layer; message buffers received by the application will consist of
111message data from reassembled MCTP packets, and will include the full message
112including message type byte and optional message integrity check (IC).
113Individual packet headers are not included; they may be accessible through a
114future `SOCK_RAW` socket type.
115
116As with all socket address families, source and destination addresses are
117specified with a new `sockaddr` type:
118
119```c
120    struct sockaddr_mctp {
121            sa_family_t         smctp_family; /* = AF_MCTP */
122            int                 smctp_network;
123            struct mctp_addr    smctp_addr;
124            uint8_t             smctp_type;
125            uint8_t             smctp_tag;
126    };
127
128    struct mctp_addr {
129            uint8_t             s_addr;
130    };
131
132    /* MCTP network values */
133    #define MCTP_NET_ANY        0
134
135    /* MCTP EID values */
136    #define MCTP_ADDR_ANY       0xff
137    #define MCTP_ADDR_BCAST     0xff
138
139    /* MCTP type values. Only the least-significant 7 bits of
140     * smctp_type are used for tag matches; the specification defines
141     * the type to be 7 bits.
142     */
143    #define MCTP_TYPE_MASK      0x7f
144
145    /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */
146    /* MCTP-spec-defined fields */
147    #define MCTP_TAG_MASK    0x07
148    #define MCTP_TAG_OWNER   0x08
149    /* Others: reserved */
150
151    /* Helpers */
152    #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */
153```
154
155### Syscall behaviour
156
157The following sections describe the MCTP-specific behaviours of the standard
158socket system calls. These behaviours have been chosen to map closely to the
159existing sockets APIs.
160
161#### `bind()`: set local socket address
162
163Sockets that receive incoming request packets will bind to a local address,
164using the `bind()` syscall.
165
166```c
167    struct sockaddr_mctp addr;
168
169    addr.smctp_family = AF_MCTP;
170    addr.smctp_network = MCTP_NET_ANY;
171    addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
172    addr.smctp_type = MCTP_TYPE_PLDM;
173    addr.smctp_tag = MCTP_TAG_OWNER;
174
175    int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));
176```
177
178This establishes the local address of the socket. Incoming MCTP messages that
179match the network, address, and message type will be received by this socket.
180The reference to 'incoming' is important here; a bound socket will only receive
181messages with the TO bit set, to indicate an incoming request message, rather
182than a response.
183
184The `smctp_tag` value will configure the tags accepted from the remote side of
185this socket. Given the above, the only valid value is `MCTP_TAG_OWNER`, which
186will result in remotely "owned" tags being routed to this socket. Since
187`MCTP_TAG_OWNER` is set, the 3 least-significant bits of `smctp_tag` are not
188used; callers must set them to zero. See the
189[Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages)
190section for more details. If the `MCTP_TAG_OWNER` bit is not set, `bind()` will
191fail with an errno of `EINVAL`.
192
193A `smctp_network` value of `MCTP_NET_ANY` will configure the socket to receive
194incoming packets from any locally-connected network. A specific network value
195will cause the socket to only receive incoming messages from that network.
196
197The `smctp_addr` field specifies a local address to bind to. A value of
198`MCTP_ADDR_ANY` configures the socket to receive messages addressed to any local
199destination EID.
200
201The `smctp_type` field specifies which message types to receive. Only the lower
2027 bits of the type is matched on incoming messages (ie., the most-significant IC
203bit is not part of the match). This results in the socket receiving packets with
204and without a message integrity check footer.
205
206#### `connect()`: set remote socket address
207
208Sockets may specify a socket's remote address with the `connect()` syscall:
209
210```c
211    struct sockaddr_mctp addr;
212    int rc;
213
214    addr.smctp_family = AF_MCTP;
215    addr.smctp_network = MCTP_NET_ANY;
216    addr.smctp_addr.s_addr = 8;
217    addr.smctp_tag = MCTP_TAG_OWNER;
218    addr.smctp_type = MCTP_TYPE_PLDM;
219
220    rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr));
221```
222
223This establishes the remote address of a socket, used for future message
224transmission. Like other `SOCK_DGRAM` behaviour, this does not generate any MCTP
225traffic directly, but just sets the default destination for messages sent from
226this socket.
227
228The `smctp_network` field may specify a locally-attached network, or the value
229`MCTP_NET_ANY`, in which case the kernel will select a suitable MCTP network.
230This is guaranteed to work for single-network configurations, but may require
231additional routing definitions for endpoints attached to multiple distinct
232networks. See the [Addressing](#addressing) section for details.
233
234The `smctp_addr` field specifies a remote EID. This may be the `MCTP_ADDR_BCAST`
235the MCTP broadcast EID (0xff).
236
237The `smctp_type` field specifies the type field of messages transferred over
238this socket.
239
240The `smctp_tag` value will configure the tag used for the local side of this
241socket. The only valid value is `MCTP_TAG_OWNER`, which will result in an
242"owned" tag to be allocated for this socket, and will remain allocated for all
243future outgoing messages, until either the socket is closed, or `connect()` is
244called again. If a tag cannot be allocated, `connect()` will report an error,
245with an errno value of `EAGAIN`. See the
246[Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages)
247section for more details. If the `MCTP_TAG_OWNER` bit is not set, `connect()`
248will fail with an errno of `EINVAL`.
249
250Requesters which connect to a single responder will typically use `connect()` to
251specify the peer address and tag for future outgoing messages.
252
253#### `sendto()`, `sendmsg()`, `send()` & `write()`: transmit an MCTP message
254
255An MCTP message is transmitted using one of the `sendto()`, `sendmsg()`,
256`send()` or `write()` syscalls. Using `sendto()` as the primary example:
257
258```c
259    struct sockaddr_mctp addr;
260    char buf[14];
261    ssize_t len;
262
263    /* set message destination */
264    addr.smctp_family = AF_MCTP;
265    addr.smctp_network = 0;
266    addr.smctp_addr.s_addr = 8;
267    addr.smctp_tag = MCTP_TAG_OWNER;
268    addr.smctp_type = MCTP_TYPE_ECHO;
269
270    /* arbitrary message to send, with message-type header */
271    buf[0] = MCTP_TYPE_ECHO;
272    memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);
273
274    len = sendto(sd, buf, sizeof(buf), 0,
275                    (struct sockaddr_mctp *)&addr, sizeof(addr));
276```
277
278The address argument is treated the same way as for `connect()`: The network and
279address fields define the remote address to send to. If `smctp_tag` has the
280`MCTP_TAG_OWNER`, the kernel will ignore any bits set in `MCTP_TAG_VALUE`, and
281generate a tag value suitable for the destination EID. If `MCTP_TAG_OWNER` is
282not set, the message will be sent with the tag value as specified. If a tag
283value cannot be allocated, the system call will report an errno of `EAGAIN`.
284
285The application must provide the message type byte as the first byte of the
286message buffer passed to `sendto()`. If a message integrity check is to be
287included in the transmitted message, it must also be provided in the message
288buffer, and the most-significant bit of the message type byte must be 1.
289
290If the first byte of the message does not match the message type value, then the
291system call will return an error of `EPROTO`.
292
293The `send()` and `write()` system calls behave in a similar way, but do not
294specify a remote address. Therefore, `connect()` must be called beforehand; if
295not, these calls will return an error of `EDESTADDRREQ` (Destination address
296required).
297
298Using `sendto()` or `sendmsg()` on a connected socket may override the remote
299socket address specified in `connect()`. The `connect()` address and tag will
300remain associated with the socket, for future unaddressed sends. The tag
301allocated through a call to `sendto()` or `sendmsg()` on a connected socket is
302subject to the same invalidation logic as on an unconnected socket: It is
303expired either by timeout or by a subsequent `sendto()`.
304
305The `sendmsg()` system call allows a more compact argument interface, and the
306message buffer to be specified as a scatter-gather list. At present no ancillary
307message types (used for the `msg_control` data passed to `sendmsg()`) are
308defined.
309
310Transmitting a message on an unconnected socket with `MCTP_TAG_OWNER` specified
311will cause an allocation of a tag, if no valid tag is already allocated for that
312destination. The (destination-eid,tag) tuple acts as an implicit local socket
313address, to allow the socket to receive responses to this outgoing message. If
314any previous allocation has been performed (to for a different remote EID), that
315allocation is lost. This tag behaviour can be controlled through the
316`MCTP_TAG_CONTROL` socket option.
317
318Sockets will only receive responses to requests they have sent (with TO=1) and
319may only respond (with TO=0) to requests they have received.
320
321#### `recvfrom()`, `recvmsg()`, `recv()` & `read()`: receive an MCTP message
322
323An MCTP message can be received by an application using one of the `recvfrom()`,
324`recvmsg()`, `recv()` or `read()` system calls. Using `recvfrom()` as the
325primary example:
326
327```c
328    struct sockaddr_mctp addr;
329    socklen_t addrlen;
330    char buf[14];
331    ssize_t len;
332
333    addrlen = sizeof(addr);
334
335    len = recvfrom(sd, buf, sizeof(buf), 0,
336                    (struct sockaddr_mctp *)&addr, &addrlen);
337
338    /* We can expect addr to describe an MCTP address */
339    assert(addrlen >= sizeof(buf));
340    assert(addr.smctp_family == AF_MCTP);
341
342    printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);
343```
344
345The address argument to `recvfrom` and `recvmsg` is populated with the remote
346address of the incoming message, including tag value (this will be needed in
347order to reply to the message).
348
349The first byte of the message buffer will contain the message type byte. If an
350integrity check follows the message, it will be included in the received buffer.
351
352The `recv()` and `read()` system calls behave in a similar way, but do not
353provide a remote address to the application. Therefore, these are only useful if
354the remote address is already known, or the message does not require a reply.
355
356Like the send calls, sockets will only receive responses to requests they have
357sent (TO=1) and may only respond (TO=0) to requests they have received.
358
359#### `getsockname()` & `getpeername()`: query local/remote socket address
360
361The `getsockname()` system call returns the `struct sockaddr_mctp` value for the
362local side of this socket, `getpeername()` for the remote (ie, that used in a
363connect()). Since the tag value is a property of the remote address,
364`getpeername()` may be used to retrieve a kernel-allocated tag value.
365
366Calling `getpeername()` on an unconnected socket will result in an error of
367`ENOTCONN`.
368
369#### Socket options
370
371The following socket options are defined for MCTP sockets:
372
373##### `MCTP_ADDR_EXT`: Use extended addressing information in sendmsg/recvmsg
374
375Enabling this socket option allows an application to specify extended addressing
376information on transmitted packets, and access the same on received packets.
377
378When the `MCTP_ADDR_EXT` socket option is enabled, the application may specify
379an expanded `struct sockaddr` to the `recvfrom()` and `sendto()` system calls.
380This as defined as:
381
382```c
383    struct sockaddr_mctp_ext {
384            /* fields exactly match struct sockaddr_mctp */
385            sa_family_t         smctp_family; /* = AF_MCTP */
386            int                 smctp_network;
387            struct mctp_addr    smctp_addr;
388            uint8_t             smcp_tag;
389            /* extended addressing */
390            int                 smctp_ifindex;
391            uint8_t             smctp_halen;
392            unsigned char       smctp_haddr[/* TBD */];
393    }
394```
395
396If the `addrlen` specified to `sendto()` or `recvfrom()` is sufficient to
397contain this larger structure, then the extended addressing fields are consumed
398/ populated respectively.
399
400##### `MCTP_TAG_CONTROL`: manage outgoing tag allocation behaviour
401
402The set/getsockopt argument is a `mctp_tagctl` structure:
403
404    struct mctp_tagctl {
405        bool            retain;
406        struct timespec timeout;
407    };
408
409This allows an application to control the behaviour of allocated tags for
410non-connected sockets when transferring messages to multiple different
411destinations (ie., where a `struct sockaddr_mctp` is provided for individual
412messages, and the `smctp_addr` destination for those sockets may vary across
413calls).
414
415The `retain` flag indicates to the kernel that the socket should not release tag
416allocations when a message is sent to a new destination EID. This causes the
417socket to continue to receive incoming messages to the old (dest,tag) tuple, in
418addition to the new tuple.
419
420The `timeout` value specifies a maximum amount of time to retain tag values.
421This should be based on the reply timeout for any upper-level protocol.
422
423The kernel may reject a request to set values that would cause excessive tag
424allocation by this socket. The kernel may also reject subsequent tag-allocation
425requests (through send or connect syscalls) which would cause excessive tags to
426be consumed by the socket, even though the tag control settings were accepted in
427the setsockopt operation.
428
429Changing the default tag control behaviour should only be required when:
430
431- the socket is sending messages with TO=1 (ie, is a requester); and
432- messages are sent to multiple different destination EIDs from the one socket.
433
434#### Syscalls not implemented
435
436The following system calls are not implemented for MCTP, primarily as they are
437not used in `SOCK_DGRAM`-type sockets:
438
439- `listen()`
440- `accept()`
441- `ioctl()`
442- `shutdown()`
443- `mmap()`
444
445### Userspace examples
446
447These examples cover three general use-cases:
448
449- **requester**: sends requests to a particular (EID, type) target, and receives
450  responses to those packets
451
452  This is similar to a typical UDP client
453
454- **responder**: receives all locally-addressed messages of a specific
455  message-type, and responds to the requester immediately.
456
457  This is similar to a typical UDP server
458
459- **controller**: a specific service for a bus owner; may send broadcast
460  messages, manage EID allocations, update local MCTP stack state. Will need
461  low-level packet data.
462
463  This is similar to a DHCP server.
464
465#### Requester
466
467"Client"-side implementation to send requests to a responder, and receive a
468response. This uses a (fictitious) message type of `MCTP_TYPE_ECHO`.
469
470```c
471    int main() {
472            struct sockaddr_mctp addr;
473            socklen_t addrlen;
474            struct {
475                uint8_t type;
476                uint8_t data[14];
477            } msg;
478            int sd, rc;
479
480            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
481
482            addr.sa_family = AF_MCTP;
483            addr.smctp_network = MCTP_NET_ANY; /* any network */
484            addr.smctp_addr.s_addr = 9;    /* remote eid 9 */
485            addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */
486            addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */
487            addrlen = sizeof(addr);
488
489            /* set message type and payload */
490            msg.type = MCTP_TYPE_ECHO;
491            strncpy(msg.data, "hello, world!", sizeof(msg.data));
492
493            /* send message */
494            rc = sendto(sd, &msg, sizeof(msg), 0,
495                            (struct sockaddr *)&addr, addrlen);
496
497            if (rc < 0)
498                    err(EXIT_FAILURE, "sendto");
499
500            /* Receive reply. This will block until a reply arrives,
501             * which may never happen. Actual code would need a timeout
502             * here. */
503            rc = recvfrom(sd, &msg, sizeof(msg), 0,
504                        (struct sockaddr *)&addr, &addrlen);
505            if (rc < 0)
506                    err(EXIT_FAILURE, "recvfrom");
507
508            assert(msg.type == MCTP_TYPE_ECHO);
509            /* ensure we're nul-terminated */
510            msg.data[sizeof(msg.data)-1] = '\0';
511
512            printf("reply: %s\n", msg.data);
513
514            return EXIT_SUCCESS;
515    }
516```
517
518#### Responder
519
520"Server"-side implementation to receive requests and respond. Like the client,
521This uses a (fictitious) message type of `MCTP_TYPE_ECHO` in the
522`struct sockaddr_mctp`; only messages matching this type will be received.
523
524```c
525    int main() {
526            struct sockaddr_mctp addr;
527            socklen_t addrlen;
528            int sd, rc;
529
530            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
531
532            addr.sa_family = AF_MCTP;
533            addr.smctp_network = MCTP_NET_ANY; /* any network */
534            addr.smctp_addr.s_addr = MCTP_EID_ANY;
535            addr.smctp_type = MCTP_TYPE_ECHO;
536            addr.smctp_tag = MCTP_TAG_OWNER;
537            addrlen = sizeof(addr);
538
539            rc = bind(sd, (struct sockaddr *)&addr, addrlen);
540            if (rc)
541                    err(EXIT_FAILURE, "bind");
542
543            for (;;) {
544                    struct {
545                        uint8_t type;
546                        uint8_t data[14];
547                    } msg;
548
549                    rc = recvfrom(sd, &msg, sizeof(msg), 0,
550                                    (struct sockaddr *)&addr, &addrlen);
551                    if (rc < 0)
552                            err(EXIT_FAILURE, "recvfrom");
553                    if (rc < 1)
554                            warnx("not enough data for a message type");
555
556                    assert(addrlen == sizeof(addr));
557                    assert(msg.type == MCTP_TYPE_ECHO);
558
559                    printf("%zd bytes from EID %d\n", rc, addr.smctp_addr);
560
561                    /* Reply to requester; this macro just clears the TO-bit.
562                     * Other addr fields will describe the remote endpoint,
563                     * so use those as-is.
564                     */
565                    addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag);
566
567                    rc = sendto(sd, &msg, rc, 0,
568                                (struct sockaddr *)&addr, addrlen);
569                    if (rc < 0)
570                            err(EXIT_FAILURE, "sendto");
571            }
572
573            return EXIT_SUCCESS;
574    }
575```
576
577#### Broadcast request
578
579Sends a request to a broadcast EID, and receives (unicast) replies. Typical
580control protocol pattern.
581
582```c
583    int main() {
584            struct sockaddr_mctp txaddr, rxaddr;
585            struct timespec start, cur;
586            struct pollfd pollfds[1];
587            socklen_t addrlen;
588            uint8_t buf[2];
589            int timeout;
590
591            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
592
593            /* destination address setup */
594            txaddr.sa_family = AF_MCTP;
595            txaddr.smctp_network = 1; /* specific network required for broadcast */
596            txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */
597            txaddr.smctp_type = MCTP_TYPE_CONTROL;
598            txaddr.smctp_tag = MCTP_TAG_OWNER;
599
600            buf[0] = MCTP_TYPE_CONTROL;
601            buf[1] = 'a';
602
603            /* We're doing a sendto() to a broadcast address here. If we were
604             * sending more than one broadcast message, we'd be better off
605             * doing connect(); sendto();, in order to retain the tag
606             * reservation across all transmitted messages. However, since this
607             * is a single transmit, that makes no difference in this
608             * particular case.
609             */
610            rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr,
611                            sizeof(txaddr));
612            if (rc < 0)
613                    err(EXIT_FAILURE, "sendto");
614
615            /* Set up poll behaviour, and record our starting time for
616             * reply timeouts */
617            pollfds[0].fd = sd;
618            pollfds[0].events = POLLIN;
619            clock_gettime(CLOCK_MONOTONIC, &start);
620
621            for (;;) {
622                    /* Calculate the amount of time left for replies */
623                    clock_gettime(CLOCK_MONOTONIC, &cur);
624                    timeout = calculate_timeout(&start, &cur, 1000);
625
626                    rc = poll(pollfds, 1, timeout)
627                    if (rc < 0)
628                        err(EXIT_FAILURE, "poll");
629
630                    /* timeout receiving a reply? */
631                    if (rc == 0)
632                        break;
633
634                    /* sanity check that we have a message to receive */
635                    if (!(pollfds[0].revents & POLLIN))
636                        break;
637
638                    addrlen = sizeof(rxaddr);
639
640                    rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr,
641                            &addrlen);
642                    if (rc < 0)
643                            err(EXIT_FAILURE, "recvfrom");
644
645                    assert(addrlen >= sizeof(rxaddr));
646                    assert(rxaddr.smctp_family == AF_MCTP);
647
648                    printf("response from EID %d\n", rxaddr.smctp_addr);
649            }
650
651            return EXIT_SUCCESS;
652    }
653```
654
655### Implementation notes
656
657#### Addressing
658
659Transmitted messages (through `sendto()` and related system calls) specify their
660destination via the `smctp_network` and `smctp_addr` fields of a
661`struct sockaddr_mctp`.
662
663The `smctp_addr` field maps directly to the destination endpoint's EID.
664
665The `smctp_network` field specifies a locally defined network identifier. To
666simplify situations where there is only one network defined, the special value
667`MCTP_NET_ANY` is allowed. This will allow the kernel to select a specific
668network for transmission.
669
670This selection is entirely user-configured; one specific network may be defined
671as the system default, in which case it will be used for all message
672transmission where `MCTP_NET_ANY` is used as the destination network.
673
674In particular, the destination EID is never used to select a destination
675network.
676
677MCTP responders should use the EID and network values of an incoming request to
678specify the destination for any responses.
679
680#### Bridging/routing
681
682The network and interface structure allows multiple interfaces to share a common
683network. By default, packets are not forwarded between interfaces.
684
685A network can be configured for "forwarding" mode. In this mode, packets may be
686forwarded if their destination EID is non-local, and matches a route for another
687interface on the same network.
688
689As per DSP0236, packet reassembly does not occur during the forwarding process.
690If the packet is larger than the MTU for the destination interface/route, then
691the packet is dropped.
692
693#### Tag behaviour for transmitted messages
694
695On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel
696must allocate a tag that will uniquely identify responses over a (destination
697EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size.
698
699To allow this, a `sendto()` with the `MCTP_TAG_OWNER` bit set in the `smctp_tag`
700field will cause the kernel to allocate a unique tag for subsequent replies from
701that specific remote EID.
702
703This allocation will expire when any of the following occur:
704
705- the socket is closed
706- a new message is sent to a new destination EID
707- an implementation-defined timeout expires
708
709Because the "tag space" is limited, it may not be possible for the kernel to
710allocate a unique tag for the outgoing message. In this case, the `sendto()`
711call will fail with errno `EAGAIN`. This is analogous to the UDP behaviour when
712a local port cannot be allocated for an outgoing message.
713
714The implementation-defined timeout value shall be chosen to reasonably cover
715standard reply timeouts. If necessary, this timeout may be modified through the
716`MCTP_TAG_CONTROL` socket option.
717
718For applications that expect to perform an ongoing message exchange with a
719particular destination address, they may use the `connect()` call to set a
720persistent remote address. In this case, the tag will be allocated during
721connect(), and remain reserved for this socket until any of the following occur:
722
723- the socket is closed
724- the remote address is changed through another call to `connect()`.
725
726In particular, calling `sendto()` with a different address does not release the
727tag reservation.
728
729Broadcast messages are particularly onerous for tag reservations. When a message
730is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must
731reserve the tag across the entire range of possible EIDs. Therefore, a
732particular tag value must be currently-unused across all EIDs to allow a
733`sendto()` to a broadcast address. Additionally, this reservation is not cleared
734when a reply is received, as there may be multiple replies to a broadcast.
735
736For this reason, applications wanting to send to the broadcast address should
737use the `connect()` system call to reserve a tag, and guarantee its availability
738for future message transmission. Note that this will remove the tag value for
739use with _any other EID_. Sending to the broadcast address should be avoided; we
740expect few applications will need this functionality.
741
742#### MCTP Control Protocol implementation
743
744Aside from the "Resolve endpoint EID" message, the MCTP control protocol
745implementation would exist as a userspace process, `mctpd`. This process is
746responsible for responding to incoming control protocol messages, any dynamic
747EID allocations (for bus owner devices) and maintaining the MCTP route table
748(for bridging devices).
749
750This process would create a socket bound to the type `MCTP_TYPE_CONTROL`, with
751the `MCTP_ADDR_EXT` socket option enabled in order to access physical addressing
752data on incoming control protocol requests. It would interact with the kernel's
753route table via a netlink interface - the same as that implemented for the
754[Utility and configuration interfaces](#utility-and-configuration-interfaces).
755
756### Neighbour and routing implementation
757
758The packet-transmission behaviour of the MCTP infrastructure relies on a single
759routing table to lookup both route and neighbour information. Entries in this
760table are of the format:
761
762| EID range | interface | physical address | metric | MTU | flags | expiry |
763| --------- | --------- | ---------------- | ------ | --- | ----- | ------ |
764
765This table can be updated from two sources:
766
767- From userspace, via a netlink interface (see the
768  [Utility and configuration interfaces](#utility-and-configuration-interfaces)
769  section).
770
771- Directly within the kernel, when basic neighbour information is discovered.
772  Kernel-originated routes are marked as such in the flags field, and have a
773  maximum validity age, indicated by the expiry field.
774
775Kernel-discovered routing information can originate from two sources:
776
777- physical-to-EID mappings discovered through received packets
778
779- explicit endpoint physical-address resolution requests
780
781When a packet is to be transmitted to an EID that does not have an entry in the
782routing table, the kernel may attempt to resolve the physical address of that
783endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol
784(section 12.9 of DSP0236). The response message will be used to add a
785kernel-originated route into the routing table.
786
787This is the only kernel-internal usage of MCTP Control Protocol messages.
788
789## Utility and configuration interfaces
790
791A small utility will be developed to control the state of the kernel MCTP stack.
792This will be similar in design to the 'iproute2' tools, which perform a similar
793function for the IPv4 and IPv6 protocols.
794
795The utility will be invoked as `mctp`, and provide subcommands for managing
796different aspects of the kernel stack.
797
798### `mctp link`: manage interfaces
799
800```sh
801    mctp link set <link> <up|down>
802    mctp link set <link> network <network-id>
803    mctp link set <link> mtu <mtu>
804    mctp link set <link> bus-owner <hwaddr>
805```
806
807### `mctp network`: manage networks
808
809```sh
810    mctp network create <network-id>
811    mctp network set <network-id> forwarding <on|off>
812    mctp network set <network-id> default [<true|false>]
813    mctp network delete <network-id>
814```
815
816### `mctp address`: manage local EID assignments
817
818```sh
819    mctp address add <eid> dev <link>
820    mctp address del <eid> dev <link>
821```
822
823### `mctp route`: manage routing tables
824
825```sh
826    mctp route add net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
827    mctp route del net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
828    mctp route show [net <network-id>]
829```
830
831### `mctp stat`: query socket status
832
833```sh
834    mctp stat
835```
836
837A set of netlink message formats will be defined to support these control
838functions.
839
840# Design points & alternatives considered
841
842## Including message-type byte in send/receive buffers
843
844This design specifies that message buffers passed to the kernel in send syscalls
845and from the kernel in receive syscalls will have the message type byte as the
846first byte of the buffer. This corresponds to the definition of a MCTP message
847payload in DSP0236.
848
849This somewhat duplicates the type data provided in `struct sockaddr_mctp`; it's
850superficially possible for the kernel to prepend this byte on send, and remove
851it on receive.
852
853However, the exact format of the MCTP message payload is not precisely defined
854by the specification. Particularly, any message integrity check data (which
855would also need to be appended / stripped in conjunction with the type byte) is
856defined by the type specification, not DSP0236. The kernel would need knowledge
857of all protocols in order to correctly deconstruct the payload data.
858
859Therefore, we transfer the message payload as-is to userspace, without any
860modification by the kernel.
861
862## MCTP message-type specification: using `sockaddr_mctp.smctp_type` rather than protocol
863
864This design specifies message-types to be passed in the `smctp_type` field of
865`struct sockaddr_mctp`. An alternative would be to pass it in the `protocol`
866argument of the `socket()` system call:
867
868```c
869    int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol);
870```
871
872The `smctp_type` implementation was chosen as it better matches the "addressing"
873model of the message type; sockets are bound to an incoming message type,
874similar to the IP protocol's model of binding UDP sockets to a local port
875number.
876
877There is no kernel behaviour that depends on the specific type (particularly
878given the design choice above), so it is not suited to use the protocol argument
879here.
880
881Future additions that perform protocol-specific message handling, and so alter
882the send/receive buffer format, may use a new protocol argument.
883
884## Networks referenced by index rather than UUID
885
886This design proposes referencing networks by an integer index. The MCTP standard
887does optionally associate a RFC4122 UUID with a networks; it would be possible
888to use this UUID where we pass a network identifier.
889
890This approach does not incorporate knowledge of network UUIDs in the kernel.
891Given that the Get Network ID message in the MCTP Control Protocol is
892implemented entirely via userspace, it does not need to be aware of network
893UUIDs, and requiring network references (for example, the `smctp_network` field
894of `struct sockaddr_mctp`, as type `uuid_t`) complicates assignment.
895
896Instead, the index integer is used instead, in a similar fashion to the integer
897index used to reference `struct netdevice`s elsewhere in the network stack.
898
899## Tag behaviour alternatives
900
901We considered _several_ different designs for the tag handling behaviour. A
902brief overview of the more-feasible of those, and why they were rejected:
903
904### Each socket is allocated a unique tag value on creation
905
906We could allocate a tag for each socket on creation, and use that value when a
907tag is required. This, however:
908
909- needlessly consumes a tag on non-tag-owning sockets (ie, those which send with
910  TO=0 - responders); and
911
912- limits us to 8 sockets per network.
913
914### Tags only used for message packetisation / reassembly
915
916An alternative would be to completely dissociate tag allocation from sockets;
917and only allocate a tag for the (short-lived) task of packetising a message, and
918sending those packets. Tags would be released when the last packet has been
919sent.
920
921However, this removes any facility to correlate responses with the correct
922socket, which is the purpose of the TO bit in DSP0236. In order for the sending
923application to receive the response, we would either need to:
924
925- limit the system to one socket of each message type (which, for example,
926  precludes running a requester and a responder of the same type); or
927
928- forward all incoming messages of a specific message-type to all sockets
929  listening on that type, making it trivial to eavesdrop on MCTP data of other
930  applications
931
932### Allocate a tag for one request/response pair
933
934Another alternative would be to allocate a tag on each outgoing TO=1 message,
935and then release that allocation after the incoming response to that tag (TO=0)
936is observed.
937
938However, MCTP protocols exist that do not have a 1:1 mapping of responses to
939requests - more than one response may be valid for a given request message. For
940example, in response to a request, a NVMe-MI implementation may send an
941in-progress reply before the final reply. In this case, we would release the tag
942after the first response is received, and then have no way to correlate the
943second message with the socket.
944
945Broadcast MCTP request messages may have multiple replies from multiple
946endpoints, meaning we cannot release the tag allocation on the first reply.
947