xref: /openbmc/docs/designs/mctp/mctp-kernel.md (revision 692765c2ed9f1dc2e9123a268212b9a981457e7b)
1# OpenBMC in-kernel MCTP
2
3Author: Jeremy Kerr <jk@codeconstruct.com.au>
4
5Please refer to the [MCTP Overview](mctp.md) document for general MCTP design
6description, background and requirements.
7
8This document describes a kernel-based implementation of MCTP infrastructure,
9providing a sockets-based API for MCTP communication within an OpenBMC-based
10platform.
11
12## Requirements for a kernel implementation
13
14- The MCTP messaging API should be an obvious application of the existing POSIX
15  socket interface
16
17- Configuration should be simple for a straightforward MCTP endpoint: a single
18  network with a single local endpoint id (EID).
19
20- Infrastructure should be flexible enough to allow for more complex MCTP
21  networks, allowing:
22  - each MCTP network (as defined by section 3.2.31 of DSP0236) may consist of
23    multiple local physical interfaces, and/or multiple EIDs;
24
25  - multiple distinct (ie., non-bridged) networks, possibly containing
26    duplicated EIDs between networks;
27
28  - multiple local EIDs on a single interface, and
29
30  - customisable routing/bridging configurations within a network.
31
32## Proposed Design
33
34The design contains several components:
35
36- An interface for userspace applications to send and receive MCTP messages: A
37  mapping of the sockets API to MCTP usage
38
39- Infrastructure for control and configuration of the MCTP network(s),
40  consisting of a configuration utility, and a kernel messaging facility for
41  this utility to use.
42
43- Kernel drivers for physical interface bindings.
44
45In general, the kernel components cover the transport functionality of MCTP,
46such as message assembly/disassembly, packet forwarding, and physical interface
47implementations.
48
49Higher-level protocols (such as PLDM) are implemented in userspace, through the
50introduced socket API. This also includes the majority of the MCTP Control
51Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically
52have a specific process to request and respond to control protocol messages.
53However, the kernel will include a small subset of control protocol code to
54allow very simple endpoints, with static EID allocations, to run without this
55process. MCTP endpoints that require more than just single-endpoint
56functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would
57include the control message protocol process.
58
59A new driver is introduced to handle each physical interface binding. These
60drivers expose the appropriate `struct net_device` to handle transmission and
61reception of MCTP packets on their associated hardware channels. Under Linux,
62the namespace for these interfaces is separate from other network interfaces -
63such as those for ethernet.
64
65### Structure: interfaces & networks
66
67The kernel models the local MCTP topology through two items: interfaces and
68networks.
69
70An interface (or "link") is an instance of an MCTP physical transport binding
71(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware
72device. This is represented as a `struct netdevice`, and has a user-visible name
73and index (`ifindex`). Non-hardware-attached interfaces are permitted, to allow
74local loopback and/or virtual interfaces.
75
76A network defines a unique address space for MCTP endpoints by endpoint-ID
77(described by DSP0236, section 3.2.31). A network has a user-visible identifier
78to allow references from userspace. Route definitions are specific to one
79network.
80
81Interfaces are associated with one network. A network may be associated with one
82or more interfaces.
83
84If multiple networks are present, each may contain EIDs that are also present on
85other networks.
86
87### Sockets API
88
89#### Protocol definitions
90
91We define a new address family (and corresponding protocol family) for MCTP:
92
93```c
94    #define AF_MCTP /* TBD */
95    #define PF_MCTP AF_MCTP
96```
97
98MCTP sockets are created with the `socket()` syscall, specifying `AF_MCTP` as
99the domain. Currently, only a `SOCK_DGRAM` socket type is defined.
100
101```c
102    int sd = socket(AF_MCTP, SOCK_DGRAM, 0);
103```
104
105The only (current) value for the `protocol` argument is 0. Future protocol
106implementations may be added later.
107
108MCTP Sockets opened with a protocol value of 0 will communicate directly at the
109transport layer; message buffers received by the application will consist of
110message data from reassembled MCTP packets, and will include the full message
111including message type byte and optional message integrity check (IC).
112Individual packet headers are not included; they may be accessible through a
113future `SOCK_RAW` socket type.
114
115As with all socket address families, source and destination addresses are
116specified with a new `sockaddr` type:
117
118```c
119    struct sockaddr_mctp {
120            sa_family_t         smctp_family; /* = AF_MCTP */
121            int                 smctp_network;
122            struct mctp_addr    smctp_addr;
123            uint8_t             smctp_type;
124            uint8_t             smctp_tag;
125    };
126
127    struct mctp_addr {
128            uint8_t             s_addr;
129    };
130
131    /* MCTP network values */
132    #define MCTP_NET_ANY        0
133
134    /* MCTP EID values */
135    #define MCTP_ADDR_ANY       0xff
136    #define MCTP_ADDR_BCAST     0xff
137
138    /* MCTP type values. Only the least-significant 7 bits of
139     * smctp_type are used for tag matches; the specification defines
140     * the type to be 7 bits.
141     */
142    #define MCTP_TYPE_MASK      0x7f
143
144    /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */
145    /* MCTP-spec-defined fields */
146    #define MCTP_TAG_MASK    0x07
147    #define MCTP_TAG_OWNER   0x08
148    /* Others: reserved */
149
150    /* Helpers */
151    #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */
152```
153
154#### Syscall behaviour
155
156The following sections describe the MCTP-specific behaviours of the standard
157socket system calls. These behaviours have been chosen to map closely to the
158existing sockets APIs.
159
160##### `bind()`: set local socket address
161
162Sockets that receive incoming request packets will bind to a local address,
163using the `bind()` syscall.
164
165```c
166    struct sockaddr_mctp addr;
167
168    addr.smctp_family = AF_MCTP;
169    addr.smctp_network = MCTP_NET_ANY;
170    addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
171    addr.smctp_type = MCTP_TYPE_PLDM;
172    addr.smctp_tag = MCTP_TAG_OWNER;
173
174    int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));
175```
176
177This establishes the local address of the socket. Incoming MCTP messages that
178match the network, address, and message type will be received by this socket.
179The reference to 'incoming' is important here; a bound socket will only receive
180messages with the TO bit set, to indicate an incoming request message, rather
181than a response.
182
183The `smctp_tag` value will configure the tags accepted from the remote side of
184this socket. Given the above, the only valid value is `MCTP_TAG_OWNER`, which
185will result in remotely "owned" tags being routed to this socket. Since
186`MCTP_TAG_OWNER` is set, the 3 least-significant bits of `smctp_tag` are not
187used; callers must set them to zero. See the
188[Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages)
189section for more details. If the `MCTP_TAG_OWNER` bit is not set, `bind()` will
190fail with an errno of `EINVAL`.
191
192A `smctp_network` value of `MCTP_NET_ANY` will configure the socket to receive
193incoming packets from any locally-connected network. A specific network value
194will cause the socket to only receive incoming messages from that network.
195
196The `smctp_addr` field specifies a local address to bind to. A value of
197`MCTP_ADDR_ANY` configures the socket to receive messages addressed to any local
198destination EID.
199
200The `smctp_type` field specifies which message types to receive. Only the lower
2017 bits of the type is matched on incoming messages (ie., the most-significant IC
202bit is not part of the match). This results in the socket receiving packets with
203and without a message integrity check footer.
204
205##### `connect()`: set remote socket address
206
207Sockets may specify a socket's remote address with the `connect()` syscall:
208
209```c
210    struct sockaddr_mctp addr;
211    int rc;
212
213    addr.smctp_family = AF_MCTP;
214    addr.smctp_network = MCTP_NET_ANY;
215    addr.smctp_addr.s_addr = 8;
216    addr.smctp_tag = MCTP_TAG_OWNER;
217    addr.smctp_type = MCTP_TYPE_PLDM;
218
219    rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr));
220```
221
222This establishes the remote address of a socket, used for future message
223transmission. Like other `SOCK_DGRAM` behaviour, this does not generate any MCTP
224traffic directly, but just sets the default destination for messages sent from
225this socket.
226
227The `smctp_network` field may specify a locally-attached network, or the value
228`MCTP_NET_ANY`, in which case the kernel will select a suitable MCTP network.
229This is guaranteed to work for single-network configurations, but may require
230additional routing definitions for endpoints attached to multiple distinct
231networks. See the [Addressing](#addressing) section for details.
232
233The `smctp_addr` field specifies a remote EID. This may be the `MCTP_ADDR_BCAST`
234the MCTP broadcast EID (0xff).
235
236The `smctp_type` field specifies the type field of messages transferred over
237this socket.
238
239The `smctp_tag` value will configure the tag used for the local side of this
240socket. The only valid value is `MCTP_TAG_OWNER`, which will result in an
241"owned" tag to be allocated for this socket, and will remain allocated for all
242future outgoing messages, until either the socket is closed, or `connect()` is
243called again. If a tag cannot be allocated, `connect()` will report an error,
244with an errno value of `EAGAIN`. See the
245[Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages)
246section for more details. If the `MCTP_TAG_OWNER` bit is not set, `connect()`
247will fail with an errno of `EINVAL`.
248
249Requesters which connect to a single responder will typically use `connect()` to
250specify the peer address and tag for future outgoing messages.
251
252##### `sendto()`, `sendmsg()`, `send()` & `write()`: transmit an MCTP message
253
254An MCTP message is transmitted using one of the `sendto()`, `sendmsg()`,
255`send()` or `write()` syscalls. Using `sendto()` as the primary example:
256
257```c
258    struct sockaddr_mctp addr;
259    char buf[14];
260    ssize_t len;
261
262    /* set message destination */
263    addr.smctp_family = AF_MCTP;
264    addr.smctp_network = 0;
265    addr.smctp_addr.s_addr = 8;
266    addr.smctp_tag = MCTP_TAG_OWNER;
267    addr.smctp_type = MCTP_TYPE_ECHO;
268
269    /* arbitrary message to send, with message-type header */
270    buf[0] = MCTP_TYPE_ECHO;
271    memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);
272
273    len = sendto(sd, buf, sizeof(buf), 0,
274                    (struct sockaddr_mctp *)&addr, sizeof(addr));
275```
276
277The address argument is treated the same way as for `connect()`: The network and
278address fields define the remote address to send to. If `smctp_tag` has the
279`MCTP_TAG_OWNER`, the kernel will ignore any bits set in `MCTP_TAG_VALUE`, and
280generate a tag value suitable for the destination EID. If `MCTP_TAG_OWNER` is
281not set, the message will be sent with the tag value as specified. If a tag
282value cannot be allocated, the system call will report an errno of `EAGAIN`.
283
284The application must provide the message type byte as the first byte of the
285message buffer passed to `sendto()`. If a message integrity check is to be
286included in the transmitted message, it must also be provided in the message
287buffer, and the most-significant bit of the message type byte must be 1.
288
289If the first byte of the message does not match the message type value, then the
290system call will return an error of `EPROTO`.
291
292The `send()` and `write()` system calls behave in a similar way, but do not
293specify a remote address. Therefore, `connect()` must be called beforehand; if
294not, these calls will return an error of `EDESTADDRREQ` (Destination address
295required).
296
297Using `sendto()` or `sendmsg()` on a connected socket may override the remote
298socket address specified in `connect()`. The `connect()` address and tag will
299remain associated with the socket, for future unaddressed sends. The tag
300allocated through a call to `sendto()` or `sendmsg()` on a connected socket is
301subject to the same invalidation logic as on an unconnected socket: It is
302expired either by timeout or by a subsequent `sendto()`.
303
304The `sendmsg()` system call allows a more compact argument interface, and the
305message buffer to be specified as a scatter-gather list. At present no ancillary
306message types (used for the `msg_control` data passed to `sendmsg()`) are
307defined.
308
309Transmitting a message on an unconnected socket with `MCTP_TAG_OWNER` specified
310will cause an allocation of a tag, if no valid tag is already allocated for that
311destination. The (destination-eid,tag) tuple acts as an implicit local socket
312address, to allow the socket to receive responses to this outgoing message. If
313any previous allocation has been performed (to for a different remote EID), that
314allocation is lost. This tag behaviour can be controlled through the
315`MCTP_TAG_CONTROL` socket option.
316
317Sockets will only receive responses to requests they have sent (with TO=1) and
318may only respond (with TO=0) to requests they have received.
319
320##### `recvfrom()`, `recvmsg()`, `recv()` & `read()`: receive an MCTP message
321
322An MCTP message can be received by an application using one of the `recvfrom()`,
323`recvmsg()`, `recv()` or `read()` system calls. Using `recvfrom()` as the
324primary example:
325
326```c
327    struct sockaddr_mctp addr;
328    socklen_t addrlen;
329    char buf[14];
330    ssize_t len;
331
332    addrlen = sizeof(addr);
333
334    len = recvfrom(sd, buf, sizeof(buf), 0,
335                    (struct sockaddr_mctp *)&addr, &addrlen);
336
337    /* We can expect addr to describe an MCTP address */
338    assert(addrlen >= sizeof(buf));
339    assert(addr.smctp_family == AF_MCTP);
340
341    printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);
342```
343
344The address argument to `recvfrom` and `recvmsg` is populated with the remote
345address of the incoming message, including tag value (this will be needed in
346order to reply to the message).
347
348The first byte of the message buffer will contain the message type byte. If an
349integrity check follows the message, it will be included in the received buffer.
350
351The `recv()` and `read()` system calls behave in a similar way, but do not
352provide a remote address to the application. Therefore, these are only useful if
353the remote address is already known, or the message does not require a reply.
354
355Like the send calls, sockets will only receive responses to requests they have
356sent (TO=1) and may only respond (TO=0) to requests they have received.
357
358##### `getsockname()` & `getpeername()`: query local/remote socket address
359
360The `getsockname()` system call returns the `struct sockaddr_mctp` value for the
361local side of this socket, `getpeername()` for the remote (ie, that used in a
362connect()). Since the tag value is a property of the remote address,
363`getpeername()` may be used to retrieve a kernel-allocated tag value.
364
365Calling `getpeername()` on an unconnected socket will result in an error of
366`ENOTCONN`.
367
368##### Socket options
369
370The following socket options are defined for MCTP sockets:
371
372###### `MCTP_ADDR_EXT`: Use extended addressing information in sendmsg/recvmsg
373
374Enabling this socket option allows an application to specify extended addressing
375information on transmitted packets, and access the same on received packets.
376
377When the `MCTP_ADDR_EXT` socket option is enabled, the application may specify
378an expanded `struct sockaddr` to the `recvfrom()` and `sendto()` system calls.
379This as defined as:
380
381```c
382    struct sockaddr_mctp_ext {
383            /* fields exactly match struct sockaddr_mctp */
384            sa_family_t         smctp_family; /* = AF_MCTP */
385            int                 smctp_network;
386            struct mctp_addr    smctp_addr;
387            uint8_t             smcp_tag;
388            /* extended addressing */
389            int                 smctp_ifindex;
390            uint8_t             smctp_halen;
391            unsigned char       smctp_haddr[/* TBD */];
392    }
393```
394
395If the `addrlen` specified to `sendto()` or `recvfrom()` is sufficient to
396contain this larger structure, then the extended addressing fields are consumed
397/ populated respectively.
398
399###### `MCTP_TAG_CONTROL`: manage outgoing tag allocation behaviour
400
401The set/getsockopt argument is a `mctp_tagctl` structure:
402
403```c
404    struct mctp_tagctl {
405        bool            retain;
406        struct timespec timeout;
407    };
408```
409
410This allows an application to control the behaviour of allocated tags for
411non-connected sockets when transferring messages to multiple different
412destinations (ie., where a `struct sockaddr_mctp` is provided for individual
413messages, and the `smctp_addr` destination for those sockets may vary across
414calls).
415
416The `retain` flag indicates to the kernel that the socket should not release tag
417allocations when a message is sent to a new destination EID. This causes the
418socket to continue to receive incoming messages to the old (dest,tag) tuple, in
419addition to the new tuple.
420
421The `timeout` value specifies a maximum amount of time to retain tag values.
422This should be based on the reply timeout for any upper-level protocol.
423
424The kernel may reject a request to set values that would cause excessive tag
425allocation by this socket. The kernel may also reject subsequent tag-allocation
426requests (through send or connect syscalls) which would cause excessive tags to
427be consumed by the socket, even though the tag control settings were accepted in
428the setsockopt operation.
429
430Changing the default tag control behaviour should only be required when:
431
432- the socket is sending messages with TO=1 (ie, is a requester); and
433- messages are sent to multiple different destination EIDs from the one socket.
434
435##### Syscalls not implemented
436
437The following system calls are not implemented for MCTP, primarily as they are
438not used in `SOCK_DGRAM`-type sockets:
439
440- `listen()`
441- `accept()`
442- `ioctl()`
443- `shutdown()`
444- `mmap()`
445
446#### Userspace examples
447
448These examples cover three general use-cases:
449
450- **requester**: sends requests to a particular (EID, type) target, and receives
451  responses to those packets
452
453  This is similar to a typical UDP client
454
455- **responder**: receives all locally-addressed messages of a specific
456  message-type, and responds to the requester immediately.
457
458  This is similar to a typical UDP server
459
460- **controller**: a specific service for a bus owner; may send broadcast
461  messages, manage EID allocations, update local MCTP stack state. Will need
462  low-level packet data.
463
464  This is similar to a DHCP server.
465
466##### Requester
467
468"Client"-side implementation to send requests to a responder, and receive a
469response. This uses a (fictitious) message type of `MCTP_TYPE_ECHO`.
470
471```c
472    int main() {
473            struct sockaddr_mctp addr;
474            socklen_t addrlen;
475            struct {
476                uint8_t type;
477                uint8_t data[14];
478            } msg;
479            int sd, rc;
480
481            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
482
483            addr.sa_family = AF_MCTP;
484            addr.smctp_network = MCTP_NET_ANY; /* any network */
485            addr.smctp_addr.s_addr = 9;    /* remote eid 9 */
486            addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */
487            addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */
488            addrlen = sizeof(addr);
489
490            /* set message type and payload */
491            msg.type = MCTP_TYPE_ECHO;
492            strncpy(msg.data, "hello, world!", sizeof(msg.data));
493
494            /* send message */
495            rc = sendto(sd, &msg, sizeof(msg), 0,
496                            (struct sockaddr *)&addr, addrlen);
497
498            if (rc < 0)
499                    err(EXIT_FAILURE, "sendto");
500
501            /* Receive reply. This will block until a reply arrives,
502             * which may never happen. Actual code would need a timeout
503             * here. */
504            rc = recvfrom(sd, &msg, sizeof(msg), 0,
505                        (struct sockaddr *)&addr, &addrlen);
506            if (rc < 0)
507                    err(EXIT_FAILURE, "recvfrom");
508
509            assert(msg.type == MCTP_TYPE_ECHO);
510            /* ensure we're nul-terminated */
511            msg.data[sizeof(msg.data)-1] = '\0';
512
513            printf("reply: %s\n", msg.data);
514
515            return EXIT_SUCCESS;
516    }
517```
518
519##### Responder
520
521"Server"-side implementation to receive requests and respond. Like the client,
522This uses a (fictitious) message type of `MCTP_TYPE_ECHO` in the
523`struct sockaddr_mctp`; only messages matching this type will be received.
524
525```c
526    int main() {
527            struct sockaddr_mctp addr;
528            socklen_t addrlen;
529            int sd, rc;
530
531            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
532
533            addr.sa_family = AF_MCTP;
534            addr.smctp_network = MCTP_NET_ANY; /* any network */
535            addr.smctp_addr.s_addr = MCTP_EID_ANY;
536            addr.smctp_type = MCTP_TYPE_ECHO;
537            addr.smctp_tag = MCTP_TAG_OWNER;
538            addrlen = sizeof(addr);
539
540            rc = bind(sd, (struct sockaddr *)&addr, addrlen);
541            if (rc)
542                    err(EXIT_FAILURE, "bind");
543
544            for (;;) {
545                    struct {
546                        uint8_t type;
547                        uint8_t data[14];
548                    } msg;
549
550                    rc = recvfrom(sd, &msg, sizeof(msg), 0,
551                                    (struct sockaddr *)&addr, &addrlen);
552                    if (rc < 0)
553                            err(EXIT_FAILURE, "recvfrom");
554                    if (rc < 1)
555                            warnx("not enough data for a message type");
556
557                    assert(addrlen == sizeof(addr));
558                    assert(msg.type == MCTP_TYPE_ECHO);
559
560                    printf("%zd bytes from EID %d\n", rc, addr.smctp_addr);
561
562                    /* Reply to requester; this macro just clears the TO-bit.
563                     * Other addr fields will describe the remote endpoint,
564                     * so use those as-is.
565                     */
566                    addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag);
567
568                    rc = sendto(sd, &msg, rc, 0,
569                                (struct sockaddr *)&addr, addrlen);
570                    if (rc < 0)
571                            err(EXIT_FAILURE, "sendto");
572            }
573
574            return EXIT_SUCCESS;
575    }
576```
577
578##### Broadcast request
579
580Sends a request to a broadcast EID, and receives (unicast) replies. Typical
581control protocol pattern.
582
583```c
584    int main() {
585            struct sockaddr_mctp txaddr, rxaddr;
586            struct timespec start, cur;
587            struct pollfd pollfds[1];
588            socklen_t addrlen;
589            uint8_t buf[2];
590            int timeout;
591
592            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
593
594            /* destination address setup */
595            txaddr.sa_family = AF_MCTP;
596            txaddr.smctp_network = 1; /* specific network required for broadcast */
597            txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */
598            txaddr.smctp_type = MCTP_TYPE_CONTROL;
599            txaddr.smctp_tag = MCTP_TAG_OWNER;
600
601            buf[0] = MCTP_TYPE_CONTROL;
602            buf[1] = 'a';
603
604            /* We're doing a sendto() to a broadcast address here. If we were
605             * sending more than one broadcast message, we'd be better off
606             * doing connect(); sendto();, in order to retain the tag
607             * reservation across all transmitted messages. However, since this
608             * is a single transmit, that makes no difference in this
609             * particular case.
610             */
611            rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr,
612                            sizeof(txaddr));
613            if (rc < 0)
614                    err(EXIT_FAILURE, "sendto");
615
616            /* Set up poll behaviour, and record our starting time for
617             * reply timeouts */
618            pollfds[0].fd = sd;
619            pollfds[0].events = POLLIN;
620            clock_gettime(CLOCK_MONOTONIC, &start);
621
622            for (;;) {
623                    /* Calculate the amount of time left for replies */
624                    clock_gettime(CLOCK_MONOTONIC, &cur);
625                    timeout = calculate_timeout(&start, &cur, 1000);
626
627                    rc = poll(pollfds, 1, timeout)
628                    if (rc < 0)
629                        err(EXIT_FAILURE, "poll");
630
631                    /* timeout receiving a reply? */
632                    if (rc == 0)
633                        break;
634
635                    /* sanity check that we have a message to receive */
636                    if (!(pollfds[0].revents & POLLIN))
637                        break;
638
639                    addrlen = sizeof(rxaddr);
640
641                    rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr,
642                            &addrlen);
643                    if (rc < 0)
644                            err(EXIT_FAILURE, "recvfrom");
645
646                    assert(addrlen >= sizeof(rxaddr));
647                    assert(rxaddr.smctp_family == AF_MCTP);
648
649                    printf("response from EID %d\n", rxaddr.smctp_addr);
650            }
651
652            return EXIT_SUCCESS;
653    }
654```
655
656#### Implementation notes
657
658##### Addressing
659
660Transmitted messages (through `sendto()` and related system calls) specify their
661destination via the `smctp_network` and `smctp_addr` fields of a
662`struct sockaddr_mctp`.
663
664The `smctp_addr` field maps directly to the destination endpoint's EID.
665
666The `smctp_network` field specifies a locally defined network identifier. To
667simplify situations where there is only one network defined, the special value
668`MCTP_NET_ANY` is allowed. This will allow the kernel to select a specific
669network for transmission.
670
671This selection is entirely user-configured; one specific network may be defined
672as the system default, in which case it will be used for all message
673transmission where `MCTP_NET_ANY` is used as the destination network.
674
675In particular, the destination EID is never used to select a destination
676network.
677
678MCTP responders should use the EID and network values of an incoming request to
679specify the destination for any responses.
680
681##### Bridging/routing
682
683The network and interface structure allows multiple interfaces to share a common
684network. By default, packets are not forwarded between interfaces.
685
686A network can be configured for "forwarding" mode. In this mode, packets may be
687forwarded if their destination EID is non-local, and matches a route for another
688interface on the same network.
689
690As per DSP0236, packet reassembly does not occur during the forwarding process.
691If the packet is larger than the MTU for the destination interface/route, then
692the packet is dropped.
693
694##### Tag behaviour for transmitted messages
695
696On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel
697must allocate a tag that will uniquely identify responses over a (destination
698EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size.
699
700To allow this, a `sendto()` with the `MCTP_TAG_OWNER` bit set in the `smctp_tag`
701field will cause the kernel to allocate a unique tag for subsequent replies from
702that specific remote EID.
703
704This allocation will expire when any of the following occur:
705
706- the socket is closed
707- a new message is sent to a new destination EID
708- an implementation-defined timeout expires
709
710Because the "tag space" is limited, it may not be possible for the kernel to
711allocate a unique tag for the outgoing message. In this case, the `sendto()`
712call will fail with errno `EAGAIN`. This is analogous to the UDP behaviour when
713a local port cannot be allocated for an outgoing message.
714
715The implementation-defined timeout value shall be chosen to reasonably cover
716standard reply timeouts. If necessary, this timeout may be modified through the
717`MCTP_TAG_CONTROL` socket option.
718
719For applications that expect to perform an ongoing message exchange with a
720particular destination address, they may use the `connect()` call to set a
721persistent remote address. In this case, the tag will be allocated during
722connect(), and remain reserved for this socket until any of the following occur:
723
724- the socket is closed
725- the remote address is changed through another call to `connect()`.
726
727In particular, calling `sendto()` with a different address does not release the
728tag reservation.
729
730Broadcast messages are particularly onerous for tag reservations. When a message
731is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must
732reserve the tag across the entire range of possible EIDs. Therefore, a
733particular tag value must be currently-unused across all EIDs to allow a
734`sendto()` to a broadcast address. Additionally, this reservation is not cleared
735when a reply is received, as there may be multiple replies to a broadcast.
736
737For this reason, applications wanting to send to the broadcast address should
738use the `connect()` system call to reserve a tag, and guarantee its availability
739for future message transmission. Note that this will remove the tag value for
740use with _any other EID_. Sending to the broadcast address should be avoided; we
741expect few applications will need this functionality.
742
743##### MCTP Control Protocol implementation
744
745Aside from the "Resolve endpoint EID" message, the MCTP control protocol
746implementation would exist as a userspace process, `mctpd`. This process is
747responsible for responding to incoming control protocol messages, any dynamic
748EID allocations (for bus owner devices) and maintaining the MCTP route table
749(for bridging devices).
750
751This process would create a socket bound to the type `MCTP_TYPE_CONTROL`, with
752the `MCTP_ADDR_EXT` socket option enabled in order to access physical addressing
753data on incoming control protocol requests. It would interact with the kernel's
754route table via a netlink interface - the same as that implemented for the
755[Utility and configuration interfaces](#utility-and-configuration-interfaces).
756
757#### Neighbour and routing implementation
758
759The packet-transmission behaviour of the MCTP infrastructure relies on a single
760routing table to lookup both route and neighbour information. Entries in this
761table are of the format:
762
763| EID range | interface | physical address | metric | MTU | flags | expiry |
764| --------- | --------- | ---------------- | ------ | --- | ----- | ------ |
765
766This table can be updated from two sources:
767
768- From userspace, via a netlink interface (see the
769  [Utility and configuration interfaces](#utility-and-configuration-interfaces)
770  section).
771
772- Directly within the kernel, when basic neighbour information is discovered.
773  Kernel-originated routes are marked as such in the flags field, and have a
774  maximum validity age, indicated by the expiry field.
775
776Kernel-discovered routing information can originate from two sources:
777
778- physical-to-EID mappings discovered through received packets
779
780- explicit endpoint physical-address resolution requests
781
782When a packet is to be transmitted to an EID that does not have an entry in the
783routing table, the kernel may attempt to resolve the physical address of that
784endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol
785(section 12.9 of DSP0236). The response message will be used to add a
786kernel-originated route into the routing table.
787
788This is the only kernel-internal usage of MCTP Control Protocol messages.
789
790### Utility and configuration interfaces
791
792A small utility will be developed to control the state of the kernel MCTP stack.
793This will be similar in design to the 'iproute2' tools, which perform a similar
794function for the IPv4 and IPv6 protocols.
795
796The utility will be invoked as `mctp`, and provide subcommands for managing
797different aspects of the kernel stack.
798
799#### `mctp link`: manage interfaces
800
801```sh
802    mctp link set <link> <up|down>
803    mctp link set <link> network <network-id>
804    mctp link set <link> mtu <mtu>
805    mctp link set <link> bus-owner <hwaddr>
806```
807
808#### `mctp network`: manage networks
809
810```sh
811    mctp network create <network-id>
812    mctp network set <network-id> forwarding <on|off>
813    mctp network set <network-id> default [<true|false>]
814    mctp network delete <network-id>
815```
816
817#### `mctp address`: manage local EID assignments
818
819```sh
820    mctp address add <eid> dev <link>
821    mctp address del <eid> dev <link>
822```
823
824#### `mctp route`: manage routing tables
825
826```sh
827    mctp route add net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
828    mctp route del net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
829    mctp route show [net <network-id>]
830```
831
832#### `mctp stat`: query socket status
833
834```sh
835    mctp stat
836```
837
838A set of netlink message formats will be defined to support these control
839functions.
840
841## Design points & alternatives considered
842
843### Including message-type byte in send/receive buffers
844
845This design specifies that message buffers passed to the kernel in send syscalls
846and from the kernel in receive syscalls will have the message type byte as the
847first byte of the buffer. This corresponds to the definition of a MCTP message
848payload in DSP0236.
849
850This somewhat duplicates the type data provided in `struct sockaddr_mctp`; it's
851superficially possible for the kernel to prepend this byte on send, and remove
852it on receive.
853
854However, the exact format of the MCTP message payload is not precisely defined
855by the specification. Particularly, any message integrity check data (which
856would also need to be appended / stripped in conjunction with the type byte) is
857defined by the type specification, not DSP0236. The kernel would need knowledge
858of all protocols in order to correctly deconstruct the payload data.
859
860Therefore, we transfer the message payload as-is to userspace, without any
861modification by the kernel.
862
863### MCTP message-type specification: using `sockaddr_mctp.smctp_type` rather than protocol
864
865This design specifies message-types to be passed in the `smctp_type` field of
866`struct sockaddr_mctp`. An alternative would be to pass it in the `protocol`
867argument of the `socket()` system call:
868
869```c
870    int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol);
871```
872
873The `smctp_type` implementation was chosen as it better matches the "addressing"
874model of the message type; sockets are bound to an incoming message type,
875similar to the IP protocol's model of binding UDP sockets to a local port
876number.
877
878There is no kernel behaviour that depends on the specific type (particularly
879given the design choice above), so it is not suited to use the protocol argument
880here.
881
882Future additions that perform protocol-specific message handling, and so alter
883the send/receive buffer format, may use a new protocol argument.
884
885### Networks referenced by index rather than UUID
886
887This design proposes referencing networks by an integer index. The MCTP standard
888does optionally associate a RFC4122 UUID with a networks; it would be possible
889to use this UUID where we pass a network identifier.
890
891This approach does not incorporate knowledge of network UUIDs in the kernel.
892Given that the Get Network ID message in the MCTP Control Protocol is
893implemented entirely via userspace, it does not need to be aware of network
894UUIDs, and requiring network references (for example, the `smctp_network` field
895of `struct sockaddr_mctp`, as type `uuid_t`) complicates assignment.
896
897Instead, the index integer is used instead, in a similar fashion to the integer
898index used to reference `struct netdevice`s elsewhere in the network stack.
899
900### Tag behaviour alternatives
901
902We considered _several_ different designs for the tag handling behaviour. A
903brief overview of the more-feasible of those, and why they were rejected:
904
905#### Each socket is allocated a unique tag value on creation
906
907We could allocate a tag for each socket on creation, and use that value when a
908tag is required. This, however:
909
910- needlessly consumes a tag on non-tag-owning sockets (ie, those which send with
911  TO=0 - responders); and
912
913- limits us to 8 sockets per network.
914
915#### Tags only used for message packetisation / reassembly
916
917An alternative would be to completely dissociate tag allocation from sockets;
918and only allocate a tag for the (short-lived) task of packetising a message, and
919sending those packets. Tags would be released when the last packet has been
920sent.
921
922However, this removes any facility to correlate responses with the correct
923socket, which is the purpose of the TO bit in DSP0236. In order for the sending
924application to receive the response, we would either need to:
925
926- limit the system to one socket of each message type (which, for example,
927  precludes running a requester and a responder of the same type); or
928
929- forward all incoming messages of a specific message-type to all sockets
930  listening on that type, making it trivial to eavesdrop on MCTP data of other
931  applications
932
933#### Allocate a tag for one request/response pair
934
935Another alternative would be to allocate a tag on each outgoing TO=1 message,
936and then release that allocation after the incoming response to that tag (TO=0)
937is observed.
938
939However, MCTP protocols exist that do not have a 1:1 mapping of responses to
940requests - more than one response may be valid for a given request message. For
941example, in response to a request, a NVMe-MI implementation may send an
942in-progress reply before the final reply. In this case, we would release the tag
943after the first response is received, and then have no way to correlate the
944second message with the socket.
945
946Broadcast MCTP request messages may have multiple replies from multiple
947endpoints, meaning we cannot release the tag allocation on the first reply.
948