xref: /openbmc/docs/designs/mctp/mctp-kernel.md (revision 8f6e989f)
1# OpenBMC in-kernel MCTP
2
3Author: Jeremy Kerr `<jk@codeconstruct.com.au>`
4
5Please refer to the [MCTP Overview](mctp.md) document for general MCTP design
6description, background and requirements.
7
8This document describes a kernel-based implementation of MCTP infrastructure,
9providing a sockets-based API for MCTP communication within an OpenBMC-based
10platform.
11
12# Requirements for a kernel implementation
13
14 * The MCTP messaging API should be an obvious application of the existing POSIX
15   socket interface
16
17 * Configuration should be simple for a straightforward MCTP endpoint: a single
18   network with a single local endpoint id (EID).
19
20 * Infrastructure should be flexible enough to allow for more complex MCTP
21   networks, allowing:
22
23    - each MCTP network (as defined by section 3.2.31 of DSP0236) may
24      consist of multiple local physical interfaces, and/or multiple EIDs;
25
26    - multiple distinct (ie., non-bridged) networks, possibly containing
27      duplicated EIDs between networks;
28
29    - multiple local EIDs on a single interface, and
30
31    - customisable routing/bridging configurations within a network.
32
33
34# Proposed Design #
35
36The design contains several components:
37
38 * An interface for userspace applications to send and receive MCTP messages: A
39   mapping of the sockets API to MCTP usage
40
41 * Infrastructure for control and configuration of the MCTP network(s),
42   consisting of a configuration utility, and a kernel messaging facility for
43   this utility to use.
44
45 * Kernel drivers for physical interface bindings.
46
47In general, the kernel components cover the transport functionality of MCTP,
48such as message assembly/disassembly, packet forwarding, and physical interface
49implementations.
50
51Higher-level protocols (such as PLDM) are implemented in userspace, through the
52introduced socket API. This also includes the majority of the MCTP Control
53Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically
54have a specific process to request and respond to control protocol messages.
55However, the kernel will include a small subset of control protocol code to
56allow very simple endpoints, with static EID allocations, to run without this
57process. MCTP endpoints that require more than just single-endpoint
58functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would
59include the control message protocol process.
60
61A new driver is introduced to handle each physical interface binding. These
62drivers expose the appropriate `struct net_device` to handle transmission and
63reception of MCTP packets on their associated hardware channels. Under Linux,
64the namespace for these interfaces is separate from other network interfaces -
65such as those for ethernet.
66
67## Structure: interfaces & networks #
68
69The kernel models the local MCTP topology through two items: interfaces and
70networks.
71
72An interface (or "link") is an instance of an MCTP physical transport binding
73(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware
74device. This is represented as a `struct netdevice`, and has a user-visible
75name and index (`ifindex`). Non-hardware-attached interfaces are permitted, to
76allow local loopback and/or virtual interfaces.
77
78A network defines a unique address space for MCTP endpoints by endpoint-ID
79(described by DSP0236, section 3.2.31). A network has a user-visible identifier
80to allow references from userspace. Route definitions are specific to one
81network.
82
83Interfaces are associated with one network. A network may be associated with one
84or more interfaces.
85
86If multiple networks are present, each may contain EIDs that are also present on
87other networks.
88
89## Sockets API ##
90
91### Protocol definitions ###
92
93We define a new address family (and corresponding protocol family) for MCTP:
94
95```c
96    #define AF_MCTP /* TBD */
97    #define PF_MCTP AF_MCTP
98```
99
100MCTP sockets are created with the `socket()` syscall, specifying `AF_MCTP` as
101the domain. Currently, only a `SOCK_DGRAM` socket type is defined.
102
103```c
104    int sd = socket(AF_MCTP, SOCK_DGRAM, 0);
105```
106
107The only (current) value for the `protocol` argument is 0. Future protocol
108implementations may be added later.
109
110MCTP Sockets opened with a protocol value of 0 will communicate directly at the
111transport layer; message buffers received by the application will consist of
112message data from reassembled MCTP packets, and will include the full message
113including message type byte and optional message integrity check (IC).
114Individual packet headers are not included; they may be accessible through a
115future `SOCK_RAW` socket type.
116
117As with all socket address families, source and destination addresses are
118specified with a new `sockaddr` type:
119
120```c
121    struct sockaddr_mctp {
122            sa_family_t         smctp_family; /* = AF_MCTP */
123            int                 smctp_network;
124            struct mctp_addr    smctp_addr;
125            uint8_t             smctp_type;
126            uint8_t             smctp_tag;
127    };
128
129    struct mctp_addr {
130            uint8_t             s_addr;
131    };
132
133    /* MCTP network values */
134    #define MCTP_NET_ANY        0
135
136    /* MCTP EID values */
137    #define MCTP_ADDR_ANY       0xff
138    #define MCTP_ADDR_BCAST     0xff
139
140    /* MCTP type values. Only the least-significant 7 bits of
141     * smctp_type are used for tag matches; the specification defines
142     * the type to be 7 bits.
143     */
144    #define MCTP_TYPE_MASK      0x7f
145
146    /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */
147    /* MCTP-spec-defined fields */
148    #define MCTP_TAG_MASK    0x07
149    #define MCTP_TAG_OWNER   0x08
150    /* Others: reserved */
151
152    /* Helpers */
153    #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */
154```
155
156### Syscall behaviour ###
157
158The following sections describe the MCTP-specific behaviours of the standard
159socket system calls. These behaviours have been chosen to map closely to the
160existing sockets APIs.
161
162#### `bind()`: set local socket address ####
163
164Sockets that receive incoming request packets will bind to a local address,
165using the `bind()` syscall.
166
167```c
168    struct sockaddr_mctp addr;
169
170    addr.smctp_family = AF_MCTP;
171    addr.smctp_network = MCTP_NET_ANY;
172    addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
173    addr.smctp_type = MCTP_TYPE_PLDM;
174    addr.smctp_tag = MCTP_TAG_OWNER;
175
176    int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));
177```
178
179This establishes the local address of the socket. Incoming MCTP messages that
180match the network, address, and message type will be received by this socket.
181The reference to 'incoming' is important here; a bound socket will only receive
182messages with the TO bit set, to indicate an incoming request message, rather
183than a response.
184
185The `smctp_tag` value will configure the tags accepted from the remote side of
186this socket. Given the above, the only valid value is `MCTP_TAG_OWNER`, which
187will result in remotely "owned" tags being routed to this socket. Since
188`MCTP_TAG_OWNER` is set, the 3 least-significant bits of `smctp_tag` are
189not used; callers must set them to zero. See the [Tag behaviour for transmitted
190messages](#tag-behaviour-for-transmitted-messages) section for more details. If
191the `MCTP_TAG_OWNER` bit is not set, `bind()` will fail with an errno of
192`EINVAL`.
193
194A `smctp_network` value of `MCTP_NET_ANY` will configure the socket to receive
195incoming packets from any locally-connected network. A specific network value
196will cause the socket to only receive incoming messages from that network.
197
198The `smctp_addr` field specifies a local address to bind to. A value of
199`MCTP_ADDR_ANY` configures the socket to receive messages addressed to any
200local destination EID.
201
202The `smctp_type` field specifies which message types to receive. Only the lower
2037 bits of the type is matched on incoming messages (ie., the most-significant IC
204bit is not part of the match). This results in the socket receiving packets with
205and without a message integrity check footer.
206
207#### `connect()`: set remote socket address ####
208
209Sockets may specify a socket's remote address with the `connect()` syscall:
210
211```c
212    struct sockaddr_mctp addr;
213    int rc;
214
215    addr.smctp_family = AF_MCTP;
216    addr.smctp_network = MCTP_NET_ANY;
217    addr.smctp_addr.s_addr = 8;
218    addr.smctp_tag = MCTP_TAG_OWNER;
219    addr.smctp_type = MCTP_TYPE_PLDM;
220
221    rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr));
222```
223
224This establishes the remote address of a socket, used for future message
225transmission. Like other `SOCK_DGRAM` behaviour, this does not generate any MCTP
226traffic directly, but just sets the default destination for messages sent from
227this socket.
228
229The `smctp_network` field may specify a locally-attached network, or the value
230`MCTP_NET_ANY`, in which case the kernel will select a suitable MCTP network.
231This is guaranteed to work for single-network configurations, but may require
232additional routing definitions for endpoints attached to multiple distinct
233networks. See the [Addressing](#addressing) section for details.
234
235The `smctp_addr` field specifies a remote EID. This may be the `MCTP_ADDR_BCAST`
236the MCTP broadcast EID (0xff).
237
238The `smctp_type` field specifies the type field of messages transferred over
239this socket.
240
241The `smctp_tag` value will configure the tag used for the local side of this
242socket. The only valid value is `MCTP_TAG_OWNER`, which will result in an
243"owned" tag to be allocated for this socket, and will remain allocated for all
244future outgoing messages, until either the socket is closed, or `connect()` is
245called again. If a tag cannot be allocated, `connect()` will report an error,
246with an errno value of `EAGAIN`. See the [Tag behaviour for transmitted
247messages](#tag-behaviour-for-transmitted-messages) section for more details. If
248the `MCTP_TAG_OWNER` bit is not set, `connect()` will fail with an errno of
249`EINVAL`.
250
251Requesters which connect to a single responder will typically use `connect()` to
252specify the peer address and tag for future outgoing messages.
253
254#### `sendto()`, `sendmsg()`, `send()` & `write()`: transmit an MCTP message ####
255
256An MCTP message is transmitted using one of the `sendto()`, `sendmsg()`, `send()`
257or `write()` syscalls. Using `sendto()` as the primary example:
258
259```c
260    struct sockaddr_mctp addr;
261    char buf[14];
262    ssize_t len;
263
264    /* set message destination */
265    addr.smctp_family = AF_MCTP;
266    addr.smctp_network = 0;
267    addr.smctp_addr.s_addr = 8;
268    addr.smctp_tag = MCTP_TAG_OWNER;
269    addr.smctp_type = MCTP_TYPE_ECHO;
270
271    /* arbitrary message to send, with message-type header */
272    buf[0] = MCTP_TYPE_ECHO;
273    memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);
274
275    len = sendto(sd, buf, sizeof(buf), 0,
276                    (struct sockaddr_mctp *)&addr, sizeof(addr));
277```
278
279The address argument is treated the same way as for `connect()`: The network and
280address fields define the remote address to send to. If `smctp_tag` has the
281`MCTP_TAG_OWNER`, the kernel will ignore any bits set in `MCTP_TAG_VALUE`, and
282generate a tag value suitable for the destination EID. If `MCTP_TAG_OWNER` is
283not set, the message will be sent with the tag value as specified. If a tag
284value cannot be allocated, the system call will report an errno of `EAGAIN`.
285
286The application must provide the message type byte as the first byte of the
287message buffer passed to `sendto()`. If a message integrity check is to be
288included in the transmitted message, it must also be provided in the message
289buffer, and the most-significant bit of the message type byte must be 1.
290
291If the first byte of the message does not match the message type value, then the
292system call will return an error of `EPROTO`.
293
294The `send()` and `write()` system calls behave in a similar way, but do not
295specify a remote address. Therefore, `connect()` must be called beforehand; if
296not, these calls will return an error of `EDESTADDRREQ` (Destination address
297required).
298
299Using `sendto()` or `sendmsg()` on a connected socket may override the remote
300socket address specified in `connect()`. The `connect()` address and tag will
301remain associated with the socket, for future unaddressed sends. The tag
302allocated through a call to `sendto()` or `sendmsg()` on a connected socket is
303subject to the same invalidation logic as on an unconnected socket: It is
304expired either by timeout or by a subsequent `sendto()`.
305
306The `sendmsg()` system call allows a more compact argument interface, and the
307message buffer to be specified as a scatter-gather list. At present no
308ancillary message types (used for the `msg_control` data passed to `sendmsg()`)
309are defined.
310
311Transmitting a message on an unconnected socket with `MCTP_TAG_OWNER` specified
312will cause an allocation of a tag, if no valid tag is already allocated for that
313destination. The (destination-eid,tag) tuple acts as an implicit local socket
314address, to allow the socket to receive responses to this outgoing message. If
315any previous allocation has been performed (to for a different remote EID), that
316allocation is lost. This tag behaviour can be controlled through the
317`MCTP_TAG_CONTROL` socket option.
318
319Sockets will only receive responses to requests they have sent (with TO=1) and may
320only respond (with TO=0) to requests they have received.
321
322#### `recvfrom()`, `recvmsg()`, `recv()` & `read()`: receive an MCTP message ####
323
324An MCTP message can be received by an application using one of the `recvfrom()`,
325`recvmsg()`, `recv()` or `read()` system calls. Using `recvfrom()` as the
326primary example:
327
328```c
329    struct sockaddr_mctp addr;
330    socklen_t addrlen;
331    char buf[14];
332    ssize_t len;
333
334    addrlen = sizeof(addr);
335
336    len = recvfrom(sd, buf, sizeof(buf), 0,
337                    (struct sockaddr_mctp *)&addr, &addrlen);
338
339    /* We can expect addr to describe an MCTP address */
340    assert(addrlen >= sizeof(buf));
341    assert(addr.smctp_family == AF_MCTP);
342
343    printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);
344```
345
346The address argument to `recvfrom` and `recvmsg` is populated with the remote
347address of the incoming message, including tag value (this will be needed in
348order to reply to the message).
349
350The first byte of the message buffer will contain the message type byte. If an
351integrity check follows the message, it will be included in the received buffer.
352
353The `recv()` and `read()` system calls behave in a similar way, but do not
354provide a remote address to the application. Therefore, these are only useful if
355the remote address is already known, or the message does not require a reply.
356
357Like the send calls, sockets will only receive responses to requests they have
358sent (TO=1) and may only respond (TO=0) to requests they have received.
359
360#### `getsockname()` & `getpeername()`: query local/remote socket address ####
361
362The `getsockname()` system call returns the `struct sockaddr_mctp` value for the
363local side of this socket, `getpeername()` for the remote (ie, that used in a
364connect()). Since the tag value is a property of the remote address,
365`getpeername()` may be used to retrieve a kernel-allocated tag value.
366
367Calling `getpeername()` on an unconnected socket will result in an error of
368`ENOTCONN`.
369
370#### Socket options ####
371
372The following socket options are defined for MCTP sockets:
373
374##### `MCTP_ADDR_EXT`: Use extended addressing information in sendmsg/recvmsg #####
375
376Enabling this socket option allows an application to specify extended addressing
377information on transmitted packets, and access the same on received packets.
378
379When the `MCTP_ADDR_EXT` socket option is enabled, the application may specify
380an expanded `struct sockaddr` to the `recvfrom()` and `sendto()` system calls.
381This as defined as:
382
383```c
384    struct sockaddr_mctp_ext {
385            /* fields exactly match struct sockaddr_mctp */
386            sa_family_t         smctp_family; /* = AF_MCTP */
387            int                 smctp_network;
388            struct mctp_addr    smctp_addr;
389            uint8_t             smcp_tag;
390            /* extended addressing */
391            int                 smctp_ifindex;
392            uint8_t             smctp_halen;
393            unsigned char       smctp_haddr[/* TBD */];
394    }
395```
396
397If the `addrlen` specified to `sendto()` or `recvfrom()` is sufficient to
398contain this larger structure, then the extended addressing fields are consumed
399/ populated respectively.
400
401
402##### `MCTP_TAG_CONTROL`: manage outgoing tag allocation behaviour #####
403
404The set/getsockopt argument is a `mctp_tagctl` structure:
405
406    struct mctp_tagctl {
407        bool            retain;
408        struct timespec timeout;
409    };
410
411This allows an application to control the behaviour of allocated tags for
412non-connected sockets when transferring messages to multiple different
413destinations (ie., where a `struct sockaddr_mctp` is provided for individual
414messages, and the `smctp_addr` destination for those sockets may vary across
415calls).
416
417The `retain` flag indicates to the kernel that the socket should not release tag
418allocations when a message is sent to a new destination EID. This causes the
419socket to continue to receive incoming messages to the old (dest,tag) tuple, in
420addition to the new tuple.
421
422The `timeout` value specifies a maximum amount of time to retain tag values.
423This should be based on the reply timeout for any upper-level protocol.
424
425The kernel may reject a request to set values that would cause excessive tag
426allocation by this socket. The kernel may also reject subsequent tag-allocation
427requests (through send or connect syscalls) which would cause excessive tags to
428be consumed by the socket, even though the tag control settings were accepted in
429the setsockopt operation.
430
431Changing the default tag control behaviour should only be required when:
432
433 * the socket is sending messages with TO=1 (ie, is a requester); and
434 * messages are sent to multiple different destination EIDs from the one
435   socket.
436
437
438#### Syscalls not implemented ####
439
440The following system calls are not implemented for MCTP, primarily as they are
441not used in `SOCK_DGRAM`-type sockets:
442
443 * `listen()`
444 * `accept()`
445 * `ioctl()`
446 * `shutdown()`
447 * `mmap()`
448
449### Userspace examples ###
450
451These examples cover three general use-cases:
452
453 - **requester**: sends requests to a particular (EID, type) target, and
454   receives responses to those packets
455
456   This is similar to a typical UDP client
457
458 - **responder**: receives all locally-addressed messages of a specific
459   message-type, and responds to the requester immediately.
460
461   This is similar to a typical UDP server
462
463 - **controller**: a specific service for a bus owner; may send broadcast
464   messages, manage EID allocations, update local MCTP stack state. Will
465   need low-level packet data.
466
467   This is similar to a DHCP server.
468
469#### Requester ####
470
471"Client"-side implementation to send requests to a responder, and receive a response.
472This uses a (fictitious) message type of `MCTP_TYPE_ECHO`.
473
474```c
475    int main() {
476            struct sockaddr_mctp addr;
477            socklen_t addrlen;
478            struct {
479                uint8_t type;
480                uint8_t data[14];
481            } msg;
482            int sd, rc;
483
484            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
485
486            addr.sa_family = AF_MCTP;
487            addr.smctp_network = MCTP_NET_ANY; /* any network */
488            addr.smctp_addr.s_addr = 9;    /* remote eid 9 */
489            addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */
490            addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */
491            addrlen = sizeof(addr);
492
493            /* set message type and payload */
494            msg.type = MCTP_TYPE_ECHO;
495            strncpy(msg.data, "hello, world!", sizeof(msg.data));
496
497            /* send message */
498            rc = sendto(sd, &msg, sizeof(msg), 0,
499                            (struct sockaddr *)&addr, addrlen);
500
501            if (rc < 0)
502                    err(EXIT_FAILURE, "sendto");
503
504            /* Receive reply. This will block until a reply arrives,
505             * which may never happen. Actual code would need a timeout
506             * here. */
507            rc = recvfrom(sd, &msg, sizeof(msg), 0,
508                        (struct sockaddr *)&addr, &addrlen);
509            if (rc < 0)
510                    err(EXIT_FAILURE, "recvfrom");
511
512            assert(msg.type == MCTP_TYPE_ECHO);
513            /* ensure we're nul-terminated */
514            msg.data[sizeof(msg.data)-1] = '\0';
515
516            printf("reply: %s\n", msg.data);
517
518            return EXIT_SUCCESS;
519    }
520```
521
522#### Responder ####
523
524"Server"-side implementation to receive requests and respond. Like the client,
525This uses a (fictitious) message type of `MCTP_TYPE_ECHO` in the `struct
526sockaddr_mctp`; only messages matching this type will be received.
527
528```c
529    int main() {
530            struct sockaddr_mctp addr;
531            socklen_t addrlen;
532            int sd, rc;
533
534            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
535
536            addr.sa_family = AF_MCTP;
537            addr.smctp_network = MCTP_NET_ANY; /* any network */
538            addr.smctp_addr.s_addr = MCTP_EID_ANY;
539            addr.smctp_type = MCTP_TYPE_ECHO;
540            addr.smctp_tag = MCTP_TAG_OWNER;
541            addrlen = sizeof(addr);
542
543            rc = bind(sd, (struct sockaddr *)&addr, addrlen);
544            if (rc)
545                    err(EXIT_FAILURE, "bind");
546
547            for (;;) {
548                    struct {
549                        uint8_t type;
550                        uint8_t data[14];
551                    } msg;
552
553                    rc = recvfrom(sd, &msg, sizeof(msg), 0,
554                                    (struct sockaddr *)&addr, &addrlen);
555                    if (rc < 0)
556                            err(EXIT_FAILURE, "recvfrom");
557                    if (rc < 1)
558                            warnx("not enough data for a message type");
559
560                    assert(addrlen == sizeof(addr));
561                    assert(msg.type == MCTP_TYPE_ECHO);
562
563                    printf("%zd bytes from EID %d\n", rc, addr.smctp_addr);
564
565                    /* Reply to requester; this macro just clears the TO-bit.
566                     * Other addr fields will describe the remote endpoint,
567                     * so use those as-is.
568                     */
569                    addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag);
570
571                    rc = sendto(sd, &msg, rc, 0,
572                                (struct sockaddr *)&addr, addrlen);
573                    if (rc < 0)
574                            err(EXIT_FAILURE, "sendto");
575            }
576
577            return EXIT_SUCCESS;
578    }
579```
580
581#### Broadcast request ####
582
583Sends a request to a broadcast EID, and receives (unicast) replies. Typical
584control protocol pattern.
585
586```c
587    int main() {
588            struct sockaddr_mctp txaddr, rxaddr;
589            struct timespec start, cur;
590            struct pollfd pollfds[1];
591            socklen_t addrlen;
592            uint8_t buf[2];
593            int timeout;
594
595            sd = socket(AF_MCTP, SOCK_DGRAM, 0);
596
597            /* destination address setup */
598            txaddr.sa_family = AF_MCTP;
599            txaddr.smctp_network = 1; /* specific network required for broadcast */
600            txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */
601            txaddr.smctp_type = MCTP_TYPE_CONTROL;
602            txaddr.smctp_tag = MCTP_TAG_OWNER;
603
604            buf[0] = MCTP_TYPE_CONTROL;
605            buf[1] = 'a';
606
607            /* We're doing a sendto() to a broadcast address here. If we were
608             * sending more than one broadcast message, we'd be better off
609             * doing connect(); sendto();, in order to retain the tag
610             * reservation across all transmitted messages. However, since this
611             * is a single transmit, that makes no difference in this
612             * particular case.
613             */
614            rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr,
615                            sizeof(txaddr));
616            if (rc < 0)
617                    err(EXIT_FAILURE, "sendto");
618
619            /* Set up poll behaviour, and record our starting time for
620             * reply timeouts */
621            pollfds[0].fd = sd;
622            pollfds[0].events = POLLIN;
623            clock_gettime(CLOCK_MONOTONIC, &start);
624
625            for (;;) {
626                    /* Calculate the amount of time left for replies */
627                    clock_gettime(CLOCK_MONOTONIC, &cur);
628                    timeout = calculate_timeout(&start, &cur, 1000);
629
630                    rc = poll(pollfds, 1, timeout)
631                    if (rc < 0)
632                        err(EXIT_FAILURE, "poll");
633
634                    /* timeout receiving a reply? */
635                    if (rc == 0)
636                        break;
637
638                    /* sanity check that we have a message to receive */
639                    if (!(pollfds[0].revents & POLLIN))
640                        break;
641
642                    addrlen = sizeof(rxaddr);
643
644                    rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr,
645                            &addrlen);
646                    if (rc < 0)
647                            err(EXIT_FAILURE, "recvfrom");
648
649                    assert(addrlen >= sizeof(rxaddr));
650                    assert(rxaddr.smctp_family == AF_MCTP);
651
652                    printf("response from EID %d\n", rxaddr.smctp_addr);
653            }
654
655            return EXIT_SUCCESS;
656    }
657```
658
659### Implementation notes ###
660
661#### Addressing ####
662
663Transmitted messages (through `sendto()` and related system calls) specify their
664destination via the `smctp_network` and `smctp_addr` fields of a `struct
665sockaddr_mctp`.
666
667The `smctp_addr` field maps directly to the destination endpoint's EID.
668
669The `smctp_network` field specifies a locally defined network identifier. To
670simplify situations where there is only one network defined, the special value
671`MCTP_NET_ANY` is allowed. This will allow the kernel to select a specific
672network for transmission.
673
674This selection is entirely user-configured; one specific network may be defined
675as the system default, in which case it will be used for all message
676transmission where `MCTP_NET_ANY` is used as the destination network.
677
678In particular, the destination EID is never used to select a destination
679network.
680
681MCTP responders should use the EID and network values of an incoming request to
682specify the destination for any responses.
683
684#### Bridging/routing ####
685
686The network and interface structure allows multiple interfaces to share a common
687network. By default, packets are not forwarded between interfaces.
688
689A network can be configured for "forwarding" mode. In this mode, packets may be
690forwarded if their destination EID is non-local, and matches a route for another
691interface on the same network.
692
693As per DSP0236, packet reassembly does not occur during the forwarding process.
694If the packet is larger than the MTU for the destination interface/route, then
695the packet is dropped.
696
697#### Tag behaviour for transmitted messages ####
698
699On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel
700must allocate a tag that will uniquely identify responses over a (destination
701EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size.
702
703To allow this, a `sendto()` with the `MCTP_TAG_OWNER` bit set in the `smctp_tag`
704field will cause the kernel to allocate a unique tag for subsequent replies from
705that specific remote EID.
706
707This allocation will expire when any of the following occur:
708
709 * the socket is closed
710 * a new message is sent to a new destination EID
711 * an implementation-defined timeout expires
712
713Because the "tag space" is limited, it may not be possible for the kernel to
714allocate a unique tag for the outgoing message. In this case, the `sendto()`
715call will fail with errno `EAGAIN`. This is analogous to the UDP behaviour when
716a local port cannot be allocated for an outgoing message.
717
718The implementation-defined timeout value shall be chosen to reasonably cover
719standard reply timeouts. If necessary, this timeout may be modified through the
720`MCTP_TAG_CONTROL` socket option.
721
722For applications that expect to perform an ongoing message exchange with a
723particular destination address, they may use the `connect()` call to set a
724persistent remote address. In this case, the tag will be allocated during
725connect(), and remain reserved for this socket until any of the following occur:
726
727 * the socket is closed
728 * the remote address is changed through another call to `connect()`.
729
730In particular, calling `sendto()` with a different address does not release the
731tag reservation.
732
733Broadcast messages are particularly onerous for tag reservations. When a message
734is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must
735reserve the tag across the entire range of possible EIDs. Therefore, a
736particular tag value must be currently-unused across all EIDs to allow a
737`sendto()` to a broadcast address. Additionally, this reservation is not cleared
738when a reply is received, as there may be multiple replies to a broadcast.
739
740For this reason, applications wanting to send to the broadcast address should
741use the `connect()` system call to reserve a tag, and guarantee its availability
742for future message transmission. Note that this will remove the tag value for
743use with *any other EID*. Sending to the broadcast address should be avoided; we
744expect few applications will need this functionality.
745
746
747#### MCTP Control Protocol implementation ####
748
749Aside from the "Resolve endpoint EID" message, the MCTP control protocol
750implementation would exist as a userspace process, `mctpd`. This process is
751responsible for responding to incoming control protocol messages, any dynamic
752EID allocations (for bus owner devices) and maintaining the MCTP route table
753(for bridging devices).
754
755This process would create a socket bound to the type `MCTP_TYPE_CONTROL`, with
756the `MCTP_ADDR_EXT` socket option enabled in order to access physical addressing
757data on incoming control protocol requests. It would interact with the kernel's
758route table via a netlink interface - the same as that implemented for the
759[Utility and configuration interfaces](#utility-and-configuration-interfaces).
760
761### Neighbour and routing implementation ###
762
763The packet-transmission behaviour of the MCTP infrastructure relies on a single
764routing table to lookup both route and neighbour information. Entries in this
765table are of the format:
766
767 | EID range | interface | physical address | metric | MTU | flags | expiry |
768 |-----------|-----------|------------------|--------|-----|-------|--------|
769
770This table can be updated from two sources:
771
772  * From userspace, via a netlink interface (see the
773    [Utility and configuration interfaces](#utility-and-configuration-interfaces)
774    section).
775
776  * Directly within the kernel, when basic neighbour information is discovered.
777    Kernel-originated routes are marked as such in the flags field, and have a
778    maximum validity age, indicated by the expiry field.
779
780Kernel-discovered routing information can originate from two sources:
781
782  * physical-to-EID mappings discovered through received packets
783
784  * explicit endpoint physical-address resolution requests
785
786When a packet is to be transmitted to an EID that does not have an entry in the
787routing table, the kernel may attempt to resolve the physical address of that
788endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol
789(section 12.9 of DSP0236). The response message will be used to add a
790kernel-originated route into the routing table.
791
792This is the only kernel-internal usage of MCTP Control Protocol messages.
793
794## Utility and configuration interfaces ##
795
796A small utility will be developed to control the state of the kernel MCTP stack.
797This will be similar in design to the 'iproute2' tools, which perform a similar
798function for the IPv4 and IPv6 protocols.
799
800The utility will be invoked as `mctp`, and provide subcommands for managing
801different aspects of the kernel stack.
802
803### `mctp link`: manage interfaces ###
804
805```sh
806    mctp link set <link> <up|down>
807    mctp link set <link> network <network-id>
808    mctp link set <link> mtu <mtu>
809    mctp link set <link> bus-owner <hwaddr>
810```
811
812### `mctp network`: manage networks ###
813
814```sh
815    mctp network create <network-id>
816    mctp network set <network-id> forwarding <on|off>
817    mctp network set <network-id> default [<true|false>]
818    mctp network delete <network-id>
819```
820
821### `mctp address`: manage local EID assignments ###
822
823```sh
824    mctp address add <eid> dev <link>
825    mctp address del <eid> dev <link>
826```
827
828### `mctp route`: manage routing tables ###
829
830```sh
831    mctp route add net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
832    mctp route del net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>]
833    mctp route show [net <network-id>]
834```
835
836### `mctp stat`: query socket status ###
837
838```sh
839    mctp stat
840```
841
842A set of netlink message formats will be defined to support these control
843functions.
844
845
846# Design points & alternatives considered #
847
848## Including message-type byte in send/receive buffers ##
849
850This design specifies that message buffers passed to the kernel in send syscalls
851and from the kernel in receive syscalls will have the message type byte as the
852first byte of the buffer. This corresponds to the definition of a MCTP message
853payload in DSP0236.
854
855This somewhat duplicates the type data provided in `struct sockaddr_mctp`; it's
856superficially possible for the kernel to prepend this byte on send, and remove
857it on receive.
858
859However, the exact format of the MCTP message payload is not precisely defined
860by the specification. Particularly, any message integrity check data (which
861would also need to be appended / stripped in conjunction with the type byte) is
862defined by the type specification, not DSP0236. The kernel would need knowledge
863of all protocols in order to correctly deconstruct the payload data.
864
865Therefore, we transfer the message payload as-is to userspace, without any
866modification by the kernel.
867
868## MCTP message-type specification: using `sockaddr_mctp.smctp_type` rather than protocol ##
869
870This design specifies message-types to be passed in the `smctp_type` field of
871`struct sockaddr_mctp`. An alternative would be to pass it in the `protocol`
872argument of the `socket()` system call:
873
874```c
875    int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol);
876```
877
878The `smctp_type` implementation was chosen as it better matches the "addressing"
879model of the message type; sockets are bound to an incoming message type,
880similar to the IP protocol's model of binding UDP sockets to a local port number.
881
882There is no kernel behaviour that depends on the specific type (particularly
883given the design choice above), so it is not suited to use the protocol argument
884here.
885
886Future additions that perform protocol-specific message handling, and so alter
887the send/receive buffer format, may use a new protocol argument.
888
889
890## Networks referenced by index rather than UUID ##
891
892This design proposes referencing networks by an integer index. The MCTP standard
893does optionally associate a RFC4122 UUID with a networks; it would be possible
894to use this UUID where we pass a network identifier.
895
896This approach does not incorporate knowledge of network UUIDs in the kernel.
897Given that the Get Network ID message in the MCTP Control Protocol is
898implemented entirely via userspace, it does not need to be aware of network
899UUIDs, and requiring network references (for example, the `smctp_network` field
900of `struct sockaddr_mctp`, as type `uuid_t`) complicates assignment.
901
902Instead, the index integer is used instead, in a similar fashion to the integer
903index used to reference `struct netdevice`s elsewhere in the network stack.
904
905
906## Tag behaviour alternatives ##
907
908We considered *several* different designs for the tag handling behaviour. A
909brief overview of the more-feasible of those, and why they were rejected:
910
911### Each socket is allocated a unique tag value on creation ###
912
913We could allocate a tag for each socket on creation, and use that value when a
914tag is required. This, however:
915
916 * needlessly consumes a tag on non-tag-owning sockets (ie, those which send
917   with TO=0 - responders); and
918
919 * limits us to 8 sockets per network.
920
921### Tags only used for message packetisation / reassembly ###
922
923An alternative would be to completely dissociate tag allocation from sockets;
924and only allocate a tag for the (short-lived) task of packetising a message, and
925sending those packets. Tags would be released when the last packet has been sent.
926
927However, this removes any facility to correlate responses with the correct
928socket, which is the purpose of the TO bit in DSP0236. In order for the sending
929application to receive the response, we would either need to:
930
931 * limit the system to one socket of each message type (which, for example,
932   precludes running a requester and a responder of the same type); or
933
934 * forward all incoming messages of a specific message-type to all sockets
935   listening on that type, making it trivial to eavesdrop on MCTP data of
936   other applications
937
938### Allocate a tag for one request/response pair ###
939
940Another alternative would be to allocate a tag on each outgoing TO=1 message,
941and then release that allocation after the incoming response to that tag (TO=0) is
942observed.
943
944However, MCTP protocols exist that do not have a 1:1 mapping of responses to
945requests - more than one response may be valid for a given request message. For
946example, in response to a request, a NVMe-MI implementation may send an
947in-progress reply before the final reply. In this case, we would release the tag
948after the first response is received, and then have no way to correlate the
949second message with the socket.
950
951Broadcast MCTP request messages may have multiple replies from multiple
952endpoints, meaning we cannot release the tag allocation on the first reply.
953