# OpenBMC in-kernel MCTP Author: Jeremy Kerr `` Please refer to the [MCTP Overview](mctp.md) document for general MCTP design description, background and requirements. This document describes a kernel-based implementation of MCTP infrastructure, providing a sockets-based API for MCTP communication within an OpenBMC-based platform. # Requirements for a kernel implementation * The MCTP messaging API should be an obvious application of the existing POSIX socket interface * Configuration should be simple for a straightforward MCTP endpoint: a single network with a single local endpoint id (EID). * Infrastructure should be flexible enough to allow for more complex MCTP networks, allowing: - each MCTP network (as defined by section 3.2.31 of DSP0236) may consist of multiple local physical interfaces, and/or multiple EIDs; - multiple distinct (ie., non-bridged) networks, possibly containing duplicated EIDs between networks; - multiple local EIDs on a single interface, and - customisable routing/bridging configurations within a network. # Proposed Design # The design contains several components: * An interface for userspace applications to send and receive MCTP messages: A mapping of the sockets API to MCTP usage * Infrastructure for control and configuration of the MCTP network(s), consisting of a configuration utility, and a kernel messaging facility for this utility to use. * Kernel drivers for physical interface bindings. In general, the kernel components cover the transport functionality of MCTP, such as message assembly/disassembly, packet forwarding, and physical interface implementations. Higher-level protocols (such as PLDM) are implemented in userspace, through the introduced socket API. This also includes the majority of the MCTP Control Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically have a specific process to request and respond to control protocol messages. However, the kernel will include a small subset of control protocol code to allow very simple endpoints, with static EID allocations, to run without this process. MCTP endpoints that require more than just single-endpoint functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would include the control message protocol process. A new driver is introduced to handle each physical interface binding. These drivers expose the appropriate `struct net_device` to handle transmission and reception of MCTP packets on their associated hardware channels. Under Linux, the namespace for these interfaces is separate from other network interfaces - such as those for ethernet. ## Structure: interfaces & networks # The kernel models the local MCTP topology through two items: interfaces and networks. An interface (or "link") is an instance of an MCTP physical transport binding (as defined by DSP0236, section 3.2.47), likely connected to a specific hardware device. This is represented as a `struct netdevice`, and has a user-visible name and index (`ifindex`). Non-hardware-attached interfaces are permitted, to allow local loopback and/or virtual interfaces. A network defines a unique address space for MCTP endpoints by endpoint-ID (described by DSP0236, section 3.2.31). A network has a user-visible identifier to allow references from userspace. Route definitions are specific to one network. Interfaces are associated with one network. A network may be associated with one or more interfaces. If multiple networks are present, each may contain EIDs that are also present on other networks. ## Sockets API ## ### Protocol definitions ### We define a new address family (and corresponding protocol family) for MCTP: ```c #define AF_MCTP /* TBD */ #define PF_MCTP AF_MCTP ``` MCTP sockets are created with the `socket()` syscall, specifying `AF_MCTP` as the domain. Currently, only a `SOCK_DGRAM` socket type is defined. ```c int sd = socket(AF_MCTP, SOCK_DGRAM, 0); ``` The only (current) value for the `protocol` argument is 0. Future protocol implementations may be added later. MCTP Sockets opened with a protocol value of 0 will communicate directly at the transport layer; message buffers received by the application will consist of message data from reassembled MCTP packets, and will include the full message including message type byte and optional message integrity check (IC). Individual packet headers are not included; they may be accessible through a future `SOCK_RAW` socket type. As with all socket address families, source and destination addresses are specified with a new `sockaddr` type: ```c struct sockaddr_mctp { sa_family_t smctp_family; /* = AF_MCTP */ int smctp_network; struct mctp_addr smctp_addr; uint8_t smctp_type; uint8_t smctp_tag; }; struct mctp_addr { uint8_t s_addr; }; /* MCTP network values */ #define MCTP_NET_ANY 0 /* MCTP EID values */ #define MCTP_ADDR_ANY 0xff #define MCTP_ADDR_BCAST 0xff /* MCTP type values. Only the least-significant 7 bits of * smctp_type are used for tag matches; the specification defines * the type to be 7 bits. */ #define MCTP_TYPE_MASK 0x7f /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */ /* MCTP-spec-defined fields */ #define MCTP_TAG_MASK 0x07 #define MCTP_TAG_OWNER 0x08 /* Others: reserved */ /* Helpers */ #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */ ``` ### Syscall behaviour ### The following sections describe the MCTP-specific behaviours of the standard socket system calls. These behaviours have been chosen to map closely to the existing sockets APIs. #### `bind()`: set local socket address #### Sockets that receive incoming request packets will bind to a local address, using the `bind()` syscall. ```c struct sockaddr_mctp addr; addr.smctp_family = AF_MCTP; addr.smctp_network = MCTP_NET_ANY; addr.smctp_addr.s_addr = MCTP_ADDR_ANY; addr.smctp_type = MCTP_TYPE_PLDM; addr.smctp_tag = MCTP_TAG_OWNER; int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr)); ``` This establishes the local address of the socket. Incoming MCTP messages that match the network, address, and message type will be received by this socket. The reference to 'incoming' is important here; a bound socket will only receive messages with the TO bit set, to indicate an incoming request message, rather than a response. The `smctp_tag` value will configure the tags accepted from the remote side of this socket. Given the above, the only valid value is `MCTP_TAG_OWNER`, which will result in remotely "owned" tags being routed to this socket. Since `MCTP_TAG_OWNER` is set, the 3 least-significant bits of `smctp_tag` are not used; callers must set them to zero. See the [Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages) section for more details. If the `MCTP_TAG_OWNER` bit is not set, `bind()` will fail with an errno of `EINVAL`. A `smctp_network` value of `MCTP_NET_ANY` will configure the socket to receive incoming packets from any locally-connected network. A specific network value will cause the socket to only receive incoming messages from that network. The `smctp_addr` field specifies a local address to bind to. A value of `MCTP_ADDR_ANY` configures the socket to receive messages addressed to any local destination EID. The `smctp_type` field specifies which message types to receive. Only the lower 7 bits of the type is matched on incoming messages (ie., the most-significant IC bit is not part of the match). This results in the socket receiving packets with and without a message integrity check footer. #### `connect()`: set remote socket address #### Sockets may specify a socket's remote address with the `connect()` syscall: ```c struct sockaddr_mctp addr; int rc; addr.smctp_family = AF_MCTP; addr.smctp_network = MCTP_NET_ANY; addr.smctp_addr.s_addr = 8; addr.smctp_tag = MCTP_TAG_OWNER; addr.smctp_type = MCTP_TYPE_PLDM; rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr)); ``` This establishes the remote address of a socket, used for future message transmission. Like other `SOCK_DGRAM` behaviour, this does not generate any MCTP traffic directly, but just sets the default destination for messages sent from this socket. The `smctp_network` field may specify a locally-attached network, or the value `MCTP_NET_ANY`, in which case the kernel will select a suitable MCTP network. This is guaranteed to work for single-network configurations, but may require additional routing definitions for endpoints attached to multiple distinct networks. See the [Addressing](#addressing) section for details. The `smctp_addr` field specifies a remote EID. This may be the `MCTP_ADDR_BCAST` the MCTP broadcast EID (0xff). The `smctp_type` field specifies the type field of messages transferred over this socket. The `smctp_tag` value will configure the tag used for the local side of this socket. The only valid value is `MCTP_TAG_OWNER`, which will result in an "owned" tag to be allocated for this socket, and will remain allocated for all future outgoing messages, until either the socket is closed, or `connect()` is called again. If a tag cannot be allocated, `connect()` will report an error, with an errno value of `EAGAIN`. See the [Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages) section for more details. If the `MCTP_TAG_OWNER` bit is not set, `connect()` will fail with an errno of `EINVAL`. Requesters which connect to a single responder will typically use `connect()` to specify the peer address and tag for future outgoing messages. #### `sendto()`, `sendmsg()`, `send()` & `write()`: transmit an MCTP message #### An MCTP message is transmitted using one of the `sendto()`, `sendmsg()`, `send()` or `write()` syscalls. Using `sendto()` as the primary example: ```c struct sockaddr_mctp addr; char buf[14]; ssize_t len; /* set message destination */ addr.smctp_family = AF_MCTP; addr.smctp_network = 0; addr.smctp_addr.s_addr = 8; addr.smctp_tag = MCTP_TAG_OWNER; addr.smctp_type = MCTP_TYPE_ECHO; /* arbitrary message to send, with message-type header */ buf[0] = MCTP_TYPE_ECHO; memcpy(buf + 1, "hello, world!", sizeof(buf) - 1); len = sendto(sd, buf, sizeof(buf), 0, (struct sockaddr_mctp *)&addr, sizeof(addr)); ``` The address argument is treated the same way as for `connect()`: The network and address fields define the remote address to send to. If `smctp_tag` has the `MCTP_TAG_OWNER`, the kernel will ignore any bits set in `MCTP_TAG_VALUE`, and generate a tag value suitable for the destination EID. If `MCTP_TAG_OWNER` is not set, the message will be sent with the tag value as specified. If a tag value cannot be allocated, the system call will report an errno of `EAGAIN`. The application must provide the message type byte as the first byte of the message buffer passed to `sendto()`. If a message integrity check is to be included in the transmitted message, it must also be provided in the message buffer, and the most-significant bit of the message type byte must be 1. If the first byte of the message does not match the message type value, then the system call will return an error of `EPROTO`. The `send()` and `write()` system calls behave in a similar way, but do not specify a remote address. Therefore, `connect()` must be called beforehand; if not, these calls will return an error of `EDESTADDRREQ` (Destination address required). Using `sendto()` or `sendmsg()` on a connected socket may override the remote socket address specified in `connect()`. The `connect()` address and tag will remain associated with the socket, for future unaddressed sends. The tag allocated through a call to `sendto()` or `sendmsg()` on a connected socket is subject to the same invalidation logic as on an unconnected socket: It is expired either by timeout or by a subsequent `sendto()`. The `sendmsg()` system call allows a more compact argument interface, and the message buffer to be specified as a scatter-gather list. At present no ancillary message types (used for the `msg_control` data passed to `sendmsg()`) are defined. Transmitting a message on an unconnected socket with `MCTP_TAG_OWNER` specified will cause an allocation of a tag, if no valid tag is already allocated for that destination. The (destination-eid,tag) tuple acts as an implicit local socket address, to allow the socket to receive responses to this outgoing message. If any previous allocation has been performed (to for a different remote EID), that allocation is lost. This tag behaviour can be controlled through the `MCTP_TAG_CONTROL` socket option. Sockets will only receive responses to requests they have sent (with TO=1) and may only respond (with TO=0) to requests they have received. #### `recvfrom()`, `recvmsg()`, `recv()` & `read()`: receive an MCTP message #### An MCTP message can be received by an application using one of the `recvfrom()`, `recvmsg()`, `recv()` or `read()` system calls. Using `recvfrom()` as the primary example: ```c struct sockaddr_mctp addr; socklen_t addrlen; char buf[14]; ssize_t len; addrlen = sizeof(addr); len = recvfrom(sd, buf, sizeof(buf), 0, (struct sockaddr_mctp *)&addr, &addrlen); /* We can expect addr to describe an MCTP address */ assert(addrlen >= sizeof(buf)); assert(addr.smctp_family == AF_MCTP); printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr); ``` The address argument to `recvfrom` and `recvmsg` is populated with the remote address of the incoming message, including tag value (this will be needed in order to reply to the message). The first byte of the message buffer will contain the message type byte. If an integrity check follows the message, it will be included in the received buffer. The `recv()` and `read()` system calls behave in a similar way, but do not provide a remote address to the application. Therefore, these are only useful if the remote address is already known, or the message does not require a reply. Like the send calls, sockets will only receive responses to requests they have sent (TO=1) and may only respond (TO=0) to requests they have received. #### `getsockname()` & `getpeername()`: query local/remote socket address #### The `getsockname()` system call returns the `struct sockaddr_mctp` value for the local side of this socket, `getpeername()` for the remote (ie, that used in a connect()). Since the tag value is a property of the remote address, `getpeername()` may be used to retrieve a kernel-allocated tag value. Calling `getpeername()` on an unconnected socket will result in an error of `ENOTCONN`. #### Socket options #### The following socket options are defined for MCTP sockets: ##### `MCTP_ADDR_EXT`: Use extended addressing information in sendmsg/recvmsg ##### Enabling this socket option allows an application to specify extended addressing information on transmitted packets, and access the same on received packets. When the `MCTP_ADDR_EXT` socket option is enabled, the application may specify an expanded `struct sockaddr` to the `recvfrom()` and `sendto()` system calls. This as defined as: ```c struct sockaddr_mctp_ext { /* fields exactly match struct sockaddr_mctp */ sa_family_t smctp_family; /* = AF_MCTP */ int smctp_network; struct mctp_addr smctp_addr; uint8_t smcp_tag; /* extended addressing */ int smctp_ifindex; uint8_t smctp_halen; unsigned char smctp_haddr[/* TBD */]; } ``` If the `addrlen` specified to `sendto()` or `recvfrom()` is sufficient to contain this larger structure, then the extended addressing fields are consumed / populated respectively. ##### `MCTP_TAG_CONTROL`: manage outgoing tag allocation behaviour ##### The set/getsockopt argument is a `mctp_tagctl` structure: struct mctp_tagctl { bool retain; struct timespec timeout; }; This allows an application to control the behaviour of allocated tags for non-connected sockets when transferring messages to multiple different destinations (ie., where a `struct sockaddr_mctp` is provided for individual messages, and the `smctp_addr` destination for those sockets may vary across calls). The `retain` flag indicates to the kernel that the socket should not release tag allocations when a message is sent to a new destination EID. This causes the socket to continue to receive incoming messages to the old (dest,tag) tuple, in addition to the new tuple. The `timeout` value specifies a maximum amount of time to retain tag values. This should be based on the reply timeout for any upper-level protocol. The kernel may reject a request to set values that would cause excessive tag allocation by this socket. The kernel may also reject subsequent tag-allocation requests (through send or connect syscalls) which would cause excessive tags to be consumed by the socket, even though the tag control settings were accepted in the setsockopt operation. Changing the default tag control behaviour should only be required when: * the socket is sending messages with TO=1 (ie, is a requester); and * messages are sent to multiple different destination EIDs from the one socket. #### Syscalls not implemented #### The following system calls are not implemented for MCTP, primarily as they are not used in `SOCK_DGRAM`-type sockets: * `listen()` * `accept()` * `ioctl()` * `shutdown()` * `mmap()` ### Userspace examples ### These examples cover three general use-cases: - **requester**: sends requests to a particular (EID, type) target, and receives responses to those packets This is similar to a typical UDP client - **responder**: receives all locally-addressed messages of a specific message-type, and responds to the requester immediately. This is similar to a typical UDP server - **controller**: a specific service for a bus owner; may send broadcast messages, manage EID allocations, update local MCTP stack state. Will need low-level packet data. This is similar to a DHCP server. #### Requester #### "Client"-side implementation to send requests to a responder, and receive a response. This uses a (fictitious) message type of `MCTP_TYPE_ECHO`. ```c int main() { struct sockaddr_mctp addr; socklen_t addrlen; struct { uint8_t type; uint8_t data[14]; } msg; int sd, rc; sd = socket(AF_MCTP, SOCK_DGRAM, 0); addr.sa_family = AF_MCTP; addr.smctp_network = MCTP_NET_ANY; /* any network */ addr.smctp_addr.s_addr = 9; /* remote eid 9 */ addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */ addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */ addrlen = sizeof(addr); /* set message type and payload */ msg.type = MCTP_TYPE_ECHO; strncpy(msg.data, "hello, world!", sizeof(msg.data)); /* send message */ rc = sendto(sd, &msg, sizeof(msg), 0, (struct sockaddr *)&addr, addrlen); if (rc < 0) err(EXIT_FAILURE, "sendto"); /* Receive reply. This will block until a reply arrives, * which may never happen. Actual code would need a timeout * here. */ rc = recvfrom(sd, &msg, sizeof(msg), 0, (struct sockaddr *)&addr, &addrlen); if (rc < 0) err(EXIT_FAILURE, "recvfrom"); assert(msg.type == MCTP_TYPE_ECHO); /* ensure we're nul-terminated */ msg.data[sizeof(msg.data)-1] = '\0'; printf("reply: %s\n", msg.data); return EXIT_SUCCESS; } ``` #### Responder #### "Server"-side implementation to receive requests and respond. Like the client, This uses a (fictitious) message type of `MCTP_TYPE_ECHO` in the `struct sockaddr_mctp`; only messages matching this type will be received. ```c int main() { struct sockaddr_mctp addr; socklen_t addrlen; int sd, rc; sd = socket(AF_MCTP, SOCK_DGRAM, 0); addr.sa_family = AF_MCTP; addr.smctp_network = MCTP_NET_ANY; /* any network */ addr.smctp_addr.s_addr = MCTP_EID_ANY; addr.smctp_type = MCTP_TYPE_ECHO; addr.smctp_tag = MCTP_TAG_OWNER; addrlen = sizeof(addr); rc = bind(sd, (struct sockaddr *)&addr, addrlen); if (rc) err(EXIT_FAILURE, "bind"); for (;;) { struct { uint8_t type; uint8_t data[14]; } msg; rc = recvfrom(sd, &msg, sizeof(msg), 0, (struct sockaddr *)&addr, &addrlen); if (rc < 0) err(EXIT_FAILURE, "recvfrom"); if (rc < 1) warnx("not enough data for a message type"); assert(addrlen == sizeof(addr)); assert(msg.type == MCTP_TYPE_ECHO); printf("%zd bytes from EID %d\n", rc, addr.smctp_addr); /* Reply to requester; this macro just clears the TO-bit. * Other addr fields will describe the remote endpoint, * so use those as-is. */ addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag); rc = sendto(sd, &msg, rc, 0, (struct sockaddr *)&addr, addrlen); if (rc < 0) err(EXIT_FAILURE, "sendto"); } return EXIT_SUCCESS; } ``` #### Broadcast request #### Sends a request to a broadcast EID, and receives (unicast) replies. Typical control protocol pattern. ```c int main() { struct sockaddr_mctp txaddr, rxaddr; struct timespec start, cur; struct pollfd pollfds[1]; socklen_t addrlen; uint8_t buf[2]; int timeout; sd = socket(AF_MCTP, SOCK_DGRAM, 0); /* destination address setup */ txaddr.sa_family = AF_MCTP; txaddr.smctp_network = 1; /* specific network required for broadcast */ txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */ txaddr.smctp_type = MCTP_TYPE_CONTROL; txaddr.smctp_tag = MCTP_TAG_OWNER; buf[0] = MCTP_TYPE_CONTROL; buf[1] = 'a'; /* We're doing a sendto() to a broadcast address here. If we were * sending more than one broadcast message, we'd be better off * doing connect(); sendto();, in order to retain the tag * reservation across all transmitted messages. However, since this * is a single transmit, that makes no difference in this * particular case. */ rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr, sizeof(txaddr)); if (rc < 0) err(EXIT_FAILURE, "sendto"); /* Set up poll behaviour, and record our starting time for * reply timeouts */ pollfds[0].fd = sd; pollfds[0].events = POLLIN; clock_gettime(CLOCK_MONOTONIC, &start); for (;;) { /* Calculate the amount of time left for replies */ clock_gettime(CLOCK_MONOTONIC, &cur); timeout = calculate_timeout(&start, &cur, 1000); rc = poll(pollfds, 1, timeout) if (rc < 0) err(EXIT_FAILURE, "poll"); /* timeout receiving a reply? */ if (rc == 0) break; /* sanity check that we have a message to receive */ if (!(pollfds[0].revents & POLLIN)) break; addrlen = sizeof(rxaddr); rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr, &addrlen); if (rc < 0) err(EXIT_FAILURE, "recvfrom"); assert(addrlen >= sizeof(rxaddr)); assert(rxaddr.smctp_family == AF_MCTP); printf("response from EID %d\n", rxaddr.smctp_addr); } return EXIT_SUCCESS; } ``` ### Implementation notes ### #### Addressing #### Transmitted messages (through `sendto()` and related system calls) specify their destination via the `smctp_network` and `smctp_addr` fields of a `struct sockaddr_mctp`. The `smctp_addr` field maps directly to the destination endpoint's EID. The `smctp_network` field specifies a locally defined network identifier. To simplify situations where there is only one network defined, the special value `MCTP_NET_ANY` is allowed. This will allow the kernel to select a specific network for transmission. This selection is entirely user-configured; one specific network may be defined as the system default, in which case it will be used for all message transmission where `MCTP_NET_ANY` is used as the destination network. In particular, the destination EID is never used to select a destination network. MCTP responders should use the EID and network values of an incoming request to specify the destination for any responses. #### Bridging/routing #### The network and interface structure allows multiple interfaces to share a common network. By default, packets are not forwarded between interfaces. A network can be configured for "forwarding" mode. In this mode, packets may be forwarded if their destination EID is non-local, and matches a route for another interface on the same network. As per DSP0236, packet reassembly does not occur during the forwarding process. If the packet is larger than the MTU for the destination interface/route, then the packet is dropped. #### Tag behaviour for transmitted messages #### On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel must allocate a tag that will uniquely identify responses over a (destination EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size. To allow this, a `sendto()` with the `MCTP_TAG_OWNER` bit set in the `smctp_tag` field will cause the kernel to allocate a unique tag for subsequent replies from that specific remote EID. This allocation will expire when any of the following occur: * the socket is closed * a new message is sent to a new destination EID * an implementation-defined timeout expires Because the "tag space" is limited, it may not be possible for the kernel to allocate a unique tag for the outgoing message. In this case, the `sendto()` call will fail with errno `EAGAIN`. This is analogous to the UDP behaviour when a local port cannot be allocated for an outgoing message. The implementation-defined timeout value shall be chosen to reasonably cover standard reply timeouts. If necessary, this timeout may be modified through the `MCTP_TAG_CONTROL` socket option. For applications that expect to perform an ongoing message exchange with a particular destination address, they may use the `connect()` call to set a persistent remote address. In this case, the tag will be allocated during connect(), and remain reserved for this socket until any of the following occur: * the socket is closed * the remote address is changed through another call to `connect()`. In particular, calling `sendto()` with a different address does not release the tag reservation. Broadcast messages are particularly onerous for tag reservations. When a message is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must reserve the tag across the entire range of possible EIDs. Therefore, a particular tag value must be currently-unused across all EIDs to allow a `sendto()` to a broadcast address. Additionally, this reservation is not cleared when a reply is received, as there may be multiple replies to a broadcast. For this reason, applications wanting to send to the broadcast address should use the `connect()` system call to reserve a tag, and guarantee its availability for future message transmission. Note that this will remove the tag value for use with *any other EID*. Sending to the broadcast address should be avoided; we expect few applications will need this functionality. #### MCTP Control Protocol implementation #### Aside from the "Resolve endpoint EID" message, the MCTP control protocol implementation would exist as a userspace process, `mctpd`. This process is responsible for responding to incoming control protocol messages, any dynamic EID allocations (for bus owner devices) and maintaining the MCTP route table (for bridging devices). This process would create a socket bound to the type `MCTP_TYPE_CONTROL`, with the `MCTP_ADDR_EXT` socket option enabled in order to access physical addressing data on incoming control protocol requests. It would interact with the kernel's route table via a netlink interface - the same as that implemented for the [Utility and configuration interfaces](#utility-and-configuration-interfaces). ### Neighbour and routing implementation ### The packet-transmission behaviour of the MCTP infrastructure relies on a single routing table to lookup both route and neighbour information. Entries in this table are of the format: | EID range | interface | physical address | metric | MTU | flags | expiry | |-----------|-----------|------------------|--------|-----|-------|--------| This table can be updated from two sources: * From userspace, via a netlink interface (see the [Utility and configuration interfaces](#utility-and-configuration-interfaces) section). * Directly within the kernel, when basic neighbour information is discovered. Kernel-originated routes are marked as such in the flags field, and have a maximum validity age, indicated by the expiry field. Kernel-discovered routing information can originate from two sources: * physical-to-EID mappings discovered through received packets * explicit endpoint physical-address resolution requests When a packet is to be transmitted to an EID that does not have an entry in the routing table, the kernel may attempt to resolve the physical address of that endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol (section 12.9 of DSP0236). The response message will be used to add a kernel-originated route into the routing table. This is the only kernel-internal usage of MCTP Control Protocol messages. ## Utility and configuration interfaces ## A small utility will be developed to control the state of the kernel MCTP stack. This will be similar in design to the 'iproute2' tools, which perform a similar function for the IPv4 and IPv6 protocols. The utility will be invoked as `mctp`, and provide subcommands for managing different aspects of the kernel stack. ### `mctp link`: manage interfaces ### ```sh mctp link set mctp link set network mctp link set mtu mctp link set bus-owner ``` ### `mctp network`: manage networks ### ```sh mctp network create mctp network set forwarding mctp network set default [] mctp network delete ``` ### `mctp address`: manage local EID assignments ### ```sh mctp address add dev mctp address del dev ``` ### `mctp route`: manage routing tables ### ```sh mctp route add net eid via [hwaddr ] [mtu ] [metric ] mctp route del net eid via [hwaddr ] [mtu ] [metric ] mctp route show [net ] ``` ### `mctp stat`: query socket status ### ```sh mctp stat ``` A set of netlink message formats will be defined to support these control functions. # Design points & alternatives considered # ## Including message-type byte in send/receive buffers ## This design specifies that message buffers passed to the kernel in send syscalls and from the kernel in receive syscalls will have the message type byte as the first byte of the buffer. This corresponds to the definition of a MCTP message payload in DSP0236. This somewhat duplicates the type data provided in `struct sockaddr_mctp`; it's superficially possible for the kernel to prepend this byte on send, and remove it on receive. However, the exact format of the MCTP message payload is not precisely defined by the specification. Particularly, any message integrity check data (which would also need to be appended / stripped in conjunction with the type byte) is defined by the type specification, not DSP0236. The kernel would need knowledge of all protocols in order to correctly deconstruct the payload data. Therefore, we transfer the message payload as-is to userspace, without any modification by the kernel. ## MCTP message-type specification: using `sockaddr_mctp.smctp_type` rather than protocol ## This design specifies message-types to be passed in the `smctp_type` field of `struct sockaddr_mctp`. An alternative would be to pass it in the `protocol` argument of the `socket()` system call: ```c int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol); ``` The `smctp_type` implementation was chosen as it better matches the "addressing" model of the message type; sockets are bound to an incoming message type, similar to the IP protocol's model of binding UDP sockets to a local port number. There is no kernel behaviour that depends on the specific type (particularly given the design choice above), so it is not suited to use the protocol argument here. Future additions that perform protocol-specific message handling, and so alter the send/receive buffer format, may use a new protocol argument. ## Networks referenced by index rather than UUID ## This design proposes referencing networks by an integer index. The MCTP standard does optionally associate a RFC4122 UUID with a networks; it would be possible to use this UUID where we pass a network identifier. This approach does not incorporate knowledge of network UUIDs in the kernel. Given that the Get Network ID message in the MCTP Control Protocol is implemented entirely via userspace, it does not need to be aware of network UUIDs, and requiring network references (for example, the `smctp_network` field of `struct sockaddr_mctp`, as type `uuid_t`) complicates assignment. Instead, the index integer is used instead, in a similar fashion to the integer index used to reference `struct netdevice`s elsewhere in the network stack. ## Tag behaviour alternatives ## We considered *several* different designs for the tag handling behaviour. A brief overview of the more-feasible of those, and why they were rejected: ### Each socket is allocated a unique tag value on creation ### We could allocate a tag for each socket on creation, and use that value when a tag is required. This, however: * needlessly consumes a tag on non-tag-owning sockets (ie, those which send with TO=0 - responders); and * limits us to 8 sockets per network. ### Tags only used for message packetisation / reassembly ### An alternative would be to completely dissociate tag allocation from sockets; and only allocate a tag for the (short-lived) task of packetising a message, and sending those packets. Tags would be released when the last packet has been sent. However, this removes any facility to correlate responses with the correct socket, which is the purpose of the TO bit in DSP0236. In order for the sending application to receive the response, we would either need to: * limit the system to one socket of each message type (which, for example, precludes running a requester and a responder of the same type); or * forward all incoming messages of a specific message-type to all sockets listening on that type, making it trivial to eavesdrop on MCTP data of other applications ### Allocate a tag for one request/response pair ### Another alternative would be to allocate a tag on each outgoing TO=1 message, and then release that allocation after the incoming response to that tag (TO=0) is observed. However, MCTP protocols exist that do not have a 1:1 mapping of responses to requests - more than one response may be valid for a given request message. For example, in response to a request, a NVMe-MI implementation may send an in-progress reply before the final reply. In this case, we would release the tag after the first response is received, and then have no way to correlate the second message with the socket. Broadcast MCTP request messages may have multiple replies from multiple endpoints, meaning we cannot release the tag allocation on the first reply.