1# OpenBMC in-kernel MCTP 2 3Author: Jeremy Kerr <jk@codeconstruct.com.au> 4 5Please refer to the [MCTP Overview](mctp.md) document for general MCTP design 6description, background and requirements. 7 8This document describes a kernel-based implementation of MCTP infrastructure, 9providing a sockets-based API for MCTP communication within an OpenBMC-based 10platform. 11 12## Requirements for a kernel implementation 13 14- The MCTP messaging API should be an obvious application of the existing POSIX 15 socket interface 16 17- Configuration should be simple for a straightforward MCTP endpoint: a single 18 network with a single local endpoint id (EID). 19 20- Infrastructure should be flexible enough to allow for more complex MCTP 21 networks, allowing: 22 - each MCTP network (as defined by section 3.2.31 of DSP0236) may consist of 23 multiple local physical interfaces, and/or multiple EIDs; 24 25 - multiple distinct (ie., non-bridged) networks, possibly containing 26 duplicated EIDs between networks; 27 28 - multiple local EIDs on a single interface, and 29 30 - customisable routing/bridging configurations within a network. 31 32## Proposed Design 33 34The design contains several components: 35 36- An interface for userspace applications to send and receive MCTP messages: A 37 mapping of the sockets API to MCTP usage 38 39- Infrastructure for control and configuration of the MCTP network(s), 40 consisting of a configuration utility, and a kernel messaging facility for 41 this utility to use. 42 43- Kernel drivers for physical interface bindings. 44 45In general, the kernel components cover the transport functionality of MCTP, 46such as message assembly/disassembly, packet forwarding, and physical interface 47implementations. 48 49Higher-level protocols (such as PLDM) are implemented in userspace, through the 50introduced socket API. This also includes the majority of the MCTP Control 51Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically 52have a specific process to request and respond to control protocol messages. 53However, the kernel will include a small subset of control protocol code to 54allow very simple endpoints, with static EID allocations, to run without this 55process. MCTP endpoints that require more than just single-endpoint 56functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would 57include the control message protocol process. 58 59A new driver is introduced to handle each physical interface binding. These 60drivers expose the appropriate `struct net_device` to handle transmission and 61reception of MCTP packets on their associated hardware channels. Under Linux, 62the namespace for these interfaces is separate from other network interfaces - 63such as those for ethernet. 64 65### Structure: interfaces & networks 66 67The kernel models the local MCTP topology through two items: interfaces and 68networks. 69 70An interface (or "link") is an instance of an MCTP physical transport binding 71(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware 72device. This is represented as a `struct netdevice`, and has a user-visible name 73and index (`ifindex`). Non-hardware-attached interfaces are permitted, to allow 74local loopback and/or virtual interfaces. 75 76A network defines a unique address space for MCTP endpoints by endpoint-ID 77(described by DSP0236, section 3.2.31). A network has a user-visible identifier 78to allow references from userspace. Route definitions are specific to one 79network. 80 81Interfaces are associated with one network. A network may be associated with one 82or more interfaces. 83 84If multiple networks are present, each may contain EIDs that are also present on 85other networks. 86 87### Sockets API 88 89#### Protocol definitions 90 91We define a new address family (and corresponding protocol family) for MCTP: 92 93```c 94 #define AF_MCTP /* TBD */ 95 #define PF_MCTP AF_MCTP 96``` 97 98MCTP sockets are created with the `socket()` syscall, specifying `AF_MCTP` as 99the domain. Currently, only a `SOCK_DGRAM` socket type is defined. 100 101```c 102 int sd = socket(AF_MCTP, SOCK_DGRAM, 0); 103``` 104 105The only (current) value for the `protocol` argument is 0. Future protocol 106implementations may be added later. 107 108MCTP Sockets opened with a protocol value of 0 will communicate directly at the 109transport layer; message buffers received by the application will consist of 110message data from reassembled MCTP packets, and will include the full message 111including message type byte and optional message integrity check (IC). 112Individual packet headers are not included; they may be accessible through a 113future `SOCK_RAW` socket type. 114 115As with all socket address families, source and destination addresses are 116specified with a new `sockaddr` type: 117 118```c 119 struct sockaddr_mctp { 120 sa_family_t smctp_family; /* = AF_MCTP */ 121 int smctp_network; 122 struct mctp_addr smctp_addr; 123 uint8_t smctp_type; 124 uint8_t smctp_tag; 125 }; 126 127 struct mctp_addr { 128 uint8_t s_addr; 129 }; 130 131 /* MCTP network values */ 132 #define MCTP_NET_ANY 0 133 134 /* MCTP EID values */ 135 #define MCTP_ADDR_ANY 0xff 136 #define MCTP_ADDR_BCAST 0xff 137 138 /* MCTP type values. Only the least-significant 7 bits of 139 * smctp_type are used for tag matches; the specification defines 140 * the type to be 7 bits. 141 */ 142 #define MCTP_TYPE_MASK 0x7f 143 144 /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */ 145 /* MCTP-spec-defined fields */ 146 #define MCTP_TAG_MASK 0x07 147 #define MCTP_TAG_OWNER 0x08 148 /* Others: reserved */ 149 150 /* Helpers */ 151 #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */ 152``` 153 154#### Syscall behaviour 155 156The following sections describe the MCTP-specific behaviours of the standard 157socket system calls. These behaviours have been chosen to map closely to the 158existing sockets APIs. 159 160##### `bind()`: set local socket address 161 162Sockets that receive incoming request packets will bind to a local address, 163using the `bind()` syscall. 164 165```c 166 struct sockaddr_mctp addr; 167 168 addr.smctp_family = AF_MCTP; 169 addr.smctp_network = MCTP_NET_ANY; 170 addr.smctp_addr.s_addr = MCTP_ADDR_ANY; 171 addr.smctp_type = MCTP_TYPE_PLDM; 172 addr.smctp_tag = MCTP_TAG_OWNER; 173 174 int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr)); 175``` 176 177This establishes the local address of the socket. Incoming MCTP messages that 178match the network, address, and message type will be received by this socket. 179The reference to 'incoming' is important here; a bound socket will only receive 180messages with the TO bit set, to indicate an incoming request message, rather 181than a response. 182 183The `smctp_tag` value will configure the tags accepted from the remote side of 184this socket. Given the above, the only valid value is `MCTP_TAG_OWNER`, which 185will result in remotely "owned" tags being routed to this socket. Since 186`MCTP_TAG_OWNER` is set, the 3 least-significant bits of `smctp_tag` are not 187used; callers must set them to zero. See the 188[Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages) 189section for more details. If the `MCTP_TAG_OWNER` bit is not set, `bind()` will 190fail with an errno of `EINVAL`. 191 192A `smctp_network` value of `MCTP_NET_ANY` will configure the socket to receive 193incoming packets from any locally-connected network. A specific network value 194will cause the socket to only receive incoming messages from that network. 195 196The `smctp_addr` field specifies a local address to bind to. A value of 197`MCTP_ADDR_ANY` configures the socket to receive messages addressed to any local 198destination EID. 199 200The `smctp_type` field specifies which message types to receive. Only the lower 2017 bits of the type is matched on incoming messages (ie., the most-significant IC 202bit is not part of the match). This results in the socket receiving packets with 203and without a message integrity check footer. 204 205##### `connect()`: set remote socket address 206 207Sockets may specify a socket's remote address with the `connect()` syscall: 208 209```c 210 struct sockaddr_mctp addr; 211 int rc; 212 213 addr.smctp_family = AF_MCTP; 214 addr.smctp_network = MCTP_NET_ANY; 215 addr.smctp_addr.s_addr = 8; 216 addr.smctp_tag = MCTP_TAG_OWNER; 217 addr.smctp_type = MCTP_TYPE_PLDM; 218 219 rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr)); 220``` 221 222This establishes the remote address of a socket, used for future message 223transmission. Like other `SOCK_DGRAM` behaviour, this does not generate any MCTP 224traffic directly, but just sets the default destination for messages sent from 225this socket. 226 227The `smctp_network` field may specify a locally-attached network, or the value 228`MCTP_NET_ANY`, in which case the kernel will select a suitable MCTP network. 229This is guaranteed to work for single-network configurations, but may require 230additional routing definitions for endpoints attached to multiple distinct 231networks. See the [Addressing](#addressing) section for details. 232 233The `smctp_addr` field specifies a remote EID. This may be the `MCTP_ADDR_BCAST` 234the MCTP broadcast EID (0xff). 235 236The `smctp_type` field specifies the type field of messages transferred over 237this socket. 238 239The `smctp_tag` value will configure the tag used for the local side of this 240socket. The only valid value is `MCTP_TAG_OWNER`, which will result in an 241"owned" tag to be allocated for this socket, and will remain allocated for all 242future outgoing messages, until either the socket is closed, or `connect()` is 243called again. If a tag cannot be allocated, `connect()` will report an error, 244with an errno value of `EAGAIN`. See the 245[Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages) 246section for more details. If the `MCTP_TAG_OWNER` bit is not set, `connect()` 247will fail with an errno of `EINVAL`. 248 249Requesters which connect to a single responder will typically use `connect()` to 250specify the peer address and tag for future outgoing messages. 251 252##### `sendto()`, `sendmsg()`, `send()` & `write()`: transmit an MCTP message 253 254An MCTP message is transmitted using one of the `sendto()`, `sendmsg()`, 255`send()` or `write()` syscalls. Using `sendto()` as the primary example: 256 257```c 258 struct sockaddr_mctp addr; 259 char buf[14]; 260 ssize_t len; 261 262 /* set message destination */ 263 addr.smctp_family = AF_MCTP; 264 addr.smctp_network = 0; 265 addr.smctp_addr.s_addr = 8; 266 addr.smctp_tag = MCTP_TAG_OWNER; 267 addr.smctp_type = MCTP_TYPE_ECHO; 268 269 /* arbitrary message to send, with message-type header */ 270 buf[0] = MCTP_TYPE_ECHO; 271 memcpy(buf + 1, "hello, world!", sizeof(buf) - 1); 272 273 len = sendto(sd, buf, sizeof(buf), 0, 274 (struct sockaddr_mctp *)&addr, sizeof(addr)); 275``` 276 277The address argument is treated the same way as for `connect()`: The network and 278address fields define the remote address to send to. If `smctp_tag` has the 279`MCTP_TAG_OWNER`, the kernel will ignore any bits set in `MCTP_TAG_VALUE`, and 280generate a tag value suitable for the destination EID. If `MCTP_TAG_OWNER` is 281not set, the message will be sent with the tag value as specified. If a tag 282value cannot be allocated, the system call will report an errno of `EAGAIN`. 283 284The application must provide the message type byte as the first byte of the 285message buffer passed to `sendto()`. If a message integrity check is to be 286included in the transmitted message, it must also be provided in the message 287buffer, and the most-significant bit of the message type byte must be 1. 288 289If the first byte of the message does not match the message type value, then the 290system call will return an error of `EPROTO`. 291 292The `send()` and `write()` system calls behave in a similar way, but do not 293specify a remote address. Therefore, `connect()` must be called beforehand; if 294not, these calls will return an error of `EDESTADDRREQ` (Destination address 295required). 296 297Using `sendto()` or `sendmsg()` on a connected socket may override the remote 298socket address specified in `connect()`. The `connect()` address and tag will 299remain associated with the socket, for future unaddressed sends. The tag 300allocated through a call to `sendto()` or `sendmsg()` on a connected socket is 301subject to the same invalidation logic as on an unconnected socket: It is 302expired either by timeout or by a subsequent `sendto()`. 303 304The `sendmsg()` system call allows a more compact argument interface, and the 305message buffer to be specified as a scatter-gather list. At present no ancillary 306message types (used for the `msg_control` data passed to `sendmsg()`) are 307defined. 308 309Transmitting a message on an unconnected socket with `MCTP_TAG_OWNER` specified 310will cause an allocation of a tag, if no valid tag is already allocated for that 311destination. The (destination-eid,tag) tuple acts as an implicit local socket 312address, to allow the socket to receive responses to this outgoing message. If 313any previous allocation has been performed (to for a different remote EID), that 314allocation is lost. This tag behaviour can be controlled through the 315`MCTP_TAG_CONTROL` socket option. 316 317Sockets will only receive responses to requests they have sent (with TO=1) and 318may only respond (with TO=0) to requests they have received. 319 320##### `recvfrom()`, `recvmsg()`, `recv()` & `read()`: receive an MCTP message 321 322An MCTP message can be received by an application using one of the `recvfrom()`, 323`recvmsg()`, `recv()` or `read()` system calls. Using `recvfrom()` as the 324primary example: 325 326```c 327 struct sockaddr_mctp addr; 328 socklen_t addrlen; 329 char buf[14]; 330 ssize_t len; 331 332 addrlen = sizeof(addr); 333 334 len = recvfrom(sd, buf, sizeof(buf), 0, 335 (struct sockaddr_mctp *)&addr, &addrlen); 336 337 /* We can expect addr to describe an MCTP address */ 338 assert(addrlen >= sizeof(buf)); 339 assert(addr.smctp_family == AF_MCTP); 340 341 printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr); 342``` 343 344The address argument to `recvfrom` and `recvmsg` is populated with the remote 345address of the incoming message, including tag value (this will be needed in 346order to reply to the message). 347 348The first byte of the message buffer will contain the message type byte. If an 349integrity check follows the message, it will be included in the received buffer. 350 351The `recv()` and `read()` system calls behave in a similar way, but do not 352provide a remote address to the application. Therefore, these are only useful if 353the remote address is already known, or the message does not require a reply. 354 355Like the send calls, sockets will only receive responses to requests they have 356sent (TO=1) and may only respond (TO=0) to requests they have received. 357 358##### `getsockname()` & `getpeername()`: query local/remote socket address 359 360The `getsockname()` system call returns the `struct sockaddr_mctp` value for the 361local side of this socket, `getpeername()` for the remote (ie, that used in a 362connect()). Since the tag value is a property of the remote address, 363`getpeername()` may be used to retrieve a kernel-allocated tag value. 364 365Calling `getpeername()` on an unconnected socket will result in an error of 366`ENOTCONN`. 367 368##### Socket options 369 370The following socket options are defined for MCTP sockets: 371 372###### `MCTP_ADDR_EXT`: Use extended addressing information in sendmsg/recvmsg 373 374Enabling this socket option allows an application to specify extended addressing 375information on transmitted packets, and access the same on received packets. 376 377When the `MCTP_ADDR_EXT` socket option is enabled, the application may specify 378an expanded `struct sockaddr` to the `recvfrom()` and `sendto()` system calls. 379This as defined as: 380 381```c 382 struct sockaddr_mctp_ext { 383 /* fields exactly match struct sockaddr_mctp */ 384 sa_family_t smctp_family; /* = AF_MCTP */ 385 int smctp_network; 386 struct mctp_addr smctp_addr; 387 uint8_t smcp_tag; 388 /* extended addressing */ 389 int smctp_ifindex; 390 uint8_t smctp_halen; 391 unsigned char smctp_haddr[/* TBD */]; 392 } 393``` 394 395If the `addrlen` specified to `sendto()` or `recvfrom()` is sufficient to 396contain this larger structure, then the extended addressing fields are consumed 397/ populated respectively. 398 399###### `MCTP_TAG_CONTROL`: manage outgoing tag allocation behaviour 400 401The set/getsockopt argument is a `mctp_tagctl` structure: 402 403```c 404 struct mctp_tagctl { 405 bool retain; 406 struct timespec timeout; 407 }; 408``` 409 410This allows an application to control the behaviour of allocated tags for 411non-connected sockets when transferring messages to multiple different 412destinations (ie., where a `struct sockaddr_mctp` is provided for individual 413messages, and the `smctp_addr` destination for those sockets may vary across 414calls). 415 416The `retain` flag indicates to the kernel that the socket should not release tag 417allocations when a message is sent to a new destination EID. This causes the 418socket to continue to receive incoming messages to the old (dest,tag) tuple, in 419addition to the new tuple. 420 421The `timeout` value specifies a maximum amount of time to retain tag values. 422This should be based on the reply timeout for any upper-level protocol. 423 424The kernel may reject a request to set values that would cause excessive tag 425allocation by this socket. The kernel may also reject subsequent tag-allocation 426requests (through send or connect syscalls) which would cause excessive tags to 427be consumed by the socket, even though the tag control settings were accepted in 428the setsockopt operation. 429 430Changing the default tag control behaviour should only be required when: 431 432- the socket is sending messages with TO=1 (ie, is a requester); and 433- messages are sent to multiple different destination EIDs from the one socket. 434 435##### Syscalls not implemented 436 437The following system calls are not implemented for MCTP, primarily as they are 438not used in `SOCK_DGRAM`-type sockets: 439 440- `listen()` 441- `accept()` 442- `ioctl()` 443- `shutdown()` 444- `mmap()` 445 446#### Userspace examples 447 448These examples cover three general use-cases: 449 450- **requester**: sends requests to a particular (EID, type) target, and receives 451 responses to those packets 452 453 This is similar to a typical UDP client 454 455- **responder**: receives all locally-addressed messages of a specific 456 message-type, and responds to the requester immediately. 457 458 This is similar to a typical UDP server 459 460- **controller**: a specific service for a bus owner; may send broadcast 461 messages, manage EID allocations, update local MCTP stack state. Will need 462 low-level packet data. 463 464 This is similar to a DHCP server. 465 466##### Requester 467 468"Client"-side implementation to send requests to a responder, and receive a 469response. This uses a (fictitious) message type of `MCTP_TYPE_ECHO`. 470 471```c 472 int main() { 473 struct sockaddr_mctp addr; 474 socklen_t addrlen; 475 struct { 476 uint8_t type; 477 uint8_t data[14]; 478 } msg; 479 int sd, rc; 480 481 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 482 483 addr.sa_family = AF_MCTP; 484 addr.smctp_network = MCTP_NET_ANY; /* any network */ 485 addr.smctp_addr.s_addr = 9; /* remote eid 9 */ 486 addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */ 487 addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */ 488 addrlen = sizeof(addr); 489 490 /* set message type and payload */ 491 msg.type = MCTP_TYPE_ECHO; 492 strncpy(msg.data, "hello, world!", sizeof(msg.data)); 493 494 /* send message */ 495 rc = sendto(sd, &msg, sizeof(msg), 0, 496 (struct sockaddr *)&addr, addrlen); 497 498 if (rc < 0) 499 err(EXIT_FAILURE, "sendto"); 500 501 /* Receive reply. This will block until a reply arrives, 502 * which may never happen. Actual code would need a timeout 503 * here. */ 504 rc = recvfrom(sd, &msg, sizeof(msg), 0, 505 (struct sockaddr *)&addr, &addrlen); 506 if (rc < 0) 507 err(EXIT_FAILURE, "recvfrom"); 508 509 assert(msg.type == MCTP_TYPE_ECHO); 510 /* ensure we're nul-terminated */ 511 msg.data[sizeof(msg.data)-1] = '\0'; 512 513 printf("reply: %s\n", msg.data); 514 515 return EXIT_SUCCESS; 516 } 517``` 518 519##### Responder 520 521"Server"-side implementation to receive requests and respond. Like the client, 522This uses a (fictitious) message type of `MCTP_TYPE_ECHO` in the 523`struct sockaddr_mctp`; only messages matching this type will be received. 524 525```c 526 int main() { 527 struct sockaddr_mctp addr; 528 socklen_t addrlen; 529 int sd, rc; 530 531 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 532 533 addr.sa_family = AF_MCTP; 534 addr.smctp_network = MCTP_NET_ANY; /* any network */ 535 addr.smctp_addr.s_addr = MCTP_EID_ANY; 536 addr.smctp_type = MCTP_TYPE_ECHO; 537 addr.smctp_tag = MCTP_TAG_OWNER; 538 addrlen = sizeof(addr); 539 540 rc = bind(sd, (struct sockaddr *)&addr, addrlen); 541 if (rc) 542 err(EXIT_FAILURE, "bind"); 543 544 for (;;) { 545 struct { 546 uint8_t type; 547 uint8_t data[14]; 548 } msg; 549 550 rc = recvfrom(sd, &msg, sizeof(msg), 0, 551 (struct sockaddr *)&addr, &addrlen); 552 if (rc < 0) 553 err(EXIT_FAILURE, "recvfrom"); 554 if (rc < 1) 555 warnx("not enough data for a message type"); 556 557 assert(addrlen == sizeof(addr)); 558 assert(msg.type == MCTP_TYPE_ECHO); 559 560 printf("%zd bytes from EID %d\n", rc, addr.smctp_addr); 561 562 /* Reply to requester; this macro just clears the TO-bit. 563 * Other addr fields will describe the remote endpoint, 564 * so use those as-is. 565 */ 566 addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag); 567 568 rc = sendto(sd, &msg, rc, 0, 569 (struct sockaddr *)&addr, addrlen); 570 if (rc < 0) 571 err(EXIT_FAILURE, "sendto"); 572 } 573 574 return EXIT_SUCCESS; 575 } 576``` 577 578##### Broadcast request 579 580Sends a request to a broadcast EID, and receives (unicast) replies. Typical 581control protocol pattern. 582 583```c 584 int main() { 585 struct sockaddr_mctp txaddr, rxaddr; 586 struct timespec start, cur; 587 struct pollfd pollfds[1]; 588 socklen_t addrlen; 589 uint8_t buf[2]; 590 int timeout; 591 592 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 593 594 /* destination address setup */ 595 txaddr.sa_family = AF_MCTP; 596 txaddr.smctp_network = 1; /* specific network required for broadcast */ 597 txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */ 598 txaddr.smctp_type = MCTP_TYPE_CONTROL; 599 txaddr.smctp_tag = MCTP_TAG_OWNER; 600 601 buf[0] = MCTP_TYPE_CONTROL; 602 buf[1] = 'a'; 603 604 /* We're doing a sendto() to a broadcast address here. If we were 605 * sending more than one broadcast message, we'd be better off 606 * doing connect(); sendto();, in order to retain the tag 607 * reservation across all transmitted messages. However, since this 608 * is a single transmit, that makes no difference in this 609 * particular case. 610 */ 611 rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr, 612 sizeof(txaddr)); 613 if (rc < 0) 614 err(EXIT_FAILURE, "sendto"); 615 616 /* Set up poll behaviour, and record our starting time for 617 * reply timeouts */ 618 pollfds[0].fd = sd; 619 pollfds[0].events = POLLIN; 620 clock_gettime(CLOCK_MONOTONIC, &start); 621 622 for (;;) { 623 /* Calculate the amount of time left for replies */ 624 clock_gettime(CLOCK_MONOTONIC, &cur); 625 timeout = calculate_timeout(&start, &cur, 1000); 626 627 rc = poll(pollfds, 1, timeout) 628 if (rc < 0) 629 err(EXIT_FAILURE, "poll"); 630 631 /* timeout receiving a reply? */ 632 if (rc == 0) 633 break; 634 635 /* sanity check that we have a message to receive */ 636 if (!(pollfds[0].revents & POLLIN)) 637 break; 638 639 addrlen = sizeof(rxaddr); 640 641 rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr, 642 &addrlen); 643 if (rc < 0) 644 err(EXIT_FAILURE, "recvfrom"); 645 646 assert(addrlen >= sizeof(rxaddr)); 647 assert(rxaddr.smctp_family == AF_MCTP); 648 649 printf("response from EID %d\n", rxaddr.smctp_addr); 650 } 651 652 return EXIT_SUCCESS; 653 } 654``` 655 656#### Implementation notes 657 658##### Addressing 659 660Transmitted messages (through `sendto()` and related system calls) specify their 661destination via the `smctp_network` and `smctp_addr` fields of a 662`struct sockaddr_mctp`. 663 664The `smctp_addr` field maps directly to the destination endpoint's EID. 665 666The `smctp_network` field specifies a locally defined network identifier. To 667simplify situations where there is only one network defined, the special value 668`MCTP_NET_ANY` is allowed. This will allow the kernel to select a specific 669network for transmission. 670 671This selection is entirely user-configured; one specific network may be defined 672as the system default, in which case it will be used for all message 673transmission where `MCTP_NET_ANY` is used as the destination network. 674 675In particular, the destination EID is never used to select a destination 676network. 677 678MCTP responders should use the EID and network values of an incoming request to 679specify the destination for any responses. 680 681##### Bridging/routing 682 683The network and interface structure allows multiple interfaces to share a common 684network. By default, packets are not forwarded between interfaces. 685 686A network can be configured for "forwarding" mode. In this mode, packets may be 687forwarded if their destination EID is non-local, and matches a route for another 688interface on the same network. 689 690As per DSP0236, packet reassembly does not occur during the forwarding process. 691If the packet is larger than the MTU for the destination interface/route, then 692the packet is dropped. 693 694##### Tag behaviour for transmitted messages 695 696On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel 697must allocate a tag that will uniquely identify responses over a (destination 698EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size. 699 700To allow this, a `sendto()` with the `MCTP_TAG_OWNER` bit set in the `smctp_tag` 701field will cause the kernel to allocate a unique tag for subsequent replies from 702that specific remote EID. 703 704This allocation will expire when any of the following occur: 705 706- the socket is closed 707- a new message is sent to a new destination EID 708- an implementation-defined timeout expires 709 710Because the "tag space" is limited, it may not be possible for the kernel to 711allocate a unique tag for the outgoing message. In this case, the `sendto()` 712call will fail with errno `EAGAIN`. This is analogous to the UDP behaviour when 713a local port cannot be allocated for an outgoing message. 714 715The implementation-defined timeout value shall be chosen to reasonably cover 716standard reply timeouts. If necessary, this timeout may be modified through the 717`MCTP_TAG_CONTROL` socket option. 718 719For applications that expect to perform an ongoing message exchange with a 720particular destination address, they may use the `connect()` call to set a 721persistent remote address. In this case, the tag will be allocated during 722connect(), and remain reserved for this socket until any of the following occur: 723 724- the socket is closed 725- the remote address is changed through another call to `connect()`. 726 727In particular, calling `sendto()` with a different address does not release the 728tag reservation. 729 730Broadcast messages are particularly onerous for tag reservations. When a message 731is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must 732reserve the tag across the entire range of possible EIDs. Therefore, a 733particular tag value must be currently-unused across all EIDs to allow a 734`sendto()` to a broadcast address. Additionally, this reservation is not cleared 735when a reply is received, as there may be multiple replies to a broadcast. 736 737For this reason, applications wanting to send to the broadcast address should 738use the `connect()` system call to reserve a tag, and guarantee its availability 739for future message transmission. Note that this will remove the tag value for 740use with _any other EID_. Sending to the broadcast address should be avoided; we 741expect few applications will need this functionality. 742 743##### MCTP Control Protocol implementation 744 745Aside from the "Resolve endpoint EID" message, the MCTP control protocol 746implementation would exist as a userspace process, `mctpd`. This process is 747responsible for responding to incoming control protocol messages, any dynamic 748EID allocations (for bus owner devices) and maintaining the MCTP route table 749(for bridging devices). 750 751This process would create a socket bound to the type `MCTP_TYPE_CONTROL`, with 752the `MCTP_ADDR_EXT` socket option enabled in order to access physical addressing 753data on incoming control protocol requests. It would interact with the kernel's 754route table via a netlink interface - the same as that implemented for the 755[Utility and configuration interfaces](#utility-and-configuration-interfaces). 756 757#### Neighbour and routing implementation 758 759The packet-transmission behaviour of the MCTP infrastructure relies on a single 760routing table to lookup both route and neighbour information. Entries in this 761table are of the format: 762 763| EID range | interface | physical address | metric | MTU | flags | expiry | 764| --------- | --------- | ---------------- | ------ | --- | ----- | ------ | 765 766This table can be updated from two sources: 767 768- From userspace, via a netlink interface (see the 769 [Utility and configuration interfaces](#utility-and-configuration-interfaces) 770 section). 771 772- Directly within the kernel, when basic neighbour information is discovered. 773 Kernel-originated routes are marked as such in the flags field, and have a 774 maximum validity age, indicated by the expiry field. 775 776Kernel-discovered routing information can originate from two sources: 777 778- physical-to-EID mappings discovered through received packets 779 780- explicit endpoint physical-address resolution requests 781 782When a packet is to be transmitted to an EID that does not have an entry in the 783routing table, the kernel may attempt to resolve the physical address of that 784endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol 785(section 12.9 of DSP0236). The response message will be used to add a 786kernel-originated route into the routing table. 787 788This is the only kernel-internal usage of MCTP Control Protocol messages. 789 790### Utility and configuration interfaces 791 792A small utility will be developed to control the state of the kernel MCTP stack. 793This will be similar in design to the 'iproute2' tools, which perform a similar 794function for the IPv4 and IPv6 protocols. 795 796The utility will be invoked as `mctp`, and provide subcommands for managing 797different aspects of the kernel stack. 798 799#### `mctp link`: manage interfaces 800 801```sh 802 mctp link set <link> <up|down> 803 mctp link set <link> network <network-id> 804 mctp link set <link> mtu <mtu> 805 mctp link set <link> bus-owner <hwaddr> 806``` 807 808#### `mctp network`: manage networks 809 810```sh 811 mctp network create <network-id> 812 mctp network set <network-id> forwarding <on|off> 813 mctp network set <network-id> default [<true|false>] 814 mctp network delete <network-id> 815``` 816 817#### `mctp address`: manage local EID assignments 818 819```sh 820 mctp address add <eid> dev <link> 821 mctp address del <eid> dev <link> 822``` 823 824#### `mctp route`: manage routing tables 825 826```sh 827 mctp route add net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>] 828 mctp route del net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>] 829 mctp route show [net <network-id>] 830``` 831 832#### `mctp stat`: query socket status 833 834```sh 835 mctp stat 836``` 837 838A set of netlink message formats will be defined to support these control 839functions. 840 841## Design points & alternatives considered 842 843### Including message-type byte in send/receive buffers 844 845This design specifies that message buffers passed to the kernel in send syscalls 846and from the kernel in receive syscalls will have the message type byte as the 847first byte of the buffer. This corresponds to the definition of a MCTP message 848payload in DSP0236. 849 850This somewhat duplicates the type data provided in `struct sockaddr_mctp`; it's 851superficially possible for the kernel to prepend this byte on send, and remove 852it on receive. 853 854However, the exact format of the MCTP message payload is not precisely defined 855by the specification. Particularly, any message integrity check data (which 856would also need to be appended / stripped in conjunction with the type byte) is 857defined by the type specification, not DSP0236. The kernel would need knowledge 858of all protocols in order to correctly deconstruct the payload data. 859 860Therefore, we transfer the message payload as-is to userspace, without any 861modification by the kernel. 862 863### MCTP message-type specification: using `sockaddr_mctp.smctp_type` rather than protocol 864 865This design specifies message-types to be passed in the `smctp_type` field of 866`struct sockaddr_mctp`. An alternative would be to pass it in the `protocol` 867argument of the `socket()` system call: 868 869```c 870 int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol); 871``` 872 873The `smctp_type` implementation was chosen as it better matches the "addressing" 874model of the message type; sockets are bound to an incoming message type, 875similar to the IP protocol's model of binding UDP sockets to a local port 876number. 877 878There is no kernel behaviour that depends on the specific type (particularly 879given the design choice above), so it is not suited to use the protocol argument 880here. 881 882Future additions that perform protocol-specific message handling, and so alter 883the send/receive buffer format, may use a new protocol argument. 884 885### Networks referenced by index rather than UUID 886 887This design proposes referencing networks by an integer index. The MCTP standard 888does optionally associate a RFC4122 UUID with a networks; it would be possible 889to use this UUID where we pass a network identifier. 890 891This approach does not incorporate knowledge of network UUIDs in the kernel. 892Given that the Get Network ID message in the MCTP Control Protocol is 893implemented entirely via userspace, it does not need to be aware of network 894UUIDs, and requiring network references (for example, the `smctp_network` field 895of `struct sockaddr_mctp`, as type `uuid_t`) complicates assignment. 896 897Instead, the index integer is used instead, in a similar fashion to the integer 898index used to reference `struct netdevice`s elsewhere in the network stack. 899 900### Tag behaviour alternatives 901 902We considered _several_ different designs for the tag handling behaviour. A 903brief overview of the more-feasible of those, and why they were rejected: 904 905#### Each socket is allocated a unique tag value on creation 906 907We could allocate a tag for each socket on creation, and use that value when a 908tag is required. This, however: 909 910- needlessly consumes a tag on non-tag-owning sockets (ie, those which send with 911 TO=0 - responders); and 912 913- limits us to 8 sockets per network. 914 915#### Tags only used for message packetisation / reassembly 916 917An alternative would be to completely dissociate tag allocation from sockets; 918and only allocate a tag for the (short-lived) task of packetising a message, and 919sending those packets. Tags would be released when the last packet has been 920sent. 921 922However, this removes any facility to correlate responses with the correct 923socket, which is the purpose of the TO bit in DSP0236. In order for the sending 924application to receive the response, we would either need to: 925 926- limit the system to one socket of each message type (which, for example, 927 precludes running a requester and a responder of the same type); or 928 929- forward all incoming messages of a specific message-type to all sockets 930 listening on that type, making it trivial to eavesdrop on MCTP data of other 931 applications 932 933#### Allocate a tag for one request/response pair 934 935Another alternative would be to allocate a tag on each outgoing TO=1 message, 936and then release that allocation after the incoming response to that tag (TO=0) 937is observed. 938 939However, MCTP protocols exist that do not have a 1:1 mapping of responses to 940requests - more than one response may be valid for a given request message. For 941example, in response to a request, a NVMe-MI implementation may send an 942in-progress reply before the final reply. In this case, we would release the tag 943after the first response is received, and then have no way to correlate the 944second message with the socket. 945 946Broadcast MCTP request messages may have multiple replies from multiple 947endpoints, meaning we cannot release the tag allocation on the first reply. 948