1# OpenBMC in-kernel MCTP 2 3Author: Jeremy Kerr `<jk@codeconstruct.com.au>` 4 5Please refer to the [MCTP Overview](mctp.md) document for general MCTP design 6description, background and requirements. 7 8This document describes a kernel-based implementation of MCTP infrastructure, 9providing a sockets-based API for MCTP communication within an OpenBMC-based 10platform. 11 12# Requirements for a kernel implementation 13 14- The MCTP messaging API should be an obvious application of the existing POSIX 15 socket interface 16 17- Configuration should be simple for a straightforward MCTP endpoint: a single 18 network with a single local endpoint id (EID). 19 20- Infrastructure should be flexible enough to allow for more complex MCTP 21 networks, allowing: 22 23 - each MCTP network (as defined by section 3.2.31 of DSP0236) may consist of 24 multiple local physical interfaces, and/or multiple EIDs; 25 26 - multiple distinct (ie., non-bridged) networks, possibly containing 27 duplicated EIDs between networks; 28 29 - multiple local EIDs on a single interface, and 30 31 - customisable routing/bridging configurations within a network. 32 33# Proposed Design 34 35The design contains several components: 36 37- An interface for userspace applications to send and receive MCTP messages: A 38 mapping of the sockets API to MCTP usage 39 40- Infrastructure for control and configuration of the MCTP network(s), 41 consisting of a configuration utility, and a kernel messaging facility for 42 this utility to use. 43 44- Kernel drivers for physical interface bindings. 45 46In general, the kernel components cover the transport functionality of MCTP, 47such as message assembly/disassembly, packet forwarding, and physical interface 48implementations. 49 50Higher-level protocols (such as PLDM) are implemented in userspace, through the 51introduced socket API. This also includes the majority of the MCTP Control 52Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically 53have a specific process to request and respond to control protocol messages. 54However, the kernel will include a small subset of control protocol code to 55allow very simple endpoints, with static EID allocations, to run without this 56process. MCTP endpoints that require more than just single-endpoint 57functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would 58include the control message protocol process. 59 60A new driver is introduced to handle each physical interface binding. These 61drivers expose the appropriate `struct net_device` to handle transmission and 62reception of MCTP packets on their associated hardware channels. Under Linux, 63the namespace for these interfaces is separate from other network interfaces - 64such as those for ethernet. 65 66## Structure: interfaces & networks 67 68The kernel models the local MCTP topology through two items: interfaces and 69networks. 70 71An interface (or "link") is an instance of an MCTP physical transport binding 72(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware 73device. This is represented as a `struct netdevice`, and has a user-visible name 74and index (`ifindex`). Non-hardware-attached interfaces are permitted, to allow 75local loopback and/or virtual interfaces. 76 77A network defines a unique address space for MCTP endpoints by endpoint-ID 78(described by DSP0236, section 3.2.31). A network has a user-visible identifier 79to allow references from userspace. Route definitions are specific to one 80network. 81 82Interfaces are associated with one network. A network may be associated with one 83or more interfaces. 84 85If multiple networks are present, each may contain EIDs that are also present on 86other networks. 87 88## Sockets API 89 90### Protocol definitions 91 92We define a new address family (and corresponding protocol family) for MCTP: 93 94```c 95 #define AF_MCTP /* TBD */ 96 #define PF_MCTP AF_MCTP 97``` 98 99MCTP sockets are created with the `socket()` syscall, specifying `AF_MCTP` as 100the domain. Currently, only a `SOCK_DGRAM` socket type is defined. 101 102```c 103 int sd = socket(AF_MCTP, SOCK_DGRAM, 0); 104``` 105 106The only (current) value for the `protocol` argument is 0. Future protocol 107implementations may be added later. 108 109MCTP Sockets opened with a protocol value of 0 will communicate directly at the 110transport layer; message buffers received by the application will consist of 111message data from reassembled MCTP packets, and will include the full message 112including message type byte and optional message integrity check (IC). 113Individual packet headers are not included; they may be accessible through a 114future `SOCK_RAW` socket type. 115 116As with all socket address families, source and destination addresses are 117specified with a new `sockaddr` type: 118 119```c 120 struct sockaddr_mctp { 121 sa_family_t smctp_family; /* = AF_MCTP */ 122 int smctp_network; 123 struct mctp_addr smctp_addr; 124 uint8_t smctp_type; 125 uint8_t smctp_tag; 126 }; 127 128 struct mctp_addr { 129 uint8_t s_addr; 130 }; 131 132 /* MCTP network values */ 133 #define MCTP_NET_ANY 0 134 135 /* MCTP EID values */ 136 #define MCTP_ADDR_ANY 0xff 137 #define MCTP_ADDR_BCAST 0xff 138 139 /* MCTP type values. Only the least-significant 7 bits of 140 * smctp_type are used for tag matches; the specification defines 141 * the type to be 7 bits. 142 */ 143 #define MCTP_TYPE_MASK 0x7f 144 145 /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */ 146 /* MCTP-spec-defined fields */ 147 #define MCTP_TAG_MASK 0x07 148 #define MCTP_TAG_OWNER 0x08 149 /* Others: reserved */ 150 151 /* Helpers */ 152 #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */ 153``` 154 155### Syscall behaviour 156 157The following sections describe the MCTP-specific behaviours of the standard 158socket system calls. These behaviours have been chosen to map closely to the 159existing sockets APIs. 160 161#### `bind()`: set local socket address 162 163Sockets that receive incoming request packets will bind to a local address, 164using the `bind()` syscall. 165 166```c 167 struct sockaddr_mctp addr; 168 169 addr.smctp_family = AF_MCTP; 170 addr.smctp_network = MCTP_NET_ANY; 171 addr.smctp_addr.s_addr = MCTP_ADDR_ANY; 172 addr.smctp_type = MCTP_TYPE_PLDM; 173 addr.smctp_tag = MCTP_TAG_OWNER; 174 175 int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr)); 176``` 177 178This establishes the local address of the socket. Incoming MCTP messages that 179match the network, address, and message type will be received by this socket. 180The reference to 'incoming' is important here; a bound socket will only receive 181messages with the TO bit set, to indicate an incoming request message, rather 182than a response. 183 184The `smctp_tag` value will configure the tags accepted from the remote side of 185this socket. Given the above, the only valid value is `MCTP_TAG_OWNER`, which 186will result in remotely "owned" tags being routed to this socket. Since 187`MCTP_TAG_OWNER` is set, the 3 least-significant bits of `smctp_tag` are not 188used; callers must set them to zero. See the 189[Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages) 190section for more details. If the `MCTP_TAG_OWNER` bit is not set, `bind()` will 191fail with an errno of `EINVAL`. 192 193A `smctp_network` value of `MCTP_NET_ANY` will configure the socket to receive 194incoming packets from any locally-connected network. A specific network value 195will cause the socket to only receive incoming messages from that network. 196 197The `smctp_addr` field specifies a local address to bind to. A value of 198`MCTP_ADDR_ANY` configures the socket to receive messages addressed to any local 199destination EID. 200 201The `smctp_type` field specifies which message types to receive. Only the lower 2027 bits of the type is matched on incoming messages (ie., the most-significant IC 203bit is not part of the match). This results in the socket receiving packets with 204and without a message integrity check footer. 205 206#### `connect()`: set remote socket address 207 208Sockets may specify a socket's remote address with the `connect()` syscall: 209 210```c 211 struct sockaddr_mctp addr; 212 int rc; 213 214 addr.smctp_family = AF_MCTP; 215 addr.smctp_network = MCTP_NET_ANY; 216 addr.smctp_addr.s_addr = 8; 217 addr.smctp_tag = MCTP_TAG_OWNER; 218 addr.smctp_type = MCTP_TYPE_PLDM; 219 220 rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr)); 221``` 222 223This establishes the remote address of a socket, used for future message 224transmission. Like other `SOCK_DGRAM` behaviour, this does not generate any MCTP 225traffic directly, but just sets the default destination for messages sent from 226this socket. 227 228The `smctp_network` field may specify a locally-attached network, or the value 229`MCTP_NET_ANY`, in which case the kernel will select a suitable MCTP network. 230This is guaranteed to work for single-network configurations, but may require 231additional routing definitions for endpoints attached to multiple distinct 232networks. See the [Addressing](#addressing) section for details. 233 234The `smctp_addr` field specifies a remote EID. This may be the `MCTP_ADDR_BCAST` 235the MCTP broadcast EID (0xff). 236 237The `smctp_type` field specifies the type field of messages transferred over 238this socket. 239 240The `smctp_tag` value will configure the tag used for the local side of this 241socket. The only valid value is `MCTP_TAG_OWNER`, which will result in an 242"owned" tag to be allocated for this socket, and will remain allocated for all 243future outgoing messages, until either the socket is closed, or `connect()` is 244called again. If a tag cannot be allocated, `connect()` will report an error, 245with an errno value of `EAGAIN`. See the 246[Tag behaviour for transmitted messages](#tag-behaviour-for-transmitted-messages) 247section for more details. If the `MCTP_TAG_OWNER` bit is not set, `connect()` 248will fail with an errno of `EINVAL`. 249 250Requesters which connect to a single responder will typically use `connect()` to 251specify the peer address and tag for future outgoing messages. 252 253#### `sendto()`, `sendmsg()`, `send()` & `write()`: transmit an MCTP message 254 255An MCTP message is transmitted using one of the `sendto()`, `sendmsg()`, 256`send()` or `write()` syscalls. Using `sendto()` as the primary example: 257 258```c 259 struct sockaddr_mctp addr; 260 char buf[14]; 261 ssize_t len; 262 263 /* set message destination */ 264 addr.smctp_family = AF_MCTP; 265 addr.smctp_network = 0; 266 addr.smctp_addr.s_addr = 8; 267 addr.smctp_tag = MCTP_TAG_OWNER; 268 addr.smctp_type = MCTP_TYPE_ECHO; 269 270 /* arbitrary message to send, with message-type header */ 271 buf[0] = MCTP_TYPE_ECHO; 272 memcpy(buf + 1, "hello, world!", sizeof(buf) - 1); 273 274 len = sendto(sd, buf, sizeof(buf), 0, 275 (struct sockaddr_mctp *)&addr, sizeof(addr)); 276``` 277 278The address argument is treated the same way as for `connect()`: The network and 279address fields define the remote address to send to. If `smctp_tag` has the 280`MCTP_TAG_OWNER`, the kernel will ignore any bits set in `MCTP_TAG_VALUE`, and 281generate a tag value suitable for the destination EID. If `MCTP_TAG_OWNER` is 282not set, the message will be sent with the tag value as specified. If a tag 283value cannot be allocated, the system call will report an errno of `EAGAIN`. 284 285The application must provide the message type byte as the first byte of the 286message buffer passed to `sendto()`. If a message integrity check is to be 287included in the transmitted message, it must also be provided in the message 288buffer, and the most-significant bit of the message type byte must be 1. 289 290If the first byte of the message does not match the message type value, then the 291system call will return an error of `EPROTO`. 292 293The `send()` and `write()` system calls behave in a similar way, but do not 294specify a remote address. Therefore, `connect()` must be called beforehand; if 295not, these calls will return an error of `EDESTADDRREQ` (Destination address 296required). 297 298Using `sendto()` or `sendmsg()` on a connected socket may override the remote 299socket address specified in `connect()`. The `connect()` address and tag will 300remain associated with the socket, for future unaddressed sends. The tag 301allocated through a call to `sendto()` or `sendmsg()` on a connected socket is 302subject to the same invalidation logic as on an unconnected socket: It is 303expired either by timeout or by a subsequent `sendto()`. 304 305The `sendmsg()` system call allows a more compact argument interface, and the 306message buffer to be specified as a scatter-gather list. At present no ancillary 307message types (used for the `msg_control` data passed to `sendmsg()`) are 308defined. 309 310Transmitting a message on an unconnected socket with `MCTP_TAG_OWNER` specified 311will cause an allocation of a tag, if no valid tag is already allocated for that 312destination. The (destination-eid,tag) tuple acts as an implicit local socket 313address, to allow the socket to receive responses to this outgoing message. If 314any previous allocation has been performed (to for a different remote EID), that 315allocation is lost. This tag behaviour can be controlled through the 316`MCTP_TAG_CONTROL` socket option. 317 318Sockets will only receive responses to requests they have sent (with TO=1) and 319may only respond (with TO=0) to requests they have received. 320 321#### `recvfrom()`, `recvmsg()`, `recv()` & `read()`: receive an MCTP message 322 323An MCTP message can be received by an application using one of the `recvfrom()`, 324`recvmsg()`, `recv()` or `read()` system calls. Using `recvfrom()` as the 325primary example: 326 327```c 328 struct sockaddr_mctp addr; 329 socklen_t addrlen; 330 char buf[14]; 331 ssize_t len; 332 333 addrlen = sizeof(addr); 334 335 len = recvfrom(sd, buf, sizeof(buf), 0, 336 (struct sockaddr_mctp *)&addr, &addrlen); 337 338 /* We can expect addr to describe an MCTP address */ 339 assert(addrlen >= sizeof(buf)); 340 assert(addr.smctp_family == AF_MCTP); 341 342 printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr); 343``` 344 345The address argument to `recvfrom` and `recvmsg` is populated with the remote 346address of the incoming message, including tag value (this will be needed in 347order to reply to the message). 348 349The first byte of the message buffer will contain the message type byte. If an 350integrity check follows the message, it will be included in the received buffer. 351 352The `recv()` and `read()` system calls behave in a similar way, but do not 353provide a remote address to the application. Therefore, these are only useful if 354the remote address is already known, or the message does not require a reply. 355 356Like the send calls, sockets will only receive responses to requests they have 357sent (TO=1) and may only respond (TO=0) to requests they have received. 358 359#### `getsockname()` & `getpeername()`: query local/remote socket address 360 361The `getsockname()` system call returns the `struct sockaddr_mctp` value for the 362local side of this socket, `getpeername()` for the remote (ie, that used in a 363connect()). Since the tag value is a property of the remote address, 364`getpeername()` may be used to retrieve a kernel-allocated tag value. 365 366Calling `getpeername()` on an unconnected socket will result in an error of 367`ENOTCONN`. 368 369#### Socket options 370 371The following socket options are defined for MCTP sockets: 372 373##### `MCTP_ADDR_EXT`: Use extended addressing information in sendmsg/recvmsg 374 375Enabling this socket option allows an application to specify extended addressing 376information on transmitted packets, and access the same on received packets. 377 378When the `MCTP_ADDR_EXT` socket option is enabled, the application may specify 379an expanded `struct sockaddr` to the `recvfrom()` and `sendto()` system calls. 380This as defined as: 381 382```c 383 struct sockaddr_mctp_ext { 384 /* fields exactly match struct sockaddr_mctp */ 385 sa_family_t smctp_family; /* = AF_MCTP */ 386 int smctp_network; 387 struct mctp_addr smctp_addr; 388 uint8_t smcp_tag; 389 /* extended addressing */ 390 int smctp_ifindex; 391 uint8_t smctp_halen; 392 unsigned char smctp_haddr[/* TBD */]; 393 } 394``` 395 396If the `addrlen` specified to `sendto()` or `recvfrom()` is sufficient to 397contain this larger structure, then the extended addressing fields are consumed 398/ populated respectively. 399 400##### `MCTP_TAG_CONTROL`: manage outgoing tag allocation behaviour 401 402The set/getsockopt argument is a `mctp_tagctl` structure: 403 404 struct mctp_tagctl { 405 bool retain; 406 struct timespec timeout; 407 }; 408 409This allows an application to control the behaviour of allocated tags for 410non-connected sockets when transferring messages to multiple different 411destinations (ie., where a `struct sockaddr_mctp` is provided for individual 412messages, and the `smctp_addr` destination for those sockets may vary across 413calls). 414 415The `retain` flag indicates to the kernel that the socket should not release tag 416allocations when a message is sent to a new destination EID. This causes the 417socket to continue to receive incoming messages to the old (dest,tag) tuple, in 418addition to the new tuple. 419 420The `timeout` value specifies a maximum amount of time to retain tag values. 421This should be based on the reply timeout for any upper-level protocol. 422 423The kernel may reject a request to set values that would cause excessive tag 424allocation by this socket. The kernel may also reject subsequent tag-allocation 425requests (through send or connect syscalls) which would cause excessive tags to 426be consumed by the socket, even though the tag control settings were accepted in 427the setsockopt operation. 428 429Changing the default tag control behaviour should only be required when: 430 431- the socket is sending messages with TO=1 (ie, is a requester); and 432- messages are sent to multiple different destination EIDs from the one socket. 433 434#### Syscalls not implemented 435 436The following system calls are not implemented for MCTP, primarily as they are 437not used in `SOCK_DGRAM`-type sockets: 438 439- `listen()` 440- `accept()` 441- `ioctl()` 442- `shutdown()` 443- `mmap()` 444 445### Userspace examples 446 447These examples cover three general use-cases: 448 449- **requester**: sends requests to a particular (EID, type) target, and receives 450 responses to those packets 451 452 This is similar to a typical UDP client 453 454- **responder**: receives all locally-addressed messages of a specific 455 message-type, and responds to the requester immediately. 456 457 This is similar to a typical UDP server 458 459- **controller**: a specific service for a bus owner; may send broadcast 460 messages, manage EID allocations, update local MCTP stack state. Will need 461 low-level packet data. 462 463 This is similar to a DHCP server. 464 465#### Requester 466 467"Client"-side implementation to send requests to a responder, and receive a 468response. This uses a (fictitious) message type of `MCTP_TYPE_ECHO`. 469 470```c 471 int main() { 472 struct sockaddr_mctp addr; 473 socklen_t addrlen; 474 struct { 475 uint8_t type; 476 uint8_t data[14]; 477 } msg; 478 int sd, rc; 479 480 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 481 482 addr.sa_family = AF_MCTP; 483 addr.smctp_network = MCTP_NET_ANY; /* any network */ 484 addr.smctp_addr.s_addr = 9; /* remote eid 9 */ 485 addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */ 486 addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */ 487 addrlen = sizeof(addr); 488 489 /* set message type and payload */ 490 msg.type = MCTP_TYPE_ECHO; 491 strncpy(msg.data, "hello, world!", sizeof(msg.data)); 492 493 /* send message */ 494 rc = sendto(sd, &msg, sizeof(msg), 0, 495 (struct sockaddr *)&addr, addrlen); 496 497 if (rc < 0) 498 err(EXIT_FAILURE, "sendto"); 499 500 /* Receive reply. This will block until a reply arrives, 501 * which may never happen. Actual code would need a timeout 502 * here. */ 503 rc = recvfrom(sd, &msg, sizeof(msg), 0, 504 (struct sockaddr *)&addr, &addrlen); 505 if (rc < 0) 506 err(EXIT_FAILURE, "recvfrom"); 507 508 assert(msg.type == MCTP_TYPE_ECHO); 509 /* ensure we're nul-terminated */ 510 msg.data[sizeof(msg.data)-1] = '\0'; 511 512 printf("reply: %s\n", msg.data); 513 514 return EXIT_SUCCESS; 515 } 516``` 517 518#### Responder 519 520"Server"-side implementation to receive requests and respond. Like the client, 521This uses a (fictitious) message type of `MCTP_TYPE_ECHO` in the 522`struct sockaddr_mctp`; only messages matching this type will be received. 523 524```c 525 int main() { 526 struct sockaddr_mctp addr; 527 socklen_t addrlen; 528 int sd, rc; 529 530 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 531 532 addr.sa_family = AF_MCTP; 533 addr.smctp_network = MCTP_NET_ANY; /* any network */ 534 addr.smctp_addr.s_addr = MCTP_EID_ANY; 535 addr.smctp_type = MCTP_TYPE_ECHO; 536 addr.smctp_tag = MCTP_TAG_OWNER; 537 addrlen = sizeof(addr); 538 539 rc = bind(sd, (struct sockaddr *)&addr, addrlen); 540 if (rc) 541 err(EXIT_FAILURE, "bind"); 542 543 for (;;) { 544 struct { 545 uint8_t type; 546 uint8_t data[14]; 547 } msg; 548 549 rc = recvfrom(sd, &msg, sizeof(msg), 0, 550 (struct sockaddr *)&addr, &addrlen); 551 if (rc < 0) 552 err(EXIT_FAILURE, "recvfrom"); 553 if (rc < 1) 554 warnx("not enough data for a message type"); 555 556 assert(addrlen == sizeof(addr)); 557 assert(msg.type == MCTP_TYPE_ECHO); 558 559 printf("%zd bytes from EID %d\n", rc, addr.smctp_addr); 560 561 /* Reply to requester; this macro just clears the TO-bit. 562 * Other addr fields will describe the remote endpoint, 563 * so use those as-is. 564 */ 565 addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag); 566 567 rc = sendto(sd, &msg, rc, 0, 568 (struct sockaddr *)&addr, addrlen); 569 if (rc < 0) 570 err(EXIT_FAILURE, "sendto"); 571 } 572 573 return EXIT_SUCCESS; 574 } 575``` 576 577#### Broadcast request 578 579Sends a request to a broadcast EID, and receives (unicast) replies. Typical 580control protocol pattern. 581 582```c 583 int main() { 584 struct sockaddr_mctp txaddr, rxaddr; 585 struct timespec start, cur; 586 struct pollfd pollfds[1]; 587 socklen_t addrlen; 588 uint8_t buf[2]; 589 int timeout; 590 591 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 592 593 /* destination address setup */ 594 txaddr.sa_family = AF_MCTP; 595 txaddr.smctp_network = 1; /* specific network required for broadcast */ 596 txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */ 597 txaddr.smctp_type = MCTP_TYPE_CONTROL; 598 txaddr.smctp_tag = MCTP_TAG_OWNER; 599 600 buf[0] = MCTP_TYPE_CONTROL; 601 buf[1] = 'a'; 602 603 /* We're doing a sendto() to a broadcast address here. If we were 604 * sending more than one broadcast message, we'd be better off 605 * doing connect(); sendto();, in order to retain the tag 606 * reservation across all transmitted messages. However, since this 607 * is a single transmit, that makes no difference in this 608 * particular case. 609 */ 610 rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr, 611 sizeof(txaddr)); 612 if (rc < 0) 613 err(EXIT_FAILURE, "sendto"); 614 615 /* Set up poll behaviour, and record our starting time for 616 * reply timeouts */ 617 pollfds[0].fd = sd; 618 pollfds[0].events = POLLIN; 619 clock_gettime(CLOCK_MONOTONIC, &start); 620 621 for (;;) { 622 /* Calculate the amount of time left for replies */ 623 clock_gettime(CLOCK_MONOTONIC, &cur); 624 timeout = calculate_timeout(&start, &cur, 1000); 625 626 rc = poll(pollfds, 1, timeout) 627 if (rc < 0) 628 err(EXIT_FAILURE, "poll"); 629 630 /* timeout receiving a reply? */ 631 if (rc == 0) 632 break; 633 634 /* sanity check that we have a message to receive */ 635 if (!(pollfds[0].revents & POLLIN)) 636 break; 637 638 addrlen = sizeof(rxaddr); 639 640 rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr, 641 &addrlen); 642 if (rc < 0) 643 err(EXIT_FAILURE, "recvfrom"); 644 645 assert(addrlen >= sizeof(rxaddr)); 646 assert(rxaddr.smctp_family == AF_MCTP); 647 648 printf("response from EID %d\n", rxaddr.smctp_addr); 649 } 650 651 return EXIT_SUCCESS; 652 } 653``` 654 655### Implementation notes 656 657#### Addressing 658 659Transmitted messages (through `sendto()` and related system calls) specify their 660destination via the `smctp_network` and `smctp_addr` fields of a 661`struct sockaddr_mctp`. 662 663The `smctp_addr` field maps directly to the destination endpoint's EID. 664 665The `smctp_network` field specifies a locally defined network identifier. To 666simplify situations where there is only one network defined, the special value 667`MCTP_NET_ANY` is allowed. This will allow the kernel to select a specific 668network for transmission. 669 670This selection is entirely user-configured; one specific network may be defined 671as the system default, in which case it will be used for all message 672transmission where `MCTP_NET_ANY` is used as the destination network. 673 674In particular, the destination EID is never used to select a destination 675network. 676 677MCTP responders should use the EID and network values of an incoming request to 678specify the destination for any responses. 679 680#### Bridging/routing 681 682The network and interface structure allows multiple interfaces to share a common 683network. By default, packets are not forwarded between interfaces. 684 685A network can be configured for "forwarding" mode. In this mode, packets may be 686forwarded if their destination EID is non-local, and matches a route for another 687interface on the same network. 688 689As per DSP0236, packet reassembly does not occur during the forwarding process. 690If the packet is larger than the MTU for the destination interface/route, then 691the packet is dropped. 692 693#### Tag behaviour for transmitted messages 694 695On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel 696must allocate a tag that will uniquely identify responses over a (destination 697EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size. 698 699To allow this, a `sendto()` with the `MCTP_TAG_OWNER` bit set in the `smctp_tag` 700field will cause the kernel to allocate a unique tag for subsequent replies from 701that specific remote EID. 702 703This allocation will expire when any of the following occur: 704 705- the socket is closed 706- a new message is sent to a new destination EID 707- an implementation-defined timeout expires 708 709Because the "tag space" is limited, it may not be possible for the kernel to 710allocate a unique tag for the outgoing message. In this case, the `sendto()` 711call will fail with errno `EAGAIN`. This is analogous to the UDP behaviour when 712a local port cannot be allocated for an outgoing message. 713 714The implementation-defined timeout value shall be chosen to reasonably cover 715standard reply timeouts. If necessary, this timeout may be modified through the 716`MCTP_TAG_CONTROL` socket option. 717 718For applications that expect to perform an ongoing message exchange with a 719particular destination address, they may use the `connect()` call to set a 720persistent remote address. In this case, the tag will be allocated during 721connect(), and remain reserved for this socket until any of the following occur: 722 723- the socket is closed 724- the remote address is changed through another call to `connect()`. 725 726In particular, calling `sendto()` with a different address does not release the 727tag reservation. 728 729Broadcast messages are particularly onerous for tag reservations. When a message 730is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must 731reserve the tag across the entire range of possible EIDs. Therefore, a 732particular tag value must be currently-unused across all EIDs to allow a 733`sendto()` to a broadcast address. Additionally, this reservation is not cleared 734when a reply is received, as there may be multiple replies to a broadcast. 735 736For this reason, applications wanting to send to the broadcast address should 737use the `connect()` system call to reserve a tag, and guarantee its availability 738for future message transmission. Note that this will remove the tag value for 739use with _any other EID_. Sending to the broadcast address should be avoided; we 740expect few applications will need this functionality. 741 742#### MCTP Control Protocol implementation 743 744Aside from the "Resolve endpoint EID" message, the MCTP control protocol 745implementation would exist as a userspace process, `mctpd`. This process is 746responsible for responding to incoming control protocol messages, any dynamic 747EID allocations (for bus owner devices) and maintaining the MCTP route table 748(for bridging devices). 749 750This process would create a socket bound to the type `MCTP_TYPE_CONTROL`, with 751the `MCTP_ADDR_EXT` socket option enabled in order to access physical addressing 752data on incoming control protocol requests. It would interact with the kernel's 753route table via a netlink interface - the same as that implemented for the 754[Utility and configuration interfaces](#utility-and-configuration-interfaces). 755 756### Neighbour and routing implementation 757 758The packet-transmission behaviour of the MCTP infrastructure relies on a single 759routing table to lookup both route and neighbour information. Entries in this 760table are of the format: 761 762| EID range | interface | physical address | metric | MTU | flags | expiry | 763| --------- | --------- | ---------------- | ------ | --- | ----- | ------ | 764 765This table can be updated from two sources: 766 767- From userspace, via a netlink interface (see the 768 [Utility and configuration interfaces](#utility-and-configuration-interfaces) 769 section). 770 771- Directly within the kernel, when basic neighbour information is discovered. 772 Kernel-originated routes are marked as such in the flags field, and have a 773 maximum validity age, indicated by the expiry field. 774 775Kernel-discovered routing information can originate from two sources: 776 777- physical-to-EID mappings discovered through received packets 778 779- explicit endpoint physical-address resolution requests 780 781When a packet is to be transmitted to an EID that does not have an entry in the 782routing table, the kernel may attempt to resolve the physical address of that 783endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol 784(section 12.9 of DSP0236). The response message will be used to add a 785kernel-originated route into the routing table. 786 787This is the only kernel-internal usage of MCTP Control Protocol messages. 788 789## Utility and configuration interfaces 790 791A small utility will be developed to control the state of the kernel MCTP stack. 792This will be similar in design to the 'iproute2' tools, which perform a similar 793function for the IPv4 and IPv6 protocols. 794 795The utility will be invoked as `mctp`, and provide subcommands for managing 796different aspects of the kernel stack. 797 798### `mctp link`: manage interfaces 799 800```sh 801 mctp link set <link> <up|down> 802 mctp link set <link> network <network-id> 803 mctp link set <link> mtu <mtu> 804 mctp link set <link> bus-owner <hwaddr> 805``` 806 807### `mctp network`: manage networks 808 809```sh 810 mctp network create <network-id> 811 mctp network set <network-id> forwarding <on|off> 812 mctp network set <network-id> default [<true|false>] 813 mctp network delete <network-id> 814``` 815 816### `mctp address`: manage local EID assignments 817 818```sh 819 mctp address add <eid> dev <link> 820 mctp address del <eid> dev <link> 821``` 822 823### `mctp route`: manage routing tables 824 825```sh 826 mctp route add net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>] 827 mctp route del net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>] 828 mctp route show [net <network-id>] 829``` 830 831### `mctp stat`: query socket status 832 833```sh 834 mctp stat 835``` 836 837A set of netlink message formats will be defined to support these control 838functions. 839 840# Design points & alternatives considered 841 842## Including message-type byte in send/receive buffers 843 844This design specifies that message buffers passed to the kernel in send syscalls 845and from the kernel in receive syscalls will have the message type byte as the 846first byte of the buffer. This corresponds to the definition of a MCTP message 847payload in DSP0236. 848 849This somewhat duplicates the type data provided in `struct sockaddr_mctp`; it's 850superficially possible for the kernel to prepend this byte on send, and remove 851it on receive. 852 853However, the exact format of the MCTP message payload is not precisely defined 854by the specification. Particularly, any message integrity check data (which 855would also need to be appended / stripped in conjunction with the type byte) is 856defined by the type specification, not DSP0236. The kernel would need knowledge 857of all protocols in order to correctly deconstruct the payload data. 858 859Therefore, we transfer the message payload as-is to userspace, without any 860modification by the kernel. 861 862## MCTP message-type specification: using `sockaddr_mctp.smctp_type` rather than protocol 863 864This design specifies message-types to be passed in the `smctp_type` field of 865`struct sockaddr_mctp`. An alternative would be to pass it in the `protocol` 866argument of the `socket()` system call: 867 868```c 869 int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol); 870``` 871 872The `smctp_type` implementation was chosen as it better matches the "addressing" 873model of the message type; sockets are bound to an incoming message type, 874similar to the IP protocol's model of binding UDP sockets to a local port 875number. 876 877There is no kernel behaviour that depends on the specific type (particularly 878given the design choice above), so it is not suited to use the protocol argument 879here. 880 881Future additions that perform protocol-specific message handling, and so alter 882the send/receive buffer format, may use a new protocol argument. 883 884## Networks referenced by index rather than UUID 885 886This design proposes referencing networks by an integer index. The MCTP standard 887does optionally associate a RFC4122 UUID with a networks; it would be possible 888to use this UUID where we pass a network identifier. 889 890This approach does not incorporate knowledge of network UUIDs in the kernel. 891Given that the Get Network ID message in the MCTP Control Protocol is 892implemented entirely via userspace, it does not need to be aware of network 893UUIDs, and requiring network references (for example, the `smctp_network` field 894of `struct sockaddr_mctp`, as type `uuid_t`) complicates assignment. 895 896Instead, the index integer is used instead, in a similar fashion to the integer 897index used to reference `struct netdevice`s elsewhere in the network stack. 898 899## Tag behaviour alternatives 900 901We considered _several_ different designs for the tag handling behaviour. A 902brief overview of the more-feasible of those, and why they were rejected: 903 904### Each socket is allocated a unique tag value on creation 905 906We could allocate a tag for each socket on creation, and use that value when a 907tag is required. This, however: 908 909- needlessly consumes a tag on non-tag-owning sockets (ie, those which send with 910 TO=0 - responders); and 911 912- limits us to 8 sockets per network. 913 914### Tags only used for message packetisation / reassembly 915 916An alternative would be to completely dissociate tag allocation from sockets; 917and only allocate a tag for the (short-lived) task of packetising a message, and 918sending those packets. Tags would be released when the last packet has been 919sent. 920 921However, this removes any facility to correlate responses with the correct 922socket, which is the purpose of the TO bit in DSP0236. In order for the sending 923application to receive the response, we would either need to: 924 925- limit the system to one socket of each message type (which, for example, 926 precludes running a requester and a responder of the same type); or 927 928- forward all incoming messages of a specific message-type to all sockets 929 listening on that type, making it trivial to eavesdrop on MCTP data of other 930 applications 931 932### Allocate a tag for one request/response pair 933 934Another alternative would be to allocate a tag on each outgoing TO=1 message, 935and then release that allocation after the incoming response to that tag (TO=0) 936is observed. 937 938However, MCTP protocols exist that do not have a 1:1 mapping of responses to 939requests - more than one response may be valid for a given request message. For 940example, in response to a request, a NVMe-MI implementation may send an 941in-progress reply before the final reply. In this case, we would release the tag 942after the first response is received, and then have no way to correlate the 943second message with the socket. 944 945Broadcast MCTP request messages may have multiple replies from multiple 946endpoints, meaning we cannot release the tag allocation on the first reply. 947