1# OpenBMC in-kernel MCTP 2 3Author: Jeremy Kerr `<jk@codeconstruct.com.au>` 4 5Please refer to the [MCTP Overview](mctp.md) document for general MCTP design 6description, background and requirements. 7 8This document describes a kernel-based implementation of MCTP infrastructure, 9providing a sockets-based API for MCTP communication within an OpenBMC-based 10platform. 11 12# Requirements for a kernel implementation 13 14 * The MCTP messaging API should be an obvious application of the existing POSIX 15 socket interface 16 17 * Configuration should be simple for a straightforward MCTP endpoint: a single 18 network with a single local endpoint id (EID). 19 20 * Infrastructure should be flexible enough to allow for more complex MCTP 21 networks, allowing: 22 23 - each MCTP network (as defined by section 3.2.31 of DSP0236) may 24 consist of multiple local physical interfaces, and/or multiple EIDs; 25 26 - multiple distinct (ie., non-bridged) networks, possibly containing 27 duplicated EIDs between networks; 28 29 - multiple local EIDs on a single interface, and 30 31 - customisable routing/bridging configurations within a network. 32 33 34# Proposed Design # 35 36The design contains several components: 37 38 * An interface for userspace applications to send and receive MCTP messages: A 39 mapping of the sockets API to MCTP usage 40 41 * Infrastructure for control and configuration of the MCTP network(s), 42 consisting of a configuration utility, and a kernel messaging facility for 43 this utility to use. 44 45 * Kernel drivers for physical interface bindings. 46 47In general, the kernel components cover the transport functionality of MCTP, 48such as message assembly/disassembly, packet forwarding, and physical interface 49implementations. 50 51Higher-level protocols (such as PLDM) are implemented in userspace, through the 52introduced socket API. This also includes the majority of the MCTP Control 53Protocol implementation (DSP0236, section 11) - MCTP endpoints will typically 54have a specific process to request and respond to control protocol messages. 55However, the kernel will include a small subset of control protocol code to 56allow very simple endpoints, with static EID allocations, to run without this 57process. MCTP endpoints that require more than just single-endpoint 58functionality (bus owners, bridges, etc), and/or dynamic EID allocation, would 59include the control message protocol process. 60 61A new driver is introduced to handle each physical interface binding. These 62drivers expose the appropriate `struct net_device` to handle transmission and 63reception of MCTP packets on their associated hardware channels. Under Linux, 64the namespace for these interfaces is separate from other network interfaces - 65such as those for ethernet. 66 67## Structure: interfaces & networks # 68 69The kernel models the local MCTP topology through two items: interfaces and 70networks. 71 72An interface (or "link") is an instance of an MCTP physical transport binding 73(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware 74device. This is represented as a `struct netdevice`, and has a user-visible 75name and index (`ifindex`). Non-hardware-attached interfaces are permitted, to 76allow local loopback and/or virtual interfaces. 77 78A network defines a unique address space for MCTP endpoints by endpoint-ID 79(described by DSP0236, section 3.2.31). A network has a user-visible identifier 80to allow references from userspace. Route definitions are specific to one 81network. 82 83Interfaces are associated with one network. A network may be associated with one 84or more interfaces. 85 86If multiple networks are present, each may contain EIDs that are also present on 87other networks. 88 89## Sockets API ## 90 91### Protocol definitions ### 92 93We define a new address family (and corresponding protocol family) for MCTP: 94 95```c 96 #define AF_MCTP /* TBD */ 97 #define PF_MCTP AF_MCTP 98``` 99 100MCTP sockets are created with the `socket()` syscall, specifying `AF_MCTP` as 101the domain. Currently, only a `SOCK_DGRAM` socket type is defined. 102 103```c 104 int sd = socket(AF_MCTP, SOCK_DGRAM, 0); 105``` 106 107The only (current) value for the `protocol` argument is 0. Future protocol 108implementations may be added later. 109 110MCTP Sockets opened with a protocol value of 0 will communicate directly at the 111transport layer; message buffers received by the application will consist of 112message data from reassembled MCTP packets, and will include the full message 113including message type byte and optional message integrity check (IC). 114Individual packet headers are not included; they may be accessible through a 115future `SOCK_RAW` socket type. 116 117As with all socket address families, source and destination addresses are 118specified with a new `sockaddr` type: 119 120```c 121 struct sockaddr_mctp { 122 sa_family_t smctp_family; /* = AF_MCTP */ 123 int smctp_network; 124 struct mctp_addr smctp_addr; 125 uint8_t smctp_type; 126 uint8_t smctp_tag; 127 }; 128 129 struct mctp_addr { 130 uint8_t s_addr; 131 }; 132 133 /* MCTP network values */ 134 #define MCTP_NET_ANY 0 135 136 /* MCTP EID values */ 137 #define MCTP_ADDR_ANY 0xff 138 #define MCTP_ADDR_BCAST 0xff 139 140 /* MCTP type values. Only the least-significant 7 bits of 141 * smctp_type are used for tag matches; the specification defines 142 * the type to be 7 bits. 143 */ 144 #define MCTP_TYPE_MASK 0x7f 145 146 /* MCTP tag defintions; used for smcp_tag field of sockaddr_mctp */ 147 /* MCTP-spec-defined fields */ 148 #define MCTP_TAG_MASK 0x07 149 #define MCTP_TAG_OWNER 0x08 150 /* Others: reserved */ 151 152 /* Helpers */ 153 #define MCTP_TAG_RSP(x) (x & MCTP_TAG_MASK) /* response to a request: clear TO, keep value */ 154``` 155 156### Syscall behaviour ### 157 158The following sections describe the MCTP-specific behaviours of the standard 159socket system calls. These behaviours have been chosen to map closely to the 160existing sockets APIs. 161 162#### `bind()`: set local socket address #### 163 164Sockets that receive incoming request packets will bind to a local address, 165using the `bind()` syscall. 166 167```c 168 struct sockaddr_mctp addr; 169 170 addr.smctp_family = AF_MCTP; 171 addr.smctp_network = MCTP_NET_ANY; 172 addr.smctp_addr.s_addr = MCTP_ADDR_ANY; 173 addr.smctp_type = MCTP_TYPE_PLDM; 174 addr.smctp_tag = MCTP_TAG_OWNER; 175 176 int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr)); 177``` 178 179This establishes the local address of the socket. Incoming MCTP messages that 180match the network, address, and message type will be received by this socket. 181The reference to 'incoming' is important here; a bound socket will only receive 182messages with the TO bit set, to indicate an incoming request message, rather 183than a response. 184 185The `smctp_tag` value will configure the tags accepted from the remote side of 186this socket. Given the above, the only valid value is `MCTP_TAG_OWNER`, which 187will result in remotely "owned" tags being routed to this socket. Since 188`MCTP_TAG_OWNER` is set, the 3 least-significant bits of `smctp_tag` are 189not used; callers must set them to zero. See the [Tag behaviour for transmitted 190messages](#tag-behaviour-for-transmitted-messages) section for more details. If 191the `MCTP_TAG_OWNER` bit is not set, `bind()` will fail with an errno of 192`EINVAL`. 193 194A `smctp_network` value of `MCTP_NET_ANY` will configure the socket to receive 195incoming packets from any locally-connected network. A specific network value 196will cause the socket to only receive incoming messages from that network. 197 198The `smctp_addr` field specifies a local address to bind to. A value of 199`MCTP_ADDR_ANY` configures the socket to receive messages addressed to any 200local destination EID. 201 202The `smctp_type` field specifies which message types to receive. Only the lower 2037 bits of the type is matched on incoming messages (ie., the most-significant IC 204bit is not part of the match). This results in the socket receiving packets with 205and without a message integrity check footer. 206 207#### `connect()`: set remote socket address #### 208 209Sockets may specify a socket's remote address with the `connect()` syscall: 210 211```c 212 struct sockaddr_mctp addr; 213 int rc; 214 215 addr.smctp_family = AF_MCTP; 216 addr.smctp_network = MCTP_NET_ANY; 217 addr.smctp_addr.s_addr = 8; 218 addr.smctp_tag = MCTP_TAG_OWNER; 219 addr.smctp_type = MCTP_TYPE_PLDM; 220 221 rc = connect(sd, (struct sockaddr *)&addr, sizeof(addr)); 222``` 223 224This establishes the remote address of a socket, used for future message 225transmission. Like other `SOCK_DGRAM` behaviour, this does not generate any MCTP 226traffic directly, but just sets the default destination for messages sent from 227this socket. 228 229The `smctp_network` field may specify a locally-attached network, or the value 230`MCTP_NET_ANY`, in which case the kernel will select a suitable MCTP network. 231This is guaranteed to work for single-network configurations, but may require 232additional routing definitions for endpoints attached to multiple distinct 233networks. See the [Addressing](#addressing) section for details. 234 235The `smctp_addr` field specifies a remote EID. This may be the `MCTP_ADDR_BCAST` 236the MCTP broadcast EID (0xff). 237 238The `smctp_type` field specifies the type field of messages transferred over 239this socket. 240 241The `smctp_tag` value will configure the tag used for the local side of this 242socket. The only valid value is `MCTP_TAG_OWNER`, which will result in an 243"owned" tag to be allocated for this socket, and will remain allocated for all 244future outgoing messages, until either the socket is closed, or `connect()` is 245called again. If a tag cannot be allocated, `connect()` will report an error, 246with an errno value of `EAGAIN`. See the [Tag behaviour for transmitted 247messages](#tag-behaviour-for-transmitted-messages) section for more details. If 248the `MCTP_TAG_OWNER` bit is not set, `connect()` will fail with an errno of 249`EINVAL`. 250 251Requesters which connect to a single responder will typically use `connect()` to 252specify the peer address and tag for future outgoing messages. 253 254#### `sendto()`, `sendmsg()`, `send()` & `write()`: transmit an MCTP message #### 255 256An MCTP message is transmitted using one of the `sendto()`, `sendmsg()`, `send()` 257or `write()` syscalls. Using `sendto()` as the primary example: 258 259```c 260 struct sockaddr_mctp addr; 261 char buf[14]; 262 ssize_t len; 263 264 /* set message destination */ 265 addr.smctp_family = AF_MCTP; 266 addr.smctp_network = 0; 267 addr.smctp_addr.s_addr = 8; 268 addr.smctp_tag = MCTP_TAG_OWNER; 269 addr.smctp_type = MCTP_TYPE_ECHO; 270 271 /* arbitrary message to send, with message-type header */ 272 buf[0] = MCTP_TYPE_ECHO; 273 memcpy(buf + 1, "hello, world!", sizeof(buf) - 1); 274 275 len = sendto(sd, buf, sizeof(buf), 0, 276 (struct sockaddr_mctp *)&addr, sizeof(addr)); 277``` 278 279The address argument is treated the same way as for `connect()`: The network and 280address fields define the remote address to send to. If `smctp_tag` has the 281`MCTP_TAG_OWNER`, the kernel will ignore any bits set in `MCTP_TAG_VALUE`, and 282generate a tag value suitable for the destination EID. If `MCTP_TAG_OWNER` is 283not set, the message will be sent with the tag value as specified. If a tag 284value cannot be allocated, the system call will report an errno of `EAGAIN`. 285 286The application must provide the message type byte as the first byte of the 287message buffer passed to `sendto()`. If a message integrity check is to be 288included in the transmitted message, it must also be provided in the message 289buffer, and the most-significant bit of the message type byte must be 1. 290 291If the first byte of the message does not match the message type value, then the 292system call will return an error of `EPROTO`. 293 294The `send()` and `write()` system calls behave in a similar way, but do not 295specify a remote address. Therefore, `connect()` must be called beforehand; if 296not, these calls will return an error of `EDESTADDRREQ` (Destination address 297required). 298 299Using `sendto()` or `sendmsg()` on a connected socket may override the remote 300socket address specified in `connect()`. The `connect()` address and tag will 301remain associated with the socket, for future unaddressed sends. The tag 302allocated through a call to `sendto()` or `sendmsg()` on a connected socket is 303subject to the same invalidation logic as on an unconnected socket: It is 304expired either by timeout or by a subsequent `sendto()`. 305 306The `sendmsg()` system call allows a more compact argument interface, and the 307message buffer to be specified as a scatter-gather list. At present no 308ancillary message types (used for the `msg_control` data passed to `sendmsg()`) 309are defined. 310 311Transmitting a message on an unconnected socket with `MCTP_TAG_OWNER` specified 312will cause an allocation of a tag, if no valid tag is already allocated for that 313destination. The (destination-eid,tag) tuple acts as an implicit local socket 314address, to allow the socket to receive responses to this outgoing message. If 315any previous allocation has been performed (to for a different remote EID), that 316allocation is lost. This tag behaviour can be controlled through the 317`MCTP_TAG_CONTROL` socket option. 318 319Sockets will only receive responses to requests they have sent (with TO=1) and may 320only respond (with TO=0) to requests they have received. 321 322#### `recvfrom()`, `recvmsg()`, `recv()` & `read()`: receive an MCTP message #### 323 324An MCTP message can be received by an application using one of the `recvfrom()`, 325`recvmsg()`, `recv()` or `read()` system calls. Using `recvfrom()` as the 326primary example: 327 328```c 329 struct sockaddr_mctp addr; 330 socklen_t addrlen; 331 char buf[14]; 332 ssize_t len; 333 334 addrlen = sizeof(addr); 335 336 len = recvfrom(sd, buf, sizeof(buf), 0, 337 (struct sockaddr_mctp *)&addr, &addrlen); 338 339 /* We can expect addr to describe an MCTP address */ 340 assert(addrlen >= sizeof(buf)); 341 assert(addr.smctp_family == AF_MCTP); 342 343 printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr); 344``` 345 346The address argument to `recvfrom` and `recvmsg` is populated with the remote 347address of the incoming message, including tag value (this will be needed in 348order to reply to the message). 349 350The first byte of the message buffer will contain the message type byte. If an 351integrity check follows the message, it will be included in the received buffer. 352 353The `recv()` and `read()` system calls behave in a similar way, but do not 354provide a remote address to the application. Therefore, these are only useful if 355the remote address is already known, or the message does not require a reply. 356 357Like the send calls, sockets will only receive responses to requests they have 358sent (TO=1) and may only respond (TO=0) to requests they have received. 359 360#### `getsockname()` & `getpeername()`: query local/remote socket address #### 361 362The `getsockname()` system call returns the `struct sockaddr_mctp` value for the 363local side of this socket, `getpeername()` for the remote (ie, that used in a 364connect()). Since the tag value is a property of the remote address, 365`getpeername()` may be used to retrieve a kernel-allocated tag value. 366 367Calling `getpeername()` on an unconnected socket will result in an error of 368`ENOTCONN`. 369 370#### Socket options #### 371 372The following socket options are defined for MCTP sockets: 373 374##### `MCTP_ADDR_EXT`: Use extended addressing information in sendmsg/recvmsg ##### 375 376Enabling this socket option allows an application to specify extended addressing 377information on transmitted packets, and access the same on received packets. 378 379When the `MCTP_ADDR_EXT` socket option is enabled, the application may specify 380an expanded `struct sockaddr` to the `recvfrom()` and `sendto()` system calls. 381This as defined as: 382 383```c 384 struct sockaddr_mctp_ext { 385 /* fields exactly match struct sockaddr_mctp */ 386 sa_family_t smctp_family; /* = AF_MCTP */ 387 int smctp_network; 388 struct mctp_addr smctp_addr; 389 uint8_t smcp_tag; 390 /* extended addressing */ 391 int smctp_ifindex; 392 uint8_t smctp_halen; 393 unsigned char smctp_haddr[/* TBD */]; 394 } 395``` 396 397If the `addrlen` specified to `sendto()` or `recvfrom()` is sufficient to 398contain this larger structure, then the extended addressing fields are consumed 399/ populated respectively. 400 401 402##### `MCTP_TAG_CONTROL`: manage outgoing tag allocation behaviour ##### 403 404The set/getsockopt argument is a `mctp_tagctl` structure: 405 406 struct mctp_tagctl { 407 bool retain; 408 struct timespec timeout; 409 }; 410 411This allows an application to control the behaviour of allocated tags for 412non-connected sockets when transferring messages to multiple different 413destinations (ie., where a `struct sockaddr_mctp` is provided for individual 414messages, and the `smctp_addr` destination for those sockets may vary across 415calls). 416 417The `retain` flag indicates to the kernel that the socket should not release tag 418allocations when a message is sent to a new destination EID. This causes the 419socket to continue to receive incoming messages to the old (dest,tag) tuple, in 420addition to the new tuple. 421 422The `timeout` value specifies a maximum amount of time to retain tag values. 423This should be based on the reply timeout for any upper-level protocol. 424 425The kernel may reject a request to set values that would cause excessive tag 426allocation by this socket. The kernel may also reject subsequent tag-allocation 427requests (through send or connect syscalls) which would cause excessive tags to 428be consumed by the socket, even though the tag control settings were accepted in 429the setsockopt operation. 430 431Changing the default tag control behaviour should only be required when: 432 433 * the socket is sending messages with TO=1 (ie, is a requester); and 434 * messages are sent to multiple different destination EIDs from the one 435 socket. 436 437 438#### Syscalls not implemented #### 439 440The following system calls are not implemented for MCTP, primarily as they are 441not used in `SOCK_DGRAM`-type sockets: 442 443 * `listen()` 444 * `accept()` 445 * `ioctl()` 446 * `shutdown()` 447 * `mmap()` 448 449### Userspace examples ### 450 451These examples cover three general use-cases: 452 453 - **requester**: sends requests to a particular (EID, type) target, and 454 receives responses to those packets 455 456 This is similar to a typical UDP client 457 458 - **responder**: receives all locally-addressed messages of a specific 459 message-type, and responds to the requester immediately. 460 461 This is similar to a typical UDP server 462 463 - **controller**: a specific service for a bus owner; may send broadcast 464 messages, manage EID allocations, update local MCTP stack state. Will 465 need low-level packet data. 466 467 This is similar to a DHCP server. 468 469#### Requester #### 470 471"Client"-side implementation to send requests to a responder, and receive a response. 472This uses a (fictitious) message type of `MCTP_TYPE_ECHO`. 473 474```c 475 int main() { 476 struct sockaddr_mctp addr; 477 socklen_t addrlen; 478 struct { 479 uint8_t type; 480 uint8_t data[14]; 481 } msg; 482 int sd, rc; 483 484 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 485 486 addr.sa_family = AF_MCTP; 487 addr.smctp_network = MCTP_NET_ANY; /* any network */ 488 addr.smctp_addr.s_addr = 9; /* remote eid 9 */ 489 addr.smctp_tag = MCTP_TAG_OWNER; /* kernel will allocate an owned tag */ 490 addr.smctp_type = MCTP_TYPE_ECHO; /* ficticious message type */ 491 addrlen = sizeof(addr); 492 493 /* set message type and payload */ 494 msg.type = MCTP_TYPE_ECHO; 495 strncpy(msg.data, "hello, world!", sizeof(msg.data)); 496 497 /* send message */ 498 rc = sendto(sd, &msg, sizeof(msg), 0, 499 (struct sockaddr *)&addr, addrlen); 500 501 if (rc < 0) 502 err(EXIT_FAILURE, "sendto"); 503 504 /* Receive reply. This will block until a reply arrives, 505 * which may never happen. Actual code would need a timeout 506 * here. */ 507 rc = recvfrom(sd, &msg, sizeof(msg), 0, 508 (struct sockaddr *)&addr, &addrlen); 509 if (rc < 0) 510 err(EXIT_FAILURE, "recvfrom"); 511 512 assert(msg.type == MCTP_TYPE_ECHO); 513 /* ensure we're nul-terminated */ 514 msg.data[sizeof(msg.data)-1] = '\0'; 515 516 printf("reply: %s\n", msg.data); 517 518 return EXIT_SUCCESS; 519 } 520``` 521 522#### Responder #### 523 524"Server"-side implementation to receive requests and respond. Like the client, 525This uses a (fictitious) message type of `MCTP_TYPE_ECHO` in the `struct 526sockaddr_mctp`; only messages matching this type will be received. 527 528```c 529 int main() { 530 struct sockaddr_mctp addr; 531 socklen_t addrlen; 532 int sd, rc; 533 534 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 535 536 addr.sa_family = AF_MCTP; 537 addr.smctp_network = MCTP_NET_ANY; /* any network */ 538 addr.smctp_addr.s_addr = MCTP_EID_ANY; 539 addr.smctp_type = MCTP_TYPE_ECHO; 540 addr.smctp_tag = MCTP_TAG_OWNER; 541 addrlen = sizeof(addr); 542 543 rc = bind(sd, (struct sockaddr *)&addr, addrlen); 544 if (rc) 545 err(EXIT_FAILURE, "bind"); 546 547 for (;;) { 548 struct { 549 uint8_t type; 550 uint8_t data[14]; 551 } msg; 552 553 rc = recvfrom(sd, &msg, sizeof(msg), 0, 554 (struct sockaddr *)&addr, &addrlen); 555 if (rc < 0) 556 err(EXIT_FAILURE, "recvfrom"); 557 if (rc < 1) 558 warnx("not enough data for a message type"); 559 560 assert(addrlen == sizeof(addr)); 561 assert(msg.type == MCTP_TYPE_ECHO); 562 563 printf("%zd bytes from EID %d\n", rc, addr.smctp_addr); 564 565 /* Reply to requester; this macro just clears the TO-bit. 566 * Other addr fields will describe the remote endpoint, 567 * so use those as-is. 568 */ 569 addr.smctp_tag = MCTP_TAG_RSP(addr.smctp_tag); 570 571 rc = sendto(sd, &msg, rc, 0, 572 (struct sockaddr *)&addr, addrlen); 573 if (rc < 0) 574 err(EXIT_FAILURE, "sendto"); 575 } 576 577 return EXIT_SUCCESS; 578 } 579``` 580 581#### Broadcast request #### 582 583Sends a request to a broadcast EID, and receives (unicast) replies. Typical 584control protocol pattern. 585 586```c 587 int main() { 588 struct sockaddr_mctp txaddr, rxaddr; 589 struct timespec start, cur; 590 struct pollfd pollfds[1]; 591 socklen_t addrlen; 592 uint8_t buf[2]; 593 int timeout; 594 595 sd = socket(AF_MCTP, SOCK_DGRAM, 0); 596 597 /* destination address setup */ 598 txaddr.sa_family = AF_MCTP; 599 txaddr.smctp_network = 1; /* specific network required for broadcast */ 600 txaddr.smctp_addr.s_addr = MCTP_TAG_BCAST; /* broadcast dest */ 601 txaddr.smctp_type = MCTP_TYPE_CONTROL; 602 txaddr.smctp_tag = MCTP_TAG_OWNER; 603 604 buf[0] = MCTP_TYPE_CONTROL; 605 buf[1] = 'a'; 606 607 /* We're doing a sendto() to a broadcast address here. If we were 608 * sending more than one broadcast message, we'd be better off 609 * doing connect(); sendto();, in order to retain the tag 610 * reservation across all transmitted messages. However, since this 611 * is a single transmit, that makes no difference in this 612 * particular case. 613 */ 614 rc = sendto(sd, buf, 2, 0, (struct sockaddr *)&txaddr, 615 sizeof(txaddr)); 616 if (rc < 0) 617 err(EXIT_FAILURE, "sendto"); 618 619 /* Set up poll behaviour, and record our starting time for 620 * reply timeouts */ 621 pollfds[0].fd = sd; 622 pollfds[0].events = POLLIN; 623 clock_gettime(CLOCK_MONOTONIC, &start); 624 625 for (;;) { 626 /* Calculate the amount of time left for replies */ 627 clock_gettime(CLOCK_MONOTONIC, &cur); 628 timeout = calculate_timeout(&start, &cur, 1000); 629 630 rc = poll(pollfds, 1, timeout) 631 if (rc < 0) 632 err(EXIT_FAILURE, "poll"); 633 634 /* timeout receiving a reply? */ 635 if (rc == 0) 636 break; 637 638 /* sanity check that we have a message to receive */ 639 if (!(pollfds[0].revents & POLLIN)) 640 break; 641 642 addrlen = sizeof(rxaddr); 643 644 rc = recvfrom(sd, &buf, 2, 0, (struct sockaddr *)&rxaddr, 645 &addrlen); 646 if (rc < 0) 647 err(EXIT_FAILURE, "recvfrom"); 648 649 assert(addrlen >= sizeof(rxaddr)); 650 assert(rxaddr.smctp_family == AF_MCTP); 651 652 printf("response from EID %d\n", rxaddr.smctp_addr); 653 } 654 655 return EXIT_SUCCESS; 656 } 657``` 658 659### Implementation notes ### 660 661#### Addressing #### 662 663Transmitted messages (through `sendto()` and related system calls) specify their 664destination via the `smctp_network` and `smctp_addr` fields of a `struct 665sockaddr_mctp`. 666 667The `smctp_addr` field maps directly to the destination endpoint's EID. 668 669The `smctp_network` field specifies a locally defined network identifier. To 670simplify situations where there is only one network defined, the special value 671`MCTP_NET_ANY` is allowed. This will allow the kernel to select a specific 672network for transmission. 673 674This selection is entirely user-configured; one specific network may be defined 675as the system default, in which case it will be used for all message 676transmission where `MCTP_NET_ANY` is used as the destination network. 677 678In particular, the destination EID is never used to select a destination 679network. 680 681MCTP responders should use the EID and network values of an incoming request to 682specify the destination for any responses. 683 684#### Bridging/routing #### 685 686The network and interface structure allows multiple interfaces to share a common 687network. By default, packets are not forwarded between interfaces. 688 689A network can be configured for "forwarding" mode. In this mode, packets may be 690forwarded if their destination EID is non-local, and matches a route for another 691interface on the same network. 692 693As per DSP0236, packet reassembly does not occur during the forwarding process. 694If the packet is larger than the MTU for the destination interface/route, then 695the packet is dropped. 696 697#### Tag behaviour for transmitted messages #### 698 699On every message sent with the tag-owner bit set ("TO" in DSP0236), the kernel 700must allocate a tag that will uniquely identify responses over a (destination 701EID, source EID, tag-owner, tag) tuple. The tag value is 3 bits in size. 702 703To allow this, a `sendto()` with the `MCTP_TAG_OWNER` bit set in the `smctp_tag` 704field will cause the kernel to allocate a unique tag for subsequent replies from 705that specific remote EID. 706 707This allocation will expire when any of the following occur: 708 709 * the socket is closed 710 * a new message is sent to a new destination EID 711 * an implementation-defined timeout expires 712 713Because the "tag space" is limited, it may not be possible for the kernel to 714allocate a unique tag for the outgoing message. In this case, the `sendto()` 715call will fail with errno `EAGAIN`. This is analogous to the UDP behaviour when 716a local port cannot be allocated for an outgoing message. 717 718The implementation-defined timeout value shall be chosen to reasonably cover 719standard reply timeouts. If necessary, this timeout may be modified through the 720`MCTP_TAG_CONTROL` socket option. 721 722For applications that expect to perform an ongoing message exchange with a 723particular destination address, they may use the `connect()` call to set a 724persistent remote address. In this case, the tag will be allocated during 725connect(), and remain reserved for this socket until any of the following occur: 726 727 * the socket is closed 728 * the remote address is changed through another call to `connect()`. 729 730In particular, calling `sendto()` with a different address does not release the 731tag reservation. 732 733Broadcast messages are particularly onerous for tag reservations. When a message 734is transmitted with TO=1 and dest=0xff (the broadcast EID), the kernel must 735reserve the tag across the entire range of possible EIDs. Therefore, a 736particular tag value must be currently-unused across all EIDs to allow a 737`sendto()` to a broadcast address. Additionally, this reservation is not cleared 738when a reply is received, as there may be multiple replies to a broadcast. 739 740For this reason, applications wanting to send to the broadcast address should 741use the `connect()` system call to reserve a tag, and guarantee its availability 742for future message transmission. Note that this will remove the tag value for 743use with *any other EID*. Sending to the broadcast address should be avoided; we 744expect few applications will need this functionality. 745 746 747#### MCTP Control Protocol implementation #### 748 749Aside from the "Resolve endpoint EID" message, the MCTP control protocol 750implementation would exist as a userspace process, `mctpd`. This process is 751responsible for responding to incoming control protocol messages, any dynamic 752EID allocations (for bus owner devices) and maintaining the MCTP route table 753(for bridging devices). 754 755This process would create a socket bound to the type `MCTP_TYPE_CONTROL`, with 756the `MCTP_ADDR_EXT` socket option enabled in order to access physical addressing 757data on incoming control protocol requests. It would interact with the kernel's 758route table via a netlink interface - the same as that implemented for the 759[Utility and configuration interfaces](#utility-and-configuration-interfaces). 760 761### Neighbour and routing implementation ### 762 763The packet-transmission behaviour of the MCTP infrastructure relies on a single 764routing table to lookup both route and neighbour information. Entries in this 765table are of the format: 766 767 | EID range | interface | physical address | metric | MTU | flags | expiry | 768 |-----------|-----------|------------------|--------|-----|-------|--------| 769 770This table can be updated from two sources: 771 772 * From userspace, via a netlink interface (see the 773 [Utility and configuration interfaces](#utility-and-configuration-interfaces) 774 section). 775 776 * Directly within the kernel, when basic neighbour information is discovered. 777 Kernel-originated routes are marked as such in the flags field, and have a 778 maximum validity age, indicated by the expiry field. 779 780Kernel-discovered routing information can originate from two sources: 781 782 * physical-to-EID mappings discovered through received packets 783 784 * explicit endpoint physical-address resolution requests 785 786When a packet is to be transmitted to an EID that does not have an entry in the 787routing table, the kernel may attempt to resolve the physical address of that 788endpoint using the Resolve Endpoint ID command of the MCTP Control Protocol 789(section 12.9 of DSP0236). The response message will be used to add a 790kernel-originated route into the routing table. 791 792This is the only kernel-internal usage of MCTP Control Protocol messages. 793 794## Utility and configuration interfaces ## 795 796A small utility will be developed to control the state of the kernel MCTP stack. 797This will be similar in design to the 'iproute2' tools, which perform a similar 798function for the IPv4 and IPv6 protocols. 799 800The utility will be invoked as `mctp`, and provide subcommands for managing 801different aspects of the kernel stack. 802 803### `mctp link`: manage interfaces ### 804 805```sh 806 mctp link set <link> <up|down> 807 mctp link set <link> network <network-id> 808 mctp link set <link> mtu <mtu> 809 mctp link set <link> bus-owner <hwaddr> 810``` 811 812### `mctp network`: manage networks ### 813 814```sh 815 mctp network create <network-id> 816 mctp network set <network-id> forwarding <on|off> 817 mctp network set <network-id> default [<true|false>] 818 mctp network delete <network-id> 819``` 820 821### `mctp address`: manage local EID assignments ### 822 823```sh 824 mctp address add <eid> dev <link> 825 mctp address del <eid> dev <link> 826``` 827 828### `mctp route`: manage routing tables ### 829 830```sh 831 mctp route add net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>] 832 mctp route del net <network-id> eid <eid|eid-range> via <link> [hwaddr <addr>] [mtu <mtu>] [metric <metric>] 833 mctp route show [net <network-id>] 834``` 835 836### `mctp stat`: query socket status ### 837 838```sh 839 mctp stat 840``` 841 842A set of netlink message formats will be defined to support these control 843functions. 844 845 846# Design points & alternatives considered # 847 848## Including message-type byte in send/receive buffers ## 849 850This design specifies that message buffers passed to the kernel in send syscalls 851and from the kernel in receive syscalls will have the message type byte as the 852first byte of the buffer. This corresponds to the definition of a MCTP message 853payload in DSP0236. 854 855This somewhat duplicates the type data provided in `struct sockaddr_mctp`; it's 856superficially possible for the kernel to prepend this byte on send, and remove 857it on receive. 858 859However, the exact format of the MCTP message payload is not precisely defined 860by the specification. Particularly, any message integrity check data (which 861would also need to be appended / stripped in conjunction with the type byte) is 862defined by the type specification, not DSP0236. The kernel would need knowledge 863of all protocols in order to correctly deconstruct the payload data. 864 865Therefore, we transfer the message payload as-is to userspace, without any 866modification by the kernel. 867 868## MCTP message-type specification: using `sockaddr_mctp.smctp_type` rather than protocol ## 869 870This design specifies message-types to be passed in the `smctp_type` field of 871`struct sockaddr_mctp`. An alternative would be to pass it in the `protocol` 872argument of the `socket()` system call: 873 874```c 875 int socket(int domain /* = AF_MCTP */, int type /* = SOCK_DGRAM */, int protocol); 876``` 877 878The `smctp_type` implementation was chosen as it better matches the "addressing" 879model of the message type; sockets are bound to an incoming message type, 880similar to the IP protocol's model of binding UDP sockets to a local port number. 881 882There is no kernel behaviour that depends on the specific type (particularly 883given the design choice above), so it is not suited to use the protocol argument 884here. 885 886Future additions that perform protocol-specific message handling, and so alter 887the send/receive buffer format, may use a new protocol argument. 888 889 890## Networks referenced by index rather than UUID ## 891 892This design proposes referencing networks by an integer index. The MCTP standard 893does optionally associate a RFC4122 UUID with a networks; it would be possible 894to use this UUID where we pass a network identifier. 895 896This approach does not incorporate knowledge of network UUIDs in the kernel. 897Given that the Get Network ID message in the MCTP Control Protocol is 898implemented entirely via userspace, it does not need to be aware of network 899UUIDs, and requiring network references (for example, the `smctp_network` field 900of `struct sockaddr_mctp`, as type `uuid_t`) complicates assignment. 901 902Instead, the index integer is used instead, in a similar fashion to the integer 903index used to reference `struct netdevice`s elsewhere in the network stack. 904 905 906## Tag behaviour alternatives ## 907 908We considered *several* different designs for the tag handling behaviour. A 909brief overview of the more-feasible of those, and why they were rejected: 910 911### Each socket is allocated a unique tag value on creation ### 912 913We could allocate a tag for each socket on creation, and use that value when a 914tag is required. This, however: 915 916 * needlessly consumes a tag on non-tag-owning sockets (ie, those which send 917 with TO=0 - responders); and 918 919 * limits us to 8 sockets per network. 920 921### Tags only used for message packetisation / reassembly ### 922 923An alternative would be to completely dissociate tag allocation from sockets; 924and only allocate a tag for the (short-lived) task of packetising a message, and 925sending those packets. Tags would be released when the last packet has been sent. 926 927However, this removes any facility to correlate responses with the correct 928socket, which is the purpose of the TO bit in DSP0236. In order for the sending 929application to receive the response, we would either need to: 930 931 * limit the system to one socket of each message type (which, for example, 932 precludes running a requester and a responder of the same type); or 933 934 * forward all incoming messages of a specific message-type to all sockets 935 listening on that type, making it trivial to eavesdrop on MCTP data of 936 other applications 937 938### Allocate a tag for one request/response pair ### 939 940Another alternative would be to allocate a tag on each outgoing TO=1 message, 941and then release that allocation after the incoming response to that tag (TO=0) is 942observed. 943 944However, MCTP protocols exist that do not have a 1:1 mapping of responses to 945requests - more than one response may be valid for a given request message. For 946example, in response to a request, a NVMe-MI implementation may send an 947in-progress reply before the final reply. In this case, we would release the tag 948after the first response is received, and then have no way to correlate the 949second message with the socket. 950 951Broadcast MCTP request messages may have multiple replies from multiple 952endpoints, meaning we cannot release the tag allocation on the first reply. 953