1Rocker Network Switch Register Programming Guide 2************************************************ 3 4.. 5 Copyright (c) Scott Feldman <sfeldma@gmail.com> 6 Copyright (c) Neil Horman <nhorman@tuxdriver.com> 7 Version 0.11, 12/29/2014 8 9 This program is free software; you can redistribute it and/or modify 10 it under the terms of the GNU General Public License as published by 11 the Free Software Foundation; either version 2 of the License, or 12 (at your option) any later version. 13 14 This program is distributed in the hope that it will be useful, 15 but WITHOUT ANY WARRANTY; without even the implied warranty of 16 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 17 GNU General Public License for more details. 18 19Introduction 20============ 21 22Overview 23-------- 24 25This document describes the hardware/software interface for the Rocker switch 26device. The intended audience is authors of OS drivers and device emulation 27software. 28 29Notations and Conventions 30------------------------- 31 32* In register descriptions, [n:m] indicates a range from bit n to bit m, 33 inclusive. 34* Use of leading 0x indicates a hexadecimal number. 35* Use of leading 0b indicates a binary number. 36* The use of RSVD or Reserved indicates that a bit or field is reserved for 37 future use. 38* Field width is in bytes, unless otherwise noted. 39* Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear 40 on read 41* TLV values in network-byte-order are designated with (N). 42 43 44PCI Configuration Registers 45=========================== 46 47PCI Configuration Space 48----------------------- 49 50Each switch instance registers as a PCI device with PCI configuration space:: 51 52 offset width description value 53 --------------------------------------------- 54 0x0 2 Vendor ID 0x1b36 55 0x2 2 Device ID 0x0006 56 0x4 4 Command/Status 57 0x8 1 Revision ID 0x01 58 0x9 3 Class code 0x2800 59 0xC 1 Cache line size 60 0xD 1 Latency timer 61 0xE 1 Header type 62 0xF 1 Built-in self test 63 0x10 4 Base address low 64 0x14 4 Base address high 65 0x18-28 Reserved 66 0x2C 2 Subsystem vendor ID * 67 0x2E 2 Subsystem ID * 68 0x30-38 Reserved 69 0x3C 1 Interrupt line 70 0x3D 1 Interrupt pin 0x00 71 0x3E 1 Min grant 0x00 72 0x3D 1 Max latency 0x00 73 0x40 1 TRDY timeout 74 0x41 1 Retry count 75 0x42 2 Reserved 76 77 * Assigned by sub-system implementation 78 79Memory-Mapped Register Space 80============================ 81 82There are two memory-mapped BARs. BAR0 maps device register space and is 830x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in 84size, allowing for 256 MSI-X vectors. 85 86All registers are 4 or 8 bytes long. It is assumed host software will access 4 87byte registers with one 4-byte access, and 8 byte registers with either two 884-byte accesses or a single 8-byte access. In the case of two 4-byte accesses, 89access must be lower and then upper 4-bytes, in that order. 90 91BAR0 device register space is organized as follows:: 92 93 offset description 94 ------------------------------------------------------ 95 0x0000-0x000f Bogus registers to catch misbehaving 96 drivers. Writes do nothing. Reads 97 back as 0xDEADBABE. 98 0x0010-0x00ff Test registers 99 0x0300-0x03ff General purpose registers 100 0x1000-0x1fff Descriptor control 101 102Holes in register space are reserved. Writes to reserved registers do nothing. 103Reads to reserved registers read back as 0. 104 105No fancy stuff like write-combining is enabled on any of the registers. 106 107BAR1 MSI-X register space is organized as follows:: 108 109 offset description 110 ------------------------------------------------------ 111 0x0000-0x0fff MSI-X vector table (256 vectors total) 112 0x1000-0x1fff MSI-X PBA table 113 114 115Interrupts, DMA, and Endianness 116=============================== 117 118PCI Interrupts 119-------------- 120 121The device supports only MSI-X interrupts. BAR1 memory-mapped region contains 122the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors. 123 124The vector assignment is:: 125 126 vector description 127 ----------------------------------------------------- 128 0 Command descriptor ring completion 129 1 Event descriptor ring completion 130 2 Test operation completion 131 3 RSVD 132 4-255 Tx and Rx descriptor ring completion 133 Tx vector is even 134 Rx vector is odd 135 136A MSI-X vector table entry is 16 bytes:: 137 138 field offset width description 139 ------------------------------------------------------------- 140 lower_addr 0x0 4 [31:2] message address[31:2] 141 [1:0] Rsvd (4 byte alignment 142 required) 143 upper_addr 0x4 4 [31:19] Rsvd 144 [14:0] message address[46:32] 145 data 0x8 4 message data[31:0] 146 control 0xc 4 [31:1] Rsvd 147 [0] mask (0 = enable, 148 1 = masked) 149 150Software should install the Interrupt Service Routine (ISR) before any ports 151are enabled or any commands are issued on the command ring. 152 153DMA Operations 154-------------- 155 156DMA operations are used for packet DMA to/from the CPU, command and event 157processing. Command processing includes statistical counters and table dumps, 158table insertion/deletion, and more. Event processing provides an async 159notification method for device-originating events. Each DMA operation has a 160set of control registers to manage a descriptor ring. The descriptor rings are 161allocated from contiguous host DMA-able memory and registers specify the rings 162base address, size and current head and tail indices. Software always writes 163the head, and hardware always writes the tail. 164 165The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion 166of a descriptor. Software will clear this bit when posting a descriptor to the 167ring, and hardware will set this bit when the descriptor is complete. 168 169Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries. 170Descriptor rings' base address must be 8-byte aligned. Descriptors must be 171packed within ring. Each descriptor in each ring must also be aligned on an 8 172byte boundary. Each descriptor ring will have these registers:: 173 174 DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W) 175 DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W) 176 DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W) 177 DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R) 178 DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W) 179 DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W) 180 DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W) 181 182Where x is descriptor ring index:: 183 184 index ring 185 -------------------- 186 0 CMD 187 1 EVENT 188 2 TX (port 0) 189 3 RX (port 0) 190 4 TX (port 1) 191 5 RX (port 1) 192 . 193 . 194 . 195 124 TX (port 61) 196 125 RX (port 61) 197 126 Resv 198 127 Resv 199 200Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero. HEAD cannot be 201written past TAIL. To do so would wrap the ring. An empty ring is when HEAD 202== TAIL. A full ring is when HEAD is one position behind TAIL. Both HEAD and 203TAIL increment and modulo wrap at the ring size. 204 205CTRL register bits:: 206 207 bit name description 208 ------------------------------------------------------------------------ 209 [0] CTRL_RESET Reset the descriptor ring 210 [1:31] Reserved 211 212All descriptor types share some common fields:: 213 214 field width description 215 ------------------------------------------------------------------- 216 DMA_DESC_BUF_ADDR 8 Phys addr of desc payload, 8-byte 217 aligned 218 DMA_DESC_COOKIE 8 Desc cookie for completion matching, 219 upper-most bit is reserved 220 DMA_DESC_BUF_SIZE 2 Desc payload size in bytes 221 DMA_DESC_TLV_SIZE 2 Desc payload total size in bytes 222 used for TLVs. Must be <= 223 DMA_DESC_BUF_SIZE. 224 DMA_DESC_COMP_ERR 2 Completion status of associated 225 desc payload. High order bit is 226 clear on new descs, toggled by 227 hw for completed items. 228 229To support forward- and backward-compatibility, descriptor and completion 230payloads are specified in TLV format. Fields are packed with Type=field name, 231Length=field length, and Value=field value. Software will ignore unknown fields 232filled in by the switch. Likewise, the switch will ignore unknown fields 233filled in by software. 234 235Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned. The 236value within a TLV is also 8-byte aligned. The (packed, 8 byte) TLV header is:: 237 238 field width description 239 ----------------------------- 240 type 4 TLV type 241 len 2 TLV value length 242 pad 2 Reserved 243 244The alignment requirements for descriptors and TLVs are to avoid unaligned 245access exceptions in software. Note that the payload for each TLV is also 2468 byte aligned. 247 248Figure 1 shows an example descriptor buffer with two TLVs:: 249 250 <------- 8 bytes -------> 251 252 8-byte +––––+ +–––––––––––+–––––+–––––+ +–+ 253 align | type | len | pad | TLV#1 hdr | 254 +–––––––––––+–––––+–––––+ (len=22) | 255 | | | 256 | value | TVL#1 value | 257 | | (padded to 8-byte | 258 | +–––––+ alignment) | 259 | |/////| | 260 8-byte +––––+ +–––––––––––+–––––––––––+ | 261 align | type | len | pad | TLV#2 hdr DESC_BUF_SIZE 262 +–––––+–––––+–––––+–––––+ (len=2) | 263 |value|/////////////////| TLV#2 value | 264 +–––––+/////////////////| | 265 |///////////////////////| | 266 |///////////////////////| | 267 |///////////////////////| | 268 |////////unused/////////| | 269 |////////space//////////| | 270 |///////////////////////| | 271 |///////////////////////| | 272 |///////////////////////| | 273 +–––––––––––––––––––––––+ +–+ 274 275 fig. 1 276 277TLVs can be nested within the NEST TLV type. 278 279Interrupt credits 280^^^^^^^^^^^^^^^^^ 281 282MSI-X vectors used for descriptor ring completions use a credit mechanism for 283efficient device, PCIe bus, OS and driver operations. Each descriptor ring has 284a credit count which represents the number of outstanding descriptors to be 285processed by the driver. As the device marks descriptors complete, the credit 286count is incremented. As the driver processes those outstanding descriptors, 287it returns credits back to the device. This way, the device knows the driver's 288progress and can make decisions about when to fire the next interrupt or not. 289When the credit count is zero, and the first descriptors are posted for the 290driver, a single interrupt is fired. Once the interrupt is fired, the 291interrupt is disabled (auto-masked*). In response to the interrupt, the driver 292will process descriptors and PIO write a returned credit value for that 293descriptor ring. If the driver returns all credits (the driver caught up with 294the device and there is no outstanding work), then the interrupt is unmasked, 295but not fired. If only partial credits are returned, the interrupt remains 296masked but the device generates an interrupt, signaling the driver that more 297outstanding work is available. 298 299(* this masking is unrelated to the MSI-X interrupt mask register) 300 301Endianness 302---------- 303 304Device registers are hard-coded to little-endian (LE). The driver should 305convert to/from host endianness to LE for device register accesses. 306 307Descriptors are LE. Descriptor buffer TLVs will have LE type and length 308fields, but the value field can either be LE or network-byte-order, depending 309on context. TLV values containing network packet data will be in network-byte 310order. A TLV value containing a field or mask used to compare against network 311packet data is network-byte order. For example, flow match fields (and masks) 312are network-byte-order since they're matched directly, byte-by-byte, against 313network packet data. All non-network-packet TLV multi-byte values will be LE. 314 315TLV values in network-byte-order are designated with (N). 316 317 318Test Registers 319============== 320 321Rocker has several test registers to support troubleshooting register access, 322interrupt generation, and DMA operations:: 323 324 TEST_REG, offset 0x0010, 32-bit (R/W) 325 TEST_REG64, offset 0x0018, 64-bit (R/W) 326 TEST_IRQ, offset 0x0020, 32-bit (R/W) 327 TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W) 328 TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W) 329 TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W) 330 331Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last 332value written to the register. The 32-bit and 64-bit versions are for testing 33332-bit and 64-bit host accesses. 334 335A vector can be written to TEST_IRQ and the device will generate an interrupt 336for that vector. 337 338To test basic DMA operations, allocate a DMA-able host buffer and put the 339buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE. Then, write to 340TEST_DMA_CTRL to manipulate the buffer contents. TEST_DMA_CTRL operations are:: 341 342 operation value description 343 ----------------------------------------------------------- 344 TEST_DMA_CTRL_CLEAR 1 clear buffer 345 TEST_DMA_CTRL_FILL 2 fill buffer bytes with 0x96 346 TEST_DMA_CTRL_INVERT 4 invert bytes in buffer 347 348Various buffer address and sizes should be tested to verify no address boundary 349issue exists. In particular, buffers that start on odd-8-byte boundary and/or 350span multiple PAGE sizes should be tested. 351 352 353Ports 354===== 355 356Physical and Logical Ports 357------------------------------------ 358 359The switch supports up to 62 physical (front-panel) ports. Register 360PORT_PHYS_COUNT returns the actual number of physical ports available:: 361 362 PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R) 363 364In addition to front-panel ports, the switch supports logical ports for 365tunnels. 366 367Front-panel ports and logical tunnel ports are mapped into a single 32-bit port 368space. A special CPU port is assigned port 0. The front-panel ports are 369mapped to ports 1-62. A special loopback port is assigned port 63. Logical 370tunnel ports are assigned ports 0x0001000-0x0001ffff. 371To summarize the port assignments:: 372 373 port mapping 374 ------------------------------------------------------- 375 0 CPU port (for packets to/from host CPU) 376 1-62 front-panel physical ports 377 63 loopback port 378 64-0x0000ffff RSVD 379 0x00010000-0x0001ffff logical tunnel ports 380 0x00020000-0xffffffff RSVD 381 382Physical Port Mode 383------------------ 384 385Switch front-panel ports operate in a mode. Currently, the only mode is 386OF-DPA. OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA) 387Abstract Switch Specification, Version 1.0, from Broadcom Corporation. To 388set/get the mode for front-panel ports, see port settings, below. 389 390Port Settings 391------------- 392 393Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS:: 394 395 PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R) 396 397 Value is port bitmap. Bits 0 and 63 always read 0. Bits 1-62 398 read 1 for link UP and 0 for link DOWN for respective front-panel ports. 399 400Other properties for front-panel ports are available via DMA CMD descriptors:: 401 402 Get PORT_SETTINGS descriptor: 403 404 field width description 405 ---------------------------------------------- 406 PORT_SETTINGS 2 CMD_GET 407 PPORT 4 Physical port # 408 409 Get PORT_SETTINGS completion: 410 411 field width description 412 ---------------------------------------------- 413 PPORT 4 Physical port # 414 SPEED 4 Current port interface speed, in Mbps 415 DUPLEX 1 1 = Full, 0 = Half 416 AUTONEG 1 1 = enabled, 0 = disabled 417 MACADDR 6 Port MAC address 418 MODE 1 0 = OF-DPA 419 LEARNING 1 MAC address learning on port 420 1 = enabled 421 0 = disabled 422 PHYS_NAME <var> Physical port name (string) 423 424 Set PORT_SETTINGS descriptor: 425 426 field width description 427 ---------------------------------------------- 428 PORT_SETTINGS 2 CMD_SET 429 PPORT 4 Physical port # 430 SPEED 4 Port interface speed, in Mbps 431 DUPLEX 1 1 = Full, 0 = Half 432 AUTONEG 1 1 = enabled, 0 = disabled 433 MACADDR 6 Port MAC address 434 MODE 1 0 = OF-DPA 435 436Port Enable 437----------- 438 439Front-panel ports are initially disabled, which means port ingress and egress 440packets will be dropped. To enable or disable a port, use PORT_PHYS_ENABLE:: 441 442 PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W) 443 444 Value is bitmap of first 64 ports. Bits 0 and 63 are ignored 445 and always read as 0. Write 1 to enable port; write 0 to disable it. 446 Default is 0. 447 448 449Switch Control 450============== 451 452This section covers switch-wide register settings. 453 454Control 455------- 456 457This register is used for low level control of the switch:: 458 459 CONTROL: offset 0x0300, 32-bit, (W) 460 461 bit name description 462 ------------------------------------------------------------------------ 463 [0] CONTROL_RESET If set, device will perform reset 464 [1:31] Reserved 465 466Switch ID 467--------- 468 469The switch has a SWITCH_ID to be used by software to uniquely identify the 470switch:: 471 472 SWITCH_ID: offset 0x0320, 64-bit, (R) 473 474 Value is opaque to switch software and no special encoding is implied. 475 476 477Events 478====== 479 480Non-I/O asynchronous events from the device are notified to the host using the 481event ring. The TLV structure for events is:: 482 483 field width description 484 --------------------------------------------------- 485 TYPE 4 Event type, one of: 486 1: LINK_CHANGED 487 2: MAC_VLAN_SEEN 488 INFO <nest> Event info (details below) 489 490Link Changed Event 491------------------ 492 493When link status changes on a physical port, this event is generated:: 494 495 field width description 496 --------------------------------------------------- 497 INFO <nest> 498 PPORT 4 Physical port 499 LINKUP 1 Link status: 500 0: down 501 1: up 502 503MAC VLAN Seen Event 504------------------- 505 506When a packet ingresses on a port and the source MAC/VLAN isn't known to the 507device, the device will generate this event. In response to the event, the 508driver should install to the device the MAC/VLAN on the port into the bridge 509table. Once installed, the MAC/VLAN is known on the port and this event will 510no longer be generated. 511 512:: 513 514 field width description 515 --------------------------------------------------- 516 INFO <nest> 517 PPORT 4 Physical port 518 MAC 6 MAC address 519 VLAN 2 VLAN ID 520 521 522CPU Packet Processing 523===================== 524 525Ingress packets directed to the host CPU for further processing are delivered 526in the DMA RX ring. Likewise, host CPU originating packets destined to egress 527on switch ports are scheduled by software using the DMA TX ring. 528 529Tx Packet Processing 530-------------------- 531 532Software schedules packets for egress on switch ports using the DMA TX ring. A 533TX descriptor buffer describes the packet location and size in host DMA-able 534memory, the destination port, and any hardware-offload functions (such as L3 535payload checksum offload). Software then bumps the descriptor head to signal 536hardware of new Tx work. In response, hardware will DMA read Tx descriptors up 537to head, DMA read descriptor buffer and packet data, perform offloading 538functions, and finally frame packet on wire (network). Once packet processing 539is complete, hardware will writeback status to descriptor(s) to signal to 540software that Tx is complete and software resources (e.g. skb) backing packet 541can be released. 542 543Figure 2 shows an example 3-fragment packet queued with one Tx descriptor. A 544TLV is used for each packet fragment:: 545 546 pkt frag 1 547 +–––––––+ +–+ 548 +–––+ | | 549 desc buf | | | | 550 +––––––––+ | | | | 551 Tx ring +–––+ +–––––+ | | | 552 +–––––––––+ | | TLVs | +–––––––+ | 553 | +–––+ +––––––––+ pkt frag 2 | 554 | desc 0 | | +–––––+ +–––––––+ | 555 +–––––––––+ | TLVs | +–––+ | | 556 head+–+ | +––––––––+ | | | 557 | desc 1 | | +–––––+ +–––––––+ |pkt 558 +–––––––––+ | TLVs | | | 559 | | +––––––––+ | pkt frag 3 | 560 | | | +–––––––+ | 561 +–––––––––+ +–––+ | | 562 | | | | | 563 | | | | | 564 +–––––––––+ | | | 565 | | | | | 566 | | | | | 567 +–––––––––+ | | | 568 | | +–––––––+ +–+ 569 | | 570 +–––––––––+ 571 572 fig 2. 573 574The TLVs for Tx descriptor buffer are:: 575 576 field width description 577 --------------------------------------------------------------------- 578 PPORT 4 Destination physical port # 579 TX_OFFLOAD 1 Hardware offload modes: 580 0: no offload 581 1: insert IP csum (ipv4 only) 582 2: insert TCP/UDP csum 583 3: L3 csum calc and insert 584 into csum offset (TX_L3_CSUM_OFF) 585 16-bit 1's complement csum value. 586 IPv4 pseudo-header and IP 587 already calculated by OS 588 and inserted. 589 4: TSO (TCP Segmentation Offload) 590 TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset, 591 from the beginning of the packet, 592 of the csum field in the L3 header 593 TX_TSO_MSS 2 For TSO offload mode, the 594 Maximum Segment Size in bytes 595 TX_TSO_HDR_LEN 2 For TSO offload mode, the 596 length of ethernet, IP, and 597 TCP/UDP headers, including IP 598 and TCP options. 599 TX_FRAGS <array> Packet fragments 600 TX_FRAG <nest> Packet fragment 601 TX_FRAG_ADDR 8 DMA address of packet fragment 602 TX_FRAG_LEN 2 Packet fragment length 603 604Possible status return codes in descriptor on completion are:: 605 606 DESC_COMP_ERR reason 607 -------------------------------------------------------------------- 608 0 OK 609 -ROCKER_ENXIO address or data read err on desc buf or packet 610 fragment 611 -ROCKER_EINVAL bad pport or TSO or csum offloading error 612 -ROCKER_ENOMEM no memory for internal staging tx fragment 613 614Rx Packet Processing 615-------------------- 616 617For packets ingressing on switch ports that are not forwarded by the switch but 618rather directed to the host CPU for further processing are delivered in the DMA 619RX ring. Rx descriptor buffers are allocated by software and placed on the 620ring. Hardware will fill Rx descriptor buffers with packet data, write the 621completion, and signal to software that a new packet is ready. Since Rx packet 622size is not known a-priori, the Rx descriptor buffer must be allocated for 623worst-case packet size. A single Rx descriptor will contain the entire Rx 624packet data in one RX_FRAG. Other Rx TLVs describe and hardware offloads 625performed on the packet, such as checksum validation. 626 627The TLVs for Rx descriptor buffer are:: 628 629 field width description 630 --------------------------------------------------- 631 PPORT 4 Source physical port # 632 RX_FLAGS 2 Packet parsing flags: 633 (1 << 0): IPv4 packet 634 (1 << 1): IPv6 packet 635 (1 << 2): csum calculated 636 (1 << 3): IPv4 csum good 637 (1 << 4): IP fragment 638 (1 << 5): TCP packet 639 (1 << 6): UDP packet 640 (1 << 7): TCP/UDP csum good 641 (1 << 8): Offload forward 642 RX_CSUM 2 IP calculated checksum: 643 IPv4: IP payload csum 644 IPv6: header and payload csum 645 (Only valid is RX_FLAGS:csum calc is set) 646 RX_FRAG_ADDR 8 DMA address of packet fragment 647 RX_FRAG_MAX_LEN 2 Packet maximum fragment length 648 RX_FRAG_LEN 2 Actual packet fragment length after receive 649 650Offload forward RX_FLAG indicates the device has already forwarded the packet 651so the host CPU should not also forward the packet. 652 653Possible status return codes in descriptor on completion are:: 654 655 DESC_COMP_ERR reason 656 -------------------------------------------------------------------- 657 0 OK 658 -ROCKER_ENXIO address or data read err on desc buf 659 -ROCKER_ENOMEM no memory for internal staging desc buf 660 -ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain 661 packet data TLV and other TLVs. 662 663 664OF-DPA Mode 665=========== 666 667OF-DPA mode allows the switch to offload flow packet processing functions to 668hardware. An OpenFlow controller would communicate with an OpenFlow agent 669installed on the switch. The OpenFlow agent would (directly or indirectly) 670communicate with the Rocker switch driver, which in turn would program switch 671hardware with flow functionality, as defined in OF-DPA. The block diagram is:: 672 673 +–––––––––––––––----–––+ 674 | OF | 675 | Remote Controller | 676 +––––––––+––----–––––––+ 677 | 678 | 679 +––––––––+–––––––––+ 680 | OF | 681 | Local Agent | 682 +––––––––––––––––––+ 683 | | 684 | Rocker Driver | 685 +––––––––––––––––––+ 686 <this spec> 687 +––––––––––––––––––+ 688 | | 689 | Rocker Switch | 690 +––––––––––––––––––+ 691 692To participate in flow functions, ports must be configure for OF-DPA mode 693during switch initialization. 694 695OF-DPA Flow Table Interface 696--------------------------- 697 698There are commands to add, modify, delete, and get stats of flow table entries. 699The commands are issued using the DMA CMD descriptor ring. The following 700commands are defined:: 701 702 CMD_ADD: add an entry to flow table 703 CMD_MOD: modify an entry in flow table 704 CMD_DEL: delete an entry from flow table 705 CMD_GET_STATS: get stats for flow entry 706 707TLVs for add and modify commands are:: 708 709 field width description 710 ---------------------------------------------------- 711 OF_DPA_CMD 2 CMD_[ADD|MOD] 712 OF_DPA_TBL 2 Flow table ID 713 0: ingress port 714 10: vlan 715 20: termination mac 716 30: unicast routing 717 40: multicast routing 718 50: bridging 719 60: ACL policy 720 OF_DPA_PRIORITY 4 Flow priority 721 OF_DPA_HARDTIME 4 Hard timeout for flow 722 OF_DPA_IDLETIME 4 Idle timeout for flow 723 OF_DPA_COOKIE 8 Cookie 724 725Additional TLVs based on flow table ID: 726 727Table ID 0: ingress port:: 728 729 field width description 730 ---------------------------------------------------- 731 OF_DPA_IN_PPORT 4 ingress physical port number 732 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop 733 734Table ID 10: vlan:: 735 736 field width description 737 ---------------------------------------------------- 738 OF_DPA_IN_PPORT 4 ingress physical port number 739 OF_DPA_VLAN_ID 2 (N) vlan ID 740 OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask 741 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop 742 OF_DPA_NEW_VLAN_ID 2 (N) new vlan ID 743 744Table ID 20: termination mac:: 745 746 field width description 747 ---------------------------------------------------- 748 OF_DPA_IN_PPORT 4 ingress physical port number 749 OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask 750 OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd 751 OF_DPA_DST_MAC 6 (N) destination MAC 752 OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask 753 OF_DPA_VLAN_ID 2 (N) vlan ID 754 OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask 755 OF_DPA_GOTO_TBL 2 only acceptable values are 756 unicast or multicast routing 757 table IDs 758 OF_DPA_OUT_PPORT 2 if specified, must be 759 controller, set zero otherwise 760 761Table ID 30: unicast routing:: 762 763 field width description 764 ---------------------------------------------------- 765 OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd 766 OF_DPA_DST_IP 4 (N) destination IPv4 address. 767 Must be unicast address 768 OF_DPA_DST_IP_MASK 4 (N) IP mask. Must be prefix mask 769 OF_DPA_DST_IPV6 16 (N) destination IPv6 address. 770 Must be unicast address 771 OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask. Must be prefix mask 772 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop 773 OF_DPA_GROUP_ID 4 data for GROUP action must 774 be an L3 Unicast group entry 775 776Table ID 40: multicast routing:: 777 778 field width description 779 ---------------------------------------------------- 780 OF_DPA_ETHERTYPE 2 (N) must be either 0x0800 or 0x86dd 781 OF_DPA_VLAN_ID 2 (N) vlan ID 782 OF_DPA_SRC_IP 4 (N) source IPv4. Optional, 783 can contain IPv4 address, 784 must be completely masked 785 if not used 786 OF_DPA_SRC_IP_MASK 4 (N) IP Mask 787 OF_DPA_DST_IP 4 (N) destination IPv4 address. 788 Must be multicast address 789 OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. 790 Can contain IPv6 address, 791 must be completely masked 792 if not used 793 OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask. 794 OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must 795 be multicast address 796 Must be multicast address 797 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop 798 OF_DPA_GROUP_ID 4 data for GROUP action must 799 be an L3 multicast group entry 800 801Table ID 50: bridging:: 802 803 field width description 804 ---------------------------------------------------- 805 OF_DPA_VLAN_ID 2 (N) vlan ID 806 OF_DPA_TUNNEL_ID 4 tunnel ID 807 OF_DPA_DST_MAC 6 (N) destination MAC 808 OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask 809 OF_DPA_GOTO_TBL 2 goto table ID; zero to drop 810 OF_DPA_GROUP_ID 4 data for GROUP action must 811 be a L2 Interface, L2 812 Multicast, L2 Flood, 813 or L2 Overlay group entry 814 as appropriate 815 OF_DPA_TUNNEL_LPORT 4 unicast Tenant Bridging 816 flows specify a tunnel 817 logical port ID 818 OF_DPA_OUT_PPORT 2 data for OUTPUT action, 819 restricted to CONTROLLER, 820 set to 0 otherwise 821 822Table ID 60: acl policy:: 823 824 field width description 825 ---------------------------------------------------- 826 OF_DPA_IN_PPORT 4 ingress physical port number 827 OF_DPA_IN_PPORT_MASK 4 ingress physical port number mask 828 OF_DPA_ETHERTYPE 2 (N) ethertype 829 OF_DPA_VLAN_ID 2 (N) vlan ID 830 OF_DPA_VLAN_ID_MASK 2 (N) vlan ID mask 831 OF_DPA_VLAN_PCP 2 (N) vlan Priority Code Point 832 OF_DPA_VLAN_PCP_MASK 2 (N) vlan Priority Code Point mask 833 OF_DPA_SRC_MAC 6 (N) source MAC 834 OF_DPA_SRC_MAC_MASK 6 (N) source MAC mask 835 OF_DPA_DST_MAC 6 (N) destination MAC 836 OF_DPA_DST_MAC_MASK 6 (N) destination MAC mask 837 OF_DPA_TUNNEL_ID 4 tunnel ID 838 OF_DPA_SRC_IP 4 (N) source IPv4. Optional, 839 can contain IPv4 address, 840 must be completely masked 841 if not used 842 OF_DPA_SRC_IP_MASK 4 (N) IP Mask 843 OF_DPA_DST_IP 4 (N) destination IPv4 address. 844 Must be multicast address 845 OF_DPA_DST_IP_MASK 4 (N) IP Mask 846 OF_DPA_SRC_IPV6 16 (N) source IPv6 Address. Optional. 847 Can contain IPv6 address, 848 must be completely masked 849 if not used 850 OF_DPA_SRC_IPV6_MASK 16 (N) IPv6 mask 851 OF_DPA_DST_IPV6 16 (N) destination IPv6 Address. Must 852 be multicast address. 853 OF_DPA_DST_IPV6_MASK 16 (N) IPv6 mask 854 OF_DPA_SRC_ARP_IP 4 (N) source IPv4 address in the ARP 855 payload. Only used if ethertype 856 == 0x0806. 857 OF_DPA_SRC_ARP_IP_MASK 4 (N) IP Mask 858 OF_DPA_IP_PROTO 1 IP protocol 859 OF_DPA_IP_PROTO_MASK 1 IP protocol mask 860 OF_DPA_IP_DSCP 1 DSCP 861 OF_DPA_IP_DSCP_MASK 1 DSCP mask 862 OF_DPA_IP_ECN 1 ECN 863 OF_DPA_IP_ECN_MASK 1 ECN mask 864 OF_DPA_L4_SRC_PORT 2 (N) L4 source port, only for 865 TCP, UDP, or SCTP 866 OF_DPA_L4_SRC_PORT_MASK 2 (N) L4 source port mask 867 OF_DPA_L4_DST_PORT 2 (N) L4 source port, only for 868 TCP, UDP, or SCTP 869 OF_DPA_L4_DST_PORT_MASK 2 (N) L4 source port mask 870 OF_DPA_ICMP_TYPE 1 ICMP type, only if IP 871 protocol is 1 872 OF_DPA_ICMP_TYPE_MASK 1 ICMP type mask 873 OF_DPA_ICMP_CODE 1 ICMP code 874 OF_DPA_ICMP_CODE_MASK 1 ICMP code mask 875 OF_DPA_IPV6_LABEL 4 (N) IPv6 flow label 876 OF_DPA_IPV6_LABEL_MASK 4 (N) IPv6 flow label mask 877 OF_DPA_GROUP_ID 4 data for GROUP action 878 OF_DPA_QUEUE_ID_ACTION 1 write the queue ID 879 OF_DPA_NEW_QUEUE_ID 1 queue ID 880 OF_DPA_VLAN_PCP_ACTION 1 write the VLAN priority 881 OF_DPA_NEW_VLAN_PCP 1 VLAN priority 882 OF_DPA_IP_DSCP_ACTION 1 write the DSCP 883 OF_DPA_NEW_IP_DSCP 1 new DSCP 884 OF_DPA_TUNNEL_LPORT 4 restrct to valid tunnel 885 logical port, set to 0 886 otherwise. 887 OF_DPA_OUT_PPORT 2 data for OUTPUT action, 888 restricted to CONTROLLER, 889 set to 0 otherwise 890 OF_DPA_CLEAR_ACTIONS 4 if 1 packets matching flow are 891 dropped (all other instructions 892 ignored) 893 894TLVs for flow delete and get stats command are:: 895 896 field width description 897 --------------------------------------------------- 898 OF_DPA_CMD 2 CMD_[DEL|GET_STATS] 899 OF_DPA_COOKIE 8 Cookie 900 901On completion of get stats command, the descriptor buffer is written back with 902the following TLVs:: 903 904 field width description 905 --------------------------------------------------- 906 OF_DPA_STAT_DURATION 4 Flow duration 907 OF_DPA_STAT_RX_PKTS 8 Received packets 908 OF_DPA_STAT_TX_PKTS 8 Transmit packets 909 910Possible status return codes in descriptor on completion are:: 911 912 DESC_COMP_ERR command reason 913 -------------------------------------------------------------------- 914 0 all OK 915 -ROCKER_EFAULT all head or tail index outside 916 of ring 917 -ROCKER_ENXIO all address or data read err on 918 desc buf 919 -ROCKER_EMSGSIZE GET_STATS cmd descriptor buffer wasn't 920 big enough to contain write-back 921 TLVs 922 -ROCKER_EINVAL all invalid parameters passed in 923 -ROCKER_EEXIST ADD entry already exists 924 -ROCKER_ENOSPC ADD no space left in flow table 925 -ROCKER_ENOENT MOD|DEL|GET_STATS cookie invalid 926 927Group Table Interface 928--------------------- 929 930There are commands to add, modify, delete, and get stats of group table 931entries. The commands are issued using the DMA CMD descriptor ring. The 932following commands are defined:: 933 934 CMD_ADD: add an entry to group table 935 CMD_MOD: modify an entry in group table 936 CMD_DEL: delete an entry from group table 937 CMD_GET_STATS: get stats for group entry 938 939TLVs for add and modify commands are:: 940 941 field width description 942 ----------------------------------------------------------- 943 FLOW_GROUP_CMD 2 CMD_[ADD|MOD] 944 FLOW_GROUP_ID 2 Flow group ID 945 FLOW_GROUP_TYPE 1 Group type: 946 0: L2 interface 947 1: L2 rewrite 948 2: L3 unicast 949 3: L2 multicast 950 4: L2 flood 951 5: L3 interface 952 6: L3 multicast 953 7: L3 ECMP 954 8: L2 overlay 955 FLOW_VLAN_ID 2 Vlan ID (types 0, 3, 4, 6) 956 FLOW_L2_PORT 2 Port (types 0) 957 FLOW_INDEX 4 Index (all types but 0) 958 FLOW_OVERLAY_TYPE 1 Overlay sub-type (type 8): 959 0: Flood unicast tunnel 960 1: Flood multicast tunnel 961 2: Multicast unicast tunnel 962 3: Multicast multicast tunnel 963 FLOW_GROUP_ACTION nest 964 FLOW_GROUP_ID 2 next group ID in chain (all 965 types except 0) 966 FLOW_OUT_PORT 4 egress port (types 0, 8) 967 FLOW_POP_VLAN_TAG 1 strip outer VLAN tag (type 1 968 only) 969 FLOW_VLAN_ID 2 (types 1, 5) 970 FLOW_SRC_MAC 6 (types 1, 2, 5) 971 FLOW_DST_MAC 6 (types 1, 2) 972 973TLVs for flow delete and get stats command are:: 974 975 field width description 976 ----------------------------------------------------------- 977 FLOW_GROUP_CMD 2 CMD_[DEL|GET_STATS] 978 FLOW_GROUP_ID 2 Flow group ID 979 980On completion of get stats command, the descriptor buffer is written back with 981the following TLVs:: 982 983 field width description 984 --------------------------------------------------- 985 FLOW_GROUP_ID 2 Flow group ID 986 FLOW_STAT_DURATION 4 Flow duration 987 FLOW_STAT_REF_COUNT 4 Flow reference count 988 FLOW_STAT_BUCKET_COUNT 4 Flow bucket count 989 990Possible status return codes in descriptor on completion are:: 991 992 DESC_COMP_ERR command reason 993 -------------------------------------------------------------------- 994 0 all OK 995 -ROCKER_EFAULT all head or tail index outside 996 of ring 997 -ROCKER_ENXIO all address or data read err on 998 desc buf 999 -ROCKER_ENOSPC GET_STATS cmd descriptor buffer wasn't 1000 big enough to contain write-back 1001 TLVs 1002 -ROCKER_EINVAL ADD|MOD invalid parameters passed in 1003 -ROCKER_EEXIST ADD entry already exists 1004 -ROCKER_ENOSPC ADD no space left in flow table 1005 -ROCKER_ENOENT MOD|DEL|GET_STATS group ID invalid 1006 -ROCKER_EBUSY DEL group reference count non-zero 1007 -ROCKER_ENODEV ADD next group ID doesn't exist 1008 1009 1010 1011References 1012========== 1013 1014[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification, 1015Version 1.0, from Broadcom Corporation, February 21, 2014. 1016