xref: /openbmc/qemu/docs/specs/rocker.rst (revision 05caa062)
1Rocker Network Switch Register Programming Guide
2************************************************
3
4..
5   Copyright (c) Scott Feldman <sfeldma@gmail.com>
6   Copyright (c) Neil Horman <nhorman@tuxdriver.com>
7   Version 0.11, 12/29/2014
8
9   This program is free software; you can redistribute it and/or modify
10   it under the terms of the GNU General Public License as published by
11   the Free Software Foundation; either version 2 of the License, or
12   (at your option) any later version.
13
14   This program is distributed in the hope that it will be useful,
15   but WITHOUT ANY WARRANTY; without even the implied warranty of
16   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
17   GNU General Public License for more details.
18
19Introduction
20============
21
22Overview
23--------
24
25This document describes the hardware/software interface for the Rocker switch
26device.  The intended audience is authors of OS drivers and device emulation
27software.
28
29Notations and Conventions
30-------------------------
31
32* In register descriptions, [n:m] indicates a range from bit n to bit m,
33  inclusive.
34* Use of leading 0x indicates a hexadecimal number.
35* Use of leading 0b indicates a binary number.
36* The use of RSVD or Reserved indicates that a bit or field is reserved for
37  future use.
38* Field width is in bytes, unless otherwise noted.
39* Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) clear
40  on read
41* TLV values in network-byte-order are designated with (N).
42
43
44PCI Configuration Registers
45===========================
46
47PCI Configuration Space
48-----------------------
49
50Each switch instance registers as a PCI device with PCI configuration space::
51
52	offset	width	description		value
53	---------------------------------------------
54	0x0	2	Vendor ID		0x1b36
55	0x2	2	Device ID		0x0006
56	0x4	4	Command/Status
57	0x8	1	Revision ID		0x01
58	0x9	3	Class code		0x2800
59	0xC	1	Cache line size
60	0xD	1	Latency timer
61	0xE	1	Header type
62	0xF	1	Built-in self test
63	0x10	4	Base address low
64	0x14	4	Base address high
65	0x18-28		Reserved
66	0x2C	2	Subsystem vendor ID	*
67	0x2E	2	Subsystem ID		*
68	0x30-38		Reserved
69	0x3C	1	Interrupt line
70	0x3D	1	Interrupt pin		0x00
71	0x3E	1	Min grant		0x00
72	0x3D	1	Max latency		0x00
73	0x40	1	TRDY timeout
74	0x41	1	Retry count
75	0x42	2	Reserved
76
77        * Assigned by sub-system implementation
78
79Memory-Mapped Register Space
80============================
81
82There are two memory-mapped BARs.  BAR0 maps device register space and is
830x2000 in size.  BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
84size, allowing for 256 MSI-X vectors.
85
86All registers are 4 or 8 bytes long.  It is assumed host software will access 4
87byte registers with one 4-byte access, and 8 byte registers with either two
884-byte accesses or a single 8-byte access.  In the case of two 4-byte accesses,
89access must be lower and then upper 4-bytes, in that order.
90
91BAR0 device register space is organized as follows::
92
93	offset		description
94	------------------------------------------------------
95	0x0000-0x000f	Bogus registers to catch misbehaving
96			drivers.  Writes do nothing.  Reads
97			back as 0xDEADBABE.
98	0x0010-0x00ff	Test registers
99	0x0300-0x03ff	General purpose registers
100	0x1000-0x1fff	Descriptor control
101
102Holes in register space are reserved.  Writes to reserved registers do nothing.
103Reads to reserved registers read back as 0.
104
105No fancy stuff like write-combining is enabled on any of the registers.
106
107BAR1 MSI-X register space is organized as follows::
108
109	offset		description
110	------------------------------------------------------
111	0x0000-0x0fff	MSI-X vector table (256 vectors total)
112	0x1000-0x1fff	MSI-X PBA table
113
114
115Interrupts, DMA, and Endianness
116===============================
117
118PCI Interrupts
119--------------
120
121The device supports only MSI-X interrupts.  BAR1 memory-mapped region contains
122the MSI-X vector and PBA tables, with support for up to 256 MSI-X vectors.
123
124The vector assignment is::
125
126	vector		description
127	-----------------------------------------------------
128	0		Command descriptor ring completion
129	1		Event descriptor ring completion
130	2		Test operation completion
131	3		RSVD
132	4-255		Tx and Rx descriptor ring completion
133			  Tx vector is even
134			  Rx vector is odd
135
136A MSI-X vector table entry is 16 bytes::
137
138	field		offset	width	description
139	-------------------------------------------------------------
140	lower_addr	0x0	4	[31:2] message address[31:2]
141					[1:0] Rsvd (4 byte alignment
142						    required)
143	upper_addr	0x4	4	[31:19] Rsvd
144					[14:0] message address[46:32]
145	data		0x8	4	message data[31:0]
146	control		0xc	4	[31:1] Rsvd
147					[0] mask (0 = enable,
148						  1 = masked)
149
150Software should install the Interrupt Service Routine (ISR) before any ports
151are enabled or any commands are issued on the command ring.
152
153DMA Operations
154--------------
155
156DMA operations are used for packet DMA to/from the CPU, command and event
157processing.  Command processing includes statistical counters and table dumps,
158table insertion/deletion, and more.  Event processing provides an async
159notification method for device-originating events.  Each DMA operation has a
160set of control registers to manage a descriptor ring.  The descriptor rings are
161allocated from contiguous host DMA-able memory and registers specify the rings
162base address, size and current head and tail indices.  Software always writes
163the head, and hardware always writes the tail.
164
165The higher-order bit of DMA_DESC_COMP_ERR is used to mark hardware completion
166of a descriptor.  Software will clear this bit when posting a descriptor to the
167ring, and hardware will set this bit when the descriptor is complete.
168
169Descriptor ring sizes must be a power of 2 and range from 2 to 64K entries.
170Descriptor rings' base address must be 8-byte aligned.  Descriptors must be
171packed within ring.  Each descriptor in each ring must also be aligned on an 8
172byte boundary.  Each descriptor ring will have these registers::
173
174	DMA_DESC_xxx_BASE_ADDR, offset 0x1000 + (x * 32), 64-bit, (R/W)
175	DMA_DESC_xxx_SIZE, offset 0x1008 + (x * 32), 32-bit, (R/W)
176	DMA_DESC_xxx_HEAD, offset 0x100c + (x * 32), 32-bit, (R/W)
177	DMA_DESC_xxx_TAIL, offset 0x1010 + (x * 32), 32-bit, (R)
178	DMA_DESC_xxx_CTRL, offset 0x1014 + (x * 32), 32-bit, (W)
179	DMA_DESC_xxx_CREDITS, offset 0x1018 + (x * 32), 32-bit, (R/W)
180	DMA_DESC_xxx_RSVD1, offset 0x101c + (x * 32), 32-bit, (R/W)
181
182Where x is descriptor ring index::
183
184	index		ring
185	--------------------
186	0		CMD
187	1		EVENT
188	2		TX (port 0)
189	3		RX (port 0)
190	4		TX (port 1)
191	5		RX (port 1)
192	.
193	.
194	.
195	124		TX (port 61)
196	125		RX (port 61)
197	126		Resv
198	127		Resv
199
200Writing BASE_ADDR or SIZE will reset HEAD and TAIL to zero.  HEAD cannot be
201written past TAIL.  To do so would wrap the ring.  An empty ring is when HEAD
202== TAIL.  A full ring is when HEAD is one position behind TAIL.  Both HEAD and
203TAIL increment and modulo wrap at the ring size.
204
205CTRL register bits::
206
207	bit	name		description
208	------------------------------------------------------------------------
209	[0]	CTRL_RESET	Reset the descriptor ring
210	[1:31]	Reserved
211
212All descriptor types share some common fields::
213
214	field			width	description
215	-------------------------------------------------------------------
216	DMA_DESC_BUF_ADDR	8	Phys addr of desc payload, 8-byte
217					aligned
218	DMA_DESC_COOKIE		8	Desc cookie for completion matching,
219					upper-most bit is reserved
220	DMA_DESC_BUF_SIZE	2	Desc payload size in bytes
221	DMA_DESC_TLV_SIZE	2	Desc payload total size in bytes
222					used for TLVs.  Must be <=
223					DMA_DESC_BUF_SIZE.
224	DMA_DESC_COMP_ERR	2	Completion status of associated
225					desc payload.  High order bit is
226					clear on new descs, toggled by
227					hw for completed items.
228
229To support forward- and backward-compatibility, descriptor and completion
230payloads are specified in TLV format.  Fields are packed with Type=field name,
231Length=field length, and Value=field value.  Software will ignore unknown fields
232filled in by the switch.  Likewise, the switch will ignore unknown fields
233filled in by software.
234
235Descriptor payload buffer is 8-byte aligned and TLVs are 8-byte aligned.  The
236value within a TLV is also 8-byte aligned.  The (packed, 8 byte) TLV header is::
237
238	field	width	description
239	-----------------------------
240	type	4	TLV type
241	len	2	TLV value length
242	pad	2	Reserved
243
244The alignment requirements for descriptors and TLVs are to avoid unaligned
245access exceptions in software.  Note that the payload for each TLV is also
2468 byte aligned.
247
248Figure 1 shows an example descriptor buffer with two TLVs::
249
250                  <------- 8 bytes ------->
251
252  8-byte  +––––+  +–––––––––––+–––––+–––––+                     +–+
253  align           |   type    | len | pad |    TLV#1 hdr          |
254                  +–––––––––––+–––––+–––––+    (len=22)           |
255                  |                       |                       |
256                  |  value                |    TVL#1 value        |
257                  |                       |    (padded to 8-byte  |
258                  |                 +–––––+     alignment)        |
259                  |                 |/////|                       |
260   8-byte +––––+  +–––––––––––+–––––––––––+                       |
261   align          |   type    | len | pad |    TLV#2 hdr    DESC_BUF_SIZE
262                  +–––––+–––––+–––––+–––––+    (len=2)            |
263                  |value|/////////////////|    TLV#2 value        |
264                  +–––––+/////////////////|                       |
265                  |///////////////////////|                       |
266                  |///////////////////////|                       |
267                  |///////////////////////|                       |
268                  |////////unused/////////|                       |
269                  |////////space//////////|                       |
270                  |///////////////////////|                       |
271                  |///////////////////////|                       |
272                  |///////////////////////|                       |
273                  +–––––––––––––––––––––––+                     +–+
274
275				fig. 1
276
277TLVs can be nested within the NEST TLV type.
278
279Interrupt credits
280^^^^^^^^^^^^^^^^^
281
282MSI-X vectors used for descriptor ring completions use a credit mechanism for
283efficient device, PCIe bus, OS and driver operations.  Each descriptor ring has
284a credit count which represents the number of outstanding descriptors to be
285processed by the driver.  As the device marks descriptors complete, the credit
286count is incremented.  As the driver processes those outstanding descriptors,
287it returns credits back to the device.  This way, the device knows the driver's
288progress and can make decisions about when to fire the next interrupt or not.
289When the credit count is zero, and the first descriptors are posted for the
290driver, a single interrupt is fired.  Once the interrupt is fired, the
291interrupt is disabled (auto-masked*).  In response to the interrupt, the driver
292will process descriptors and PIO write a returned credit value for that
293descriptor ring.  If the driver returns all credits (the driver caught up with
294the device and there is no outstanding work), then the interrupt is unmasked,
295but not fired.  If only partial credits are returned, the interrupt remains
296masked but the device generates an interrupt, signaling the driver that more
297outstanding work is available.
298
299(* this masking is unrelated to the MSI-X interrupt mask register)
300
301Endianness
302----------
303
304Device registers are hard-coded to little-endian (LE).  The driver should
305convert to/from host endianness to LE for device register accesses.
306
307Descriptors are LE.  Descriptor buffer TLVs will have LE type and length
308fields, but the value field can either be LE or network-byte-order, depending
309on context.  TLV values containing network packet data will be in network-byte
310order.  A TLV value containing a field or mask used to compare against network
311packet data is network-byte order.  For example, flow match fields (and masks)
312are network-byte-order since they're matched directly, byte-by-byte, against
313network packet data.  All non-network-packet TLV multi-byte values will be LE.
314
315TLV values in network-byte-order are designated with (N).
316
317
318Test Registers
319==============
320
321Rocker has several test registers to support troubleshooting register access,
322interrupt generation, and DMA operations::
323
324	TEST_REG, offset 0x0010, 32-bit (R/W)
325	TEST_REG64, offset 0x0018, 64-bit (R/W)
326	TEST_IRQ, offset 0x0020, 32-bit (R/W)
327	TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
328	TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
329	TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
330
331Reads to TEST_REG and TEST_REG64 will read a value equal to twice the last
332value written to the register.  The 32-bit and 64-bit versions are for testing
33332-bit and 64-bit host accesses.
334
335A vector can be written to TEST_IRQ and the device will generate an interrupt
336for that vector.
337
338To test basic DMA operations, allocate a DMA-able host buffer and put the
339buffer address into TEST_DMA_ADDR and size into TEST_DMA_SIZE.  Then, write to
340TEST_DMA_CTRL to manipulate the buffer contents.  TEST_DMA_CTRL operations are::
341
342	operation		value	description
343	-----------------------------------------------------------
344	TEST_DMA_CTRL_CLEAR	1	clear buffer
345	TEST_DMA_CTRL_FILL	2	fill buffer bytes with 0x96
346	TEST_DMA_CTRL_INVERT	4	invert bytes in buffer
347
348Various buffer address and sizes should be tested to verify no address boundary
349issue exists.  In particular, buffers that start on odd-8-byte boundary and/or
350span multiple PAGE sizes should be tested.
351
352
353Ports
354=====
355
356Physical and Logical Ports
357------------------------------------
358
359The switch supports up to 62 physical (front-panel) ports.  Register
360PORT_PHYS_COUNT returns the actual number of physical ports available::
361
362	PORT_PHYS_COUNT, offset 0x0304, 32-bit, (R)
363
364In addition to front-panel ports, the switch supports logical ports for
365tunnels.
366
367Front-panel ports and logical tunnel ports are mapped into a single 32-bit port
368space.  A special CPU port is assigned port 0.  The front-panel ports are
369mapped to ports 1-62.  A special loopback port is assigned port 63.  Logical
370tunnel ports are assigned ports 0x0001000-0x0001ffff.
371To summarize the port assignments::
372
373	port			mapping
374	-------------------------------------------------------
375	0			CPU port (for packets to/from host CPU)
376	1-62			front-panel physical ports
377	63			loopback port
378	64-0x0000ffff		RSVD
379	0x00010000-0x0001ffff	logical tunnel ports
380	0x00020000-0xffffffff	RSVD
381
382Physical Port Mode
383------------------
384
385Switch front-panel ports operate in a mode.  Currently, the only mode is
386OF-DPA.  OF-DPA[1] mode is based on OpenFlow Data Plane Abstraction (OF-DPA)
387Abstract Switch Specification, Version 1.0, from Broadcom Corporation.  To
388set/get the mode for front-panel ports, see port settings, below.
389
390Port Settings
391-------------
392
393Link status for all front-panel ports is available via PORT_PHYS_LINK_STATUS::
394
395	PORT_PHYS_LINK_STATUS, offset 0x0310, 64-bit, (R)
396
397	Value is port bitmap.  Bits 0 and 63 always read 0.  Bits 1-62
398	read 1 for link UP and 0 for link DOWN for respective front-panel ports.
399
400Other properties for front-panel ports are available via DMA CMD descriptors::
401
402	Get PORT_SETTINGS descriptor:
403
404		field		width	description
405		----------------------------------------------
406		PORT_SETTINGS	2	CMD_GET
407		PPORT		4	Physical port #
408
409	Get PORT_SETTINGS completion:
410
411		field		width	description
412		----------------------------------------------
413		PPORT		4	Physical port #
414		SPEED		4	Current port interface speed, in Mbps
415		DUPLEX		1	1 = Full, 0 = Half
416		AUTONEG		1	1 = enabled, 0 = disabled
417		MACADDR		6	Port MAC address
418		MODE		1	0 = OF-DPA
419		LEARNING	1	MAC address learning on port
420						1 = enabled
421						0 = disabled
422		PHYS_NAME	<var>	Physical port name (string)
423
424	Set PORT_SETTINGS descriptor:
425
426		field		width	description
427		----------------------------------------------
428		PORT_SETTINGS	2	CMD_SET
429		PPORT		4	Physical port #
430		SPEED		4	Port interface speed, in Mbps
431		DUPLEX		1	1 = Full, 0 = Half
432		AUTONEG		1	1 = enabled, 0 = disabled
433		MACADDR		6	Port MAC address
434		MODE		1	0 = OF-DPA
435
436Port Enable
437-----------
438
439Front-panel ports are initially disabled, which means port ingress and egress
440packets will be dropped.  To enable or disable a port, use PORT_PHYS_ENABLE::
441
442	PORT_PHYS_ENABLE: offset 0x0318, 64-bit, (R/W)
443
444	Value is bitmap of first 64 ports.  Bits 0 and 63 are ignored
445	and always read as 0.  Write 1 to enable port; write 0 to disable it.
446	Default is 0.
447
448
449Switch Control
450==============
451
452This section covers switch-wide register settings.
453
454Control
455-------
456
457This register is used for low level control of the switch::
458
459	CONTROL: offset 0x0300, 32-bit, (W)
460
461	bit	name		description
462	------------------------------------------------------------------------
463	[0]	CONTROL_RESET	If set, device will perform reset
464	[1:31]	Reserved
465
466Switch ID
467---------
468
469The switch has a SWITCH_ID to be used by software to uniquely identify the
470switch::
471
472	SWITCH_ID: offset 0x0320, 64-bit, (R)
473
474	Value is opaque to switch software and no special encoding is implied.
475
476
477Events
478======
479
480Non-I/O asynchronous events from the device are notified to the host using the
481event ring.  The TLV structure for events is::
482
483	field		width	description
484	---------------------------------------------------
485	TYPE		4	Event type, one of:
486					1: LINK_CHANGED
487					2: MAC_VLAN_SEEN
488	INFO		<nest>	Event info (details below)
489
490Link Changed Event
491------------------
492
493When link status changes on a physical port, this event is generated::
494
495	field		width	description
496	---------------------------------------------------
497	INFO		<nest>
498	  PPORT		4	Physical port
499	  LINKUP	1	Link status:
500					0: down
501					1: up
502
503MAC VLAN Seen Event
504-------------------
505
506When a packet ingresses on a port and the source MAC/VLAN isn't known to the
507device, the device will generate this event.  In response to the event, the
508driver should install to the device the MAC/VLAN on the port into the bridge
509table.  Once installed, the MAC/VLAN is known on the port and this event will
510no longer be generated.
511
512::
513
514	field		width	description
515	---------------------------------------------------
516	INFO		<nest>
517	  PPORT		4	Physical port
518	  MAC		6	MAC address
519	  VLAN		2	VLAN ID
520
521
522CPU Packet Processing
523=====================
524
525Ingress packets directed to the host CPU for further processing are delivered
526in the DMA RX ring.  Likewise, host CPU originating packets destined to egress
527on switch ports are scheduled by software using the DMA TX ring.
528
529Tx Packet Processing
530--------------------
531
532Software schedules packets for egress on switch ports using the DMA TX ring.  A
533TX descriptor buffer describes the packet location and size in host DMA-able
534memory, the destination port, and any hardware-offload functions (such as L3
535payload checksum offload).  Software then bumps the descriptor head to signal
536hardware of new Tx work.  In response, hardware will DMA read Tx descriptors up
537to head, DMA read descriptor buffer and packet data, perform offloading
538functions, and finally frame packet on wire (network).  Once packet processing
539is complete, hardware will writeback status to descriptor(s) to signal to
540software that Tx is complete and software resources (e.g. skb) backing packet
541can be released.
542
543Figure 2 shows an example 3-fragment packet queued with one Tx descriptor.  A
544TLV is used for each packet fragment::
545
546	                                           pkt frag 1
547	                                           +–––––––+  +–+
548	                                       +–––+       |    |
549	                         desc buf      |   |       |    |
550	                        +––––––––+     |   |       |    |
551	        Tx ring     +–––+        +–––––+   |       |    |
552	      +–––––––––+   |   |  TLVs  |         +–––––––+    |
553	      |         +–––+   +––––––––+         pkt frag 2   |
554	      | desc 0  |       |        +–––––+   +–––––––+    |
555	      +–––––––––+       |  TLVs  |     +–––+       |    |
556	head+–+         |       +––––––––+         |       |    |
557	      | desc 1  |       |        +–––––+   +–––––––+    |pkt
558	      +–––––––––+       |  TLVs  |     |                |
559	      |         |       +––––––––+     |   pkt frag 3   |
560	      |         |                      |   +–––––––+    |
561	      +–––––––––+                      +–––+       |    |
562	      |         |                          |       |    |
563	      |         |                          |       |    |
564	      +–––––––––+                          |       |    |
565	      |         |                          |       |    |
566	      |         |                          |       |    |
567	      +–––––––––+                          |       |    |
568	      |         |                          +–––––––+  +–+
569	      |         |
570	      +–––––––––+
571
572				fig 2.
573
574The TLVs for Tx descriptor buffer are::
575
576	field			width	description
577	---------------------------------------------------------------------
578	PPORT			4	Destination physical port #
579	TX_OFFLOAD		1	Hardware offload modes:
580					  0: no offload
581					  1: insert IP csum (ipv4 only)
582					  2: insert TCP/UDP csum
583					  3: L3 csum calc and insert
584                        	             into csum offset (TX_L3_CSUM_OFF)
585                 	                    16-bit 1's complement csum value.
586                                	     IPv4 pseudo-header and IP
587                        	             already calculated by OS
588                  	                   and inserted.
589					  4: TSO (TCP Segmentation Offload)
590	TX_L3_CSUM_OFF		2	For L3 csum offload mode, the offset,
591					from the beginning of the packet,
592					of the csum field in the L3 header
593	TX_TSO_MSS		2	For TSO offload mode, the
594					Maximum Segment Size in bytes
595        TX_TSO_HDR_LEN		2	For TSO offload mode, the
596					length of ethernet, IP, and
597					TCP/UDP headers, including IP
598					and TCP options.
599	TX_FRAGS		<array>	Packet fragments
600	  TX_FRAG		<nest>	Packet fragment
601	    TX_FRAG_ADDR	8	DMA address of packet fragment
602	    TX_FRAG_LEN		2	Packet fragment length
603
604Possible status return codes in descriptor on completion are::
605
606	DESC_COMP_ERR	reason
607	--------------------------------------------------------------------
608	0		OK
609	-ROCKER_ENXIO	address or data read err on desc buf or packet
610			fragment
611	-ROCKER_EINVAL	bad pport or TSO or csum offloading error
612	-ROCKER_ENOMEM	no memory for internal staging tx fragment
613
614Rx Packet Processing
615--------------------
616
617For packets ingressing on switch ports that are not forwarded by the switch but
618rather directed to the host CPU for further processing are delivered in the DMA
619RX ring.  Rx descriptor buffers are allocated by software and placed on the
620ring.  Hardware will fill Rx descriptor buffers with packet data, write the
621completion, and signal to software that a new packet is ready.  Since Rx packet
622size is not known a-priori, the Rx descriptor buffer must be allocated for
623worst-case packet size.  A single Rx descriptor will contain the entire Rx
624packet data in one RX_FRAG.  Other Rx TLVs describe and hardware offloads
625performed on the packet, such as checksum validation.
626
627The TLVs for Rx descriptor buffer are::
628
629	field		width	description
630	---------------------------------------------------
631	PPORT		4	Source physical port #
632	RX_FLAGS	2	Packet parsing flags:
633				  (1 << 0): IPv4 packet
634				  (1 << 1): IPv6 packet
635				  (1 << 2): csum calculated
636				  (1 << 3): IPv4 csum good
637				  (1 << 4): IP fragment
638				  (1 << 5): TCP packet
639				  (1 << 6): UDP packet
640				  (1 << 7): TCP/UDP csum good
641				  (1 << 8): Offload forward
642	RX_CSUM		2	IP calculated checksum:
643				  IPv4: IP payload csum
644				  IPv6: header and payload csum
645				(Only valid is RX_FLAGS:csum calc is set)
646	RX_FRAG_ADDR	8	DMA address of packet fragment
647	RX_FRAG_MAX_LEN	2	Packet maximum fragment length
648	RX_FRAG_LEN	2	Actual packet fragment length after receive
649
650Offload forward RX_FLAG indicates the device has already forwarded the packet
651so the host CPU should not also forward the packet.
652
653Possible status return codes in descriptor on completion are::
654
655	DESC_COMP_ERR	reason
656	--------------------------------------------------------------------
657	0		OK
658	-ROCKER_ENXIO	address or data read err on desc buf
659	-ROCKER_ENOMEM	no memory for internal staging desc buf
660	-ROCKER_EMSGSIZE Rx descriptor buffer wasn't big enough to contain
661			packet data TLV and other TLVs.
662
663
664OF-DPA Mode
665===========
666
667OF-DPA mode allows the switch to offload flow packet processing functions to
668hardware.  An OpenFlow controller would communicate with an OpenFlow agent
669installed on the switch.  The OpenFlow agent would (directly or indirectly)
670communicate with the Rocker switch driver, which in turn would program switch
671hardware with flow functionality, as defined in OF-DPA.  The block diagram is::
672
673		+–––––––––––––––----–––+
674		|        OF            |
675		|  Remote Controller   |
676		+––––––––+––----–––––––+
677		         |
678		         |
679		+––––––––+–––––––––+
680		|       OF         |
681		|   Local Agent    |
682		+––––––––––––––––––+
683		|                  |
684		|   Rocker Driver  |
685		+––––––––––––––––––+
686		    <this spec>
687		+––––––––––––––––––+
688		|                  |
689		|   Rocker Switch  |
690		+––––––––––––––––––+
691
692To participate in flow functions, ports must be configure for OF-DPA mode
693during switch initialization.
694
695OF-DPA Flow Table Interface
696---------------------------
697
698There are commands to add, modify, delete, and get stats of flow table entries.
699The commands are issued using the DMA CMD descriptor ring.  The following
700commands are defined::
701
702	CMD_ADD:		add an entry to flow table
703	CMD_MOD:		modify an entry in flow table
704	CMD_DEL:		delete an entry from flow table
705	CMD_GET_STATS:		get stats for flow entry
706
707TLVs for add and modify commands are::
708
709	field			width	description
710	----------------------------------------------------
711	OF_DPA_CMD		2	CMD_[ADD|MOD]
712	OF_DPA_TBL		2	Flow table ID
713					  0: ingress port
714					  10: vlan
715					  20: termination mac
716					  30: unicast routing
717					  40: multicast routing
718					  50: bridging
719					  60: ACL policy
720	OF_DPA_PRIORITY		4	Flow priority
721	OF_DPA_HARDTIME		4	Hard timeout for flow
722	OF_DPA_IDLETIME		4	Idle timeout for flow
723	OF_DPA_COOKIE		8	Cookie
724
725Additional TLVs based on flow table ID:
726
727Table ID 0: ingress port::
728
729	field			width	description
730	----------------------------------------------------
731	OF_DPA_IN_PPORT		4	ingress physical port number
732	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
733
734Table ID 10: vlan::
735
736	field			width	description
737	----------------------------------------------------
738	OF_DPA_IN_PPORT		4	ingress physical port number
739	OF_DPA_VLAN_ID		2 (N)	vlan ID
740	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask
741	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
742	OF_DPA_NEW_VLAN_ID	2 (N)	new vlan ID
743
744Table ID 20: termination mac::
745
746	field			width	description
747	----------------------------------------------------
748	OF_DPA_IN_PPORT		4	ingress physical port number
749	OF_DPA_IN_PPORT_MASK	4	ingress physical port number mask
750	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd
751	OF_DPA_DST_MAC		6 (N)	destination MAC
752	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask
753	OF_DPA_VLAN_ID		2 (N)	vlan ID
754	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask
755	OF_DPA_GOTO_TBL		2	only acceptable values are
756					unicast or multicast routing
757					table IDs
758	OF_DPA_OUT_PPORT	2	if specified, must be
759					controller, set zero otherwise
760
761Table ID 30: unicast routing::
762
763	field			width	description
764	----------------------------------------------------
765	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd
766	OF_DPA_DST_IP		4 (N)	destination IPv4 address.
767					Must be unicast address
768	OF_DPA_DST_IP_MASK	4 (N)	IP mask.  Must be prefix mask
769	OF_DPA_DST_IPV6		16 (N)	destination IPv6 address.
770					Must be unicast address
771	OF_DPA_DST_IPV6_MASK	16 (N)	IPv6 mask. Must be prefix mask
772	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
773	OF_DPA_GROUP_ID		4	data for GROUP action must
774					be an L3 Unicast group entry
775
776Table ID 40: multicast routing::
777
778	field			width	description
779	----------------------------------------------------
780	OF_DPA_ETHERTYPE	2 (N)	must be either 0x0800 or 0x86dd
781	OF_DPA_VLAN_ID		2 (N)	vlan ID
782	OF_DPA_SRC_IP		4 (N)	source IPv4. Optional,
783					can contain IPv4 address,
784					must be completely masked
785					if not used
786	OF_DPA_SRC_IP_MASK	4 (N)	IP Mask
787	OF_DPA_DST_IP		4 (N)	destination IPv4 address.
788					Must be multicast address
789	OF_DPA_SRC_IPV6		16 (N)	source IPv6 Address. Optional.
790					Can contain IPv6 address,
791					must be completely masked
792					if not used
793	OF_DPA_SRC_IPV6_MASK	16 (N)	IPv6 mask.
794	OF_DPA_DST_IPV6		16 (N)	destination IPv6 Address. Must
795					be multicast address
796					Must be multicast address
797	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
798	OF_DPA_GROUP_ID		4	data for GROUP action must
799					be an L3 multicast group entry
800
801Table ID 50: bridging::
802
803	field			width	description
804	----------------------------------------------------
805	OF_DPA_VLAN_ID		2 (N)	vlan ID
806	OF_DPA_TUNNEL_ID	4	tunnel ID
807	OF_DPA_DST_MAC		6 (N)	destination MAC
808	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask
809	OF_DPA_GOTO_TBL		2	goto table ID; zero to drop
810	OF_DPA_GROUP_ID		4	data for GROUP action must
811					be a L2 Interface, L2
812					Multicast, L2 Flood,
813					or L2 Overlay group entry
814					as appropriate
815	OF_DPA_TUNNEL_LPORT	4	unicast Tenant Bridging
816					flows specify a tunnel
817					logical port ID
818	OF_DPA_OUT_PPORT	2	data for OUTPUT action,
819					restricted to CONTROLLER,
820					set to 0 otherwise
821
822Table ID 60: acl policy::
823
824	field			width	description
825	----------------------------------------------------
826	OF_DPA_IN_PPORT		4	ingress physical port number
827	OF_DPA_IN_PPORT_MASK	4	ingress physical port number mask
828	OF_DPA_ETHERTYPE	2 (N)	ethertype
829	OF_DPA_VLAN_ID		2 (N)	vlan ID
830	OF_DPA_VLAN_ID_MASK	2 (N)	vlan ID mask
831	OF_DPA_VLAN_PCP		2 (N)	vlan Priority Code Point
832	OF_DPA_VLAN_PCP_MASK	2 (N)	vlan Priority Code Point mask
833	OF_DPA_SRC_MAC		6 (N)	source MAC
834	OF_DPA_SRC_MAC_MASK	6 (N)	source MAC mask
835	OF_DPA_DST_MAC		6 (N)	destination MAC
836	OF_DPA_DST_MAC_MASK	6 (N)	destination MAC mask
837	OF_DPA_TUNNEL_ID	4	tunnel ID
838	OF_DPA_SRC_IP		4 (N)	source IPv4. Optional,
839					can contain IPv4 address,
840					must be completely masked
841					if not used
842	OF_DPA_SRC_IP_MASK	4 (N)	IP Mask
843	OF_DPA_DST_IP		4 (N)	destination IPv4 address.
844					Must be multicast address
845	OF_DPA_DST_IP_MASK	4 (N)	IP Mask
846	OF_DPA_SRC_IPV6		16 (N)	source IPv6 Address. Optional.
847					Can contain IPv6 address,
848					must be completely masked
849					if not used
850	OF_DPA_SRC_IPV6_MASK	16 (N)	IPv6 mask
851	OF_DPA_DST_IPV6		16 (N)	destination IPv6 Address. Must
852					be multicast address.
853	OF_DPA_DST_IPV6_MASK	16 (N)	IPv6 mask
854	OF_DPA_SRC_ARP_IP	4 (N)	source IPv4 address in the ARP
855					payload.  Only used if ethertype
856					== 0x0806.
857	OF_DPA_SRC_ARP_IP_MASK	4 (N)	IP Mask
858	OF_DPA_IP_PROTO		1	IP protocol
859	OF_DPA_IP_PROTO_MASK	1	IP protocol mask
860	OF_DPA_IP_DSCP		1	DSCP
861	OF_DPA_IP_DSCP_MASK	1	DSCP mask
862	OF_DPA_IP_ECN		1	ECN
863	OF_DPA_IP_ECN_MASK		1	ECN mask
864	OF_DPA_L4_SRC_PORT	2 (N)	L4 source port, only for
865					TCP, UDP, or SCTP
866	OF_DPA_L4_SRC_PORT_MASK	2 (N)	L4 source port mask
867	OF_DPA_L4_DST_PORT	2 (N)	L4 source port, only for
868					TCP, UDP, or SCTP
869	OF_DPA_L4_DST_PORT_MASK	2 (N)	L4 source port mask
870	OF_DPA_ICMP_TYPE	1	ICMP type, only if IP
871					protocol is 1
872	OF_DPA_ICMP_TYPE_MASK	1	ICMP type mask
873	OF_DPA_ICMP_CODE	1	ICMP code
874	OF_DPA_ICMP_CODE_MASK	1	ICMP code mask
875	OF_DPA_IPV6_LABEL	4 (N)	IPv6 flow label
876	OF_DPA_IPV6_LABEL_MASK	4 (N)	IPv6 flow label mask
877	OF_DPA_GROUP_ID		4	data for GROUP action
878	OF_DPA_QUEUE_ID_ACTION	1	write the queue ID
879	OF_DPA_NEW_QUEUE_ID	1	queue ID
880	OF_DPA_VLAN_PCP_ACTION	1	write the VLAN priority
881	OF_DPA_NEW_VLAN_PCP	1	VLAN priority
882	OF_DPA_IP_DSCP_ACTION	1	write the DSCP
883	OF_DPA_NEW_IP_DSCP	1	new DSCP
884	OF_DPA_TUNNEL_LPORT	4	restrct to valid tunnel
885					logical port, set to 0
886					otherwise.
887	OF_DPA_OUT_PPORT	2	data for OUTPUT action,
888					restricted to CONTROLLER,
889					set to 0 otherwise
890	OF_DPA_CLEAR_ACTIONS	4	if 1 packets matching flow are
891					dropped (all other instructions
892					ignored)
893
894TLVs for flow delete and get stats command are::
895
896	field			width	description
897	---------------------------------------------------
898	OF_DPA_CMD		2	CMD_[DEL|GET_STATS]
899	OF_DPA_COOKIE		8	Cookie
900
901On completion of get stats command, the descriptor buffer is written back with
902the following TLVs::
903
904	field			width	description
905	---------------------------------------------------
906	OF_DPA_STAT_DURATION	4	Flow duration
907	OF_DPA_STAT_RX_PKTS	8	Received packets
908	OF_DPA_STAT_TX_PKTS	8	Transmit packets
909
910Possible status return codes in descriptor on completion are::
911
912	DESC_COMP_ERR	command			reason
913	--------------------------------------------------------------------
914	0		all			OK
915	-ROCKER_EFAULT	all			head or tail index outside
916						of ring
917	-ROCKER_ENXIO	all			address or data read err on
918						desc buf
919	-ROCKER_EMSGSIZE GET_STATS		cmd descriptor buffer wasn't
920						big enough to contain write-back
921						TLVs
922	-ROCKER_EINVAL	all			invalid parameters passed in
923	-ROCKER_EEXIST	ADD			entry already exists
924	-ROCKER_ENOSPC	ADD			no space left in flow table
925	-ROCKER_ENOENT	MOD|DEL|GET_STATS	cookie invalid
926
927Group Table Interface
928---------------------
929
930There are commands to add, modify, delete, and get stats of group table
931entries.  The commands are issued using the DMA CMD descriptor ring.  The
932following commands are defined::
933
934	CMD_ADD:		add an entry to group table
935	CMD_MOD:		modify an entry in group table
936	CMD_DEL:		delete an entry from group table
937	CMD_GET_STATS:		get stats for group entry
938
939TLVs for add and modify commands are::
940
941	field			width	description
942	-----------------------------------------------------------
943	FLOW_GROUP_CMD		2	CMD_[ADD|MOD]
944	FLOW_GROUP_ID		2	Flow group ID
945	FLOW_GROUP_TYPE		1	Group type:
946					  0: L2 interface
947					  1: L2 rewrite
948					  2: L3 unicast
949					  3: L2 multicast
950					  4: L2 flood
951					  5: L3 interface
952					  6: L3 multicast
953					  7: L3 ECMP
954					  8: L2 overlay
955	FLOW_VLAN_ID		2	Vlan ID (types 0, 3, 4, 6)
956	FLOW_L2_PORT		2	Port (types 0)
957	FLOW_INDEX		4	Index (all types but 0)
958	FLOW_OVERLAY_TYPE	1	Overlay sub-type (type 8):
959					  0: Flood unicast tunnel
960					  1: Flood multicast tunnel
961					  2: Multicast unicast tunnel
962					  3: Multicast multicast tunnel
963	FLOW_GROUP_ACTION		nest
964	  FLOW_GROUP_ID		2	next group ID in chain (all
965					types except 0)
966	  FLOW_OUT_PORT		4	egress port (types 0, 8)
967	  FLOW_POP_VLAN_TAG	1	strip outer VLAN tag (type 1
968					only)
969	  FLOW_VLAN_ID		2	(types 1, 5)
970	  FLOW_SRC_MAC		6	(types 1, 2, 5)
971	  FLOW_DST_MAC		6	(types 1, 2)
972
973TLVs for flow delete and get stats command are::
974
975	field			width	description
976	-----------------------------------------------------------
977	FLOW_GROUP_CMD		2	CMD_[DEL|GET_STATS]
978	FLOW_GROUP_ID		2	Flow group ID
979
980On completion of get stats command, the descriptor buffer is written back with
981the following TLVs::
982
983	field			width	description
984	---------------------------------------------------
985	FLOW_GROUP_ID		2	Flow group ID
986	FLOW_STAT_DURATION	4	Flow duration
987	FLOW_STAT_REF_COUNT	4	Flow reference count
988	FLOW_STAT_BUCKET_COUNT	4	Flow bucket count
989
990Possible status return codes in descriptor on completion are::
991
992	DESC_COMP_ERR	command			reason
993	--------------------------------------------------------------------
994	0		all			OK
995	-ROCKER_EFAULT	all			head or tail index outside
996						of ring
997	-ROCKER_ENXIO	all			address or data read err on
998						desc buf
999	-ROCKER_ENOSPC	GET_STATS		cmd descriptor buffer wasn't
1000						big enough to contain write-back
1001						TLVs
1002	-ROCKER_EINVAL	ADD|MOD			invalid parameters passed in
1003	-ROCKER_EEXIST	ADD			entry already exists
1004	-ROCKER_ENOSPC	ADD			no space left in flow table
1005	-ROCKER_ENOENT	MOD|DEL|GET_STATS	group ID invalid
1006	-ROCKER_EBUSY	DEL			group reference count non-zero
1007	-ROCKER_ENODEV	ADD			next group ID doesn't exist
1008
1009
1010
1011References
1012==========
1013
1014[1] OpenFlow Data Plane Abstraction (OF-DPA) Abstract Switch Specification,
1015Version 1.0, from Broadcom Corporation, February 21, 2014.
1016