xref: /openbmc/linux/Documentation/networking/openvswitch.rst (revision 4b4193256c8d3bc3a5397b5cd9494c2ad386317d)
1*63893472SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2*63893472SMauro Carvalho Chehab
3*63893472SMauro Carvalho Chehab=============================================
4*63893472SMauro Carvalho ChehabOpen vSwitch datapath developer documentation
5*63893472SMauro Carvalho Chehab=============================================
6*63893472SMauro Carvalho Chehab
7*63893472SMauro Carvalho ChehabThe Open vSwitch kernel module allows flexible userspace control over
8*63893472SMauro Carvalho Chehabflow-level packet processing on selected network devices.  It can be
9*63893472SMauro Carvalho Chehabused to implement a plain Ethernet switch, network device bonding,
10*63893472SMauro Carvalho ChehabVLAN processing, network access control, flow-based network control,
11*63893472SMauro Carvalho Chehaband so on.
12*63893472SMauro Carvalho Chehab
13*63893472SMauro Carvalho ChehabThe kernel module implements multiple "datapaths" (analogous to
14*63893472SMauro Carvalho Chehabbridges), each of which can have multiple "vports" (analogous to ports
15*63893472SMauro Carvalho Chehabwithin a bridge).  Each datapath also has associated with it a "flow
16*63893472SMauro Carvalho Chehabtable" that userspace populates with "flows" that map from keys based
17*63893472SMauro Carvalho Chehabon packet headers and metadata to sets of actions.  The most common
18*63893472SMauro Carvalho Chehabaction forwards the packet to another vport; other actions are also
19*63893472SMauro Carvalho Chehabimplemented.
20*63893472SMauro Carvalho Chehab
21*63893472SMauro Carvalho ChehabWhen a packet arrives on a vport, the kernel module processes it by
22*63893472SMauro Carvalho Chehabextracting its flow key and looking it up in the flow table.  If there
23*63893472SMauro Carvalho Chehabis a matching flow, it executes the associated actions.  If there is
24*63893472SMauro Carvalho Chehabno match, it queues the packet to userspace for processing (as part of
25*63893472SMauro Carvalho Chehabits processing, userspace will likely set up a flow to handle further
26*63893472SMauro Carvalho Chehabpackets of the same type entirely in-kernel).
27*63893472SMauro Carvalho Chehab
28*63893472SMauro Carvalho Chehab
29*63893472SMauro Carvalho ChehabFlow key compatibility
30*63893472SMauro Carvalho Chehab----------------------
31*63893472SMauro Carvalho Chehab
32*63893472SMauro Carvalho ChehabNetwork protocols evolve over time.  New protocols become important
33*63893472SMauro Carvalho Chehaband existing protocols lose their prominence.  For the Open vSwitch
34*63893472SMauro Carvalho Chehabkernel module to remain relevant, it must be possible for newer
35*63893472SMauro Carvalho Chehabversions to parse additional protocols as part of the flow key.  It
36*63893472SMauro Carvalho Chehabmight even be desirable, someday, to drop support for parsing
37*63893472SMauro Carvalho Chehabprotocols that have become obsolete.  Therefore, the Netlink interface
38*63893472SMauro Carvalho Chehabto Open vSwitch is designed to allow carefully written userspace
39*63893472SMauro Carvalho Chehabapplications to work with any version of the flow key, past or future.
40*63893472SMauro Carvalho Chehab
41*63893472SMauro Carvalho ChehabTo support this forward and backward compatibility, whenever the
42*63893472SMauro Carvalho Chehabkernel module passes a packet to userspace, it also passes along the
43*63893472SMauro Carvalho Chehabflow key that it parsed from the packet.  Userspace then extracts its
44*63893472SMauro Carvalho Chehabown notion of a flow key from the packet and compares it against the
45*63893472SMauro Carvalho Chehabkernel-provided version:
46*63893472SMauro Carvalho Chehab
47*63893472SMauro Carvalho Chehab    - If userspace's notion of the flow key for the packet matches the
48*63893472SMauro Carvalho Chehab      kernel's, then nothing special is necessary.
49*63893472SMauro Carvalho Chehab
50*63893472SMauro Carvalho Chehab    - If the kernel's flow key includes more fields than the userspace
51*63893472SMauro Carvalho Chehab      version of the flow key, for example if the kernel decoded IPv6
52*63893472SMauro Carvalho Chehab      headers but userspace stopped at the Ethernet type (because it
53*63893472SMauro Carvalho Chehab      does not understand IPv6), then again nothing special is
54*63893472SMauro Carvalho Chehab      necessary.  Userspace can still set up a flow in the usual way,
55*63893472SMauro Carvalho Chehab      as long as it uses the kernel-provided flow key to do it.
56*63893472SMauro Carvalho Chehab
57*63893472SMauro Carvalho Chehab    - If the userspace flow key includes more fields than the
58*63893472SMauro Carvalho Chehab      kernel's, for example if userspace decoded an IPv6 header but
59*63893472SMauro Carvalho Chehab      the kernel stopped at the Ethernet type, then userspace can
60*63893472SMauro Carvalho Chehab      forward the packet manually, without setting up a flow in the
61*63893472SMauro Carvalho Chehab      kernel.  This case is bad for performance because every packet
62*63893472SMauro Carvalho Chehab      that the kernel considers part of the flow must go to userspace,
63*63893472SMauro Carvalho Chehab      but the forwarding behavior is correct.  (If userspace can
64*63893472SMauro Carvalho Chehab      determine that the values of the extra fields would not affect
65*63893472SMauro Carvalho Chehab      forwarding behavior, then it could set up a flow anyway.)
66*63893472SMauro Carvalho Chehab
67*63893472SMauro Carvalho ChehabHow flow keys evolve over time is important to making this work, so
68*63893472SMauro Carvalho Chehabthe following sections go into detail.
69*63893472SMauro Carvalho Chehab
70*63893472SMauro Carvalho Chehab
71*63893472SMauro Carvalho ChehabFlow key format
72*63893472SMauro Carvalho Chehab---------------
73*63893472SMauro Carvalho Chehab
74*63893472SMauro Carvalho ChehabA flow key is passed over a Netlink socket as a sequence of Netlink
75*63893472SMauro Carvalho Chehabattributes.  Some attributes represent packet metadata, defined as any
76*63893472SMauro Carvalho Chehabinformation about a packet that cannot be extracted from the packet
77*63893472SMauro Carvalho Chehabitself, e.g. the vport on which the packet was received.  Most
78*63893472SMauro Carvalho Chehabattributes, however, are extracted from headers within the packet,
79*63893472SMauro Carvalho Chehabe.g. source and destination addresses from Ethernet, IP, or TCP
80*63893472SMauro Carvalho Chehabheaders.
81*63893472SMauro Carvalho Chehab
82*63893472SMauro Carvalho ChehabThe <linux/openvswitch.h> header file defines the exact format of the
83*63893472SMauro Carvalho Chehabflow key attributes.  For informal explanatory purposes here, we write
84*63893472SMauro Carvalho Chehabthem as comma-separated strings, with parentheses indicating arguments
85*63893472SMauro Carvalho Chehaband nesting.  For example, the following could represent a flow key
86*63893472SMauro Carvalho Chehabcorresponding to a TCP packet that arrived on vport 1::
87*63893472SMauro Carvalho Chehab
88*63893472SMauro Carvalho Chehab    in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
89*63893472SMauro Carvalho Chehab    eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
90*63893472SMauro Carvalho Chehab    frag=no), tcp(src=49163, dst=80)
91*63893472SMauro Carvalho Chehab
92*63893472SMauro Carvalho ChehabOften we ellipsize arguments not important to the discussion, e.g.::
93*63893472SMauro Carvalho Chehab
94*63893472SMauro Carvalho Chehab    in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
95*63893472SMauro Carvalho Chehab
96*63893472SMauro Carvalho Chehab
97*63893472SMauro Carvalho ChehabWildcarded flow key format
98*63893472SMauro Carvalho Chehab--------------------------
99*63893472SMauro Carvalho Chehab
100*63893472SMauro Carvalho ChehabA wildcarded flow is described with two sequences of Netlink attributes
101*63893472SMauro Carvalho Chehabpassed over the Netlink socket. A flow key, exactly as described above, and an
102*63893472SMauro Carvalho Chehaboptional corresponding flow mask.
103*63893472SMauro Carvalho Chehab
104*63893472SMauro Carvalho ChehabA wildcarded flow can represent a group of exact match flows. Each '1' bit
105*63893472SMauro Carvalho Chehabin the mask specifies a exact match with the corresponding bit in the flow key.
106*63893472SMauro Carvalho ChehabA '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
107*63893472SMauro Carvalho Chehabof a incoming packet. Using wildcarded flow can improve the flow set up rate
108*63893472SMauro Carvalho Chehabby reduce the number of new flows need to be processed by the user space program.
109*63893472SMauro Carvalho Chehab
110*63893472SMauro Carvalho ChehabSupport for the mask Netlink attribute is optional for both the kernel and user
111*63893472SMauro Carvalho Chehabspace program. The kernel can ignore the mask attribute, installing an exact
112*63893472SMauro Carvalho Chehabmatch flow, or reduce the number of don't care bits in the kernel to less than
113*63893472SMauro Carvalho Chehabwhat was specified by the user space program. In this case, variations in bits
114*63893472SMauro Carvalho Chehabthat the kernel does not implement will simply result in additional flow setups.
115*63893472SMauro Carvalho ChehabThe kernel module will also work with user space programs that neither support
116*63893472SMauro Carvalho Chehabnor supply flow mask attributes.
117*63893472SMauro Carvalho Chehab
118*63893472SMauro Carvalho ChehabSince the kernel may ignore or modify wildcard bits, it can be difficult for
119*63893472SMauro Carvalho Chehabthe userspace program to know exactly what matches are installed. There are
120*63893472SMauro Carvalho Chehabtwo possible approaches: reactively install flows as they miss the kernel
121*63893472SMauro Carvalho Chehabflow table (and therefore not attempt to determine wildcard changes at all)
122*63893472SMauro Carvalho Chehabor use the kernel's response messages to determine the installed wildcards.
123*63893472SMauro Carvalho Chehab
124*63893472SMauro Carvalho ChehabWhen interacting with userspace, the kernel should maintain the match portion
125*63893472SMauro Carvalho Chehabof the key exactly as originally installed. This will provides a handle to
126*63893472SMauro Carvalho Chehabidentify the flow for all future operations. However, when reporting the
127*63893472SMauro Carvalho Chehabmask of an installed flow, the mask should include any restrictions imposed
128*63893472SMauro Carvalho Chehabby the kernel.
129*63893472SMauro Carvalho Chehab
130*63893472SMauro Carvalho ChehabThe behavior when using overlapping wildcarded flows is undefined. It is the
131*63893472SMauro Carvalho Chehabresponsibility of the user space program to ensure that any incoming packet
132*63893472SMauro Carvalho Chehabcan match at most one flow, wildcarded or not. The current implementation
133*63893472SMauro Carvalho Chehabperforms best-effort detection of overlapping wildcarded flows and may reject
134*63893472SMauro Carvalho Chehabsome but not all of them. However, this behavior may change in future versions.
135*63893472SMauro Carvalho Chehab
136*63893472SMauro Carvalho Chehab
137*63893472SMauro Carvalho ChehabUnique flow identifiers
138*63893472SMauro Carvalho Chehab-----------------------
139*63893472SMauro Carvalho Chehab
140*63893472SMauro Carvalho ChehabAn alternative to using the original match portion of a key as the handle for
141*63893472SMauro Carvalho Chehabflow identification is a unique flow identifier, or "UFID". UFIDs are optional
142*63893472SMauro Carvalho Chehabfor both the kernel and user space program.
143*63893472SMauro Carvalho Chehab
144*63893472SMauro Carvalho ChehabUser space programs that support UFID are expected to provide it during flow
145*63893472SMauro Carvalho Chehabsetup in addition to the flow, then refer to the flow using the UFID for all
146*63893472SMauro Carvalho Chehabfuture operations. The kernel is not required to index flows by the original
147*63893472SMauro Carvalho Chehabflow key if a UFID is specified.
148*63893472SMauro Carvalho Chehab
149*63893472SMauro Carvalho Chehab
150*63893472SMauro Carvalho ChehabBasic rule for evolving flow keys
151*63893472SMauro Carvalho Chehab---------------------------------
152*63893472SMauro Carvalho Chehab
153*63893472SMauro Carvalho ChehabSome care is needed to really maintain forward and backward
154*63893472SMauro Carvalho Chehabcompatibility for applications that follow the rules listed under
155*63893472SMauro Carvalho Chehab"Flow key compatibility" above.
156*63893472SMauro Carvalho Chehab
157*63893472SMauro Carvalho ChehabThe basic rule is obvious::
158*63893472SMauro Carvalho Chehab
159*63893472SMauro Carvalho Chehab    ==================================================================
160*63893472SMauro Carvalho Chehab    New network protocol support must only supplement existing flow
161*63893472SMauro Carvalho Chehab    key attributes.  It must not change the meaning of already defined
162*63893472SMauro Carvalho Chehab    flow key attributes.
163*63893472SMauro Carvalho Chehab    ==================================================================
164*63893472SMauro Carvalho Chehab
165*63893472SMauro Carvalho ChehabThis rule does have less-obvious consequences so it is worth working
166*63893472SMauro Carvalho Chehabthrough a few examples.  Suppose, for example, that the kernel module
167*63893472SMauro Carvalho Chehabdid not already implement VLAN parsing.  Instead, it just interpreted
168*63893472SMauro Carvalho Chehabthe 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
169*63893472SMauro Carvalho Chehabpacket.  The flow key for any packet with an 802.1Q header would look
170*63893472SMauro Carvalho Chehabessentially like this, ignoring metadata::
171*63893472SMauro Carvalho Chehab
172*63893472SMauro Carvalho Chehab    eth(...), eth_type(0x8100)
173*63893472SMauro Carvalho Chehab
174*63893472SMauro Carvalho ChehabNaively, to add VLAN support, it makes sense to add a new "vlan" flow
175*63893472SMauro Carvalho Chehabkey attribute to contain the VLAN tag, then continue to decode the
176*63893472SMauro Carvalho Chehabencapsulated headers beyond the VLAN tag using the existing field
177*63893472SMauro Carvalho Chehabdefinitions.  With this change, a TCP packet in VLAN 10 would have a
178*63893472SMauro Carvalho Chehabflow key much like this::
179*63893472SMauro Carvalho Chehab
180*63893472SMauro Carvalho Chehab    eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
181*63893472SMauro Carvalho Chehab
182*63893472SMauro Carvalho ChehabBut this change would negatively affect a userspace application that
183*63893472SMauro Carvalho Chehabhas not been updated to understand the new "vlan" flow key attribute.
184*63893472SMauro Carvalho ChehabThe application could, following the flow compatibility rules above,
185*63893472SMauro Carvalho Chehabignore the "vlan" attribute that it does not understand and therefore
186*63893472SMauro Carvalho Chehabassume that the flow contained IP packets.  This is a bad assumption
187*63893472SMauro Carvalho Chehab(the flow only contains IP packets if one parses and skips over the
188*63893472SMauro Carvalho Chehab802.1Q header) and it could cause the application's behavior to change
189*63893472SMauro Carvalho Chehabacross kernel versions even though it follows the compatibility rules.
190*63893472SMauro Carvalho Chehab
191*63893472SMauro Carvalho ChehabThe solution is to use a set of nested attributes.  This is, for
192*63893472SMauro Carvalho Chehabexample, why 802.1Q support uses nested attributes.  A TCP packet in
193*63893472SMauro Carvalho ChehabVLAN 10 is actually expressed as::
194*63893472SMauro Carvalho Chehab
195*63893472SMauro Carvalho Chehab    eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
196*63893472SMauro Carvalho Chehab    ip(proto=6, ...), tcp(...)))
197*63893472SMauro Carvalho Chehab
198*63893472SMauro Carvalho ChehabNotice how the "eth_type", "ip", and "tcp" flow key attributes are
199*63893472SMauro Carvalho Chehabnested inside the "encap" attribute.  Thus, an application that does
200*63893472SMauro Carvalho Chehabnot understand the "vlan" key will not see either of those attributes
201*63893472SMauro Carvalho Chehaband therefore will not misinterpret them.  (Also, the outer eth_type
202*63893472SMauro Carvalho Chehabis still 0x8100, not changed to 0x0800.)
203*63893472SMauro Carvalho Chehab
204*63893472SMauro Carvalho ChehabHandling malformed packets
205*63893472SMauro Carvalho Chehab--------------------------
206*63893472SMauro Carvalho Chehab
207*63893472SMauro Carvalho ChehabDon't drop packets in the kernel for malformed protocol headers, bad
208*63893472SMauro Carvalho Chehabchecksums, etc.  This would prevent userspace from implementing a
209*63893472SMauro Carvalho Chehabsimple Ethernet switch that forwards every packet.
210*63893472SMauro Carvalho Chehab
211*63893472SMauro Carvalho ChehabInstead, in such a case, include an attribute with "empty" content.
212*63893472SMauro Carvalho ChehabIt doesn't matter if the empty content could be valid protocol values,
213*63893472SMauro Carvalho Chehabas long as those values are rarely seen in practice, because userspace
214*63893472SMauro Carvalho Chehabcan always forward all packets with those values to userspace and
215*63893472SMauro Carvalho Chehabhandle them individually.
216*63893472SMauro Carvalho Chehab
217*63893472SMauro Carvalho ChehabFor example, consider a packet that contains an IP header that
218*63893472SMauro Carvalho Chehabindicates protocol 6 for TCP, but which is truncated just after the IP
219*63893472SMauro Carvalho Chehabheader, so that the TCP header is missing.  The flow key for this
220*63893472SMauro Carvalho Chehabpacket would include a tcp attribute with all-zero src and dst, like
221*63893472SMauro Carvalho Chehabthis::
222*63893472SMauro Carvalho Chehab
223*63893472SMauro Carvalho Chehab    eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
224*63893472SMauro Carvalho Chehab
225*63893472SMauro Carvalho ChehabAs another example, consider a packet with an Ethernet type of 0x8100,
226*63893472SMauro Carvalho Chehabindicating that a VLAN TCI should follow, but which is truncated just
227*63893472SMauro Carvalho Chehabafter the Ethernet type.  The flow key for this packet would include
228*63893472SMauro Carvalho Chehaban all-zero-bits vlan and an empty encap attribute, like this::
229*63893472SMauro Carvalho Chehab
230*63893472SMauro Carvalho Chehab    eth(...), eth_type(0x8100), vlan(0), encap()
231*63893472SMauro Carvalho Chehab
232*63893472SMauro Carvalho ChehabUnlike a TCP packet with source and destination ports 0, an
233*63893472SMauro Carvalho Chehaball-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
234*63893472SMauro Carvalho ChehabVLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
235*63893472SMauro Carvalho Chehabattribute expressly to allow this situation to be distinguished.
236*63893472SMauro Carvalho ChehabThus, the flow key in this second example unambiguously indicates a
237*63893472SMauro Carvalho Chehabmissing or malformed VLAN TCI.
238*63893472SMauro Carvalho Chehab
239*63893472SMauro Carvalho ChehabOther rules
240*63893472SMauro Carvalho Chehab-----------
241*63893472SMauro Carvalho Chehab
242*63893472SMauro Carvalho ChehabThe other rules for flow keys are much less subtle:
243*63893472SMauro Carvalho Chehab
244*63893472SMauro Carvalho Chehab    - Duplicate attributes are not allowed at a given nesting level.
245*63893472SMauro Carvalho Chehab
246*63893472SMauro Carvalho Chehab    - Ordering of attributes is not significant.
247*63893472SMauro Carvalho Chehab
248*63893472SMauro Carvalho Chehab    - When the kernel sends a given flow key to userspace, it always
249*63893472SMauro Carvalho Chehab      composes it the same way.  This allows userspace to hash and
250*63893472SMauro Carvalho Chehab      compare entire flow keys that it may not be able to fully
251*63893472SMauro Carvalho Chehab      interpret.
252