1============
2Architecture
3============
4
5This document describes the **Distributed Switch Architecture (DSA)** subsystem
6design principles, limitations, interactions with other subsystems, and how to
7develop drivers for this subsystem as well as a TODO for developers interested
8in joining the effort.
9
10Design principles
11=================
12
13The Distributed Switch Architecture is a subsystem which was primarily designed
14to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
15using Linux, but has since evolved to support other vendors as well.
16
17The original philosophy behind this design was to be able to use unmodified
18Linux tools such as bridge, iproute2, ifconfig to work transparently whether
19they configured/queried a switch port network device or a regular network
20device.
21
22An Ethernet switch is typically comprised of multiple front-panel ports, and one
23or more CPU or management port. The DSA subsystem currently relies on the
24presence of a management port connected to an Ethernet controller capable of
25receiving Ethernet frames from the switch. This is a very common setup for all
26kinds of Ethernet switches found in Small Home and Office products: routers,
27gateways, or even top-of-the rack switches. This host Ethernet controller will
28be later referred to as "master" and "cpu" in DSA terminology and code.
29
30The D in DSA stands for Distributed, because the subsystem has been designed
31with the ability to configure and manage cascaded switches on top of each other
32using upstream and downstream Ethernet links between switches. These specific
33ports are referred to as "dsa" ports in DSA terminology and code. A collection
34of multiple switches connected to each other is called a "switch tree".
35
36For each front-panel port, DSA will create specialized network devices which are
37used as controlling and data-flowing endpoints for use by the Linux networking
38stack. These specialized network interfaces are referred to as "slave" network
39interfaces in DSA terminology and code.
40
41The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
42which is a hardware feature making the switch insert a specific tag for each
43Ethernet frames it received to/from specific ports to help the management
44interface figure out:
45
46- what port is this frame coming from
47- what was the reason why this frame got forwarded
48- how to send CPU originated traffic to specific ports
49
50The subsystem does support switches not capable of inserting/stripping tags, but
51the features might be slightly limited in that case (traffic separation relies
52on Port-based VLAN IDs).
53
54Note that DSA does not currently create network interfaces for the "cpu" and
55"dsa" ports because:
56
57- the "cpu" port is the Ethernet switch facing side of the management
58  controller, and as such, would create a duplication of feature, since you
59  would get two interfaces for the same conduit: master netdev, and "cpu" netdev
60
61- the "dsa" port(s) are just conduits between two or more switches, and as such
62  cannot really be used as proper network interfaces either, only the
63  downstream, or the top-most upstream interface makes sense with that model
64
65Switch tagging protocols
66------------------------
67
68DSA currently supports 5 different tagging protocols, and a tag-less mode as
69well. The different protocols are implemented in:
70
71- ``net/dsa/tag_trailer.c``: Marvell's 4 trailer tag mode (legacy)
72- ``net/dsa/tag_dsa.c``: Marvell's original DSA tag
73- ``net/dsa/tag_edsa.c``: Marvell's enhanced DSA tag
74- ``net/dsa/tag_brcm.c``: Broadcom's 4 bytes tag
75- ``net/dsa/tag_qca.c``: Qualcomm's 2 bytes tag
76
77The exact format of the tag protocol is vendor specific, but in general, they
78all contain something which:
79
80- identifies which port the Ethernet frame came from/should be sent to
81- provides a reason why this frame was forwarded to the management interface
82
83Master network devices
84----------------------
85
86Master network devices are regular, unmodified Linux network device drivers for
87the CPU/management Ethernet interface. Such a driver might occasionally need to
88know whether DSA is enabled (e.g.: to enable/disable specific offload features),
89but the DSA subsystem has been proven to work with industry standard drivers:
90``e1000e,`` ``mv643xx_eth`` etc. without having to introduce modifications to these
91drivers. Such network devices are also often referred to as conduit network
92devices since they act as a pipe between the host processor and the hardware
93Ethernet switch.
94
95Networking stack hooks
96----------------------
97
98When a master netdev is used with DSA, a small hook is placed in in the
99networking stack is in order to have the DSA subsystem process the Ethernet
100switch specific tagging protocol. DSA accomplishes this by registering a
101specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
102networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical
103Ethernet Frame receive sequence looks like this:
104
105Master network device (e.g.: e1000e):
106
1071. Receive interrupt fires:
108
109        - receive function is invoked
110        - basic packet processing is done: getting length, status etc.
111        - packet is prepared to be processed by the Ethernet layer by calling
112          ``eth_type_trans``
113
1142. net/ethernet/eth.c::
115
116          eth_type_trans(skb, dev)
117                  if (dev->dsa_ptr != NULL)
118                          -> skb->protocol = ETH_P_XDSA
119
1203. drivers/net/ethernet/\*::
121
122          netif_receive_skb(skb)
123                  -> iterate over registered packet_type
124                          -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv()
125
1264. net/dsa/dsa.c::
127
128          -> dsa_switch_rcv()
129                  -> invoke switch tag specific protocol handler in 'net/dsa/tag_*.c'
130
1315. net/dsa/tag_*.c:
132
133        - inspect and strip switch tag protocol to determine originating port
134        - locate per-port network device
135        - invoke ``eth_type_trans()`` with the DSA slave network device
136        - invoked ``netif_receive_skb()``
137
138Past this point, the DSA slave network devices get delivered regular Ethernet
139frames that can be processed by the networking stack.
140
141Slave network devices
142---------------------
143
144Slave network devices created by DSA are stacked on top of their master network
145device, each of these network interfaces will be responsible for being a
146controlling and data-flowing end-point for each front-panel port of the switch.
147These interfaces are specialized in order to:
148
149- insert/remove the switch tag protocol (if it exists) when sending traffic
150  to/from specific switch ports
151- query the switch for ethtool operations: statistics, link state,
152  Wake-on-LAN, register dumps...
153- external/internal PHY management: link, auto-negotiation etc.
154
155These slave network devices have custom net_device_ops and ethtool_ops function
156pointers which allow DSA to introduce a level of layering between the networking
157stack/ethtool, and the switch driver implementation.
158
159Upon frame transmission from these slave network devices, DSA will look up which
160switch tagging protocol is currently registered with these network devices, and
161invoke a specific transmit routine which takes care of adding the relevant
162switch tag in the Ethernet frames.
163
164These frames are then queued for transmission using the master network device
165``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the
166Ethernet switch will be able to process these incoming frames from the
167management interface and delivers these frames to the physical switch port.
168
169Graphical representation
170------------------------
171
172Summarized, this is basically how DSA looks like from a network device
173perspective::
174
175
176                |---------------------------
177                | CPU network device (eth0)|
178                ----------------------------
179                | <tag added by switch     |
180                |                          |
181                |                          |
182                |        tag added by CPU> |
183        |--------------------------------------------|
184        |            Switch driver                   |
185        |--------------------------------------------|
186                  ||        ||         ||
187              |-------|  |-------|  |-------|
188              | sw0p0 |  | sw0p1 |  | sw0p2 |
189              |-------|  |-------|  |-------|
190
191
192
193Slave MDIO bus
194--------------
195
196In order to be able to read to/from a switch PHY built into it, DSA creates a
197slave MDIO bus which allows a specific switch driver to divert and intercept
198MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
199switches, these functions would utilize direct or indirect PHY addressing mode
200to return standard MII registers from the switch builtin PHYs, allowing the PHY
201library and/or to return link status, link partner pages, auto-negotiation
202results etc..
203
204For Ethernet switches which have both external and internal MDIO busses, the
205slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
206internal or external MDIO devices this switch might be connected to: internal
207PHYs, external PHYs, or even external switches.
208
209Data structures
210---------------
211
212DSA data structures are defined in ``include/net/dsa.h`` as well as
213``net/dsa/dsa_priv.h``:
214
215- ``dsa_chip_data``: platform data configuration for a given switch device,
216  this structure describes a switch device's parent device, its address, as
217  well as various properties of its ports: names/labels, and finally a routing
218  table indication (when cascading switches)
219
220- ``dsa_platform_data``: platform device configuration data which can reference
221  a collection of dsa_chip_data structure if multiples switches are cascaded,
222  the master network device this switch tree is attached to needs to be
223  referenced
224
225- ``dsa_switch_tree``: structure assigned to the master network device under
226  ``dsa_ptr``, this structure references a dsa_platform_data structure as well as
227  the tagging protocol supported by the switch tree, and which receive/transmit
228  function hooks should be invoked, information about the directly attached
229  switch is also provided: CPU port. Finally, a collection of dsa_switch are
230  referenced to address individual switches in the tree.
231
232- ``dsa_switch``: structure describing a switch device in the tree, referencing
233  a ``dsa_switch_tree`` as a backpointer, slave network devices, master network
234  device, and a reference to the backing``dsa_switch_ops``
235
236- ``dsa_switch_ops``: structure referencing function pointers, see below for a
237  full description.
238
239Design limitations
240==================
241
242Limits on the number of devices and ports
243-----------------------------------------
244
245DSA currently limits the number of maximum switches within a tree to 4
246(``DSA_MAX_SWITCHES``), and the number of ports per switch to 12 (``DSA_MAX_PORTS``).
247These limits could be extended to support larger configurations would this need
248arise.
249
250Lack of CPU/DSA network devices
251-------------------------------
252
253DSA does not currently create slave network devices for the CPU or DSA ports, as
254described before. This might be an issue in the following cases:
255
256- inability to fetch switch CPU port statistics counters using ethtool, which
257  can make it harder to debug MDIO switch connected using xMII interfaces
258
259- inability to configure the CPU port link parameters based on the Ethernet
260  controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/
261
262- inability to configure specific VLAN IDs / trunking VLANs between switches
263  when using a cascaded setup
264
265Common pitfalls using DSA setups
266--------------------------------
267
268Once a master network device is configured to use DSA (dev->dsa_ptr becomes
269non-NULL), and the switch behind it expects a tagging protocol, this network
270interface can only exclusively be used as a conduit interface. Sending packets
271directly through this interface (e.g.: opening a socket using this interface)
272will not make us go through the switch tagging protocol transmit function, so
273the Ethernet switch on the other end, expecting a tag will typically drop this
274frame.
275
276Slave network devices check that the master network device is UP before allowing
277you to administratively bring UP these slave network devices. A common
278configuration mistake is forgetting to bring UP the master network device first.
279
280Interactions with other subsystems
281==================================
282
283DSA currently leverages the following subsystems:
284
285- MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
286- Switchdev:``net/switchdev/*``
287- Device Tree for various of_* functions
288
289MDIO/PHY library
290----------------
291
292Slave network devices exposed by DSA may or may not be interfacing with PHY
293devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA
294subsystem deals with all possible combinations:
295
296- internal PHY devices, built into the Ethernet switch hardware
297- external PHY devices, connected via an internal or external MDIO bus
298- internal PHY devices, connected via an internal MDIO bus
299- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
300  fixed PHYs
301
302The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the
303logic basically looks like this:
304
305- if Device Tree is used, the PHY device is looked up using the standard
306  "phy-handle" property, if found, this PHY device is created and registered
307  using ``of_phy_connect()``
308
309- if Device Tree is used, and the PHY device is "fixed", that is, conforms to
310  the definition of a non-MDIO managed PHY as defined in
311  ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
312  and connected transparently using the special fixed MDIO bus driver
313
314- finally, if the PHY is built into the switch, as is very common with
315  standalone switch packages, the PHY is probed using the slave MII bus created
316  by DSA
317
318
319SWITCHDEV
320---------
321
322DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
323more specifically with its VLAN filtering portion when configuring VLANs on top
324of per-port slave network devices. Since DSA primarily deals with
325MDIO-connected switches, although not exclusively, SWITCHDEV's
326prepare/abort/commit phases are often simplified into a prepare phase which
327checks whether the operation is supported by the DSA switch driver, and a commit
328phase which applies the changes.
329
330As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN
331objects.
332
333Device Tree
334-----------
335
336DSA features a standardized binding which is documented in
337``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
338functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
339per-port PHY specific details: interface connection, MDIO bus location etc..
340
341Driver development
342==================
343
344DSA switch drivers need to implement a dsa_switch_ops structure which will
345contain the various members described below.
346
347``register_switch_driver()`` registers this dsa_switch_ops in its internal list
348of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite.
349
350Unless requested differently by setting the priv_size member accordingly, DSA
351does not allocate any driver private context space.
352
353Switch configuration
354--------------------
355
356- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported,
357  should be a valid value from the ``dsa_tag_protocol`` enum
358
359- ``probe``: probe routine which will be invoked by the DSA platform device upon
360  registration to test for the presence/absence of a switch device. For MDIO
361  devices, it is recommended to issue a read towards internal registers using
362  the switch pseudo-PHY and return whether this is a supported device. For other
363  buses, return a non-NULL string
364
365- ``setup``: setup function for the switch, this function is responsible for setting
366  up the ``dsa_switch_ops`` private structure with all it needs: register maps,
367  interrupts, mutexes, locks etc.. This function is also expected to properly
368  configure the switch to separate all network interfaces from each other, that
369  is, they should be isolated by the switch hardware itself, typically by creating
370  a Port-based VLAN ID for each port and allowing only the CPU port and the
371  specific port to be in the forwarding vector. Ports that are unused by the
372  platform should be disabled. Past this function, the switch is expected to be
373  fully configured and ready to serve any kind of request. It is recommended
374  to issue a software reset of the switch during this setup function in order to
375  avoid relying on what a previous software agent such as a bootloader/firmware
376  may have previously configured.
377
378PHY devices and link management
379-------------------------------
380
381- ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
382  if the PHY library PHY driver needs to know about information it cannot obtain
383  on its own (e.g.: coming from switch memory mapped registers), this function
384  should return a 32-bits bitmask of "flags", that is private between the switch
385  driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
386
387- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read
388  the switch port MDIO registers. If unavailable, return 0xffff for each read.
389  For builtin switch Ethernet PHYs, this function should allow reading the link
390  status, auto-negotiation results, link partner pages etc..
391
392- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write
393  to the switch port MDIO registers. If unavailable return a negative error
394  code.
395
396- ``adjust_link``: Function invoked by the PHY library when a slave network device
397  is attached to a PHY device. This function is responsible for appropriately
398  configuring the switch port link parameters: speed, duplex, pause based on
399  what the ``phy_device`` is providing.
400
401- ``fixed_link_update``: Function invoked by the PHY library, and specifically by
402  the fixed PHY driver asking the switch driver for link parameters that could
403  not be auto-negotiated, or obtained by reading the PHY registers through MDIO.
404  This is particularly useful for specific kinds of hardware such as QSGMII,
405  MoCA or other kinds of non-MDIO managed PHYs where out of band link
406  information is obtained
407
408Ethtool operations
409------------------
410
411- ``get_strings``: ethtool function used to query the driver's strings, will
412  typically return statistics strings, private flags strings etc.
413
414- ``get_ethtool_stats``: ethtool function used to query per-port statistics and
415  return their values. DSA overlays slave network devices general statistics:
416  RX/TX counters from the network device, with switch driver specific statistics
417  per port
418
419- ``get_sset_count``: ethtool function used to query the number of statistics items
420
421- ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
422  function may, for certain implementations also query the master network device
423  Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
424
425- ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
426  direct counterpart to set_wol with similar restrictions
427
428- ``set_eee``: ethtool function which is used to configure a switch port EEE (Green
429  Ethernet) settings, can optionally invoke the PHY library to enable EEE at the
430  PHY level if relevant. This function should enable EEE at the switch port MAC
431  controller and data-processing logic
432
433- ``get_eee``: ethtool function which is used to query a switch port EEE settings,
434  this function should return the EEE state of the switch port MAC controller
435  and data-processing logic as well as query the PHY for its currently configured
436  EEE settings
437
438- ``get_eeprom_len``: ethtool function returning for a given switch the EEPROM
439  length/size in bytes
440
441- ``get_eeprom``: ethtool function returning for a given switch the EEPROM contents
442
443- ``set_eeprom``: ethtool function writing specified data to a given switch EEPROM
444
445- ``get_regs_len``: ethtool function returning the register length for a given
446  switch
447
448- ``get_regs``: ethtool function returning the Ethernet switch internal register
449  contents. This function might require user-land code in ethtool to
450  pretty-print register values and registers
451
452Power management
453----------------
454
455- ``suspend``: function invoked by the DSA platform device when the system goes to
456  suspend, should quiesce all Ethernet switch activities, but keep ports
457  participating in Wake-on-LAN active as well as additional wake-up logic if
458  supported
459
460- ``resume``: function invoked by the DSA platform device when the system resumes,
461  should resume all Ethernet switch activities and re-configure the switch to be
462  in a fully active state
463
464- ``port_enable``: function invoked by the DSA slave network device ndo_open
465  function when a port is administratively brought up, this function should be
466  fully enabling a given switch port. DSA takes care of marking the port with
467  ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
468  was not, and propagating these changes down to the hardware
469
470- ``port_disable``: function invoked by the DSA slave network device ndo_close
471  function when a port is administratively brought down, this function should be
472  fully disabling a given switch port. DSA takes care of marking the port with
473  ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
474  disabled while being a bridge member
475
476Bridge layer
477------------
478
479- ``port_bridge_join``: bridge layer function invoked when a given switch port is
480  added to a bridge, this function should be doing the necessary at the switch
481  level to permit the joining port from being added to the relevant logical
482  domain for it to ingress/egress traffic with other members of the bridge.
483
484- ``port_bridge_leave``: bridge layer function invoked when a given switch port is
485  removed from a bridge, this function should be doing the necessary at the
486  switch level to deny the leaving port from ingress/egress traffic from the
487  remaining bridge members. When the port leaves the bridge, it should be aged
488  out at the switch hardware for the switch to (re) learn MAC addresses behind
489  this port.
490
491- ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
492  state is computed by the bridge layer and should be propagated to switch
493  hardware to forward/block/learn traffic. The switch driver is responsible for
494  computing a STP state change based on current and asked parameters and perform
495  the relevant ageing based on the intersection results
496
497Bridge VLAN filtering
498---------------------
499
500- ``port_vlan_filtering``: bridge layer function invoked when the bridge gets
501  configured for turning on or off VLAN filtering. If nothing specific needs to
502  be done at the hardware level, this callback does not need to be implemented.
503  When VLAN filtering is turned on, the hardware must be programmed with
504  rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed
505  VLAN ID map/rules.  If there is no PVID programmed into the switch port,
506  untagged frames must be rejected as well. When turned off the switch must
507  accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
508  allowed.
509
510- ``port_vlan_prepare``: bridge layer function invoked when the bridge prepares the
511  configuration of a VLAN on the given port. If the operation is not supported
512  by the hardware, this function should return ``-EOPNOTSUPP`` to inform the bridge
513  code to fallback to a software implementation. No hardware setup must be done
514  in this function. See port_vlan_add for this and details.
515
516- ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
517  (tagged or untagged) for the given switch port
518
519- ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
520  given switch port
521
522- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback
523  function that the driver has to call for each VLAN the given port is a member
524  of. A switchdev object is used to carry the VID and bridge flags.
525
526- ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
527  Forwarding Database entry, the switch hardware should be programmed with the
528  specified address in the specified VLAN Id in the forwarding database
529  associated with this VLAN ID. If the operation is not supported, this
530  function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to
531  a software implementation.
532
533.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
534        of DSA, would be the its port-based VLAN, used by the associated bridge device.
535
536- ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
537  Forwarding Database entry, the switch hardware should be programmed to delete
538  the specified MAC address from the specified VLAN ID if it was mapped into
539  this port forwarding database
540
541- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback
542  function that the driver has to call for each MAC address known to be behind
543  the given port. A switchdev object is used to carry the VID and FDB info.
544
545- ``port_mdb_prepare``: bridge layer function invoked when the bridge prepares the
546  installation of a multicast database entry. If the operation is not supported,
547  this function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback
548  to a software implementation. No hardware setup must be done in this function.
549  See ``port_fdb_add`` for this and details.
550
551- ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
552  a multicast database entry, the switch hardware should be programmed with the
553  specified address in the specified VLAN ID in the forwarding database
554  associated with this VLAN ID.
555
556.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
557        of DSA, would be the its port-based VLAN, used by the associated bridge device.
558
559- ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
560  multicast database entry, the switch hardware should be programmed to delete
561  the specified MAC address from the specified VLAN ID if it was mapped into
562  this port forwarding database.
563
564- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback
565  function that the driver has to call for each MAC address known to be behind
566  the given port. A switchdev object is used to carry the VID and MDB info.
567
568TODO
569====
570
571Making SWITCHDEV and DSA converge towards an unified codebase
572-------------------------------------------------------------
573
574SWITCHDEV properly takes care of abstracting the networking stack with offload
575capable hardware, but does not enforce a strict switch device driver model. On
576the other DSA enforces a fairly strict device driver model, and deals with most
577of the switch specific. At some point we should envision a merger between these
578two subsystems and get the best of both worlds.
579
580Other hanging fruits
581--------------------
582
583- making the number of ports fully dynamic and not dependent on ``DSA_MAX_PORTS``
584- allowing more than one CPU/management interface:
585  http://comments.gmane.org/gmane.linux.network/365657
586- porting more drivers from other vendors:
587  http://comments.gmane.org/gmane.linux.network/365510
588