1============
2Architecture
3============
4
5This document describes the **Distributed Switch Architecture (DSA)** subsystem
6design principles, limitations, interactions with other subsystems, and how to
7develop drivers for this subsystem as well as a TODO for developers interested
8in joining the effort.
9
10Design principles
11=================
12
13The Distributed Switch Architecture is a subsystem which was primarily designed
14to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
15using Linux, but has since evolved to support other vendors as well.
16
17The original philosophy behind this design was to be able to use unmodified
18Linux tools such as bridge, iproute2, ifconfig to work transparently whether
19they configured/queried a switch port network device or a regular network
20device.
21
22An Ethernet switch is typically comprised of multiple front-panel ports, and one
23or more CPU or management port. The DSA subsystem currently relies on the
24presence of a management port connected to an Ethernet controller capable of
25receiving Ethernet frames from the switch. This is a very common setup for all
26kinds of Ethernet switches found in Small Home and Office products: routers,
27gateways, or even top-of-the rack switches. This host Ethernet controller will
28be later referred to as "master" and "cpu" in DSA terminology and code.
29
30The D in DSA stands for Distributed, because the subsystem has been designed
31with the ability to configure and manage cascaded switches on top of each other
32using upstream and downstream Ethernet links between switches. These specific
33ports are referred to as "dsa" ports in DSA terminology and code. A collection
34of multiple switches connected to each other is called a "switch tree".
35
36For each front-panel port, DSA will create specialized network devices which are
37used as controlling and data-flowing endpoints for use by the Linux networking
38stack. These specialized network interfaces are referred to as "slave" network
39interfaces in DSA terminology and code.
40
41The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
42which is a hardware feature making the switch insert a specific tag for each
43Ethernet frames it received to/from specific ports to help the management
44interface figure out:
45
46- what port is this frame coming from
47- what was the reason why this frame got forwarded
48- how to send CPU originated traffic to specific ports
49
50The subsystem does support switches not capable of inserting/stripping tags, but
51the features might be slightly limited in that case (traffic separation relies
52on Port-based VLAN IDs).
53
54Note that DSA does not currently create network interfaces for the "cpu" and
55"dsa" ports because:
56
57- the "cpu" port is the Ethernet switch facing side of the management
58  controller, and as such, would create a duplication of feature, since you
59  would get two interfaces for the same conduit: master netdev, and "cpu" netdev
60
61- the "dsa" port(s) are just conduits between two or more switches, and as such
62  cannot really be used as proper network interfaces either, only the
63  downstream, or the top-most upstream interface makes sense with that model
64
65Switch tagging protocols
66------------------------
67
68DSA currently supports 5 different tagging protocols, and a tag-less mode as
69well. The different protocols are implemented in:
70
71- ``net/dsa/tag_trailer.c``: Marvell's 4 trailer tag mode (legacy)
72- ``net/dsa/tag_dsa.c``: Marvell's original DSA tag
73- ``net/dsa/tag_edsa.c``: Marvell's enhanced DSA tag
74- ``net/dsa/tag_brcm.c``: Broadcom's 4 bytes tag
75- ``net/dsa/tag_qca.c``: Qualcomm's 2 bytes tag
76
77The exact format of the tag protocol is vendor specific, but in general, they
78all contain something which:
79
80- identifies which port the Ethernet frame came from/should be sent to
81- provides a reason why this frame was forwarded to the management interface
82
83Master network devices
84----------------------
85
86Master network devices are regular, unmodified Linux network device drivers for
87the CPU/management Ethernet interface. Such a driver might occasionally need to
88know whether DSA is enabled (e.g.: to enable/disable specific offload features),
89but the DSA subsystem has been proven to work with industry standard drivers:
90``e1000e,`` ``mv643xx_eth`` etc. without having to introduce modifications to these
91drivers. Such network devices are also often referred to as conduit network
92devices since they act as a pipe between the host processor and the hardware
93Ethernet switch.
94
95Networking stack hooks
96----------------------
97
98When a master netdev is used with DSA, a small hook is placed in the
99networking stack is in order to have the DSA subsystem process the Ethernet
100switch specific tagging protocol. DSA accomplishes this by registering a
101specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
102networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical
103Ethernet Frame receive sequence looks like this:
104
105Master network device (e.g.: e1000e):
106
1071. Receive interrupt fires:
108
109        - receive function is invoked
110        - basic packet processing is done: getting length, status etc.
111        - packet is prepared to be processed by the Ethernet layer by calling
112          ``eth_type_trans``
113
1142. net/ethernet/eth.c::
115
116          eth_type_trans(skb, dev)
117                  if (dev->dsa_ptr != NULL)
118                          -> skb->protocol = ETH_P_XDSA
119
1203. drivers/net/ethernet/\*::
121
122          netif_receive_skb(skb)
123                  -> iterate over registered packet_type
124                          -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv()
125
1264. net/dsa/dsa.c::
127
128          -> dsa_switch_rcv()
129                  -> invoke switch tag specific protocol handler in 'net/dsa/tag_*.c'
130
1315. net/dsa/tag_*.c:
132
133        - inspect and strip switch tag protocol to determine originating port
134        - locate per-port network device
135        - invoke ``eth_type_trans()`` with the DSA slave network device
136        - invoked ``netif_receive_skb()``
137
138Past this point, the DSA slave network devices get delivered regular Ethernet
139frames that can be processed by the networking stack.
140
141Slave network devices
142---------------------
143
144Slave network devices created by DSA are stacked on top of their master network
145device, each of these network interfaces will be responsible for being a
146controlling and data-flowing end-point for each front-panel port of the switch.
147These interfaces are specialized in order to:
148
149- insert/remove the switch tag protocol (if it exists) when sending traffic
150  to/from specific switch ports
151- query the switch for ethtool operations: statistics, link state,
152  Wake-on-LAN, register dumps...
153- external/internal PHY management: link, auto-negotiation etc.
154
155These slave network devices have custom net_device_ops and ethtool_ops function
156pointers which allow DSA to introduce a level of layering between the networking
157stack/ethtool, and the switch driver implementation.
158
159Upon frame transmission from these slave network devices, DSA will look up which
160switch tagging protocol is currently registered with these network devices, and
161invoke a specific transmit routine which takes care of adding the relevant
162switch tag in the Ethernet frames.
163
164These frames are then queued for transmission using the master network device
165``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the
166Ethernet switch will be able to process these incoming frames from the
167management interface and delivers these frames to the physical switch port.
168
169Graphical representation
170------------------------
171
172Summarized, this is basically how DSA looks like from a network device
173perspective::
174
175
176                |---------------------------
177                | CPU network device (eth0)|
178                ----------------------------
179                | <tag added by switch     |
180                |                          |
181                |                          |
182                |        tag added by CPU> |
183        |--------------------------------------------|
184        |            Switch driver                   |
185        |--------------------------------------------|
186                  ||        ||         ||
187              |-------|  |-------|  |-------|
188              | sw0p0 |  | sw0p1 |  | sw0p2 |
189              |-------|  |-------|  |-------|
190
191
192
193Slave MDIO bus
194--------------
195
196In order to be able to read to/from a switch PHY built into it, DSA creates a
197slave MDIO bus which allows a specific switch driver to divert and intercept
198MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
199switches, these functions would utilize direct or indirect PHY addressing mode
200to return standard MII registers from the switch builtin PHYs, allowing the PHY
201library and/or to return link status, link partner pages, auto-negotiation
202results etc..
203
204For Ethernet switches which have both external and internal MDIO busses, the
205slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
206internal or external MDIO devices this switch might be connected to: internal
207PHYs, external PHYs, or even external switches.
208
209Data structures
210---------------
211
212DSA data structures are defined in ``include/net/dsa.h`` as well as
213``net/dsa/dsa_priv.h``:
214
215- ``dsa_chip_data``: platform data configuration for a given switch device,
216  this structure describes a switch device's parent device, its address, as
217  well as various properties of its ports: names/labels, and finally a routing
218  table indication (when cascading switches)
219
220- ``dsa_platform_data``: platform device configuration data which can reference
221  a collection of dsa_chip_data structure if multiples switches are cascaded,
222  the master network device this switch tree is attached to needs to be
223  referenced
224
225- ``dsa_switch_tree``: structure assigned to the master network device under
226  ``dsa_ptr``, this structure references a dsa_platform_data structure as well as
227  the tagging protocol supported by the switch tree, and which receive/transmit
228  function hooks should be invoked, information about the directly attached
229  switch is also provided: CPU port. Finally, a collection of dsa_switch are
230  referenced to address individual switches in the tree.
231
232- ``dsa_switch``: structure describing a switch device in the tree, referencing
233  a ``dsa_switch_tree`` as a backpointer, slave network devices, master network
234  device, and a reference to the backing``dsa_switch_ops``
235
236- ``dsa_switch_ops``: structure referencing function pointers, see below for a
237  full description.
238
239Design limitations
240==================
241
242Limits on the number of devices and ports
243-----------------------------------------
244
245DSA currently limits the number of maximum switches within a tree to 4
246(``DSA_MAX_SWITCHES``), and the number of ports per switch to 12 (``DSA_MAX_PORTS``).
247These limits could be extended to support larger configurations would this need
248arise.
249
250Lack of CPU/DSA network devices
251-------------------------------
252
253DSA does not currently create slave network devices for the CPU or DSA ports, as
254described before. This might be an issue in the following cases:
255
256- inability to fetch switch CPU port statistics counters using ethtool, which
257  can make it harder to debug MDIO switch connected using xMII interfaces
258
259- inability to configure the CPU port link parameters based on the Ethernet
260  controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/
261
262- inability to configure specific VLAN IDs / trunking VLANs between switches
263  when using a cascaded setup
264
265Common pitfalls using DSA setups
266--------------------------------
267
268Once a master network device is configured to use DSA (dev->dsa_ptr becomes
269non-NULL), and the switch behind it expects a tagging protocol, this network
270interface can only exclusively be used as a conduit interface. Sending packets
271directly through this interface (e.g.: opening a socket using this interface)
272will not make us go through the switch tagging protocol transmit function, so
273the Ethernet switch on the other end, expecting a tag will typically drop this
274frame.
275
276Interactions with other subsystems
277==================================
278
279DSA currently leverages the following subsystems:
280
281- MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
282- Switchdev:``net/switchdev/*``
283- Device Tree for various of_* functions
284
285MDIO/PHY library
286----------------
287
288Slave network devices exposed by DSA may or may not be interfacing with PHY
289devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA
290subsystem deals with all possible combinations:
291
292- internal PHY devices, built into the Ethernet switch hardware
293- external PHY devices, connected via an internal or external MDIO bus
294- internal PHY devices, connected via an internal MDIO bus
295- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
296  fixed PHYs
297
298The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the
299logic basically looks like this:
300
301- if Device Tree is used, the PHY device is looked up using the standard
302  "phy-handle" property, if found, this PHY device is created and registered
303  using ``of_phy_connect()``
304
305- if Device Tree is used, and the PHY device is "fixed", that is, conforms to
306  the definition of a non-MDIO managed PHY as defined in
307  ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
308  and connected transparently using the special fixed MDIO bus driver
309
310- finally, if the PHY is built into the switch, as is very common with
311  standalone switch packages, the PHY is probed using the slave MII bus created
312  by DSA
313
314
315SWITCHDEV
316---------
317
318DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
319more specifically with its VLAN filtering portion when configuring VLANs on top
320of per-port slave network devices. Since DSA primarily deals with
321MDIO-connected switches, although not exclusively, SWITCHDEV's
322prepare/abort/commit phases are often simplified into a prepare phase which
323checks whether the operation is supported by the DSA switch driver, and a commit
324phase which applies the changes.
325
326As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN
327objects.
328
329Device Tree
330-----------
331
332DSA features a standardized binding which is documented in
333``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
334functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
335per-port PHY specific details: interface connection, MDIO bus location etc..
336
337Driver development
338==================
339
340DSA switch drivers need to implement a dsa_switch_ops structure which will
341contain the various members described below.
342
343``register_switch_driver()`` registers this dsa_switch_ops in its internal list
344of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite.
345
346Unless requested differently by setting the priv_size member accordingly, DSA
347does not allocate any driver private context space.
348
349Switch configuration
350--------------------
351
352- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported,
353  should be a valid value from the ``dsa_tag_protocol`` enum
354
355- ``probe``: probe routine which will be invoked by the DSA platform device upon
356  registration to test for the presence/absence of a switch device. For MDIO
357  devices, it is recommended to issue a read towards internal registers using
358  the switch pseudo-PHY and return whether this is a supported device. For other
359  buses, return a non-NULL string
360
361- ``setup``: setup function for the switch, this function is responsible for setting
362  up the ``dsa_switch_ops`` private structure with all it needs: register maps,
363  interrupts, mutexes, locks etc.. This function is also expected to properly
364  configure the switch to separate all network interfaces from each other, that
365  is, they should be isolated by the switch hardware itself, typically by creating
366  a Port-based VLAN ID for each port and allowing only the CPU port and the
367  specific port to be in the forwarding vector. Ports that are unused by the
368  platform should be disabled. Past this function, the switch is expected to be
369  fully configured and ready to serve any kind of request. It is recommended
370  to issue a software reset of the switch during this setup function in order to
371  avoid relying on what a previous software agent such as a bootloader/firmware
372  may have previously configured.
373
374PHY devices and link management
375-------------------------------
376
377- ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
378  if the PHY library PHY driver needs to know about information it cannot obtain
379  on its own (e.g.: coming from switch memory mapped registers), this function
380  should return a 32-bits bitmask of "flags", that is private between the switch
381  driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
382
383- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read
384  the switch port MDIO registers. If unavailable, return 0xffff for each read.
385  For builtin switch Ethernet PHYs, this function should allow reading the link
386  status, auto-negotiation results, link partner pages etc..
387
388- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write
389  to the switch port MDIO registers. If unavailable return a negative error
390  code.
391
392- ``adjust_link``: Function invoked by the PHY library when a slave network device
393  is attached to a PHY device. This function is responsible for appropriately
394  configuring the switch port link parameters: speed, duplex, pause based on
395  what the ``phy_device`` is providing.
396
397- ``fixed_link_update``: Function invoked by the PHY library, and specifically by
398  the fixed PHY driver asking the switch driver for link parameters that could
399  not be auto-negotiated, or obtained by reading the PHY registers through MDIO.
400  This is particularly useful for specific kinds of hardware such as QSGMII,
401  MoCA or other kinds of non-MDIO managed PHYs where out of band link
402  information is obtained
403
404Ethtool operations
405------------------
406
407- ``get_strings``: ethtool function used to query the driver's strings, will
408  typically return statistics strings, private flags strings etc.
409
410- ``get_ethtool_stats``: ethtool function used to query per-port statistics and
411  return their values. DSA overlays slave network devices general statistics:
412  RX/TX counters from the network device, with switch driver specific statistics
413  per port
414
415- ``get_sset_count``: ethtool function used to query the number of statistics items
416
417- ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
418  function may, for certain implementations also query the master network device
419  Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
420
421- ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
422  direct counterpart to set_wol with similar restrictions
423
424- ``set_eee``: ethtool function which is used to configure a switch port EEE (Green
425  Ethernet) settings, can optionally invoke the PHY library to enable EEE at the
426  PHY level if relevant. This function should enable EEE at the switch port MAC
427  controller and data-processing logic
428
429- ``get_eee``: ethtool function which is used to query a switch port EEE settings,
430  this function should return the EEE state of the switch port MAC controller
431  and data-processing logic as well as query the PHY for its currently configured
432  EEE settings
433
434- ``get_eeprom_len``: ethtool function returning for a given switch the EEPROM
435  length/size in bytes
436
437- ``get_eeprom``: ethtool function returning for a given switch the EEPROM contents
438
439- ``set_eeprom``: ethtool function writing specified data to a given switch EEPROM
440
441- ``get_regs_len``: ethtool function returning the register length for a given
442  switch
443
444- ``get_regs``: ethtool function returning the Ethernet switch internal register
445  contents. This function might require user-land code in ethtool to
446  pretty-print register values and registers
447
448Power management
449----------------
450
451- ``suspend``: function invoked by the DSA platform device when the system goes to
452  suspend, should quiesce all Ethernet switch activities, but keep ports
453  participating in Wake-on-LAN active as well as additional wake-up logic if
454  supported
455
456- ``resume``: function invoked by the DSA platform device when the system resumes,
457  should resume all Ethernet switch activities and re-configure the switch to be
458  in a fully active state
459
460- ``port_enable``: function invoked by the DSA slave network device ndo_open
461  function when a port is administratively brought up, this function should be
462  fully enabling a given switch port. DSA takes care of marking the port with
463  ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
464  was not, and propagating these changes down to the hardware
465
466- ``port_disable``: function invoked by the DSA slave network device ndo_close
467  function when a port is administratively brought down, this function should be
468  fully disabling a given switch port. DSA takes care of marking the port with
469  ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
470  disabled while being a bridge member
471
472Bridge layer
473------------
474
475- ``port_bridge_join``: bridge layer function invoked when a given switch port is
476  added to a bridge, this function should be doing the necessary at the switch
477  level to permit the joining port from being added to the relevant logical
478  domain for it to ingress/egress traffic with other members of the bridge.
479
480- ``port_bridge_leave``: bridge layer function invoked when a given switch port is
481  removed from a bridge, this function should be doing the necessary at the
482  switch level to deny the leaving port from ingress/egress traffic from the
483  remaining bridge members. When the port leaves the bridge, it should be aged
484  out at the switch hardware for the switch to (re) learn MAC addresses behind
485  this port.
486
487- ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
488  state is computed by the bridge layer and should be propagated to switch
489  hardware to forward/block/learn traffic. The switch driver is responsible for
490  computing a STP state change based on current and asked parameters and perform
491  the relevant ageing based on the intersection results
492
493Bridge VLAN filtering
494---------------------
495
496- ``port_vlan_filtering``: bridge layer function invoked when the bridge gets
497  configured for turning on or off VLAN filtering. If nothing specific needs to
498  be done at the hardware level, this callback does not need to be implemented.
499  When VLAN filtering is turned on, the hardware must be programmed with
500  rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed
501  VLAN ID map/rules.  If there is no PVID programmed into the switch port,
502  untagged frames must be rejected as well. When turned off the switch must
503  accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
504  allowed.
505
506- ``port_vlan_prepare``: bridge layer function invoked when the bridge prepares the
507  configuration of a VLAN on the given port. If the operation is not supported
508  by the hardware, this function should return ``-EOPNOTSUPP`` to inform the bridge
509  code to fallback to a software implementation. No hardware setup must be done
510  in this function. See port_vlan_add for this and details.
511
512- ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
513  (tagged or untagged) for the given switch port
514
515- ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
516  given switch port
517
518- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback
519  function that the driver has to call for each VLAN the given port is a member
520  of. A switchdev object is used to carry the VID and bridge flags.
521
522- ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
523  Forwarding Database entry, the switch hardware should be programmed with the
524  specified address in the specified VLAN Id in the forwarding database
525  associated with this VLAN ID. If the operation is not supported, this
526  function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to
527  a software implementation.
528
529.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
530        of DSA, would be its port-based VLAN, used by the associated bridge device.
531
532- ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
533  Forwarding Database entry, the switch hardware should be programmed to delete
534  the specified MAC address from the specified VLAN ID if it was mapped into
535  this port forwarding database
536
537- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback
538  function that the driver has to call for each MAC address known to be behind
539  the given port. A switchdev object is used to carry the VID and FDB info.
540
541- ``port_mdb_prepare``: bridge layer function invoked when the bridge prepares the
542  installation of a multicast database entry. If the operation is not supported,
543  this function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback
544  to a software implementation. No hardware setup must be done in this function.
545  See ``port_fdb_add`` for this and details.
546
547- ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
548  a multicast database entry, the switch hardware should be programmed with the
549  specified address in the specified VLAN ID in the forwarding database
550  associated with this VLAN ID.
551
552.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
553        of DSA, would be its port-based VLAN, used by the associated bridge device.
554
555- ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
556  multicast database entry, the switch hardware should be programmed to delete
557  the specified MAC address from the specified VLAN ID if it was mapped into
558  this port forwarding database.
559
560- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback
561  function that the driver has to call for each MAC address known to be behind
562  the given port. A switchdev object is used to carry the VID and MDB info.
563
564TODO
565====
566
567Making SWITCHDEV and DSA converge towards an unified codebase
568-------------------------------------------------------------
569
570SWITCHDEV properly takes care of abstracting the networking stack with offload
571capable hardware, but does not enforce a strict switch device driver model. On
572the other DSA enforces a fairly strict device driver model, and deals with most
573of the switch specific. At some point we should envision a merger between these
574two subsystems and get the best of both worlds.
575
576Other hanging fruits
577--------------------
578
579- making the number of ports fully dynamic and not dependent on ``DSA_MAX_PORTS``
580- allowing more than one CPU/management interface:
581  http://comments.gmane.org/gmane.linux.network/365657
582- porting more drivers from other vendors:
583  http://comments.gmane.org/gmane.linux.network/365510
584