1============ 2Architecture 3============ 4 5This document describes the **Distributed Switch Architecture (DSA)** subsystem 6design principles, limitations, interactions with other subsystems, and how to 7develop drivers for this subsystem as well as a TODO for developers interested 8in joining the effort. 9 10Design principles 11================= 12 13The Distributed Switch Architecture is a subsystem which was primarily designed 14to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line) 15using Linux, but has since evolved to support other vendors as well. 16 17The original philosophy behind this design was to be able to use unmodified 18Linux tools such as bridge, iproute2, ifconfig to work transparently whether 19they configured/queried a switch port network device or a regular network 20device. 21 22An Ethernet switch is typically comprised of multiple front-panel ports, and one 23or more CPU or management port. The DSA subsystem currently relies on the 24presence of a management port connected to an Ethernet controller capable of 25receiving Ethernet frames from the switch. This is a very common setup for all 26kinds of Ethernet switches found in Small Home and Office products: routers, 27gateways, or even top-of-the rack switches. This host Ethernet controller will 28be later referred to as "master" and "cpu" in DSA terminology and code. 29 30The D in DSA stands for Distributed, because the subsystem has been designed 31with the ability to configure and manage cascaded switches on top of each other 32using upstream and downstream Ethernet links between switches. These specific 33ports are referred to as "dsa" ports in DSA terminology and code. A collection 34of multiple switches connected to each other is called a "switch tree". 35 36For each front-panel port, DSA will create specialized network devices which are 37used as controlling and data-flowing endpoints for use by the Linux networking 38stack. These specialized network interfaces are referred to as "slave" network 39interfaces in DSA terminology and code. 40 41The ideal case for using DSA is when an Ethernet switch supports a "switch tag" 42which is a hardware feature making the switch insert a specific tag for each 43Ethernet frames it received to/from specific ports to help the management 44interface figure out: 45 46- what port is this frame coming from 47- what was the reason why this frame got forwarded 48- how to send CPU originated traffic to specific ports 49 50The subsystem does support switches not capable of inserting/stripping tags, but 51the features might be slightly limited in that case (traffic separation relies 52on Port-based VLAN IDs). 53 54Note that DSA does not currently create network interfaces for the "cpu" and 55"dsa" ports because: 56 57- the "cpu" port is the Ethernet switch facing side of the management 58 controller, and as such, would create a duplication of feature, since you 59 would get two interfaces for the same conduit: master netdev, and "cpu" netdev 60 61- the "dsa" port(s) are just conduits between two or more switches, and as such 62 cannot really be used as proper network interfaces either, only the 63 downstream, or the top-most upstream interface makes sense with that model 64 65Switch tagging protocols 66------------------------ 67 68DSA currently supports 5 different tagging protocols, and a tag-less mode as 69well. The different protocols are implemented in: 70 71- ``net/dsa/tag_trailer.c``: Marvell's 4 trailer tag mode (legacy) 72- ``net/dsa/tag_dsa.c``: Marvell's original DSA tag 73- ``net/dsa/tag_edsa.c``: Marvell's enhanced DSA tag 74- ``net/dsa/tag_brcm.c``: Broadcom's 4 bytes tag 75- ``net/dsa/tag_qca.c``: Qualcomm's 2 bytes tag 76 77The exact format of the tag protocol is vendor specific, but in general, they 78all contain something which: 79 80- identifies which port the Ethernet frame came from/should be sent to 81- provides a reason why this frame was forwarded to the management interface 82 83Master network devices 84---------------------- 85 86Master network devices are regular, unmodified Linux network device drivers for 87the CPU/management Ethernet interface. Such a driver might occasionally need to 88know whether DSA is enabled (e.g.: to enable/disable specific offload features), 89but the DSA subsystem has been proven to work with industry standard drivers: 90``e1000e,`` ``mv643xx_eth`` etc. without having to introduce modifications to these 91drivers. Such network devices are also often referred to as conduit network 92devices since they act as a pipe between the host processor and the hardware 93Ethernet switch. 94 95Networking stack hooks 96---------------------- 97 98When a master netdev is used with DSA, a small hook is placed in the 99networking stack is in order to have the DSA subsystem process the Ethernet 100switch specific tagging protocol. DSA accomplishes this by registering a 101specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the 102networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical 103Ethernet Frame receive sequence looks like this: 104 105Master network device (e.g.: e1000e): 106 1071. Receive interrupt fires: 108 109 - receive function is invoked 110 - basic packet processing is done: getting length, status etc. 111 - packet is prepared to be processed by the Ethernet layer by calling 112 ``eth_type_trans`` 113 1142. net/ethernet/eth.c:: 115 116 eth_type_trans(skb, dev) 117 if (dev->dsa_ptr != NULL) 118 -> skb->protocol = ETH_P_XDSA 119 1203. drivers/net/ethernet/\*:: 121 122 netif_receive_skb(skb) 123 -> iterate over registered packet_type 124 -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv() 125 1264. net/dsa/dsa.c:: 127 128 -> dsa_switch_rcv() 129 -> invoke switch tag specific protocol handler in 'net/dsa/tag_*.c' 130 1315. net/dsa/tag_*.c: 132 133 - inspect and strip switch tag protocol to determine originating port 134 - locate per-port network device 135 - invoke ``eth_type_trans()`` with the DSA slave network device 136 - invoked ``netif_receive_skb()`` 137 138Past this point, the DSA slave network devices get delivered regular Ethernet 139frames that can be processed by the networking stack. 140 141Slave network devices 142--------------------- 143 144Slave network devices created by DSA are stacked on top of their master network 145device, each of these network interfaces will be responsible for being a 146controlling and data-flowing end-point for each front-panel port of the switch. 147These interfaces are specialized in order to: 148 149- insert/remove the switch tag protocol (if it exists) when sending traffic 150 to/from specific switch ports 151- query the switch for ethtool operations: statistics, link state, 152 Wake-on-LAN, register dumps... 153- external/internal PHY management: link, auto-negotiation etc. 154 155These slave network devices have custom net_device_ops and ethtool_ops function 156pointers which allow DSA to introduce a level of layering between the networking 157stack/ethtool, and the switch driver implementation. 158 159Upon frame transmission from these slave network devices, DSA will look up which 160switch tagging protocol is currently registered with these network devices, and 161invoke a specific transmit routine which takes care of adding the relevant 162switch tag in the Ethernet frames. 163 164These frames are then queued for transmission using the master network device 165``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the 166Ethernet switch will be able to process these incoming frames from the 167management interface and delivers these frames to the physical switch port. 168 169Graphical representation 170------------------------ 171 172Summarized, this is basically how DSA looks like from a network device 173perspective:: 174 175 176 |--------------------------- 177 | CPU network device (eth0)| 178 ---------------------------- 179 | <tag added by switch | 180 | | 181 | | 182 | tag added by CPU> | 183 |--------------------------------------------| 184 | Switch driver | 185 |--------------------------------------------| 186 || || || 187 |-------| |-------| |-------| 188 | sw0p0 | | sw0p1 | | sw0p2 | 189 |-------| |-------| |-------| 190 191 192 193Slave MDIO bus 194-------------- 195 196In order to be able to read to/from a switch PHY built into it, DSA creates a 197slave MDIO bus which allows a specific switch driver to divert and intercept 198MDIO reads/writes towards specific PHY addresses. In most MDIO-connected 199switches, these functions would utilize direct or indirect PHY addressing mode 200to return standard MII registers from the switch builtin PHYs, allowing the PHY 201library and/or to return link status, link partner pages, auto-negotiation 202results etc.. 203 204For Ethernet switches which have both external and internal MDIO busses, the 205slave MII bus can be utilized to mux/demux MDIO reads and writes towards either 206internal or external MDIO devices this switch might be connected to: internal 207PHYs, external PHYs, or even external switches. 208 209Data structures 210--------------- 211 212DSA data structures are defined in ``include/net/dsa.h`` as well as 213``net/dsa/dsa_priv.h``: 214 215- ``dsa_chip_data``: platform data configuration for a given switch device, 216 this structure describes a switch device's parent device, its address, as 217 well as various properties of its ports: names/labels, and finally a routing 218 table indication (when cascading switches) 219 220- ``dsa_platform_data``: platform device configuration data which can reference 221 a collection of dsa_chip_data structure if multiples switches are cascaded, 222 the master network device this switch tree is attached to needs to be 223 referenced 224 225- ``dsa_switch_tree``: structure assigned to the master network device under 226 ``dsa_ptr``, this structure references a dsa_platform_data structure as well as 227 the tagging protocol supported by the switch tree, and which receive/transmit 228 function hooks should be invoked, information about the directly attached 229 switch is also provided: CPU port. Finally, a collection of dsa_switch are 230 referenced to address individual switches in the tree. 231 232- ``dsa_switch``: structure describing a switch device in the tree, referencing 233 a ``dsa_switch_tree`` as a backpointer, slave network devices, master network 234 device, and a reference to the backing``dsa_switch_ops`` 235 236- ``dsa_switch_ops``: structure referencing function pointers, see below for a 237 full description. 238 239Design limitations 240================== 241 242Limits on the number of devices and ports 243----------------------------------------- 244 245DSA currently limits the number of maximum switches within a tree to 4 246(``DSA_MAX_SWITCHES``), and the number of ports per switch to 12 (``DSA_MAX_PORTS``). 247These limits could be extended to support larger configurations would this need 248arise. 249 250Lack of CPU/DSA network devices 251------------------------------- 252 253DSA does not currently create slave network devices for the CPU or DSA ports, as 254described before. This might be an issue in the following cases: 255 256- inability to fetch switch CPU port statistics counters using ethtool, which 257 can make it harder to debug MDIO switch connected using xMII interfaces 258 259- inability to configure the CPU port link parameters based on the Ethernet 260 controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/ 261 262- inability to configure specific VLAN IDs / trunking VLANs between switches 263 when using a cascaded setup 264 265Common pitfalls using DSA setups 266-------------------------------- 267 268Once a master network device is configured to use DSA (dev->dsa_ptr becomes 269non-NULL), and the switch behind it expects a tagging protocol, this network 270interface can only exclusively be used as a conduit interface. Sending packets 271directly through this interface (e.g.: opening a socket using this interface) 272will not make us go through the switch tagging protocol transmit function, so 273the Ethernet switch on the other end, expecting a tag will typically drop this 274frame. 275 276Interactions with other subsystems 277================================== 278 279DSA currently leverages the following subsystems: 280 281- MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c`` 282- Switchdev:``net/switchdev/*`` 283- Device Tree for various of_* functions 284 285MDIO/PHY library 286---------------- 287 288Slave network devices exposed by DSA may or may not be interfacing with PHY 289devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA 290subsystem deals with all possible combinations: 291 292- internal PHY devices, built into the Ethernet switch hardware 293- external PHY devices, connected via an internal or external MDIO bus 294- internal PHY devices, connected via an internal MDIO bus 295- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a 296 fixed PHYs 297 298The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the 299logic basically looks like this: 300 301- if Device Tree is used, the PHY device is looked up using the standard 302 "phy-handle" property, if found, this PHY device is created and registered 303 using ``of_phy_connect()`` 304 305- if Device Tree is used, and the PHY device is "fixed", that is, conforms to 306 the definition of a non-MDIO managed PHY as defined in 307 ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered 308 and connected transparently using the special fixed MDIO bus driver 309 310- finally, if the PHY is built into the switch, as is very common with 311 standalone switch packages, the PHY is probed using the slave MII bus created 312 by DSA 313 314 315SWITCHDEV 316--------- 317 318DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and 319more specifically with its VLAN filtering portion when configuring VLANs on top 320of per-port slave network devices. Since DSA primarily deals with 321MDIO-connected switches, although not exclusively, SWITCHDEV's 322prepare/abort/commit phases are often simplified into a prepare phase which 323checks whether the operation is supported by the DSA switch driver, and a commit 324phase which applies the changes. 325 326As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN 327objects. 328 329Device Tree 330----------- 331 332DSA features a standardized binding which is documented in 333``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper 334functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query 335per-port PHY specific details: interface connection, MDIO bus location etc.. 336 337Driver development 338================== 339 340DSA switch drivers need to implement a dsa_switch_ops structure which will 341contain the various members described below. 342 343``register_switch_driver()`` registers this dsa_switch_ops in its internal list 344of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite. 345 346Unless requested differently by setting the priv_size member accordingly, DSA 347does not allocate any driver private context space. 348 349Switch configuration 350-------------------- 351 352- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported, 353 should be a valid value from the ``dsa_tag_protocol`` enum 354 355- ``probe``: probe routine which will be invoked by the DSA platform device upon 356 registration to test for the presence/absence of a switch device. For MDIO 357 devices, it is recommended to issue a read towards internal registers using 358 the switch pseudo-PHY and return whether this is a supported device. For other 359 buses, return a non-NULL string 360 361- ``setup``: setup function for the switch, this function is responsible for setting 362 up the ``dsa_switch_ops`` private structure with all it needs: register maps, 363 interrupts, mutexes, locks etc.. This function is also expected to properly 364 configure the switch to separate all network interfaces from each other, that 365 is, they should be isolated by the switch hardware itself, typically by creating 366 a Port-based VLAN ID for each port and allowing only the CPU port and the 367 specific port to be in the forwarding vector. Ports that are unused by the 368 platform should be disabled. Past this function, the switch is expected to be 369 fully configured and ready to serve any kind of request. It is recommended 370 to issue a software reset of the switch during this setup function in order to 371 avoid relying on what a previous software agent such as a bootloader/firmware 372 may have previously configured. 373 374PHY devices and link management 375------------------------------- 376 377- ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs, 378 if the PHY library PHY driver needs to know about information it cannot obtain 379 on its own (e.g.: coming from switch memory mapped registers), this function 380 should return a 32-bits bitmask of "flags", that is private between the switch 381 driver and the Ethernet PHY driver in ``drivers/net/phy/\*``. 382 383- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read 384 the switch port MDIO registers. If unavailable, return 0xffff for each read. 385 For builtin switch Ethernet PHYs, this function should allow reading the link 386 status, auto-negotiation results, link partner pages etc.. 387 388- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write 389 to the switch port MDIO registers. If unavailable return a negative error 390 code. 391 392- ``adjust_link``: Function invoked by the PHY library when a slave network device 393 is attached to a PHY device. This function is responsible for appropriately 394 configuring the switch port link parameters: speed, duplex, pause based on 395 what the ``phy_device`` is providing. 396 397- ``fixed_link_update``: Function invoked by the PHY library, and specifically by 398 the fixed PHY driver asking the switch driver for link parameters that could 399 not be auto-negotiated, or obtained by reading the PHY registers through MDIO. 400 This is particularly useful for specific kinds of hardware such as QSGMII, 401 MoCA or other kinds of non-MDIO managed PHYs where out of band link 402 information is obtained 403 404Ethtool operations 405------------------ 406 407- ``get_strings``: ethtool function used to query the driver's strings, will 408 typically return statistics strings, private flags strings etc. 409 410- ``get_ethtool_stats``: ethtool function used to query per-port statistics and 411 return their values. DSA overlays slave network devices general statistics: 412 RX/TX counters from the network device, with switch driver specific statistics 413 per port 414 415- ``get_sset_count``: ethtool function used to query the number of statistics items 416 417- ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this 418 function may, for certain implementations also query the master network device 419 Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN 420 421- ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port, 422 direct counterpart to set_wol with similar restrictions 423 424- ``set_eee``: ethtool function which is used to configure a switch port EEE (Green 425 Ethernet) settings, can optionally invoke the PHY library to enable EEE at the 426 PHY level if relevant. This function should enable EEE at the switch port MAC 427 controller and data-processing logic 428 429- ``get_eee``: ethtool function which is used to query a switch port EEE settings, 430 this function should return the EEE state of the switch port MAC controller 431 and data-processing logic as well as query the PHY for its currently configured 432 EEE settings 433 434- ``get_eeprom_len``: ethtool function returning for a given switch the EEPROM 435 length/size in bytes 436 437- ``get_eeprom``: ethtool function returning for a given switch the EEPROM contents 438 439- ``set_eeprom``: ethtool function writing specified data to a given switch EEPROM 440 441- ``get_regs_len``: ethtool function returning the register length for a given 442 switch 443 444- ``get_regs``: ethtool function returning the Ethernet switch internal register 445 contents. This function might require user-land code in ethtool to 446 pretty-print register values and registers 447 448Power management 449---------------- 450 451- ``suspend``: function invoked by the DSA platform device when the system goes to 452 suspend, should quiesce all Ethernet switch activities, but keep ports 453 participating in Wake-on-LAN active as well as additional wake-up logic if 454 supported 455 456- ``resume``: function invoked by the DSA platform device when the system resumes, 457 should resume all Ethernet switch activities and re-configure the switch to be 458 in a fully active state 459 460- ``port_enable``: function invoked by the DSA slave network device ndo_open 461 function when a port is administratively brought up, this function should be 462 fully enabling a given switch port. DSA takes care of marking the port with 463 ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it 464 was not, and propagating these changes down to the hardware 465 466- ``port_disable``: function invoked by the DSA slave network device ndo_close 467 function when a port is administratively brought down, this function should be 468 fully disabling a given switch port. DSA takes care of marking the port with 469 ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is 470 disabled while being a bridge member 471 472Bridge layer 473------------ 474 475- ``port_bridge_join``: bridge layer function invoked when a given switch port is 476 added to a bridge, this function should be doing the necessary at the switch 477 level to permit the joining port from being added to the relevant logical 478 domain for it to ingress/egress traffic with other members of the bridge. 479 480- ``port_bridge_leave``: bridge layer function invoked when a given switch port is 481 removed from a bridge, this function should be doing the necessary at the 482 switch level to deny the leaving port from ingress/egress traffic from the 483 remaining bridge members. When the port leaves the bridge, it should be aged 484 out at the switch hardware for the switch to (re) learn MAC addresses behind 485 this port. 486 487- ``port_stp_state_set``: bridge layer function invoked when a given switch port STP 488 state is computed by the bridge layer and should be propagated to switch 489 hardware to forward/block/learn traffic. The switch driver is responsible for 490 computing a STP state change based on current and asked parameters and perform 491 the relevant ageing based on the intersection results 492 493Bridge VLAN filtering 494--------------------- 495 496- ``port_vlan_filtering``: bridge layer function invoked when the bridge gets 497 configured for turning on or off VLAN filtering. If nothing specific needs to 498 be done at the hardware level, this callback does not need to be implemented. 499 When VLAN filtering is turned on, the hardware must be programmed with 500 rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed 501 VLAN ID map/rules. If there is no PVID programmed into the switch port, 502 untagged frames must be rejected as well. When turned off the switch must 503 accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are 504 allowed. 505 506- ``port_vlan_prepare``: bridge layer function invoked when the bridge prepares the 507 configuration of a VLAN on the given port. If the operation is not supported 508 by the hardware, this function should return ``-EOPNOTSUPP`` to inform the bridge 509 code to fallback to a software implementation. No hardware setup must be done 510 in this function. See port_vlan_add for this and details. 511 512- ``port_vlan_add``: bridge layer function invoked when a VLAN is configured 513 (tagged or untagged) for the given switch port 514 515- ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the 516 given switch port 517 518- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback 519 function that the driver has to call for each VLAN the given port is a member 520 of. A switchdev object is used to carry the VID and bridge flags. 521 522- ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a 523 Forwarding Database entry, the switch hardware should be programmed with the 524 specified address in the specified VLAN Id in the forwarding database 525 associated with this VLAN ID. If the operation is not supported, this 526 function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to 527 a software implementation. 528 529.. note:: VLAN ID 0 corresponds to the port private database, which, in the context 530 of DSA, would be its port-based VLAN, used by the associated bridge device. 531 532- ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a 533 Forwarding Database entry, the switch hardware should be programmed to delete 534 the specified MAC address from the specified VLAN ID if it was mapped into 535 this port forwarding database 536 537- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback 538 function that the driver has to call for each MAC address known to be behind 539 the given port. A switchdev object is used to carry the VID and FDB info. 540 541- ``port_mdb_prepare``: bridge layer function invoked when the bridge prepares the 542 installation of a multicast database entry. If the operation is not supported, 543 this function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback 544 to a software implementation. No hardware setup must be done in this function. 545 See ``port_fdb_add`` for this and details. 546 547- ``port_mdb_add``: bridge layer function invoked when the bridge wants to install 548 a multicast database entry, the switch hardware should be programmed with the 549 specified address in the specified VLAN ID in the forwarding database 550 associated with this VLAN ID. 551 552.. note:: VLAN ID 0 corresponds to the port private database, which, in the context 553 of DSA, would be its port-based VLAN, used by the associated bridge device. 554 555- ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a 556 multicast database entry, the switch hardware should be programmed to delete 557 the specified MAC address from the specified VLAN ID if it was mapped into 558 this port forwarding database. 559 560- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback 561 function that the driver has to call for each MAC address known to be behind 562 the given port. A switchdev object is used to carry the VID and MDB info. 563 564TODO 565==== 566 567Making SWITCHDEV and DSA converge towards an unified codebase 568------------------------------------------------------------- 569 570SWITCHDEV properly takes care of abstracting the networking stack with offload 571capable hardware, but does not enforce a strict switch device driver model. On 572the other DSA enforces a fairly strict device driver model, and deals with most 573of the switch specific. At some point we should envision a merger between these 574two subsystems and get the best of both worlds. 575 576Other hanging fruits 577-------------------- 578 579- making the number of ports fully dynamic and not dependent on ``DSA_MAX_PORTS`` 580- allowing more than one CPU/management interface: 581 http://comments.gmane.org/gmane.linux.network/365657 582- porting more drivers from other vendors: 583 http://comments.gmane.org/gmane.linux.network/365510 584