1============ 2Architecture 3============ 4 5This document describes the **Distributed Switch Architecture (DSA)** subsystem 6design principles, limitations, interactions with other subsystems, and how to 7develop drivers for this subsystem as well as a TODO for developers interested 8in joining the effort. 9 10Design principles 11================= 12 13The Distributed Switch Architecture is a subsystem which was primarily designed 14to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line) 15using Linux, but has since evolved to support other vendors as well. 16 17The original philosophy behind this design was to be able to use unmodified 18Linux tools such as bridge, iproute2, ifconfig to work transparently whether 19they configured/queried a switch port network device or a regular network 20device. 21 22An Ethernet switch is typically comprised of multiple front-panel ports, and one 23or more CPU or management port. The DSA subsystem currently relies on the 24presence of a management port connected to an Ethernet controller capable of 25receiving Ethernet frames from the switch. This is a very common setup for all 26kinds of Ethernet switches found in Small Home and Office products: routers, 27gateways, or even top-of-the rack switches. This host Ethernet controller will 28be later referred to as "master" and "cpu" in DSA terminology and code. 29 30The D in DSA stands for Distributed, because the subsystem has been designed 31with the ability to configure and manage cascaded switches on top of each other 32using upstream and downstream Ethernet links between switches. These specific 33ports are referred to as "dsa" ports in DSA terminology and code. A collection 34of multiple switches connected to each other is called a "switch tree". 35 36For each front-panel port, DSA will create specialized network devices which are 37used as controlling and data-flowing endpoints for use by the Linux networking 38stack. These specialized network interfaces are referred to as "slave" network 39interfaces in DSA terminology and code. 40 41The ideal case for using DSA is when an Ethernet switch supports a "switch tag" 42which is a hardware feature making the switch insert a specific tag for each 43Ethernet frames it received to/from specific ports to help the management 44interface figure out: 45 46- what port is this frame coming from 47- what was the reason why this frame got forwarded 48- how to send CPU originated traffic to specific ports 49 50The subsystem does support switches not capable of inserting/stripping tags, but 51the features might be slightly limited in that case (traffic separation relies 52on Port-based VLAN IDs). 53 54Note that DSA does not currently create network interfaces for the "cpu" and 55"dsa" ports because: 56 57- the "cpu" port is the Ethernet switch facing side of the management 58 controller, and as such, would create a duplication of feature, since you 59 would get two interfaces for the same conduit: master netdev, and "cpu" netdev 60 61- the "dsa" port(s) are just conduits between two or more switches, and as such 62 cannot really be used as proper network interfaces either, only the 63 downstream, or the top-most upstream interface makes sense with that model 64 65Switch tagging protocols 66------------------------ 67 68DSA currently supports 5 different tagging protocols, and a tag-less mode as 69well. The different protocols are implemented in: 70 71- ``net/dsa/tag_trailer.c``: Marvell's 4 trailer tag mode (legacy) 72- ``net/dsa/tag_dsa.c``: Marvell's original DSA tag 73- ``net/dsa/tag_edsa.c``: Marvell's enhanced DSA tag 74- ``net/dsa/tag_brcm.c``: Broadcom's 4 bytes tag 75- ``net/dsa/tag_qca.c``: Qualcomm's 2 bytes tag 76 77The exact format of the tag protocol is vendor specific, but in general, they 78all contain something which: 79 80- identifies which port the Ethernet frame came from/should be sent to 81- provides a reason why this frame was forwarded to the management interface 82 83Master network devices 84---------------------- 85 86Master network devices are regular, unmodified Linux network device drivers for 87the CPU/management Ethernet interface. Such a driver might occasionally need to 88know whether DSA is enabled (e.g.: to enable/disable specific offload features), 89but the DSA subsystem has been proven to work with industry standard drivers: 90``e1000e,`` ``mv643xx_eth`` etc. without having to introduce modifications to these 91drivers. Such network devices are also often referred to as conduit network 92devices since they act as a pipe between the host processor and the hardware 93Ethernet switch. 94 95Networking stack hooks 96---------------------- 97 98When a master netdev is used with DSA, a small hook is placed in in the 99networking stack is in order to have the DSA subsystem process the Ethernet 100switch specific tagging protocol. DSA accomplishes this by registering a 101specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the 102networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical 103Ethernet Frame receive sequence looks like this: 104 105Master network device (e.g.: e1000e): 106 1071. Receive interrupt fires: 108 109 - receive function is invoked 110 - basic packet processing is done: getting length, status etc. 111 - packet is prepared to be processed by the Ethernet layer by calling 112 ``eth_type_trans`` 113 1142. net/ethernet/eth.c:: 115 116 eth_type_trans(skb, dev) 117 if (dev->dsa_ptr != NULL) 118 -> skb->protocol = ETH_P_XDSA 119 1203. drivers/net/ethernet/\*:: 121 122 netif_receive_skb(skb) 123 -> iterate over registered packet_type 124 -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv() 125 1264. net/dsa/dsa.c:: 127 128 -> dsa_switch_rcv() 129 -> invoke switch tag specific protocol handler in 'net/dsa/tag_*.c' 130 1315. net/dsa/tag_*.c: 132 133 - inspect and strip switch tag protocol to determine originating port 134 - locate per-port network device 135 - invoke ``eth_type_trans()`` with the DSA slave network device 136 - invoked ``netif_receive_skb()`` 137 138Past this point, the DSA slave network devices get delivered regular Ethernet 139frames that can be processed by the networking stack. 140 141Slave network devices 142--------------------- 143 144Slave network devices created by DSA are stacked on top of their master network 145device, each of these network interfaces will be responsible for being a 146controlling and data-flowing end-point for each front-panel port of the switch. 147These interfaces are specialized in order to: 148 149- insert/remove the switch tag protocol (if it exists) when sending traffic 150 to/from specific switch ports 151- query the switch for ethtool operations: statistics, link state, 152 Wake-on-LAN, register dumps... 153- external/internal PHY management: link, auto-negotiation etc. 154 155These slave network devices have custom net_device_ops and ethtool_ops function 156pointers which allow DSA to introduce a level of layering between the networking 157stack/ethtool, and the switch driver implementation. 158 159Upon frame transmission from these slave network devices, DSA will look up which 160switch tagging protocol is currently registered with these network devices, and 161invoke a specific transmit routine which takes care of adding the relevant 162switch tag in the Ethernet frames. 163 164These frames are then queued for transmission using the master network device 165``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the 166Ethernet switch will be able to process these incoming frames from the 167management interface and delivers these frames to the physical switch port. 168 169Graphical representation 170------------------------ 171 172Summarized, this is basically how DSA looks like from a network device 173perspective:: 174 175 176 |--------------------------- 177 | CPU network device (eth0)| 178 ---------------------------- 179 | <tag added by switch | 180 | | 181 | | 182 | tag added by CPU> | 183 |--------------------------------------------| 184 | Switch driver | 185 |--------------------------------------------| 186 || || || 187 |-------| |-------| |-------| 188 | sw0p0 | | sw0p1 | | sw0p2 | 189 |-------| |-------| |-------| 190 191 192 193Slave MDIO bus 194-------------- 195 196In order to be able to read to/from a switch PHY built into it, DSA creates a 197slave MDIO bus which allows a specific switch driver to divert and intercept 198MDIO reads/writes towards specific PHY addresses. In most MDIO-connected 199switches, these functions would utilize direct or indirect PHY addressing mode 200to return standard MII registers from the switch builtin PHYs, allowing the PHY 201library and/or to return link status, link partner pages, auto-negotiation 202results etc.. 203 204For Ethernet switches which have both external and internal MDIO busses, the 205slave MII bus can be utilized to mux/demux MDIO reads and writes towards either 206internal or external MDIO devices this switch might be connected to: internal 207PHYs, external PHYs, or even external switches. 208 209Data structures 210--------------- 211 212DSA data structures are defined in ``include/net/dsa.h`` as well as 213``net/dsa/dsa_priv.h``: 214 215- ``dsa_chip_data``: platform data configuration for a given switch device, 216 this structure describes a switch device's parent device, its address, as 217 well as various properties of its ports: names/labels, and finally a routing 218 table indication (when cascading switches) 219 220- ``dsa_platform_data``: platform device configuration data which can reference 221 a collection of dsa_chip_data structure if multiples switches are cascaded, 222 the master network device this switch tree is attached to needs to be 223 referenced 224 225- ``dsa_switch_tree``: structure assigned to the master network device under 226 ``dsa_ptr``, this structure references a dsa_platform_data structure as well as 227 the tagging protocol supported by the switch tree, and which receive/transmit 228 function hooks should be invoked, information about the directly attached 229 switch is also provided: CPU port. Finally, a collection of dsa_switch are 230 referenced to address individual switches in the tree. 231 232- ``dsa_switch``: structure describing a switch device in the tree, referencing 233 a ``dsa_switch_tree`` as a backpointer, slave network devices, master network 234 device, and a reference to the backing``dsa_switch_ops`` 235 236- ``dsa_switch_ops``: structure referencing function pointers, see below for a 237 full description. 238 239Design limitations 240================== 241 242Limits on the number of devices and ports 243----------------------------------------- 244 245DSA currently limits the number of maximum switches within a tree to 4 246(``DSA_MAX_SWITCHES``), and the number of ports per switch to 12 (``DSA_MAX_PORTS``). 247These limits could be extended to support larger configurations would this need 248arise. 249 250Lack of CPU/DSA network devices 251------------------------------- 252 253DSA does not currently create slave network devices for the CPU or DSA ports, as 254described before. This might be an issue in the following cases: 255 256- inability to fetch switch CPU port statistics counters using ethtool, which 257 can make it harder to debug MDIO switch connected using xMII interfaces 258 259- inability to configure the CPU port link parameters based on the Ethernet 260 controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/ 261 262- inability to configure specific VLAN IDs / trunking VLANs between switches 263 when using a cascaded setup 264 265Common pitfalls using DSA setups 266-------------------------------- 267 268Once a master network device is configured to use DSA (dev->dsa_ptr becomes 269non-NULL), and the switch behind it expects a tagging protocol, this network 270interface can only exclusively be used as a conduit interface. Sending packets 271directly through this interface (e.g.: opening a socket using this interface) 272will not make us go through the switch tagging protocol transmit function, so 273the Ethernet switch on the other end, expecting a tag will typically drop this 274frame. 275 276Slave network devices check that the master network device is UP before allowing 277you to administratively bring UP these slave network devices. A common 278configuration mistake is forgetting to bring UP the master network device first. 279 280Interactions with other subsystems 281================================== 282 283DSA currently leverages the following subsystems: 284 285- MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c`` 286- Switchdev:``net/switchdev/*`` 287- Device Tree for various of_* functions 288 289MDIO/PHY library 290---------------- 291 292Slave network devices exposed by DSA may or may not be interfacing with PHY 293devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA 294subsystem deals with all possible combinations: 295 296- internal PHY devices, built into the Ethernet switch hardware 297- external PHY devices, connected via an internal or external MDIO bus 298- internal PHY devices, connected via an internal MDIO bus 299- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a 300 fixed PHYs 301 302The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the 303logic basically looks like this: 304 305- if Device Tree is used, the PHY device is looked up using the standard 306 "phy-handle" property, if found, this PHY device is created and registered 307 using ``of_phy_connect()`` 308 309- if Device Tree is used, and the PHY device is "fixed", that is, conforms to 310 the definition of a non-MDIO managed PHY as defined in 311 ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered 312 and connected transparently using the special fixed MDIO bus driver 313 314- finally, if the PHY is built into the switch, as is very common with 315 standalone switch packages, the PHY is probed using the slave MII bus created 316 by DSA 317 318 319SWITCHDEV 320--------- 321 322DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and 323more specifically with its VLAN filtering portion when configuring VLANs on top 324of per-port slave network devices. Since DSA primarily deals with 325MDIO-connected switches, although not exclusively, SWITCHDEV's 326prepare/abort/commit phases are often simplified into a prepare phase which 327checks whether the operation is supported by the DSA switch driver, and a commit 328phase which applies the changes. 329 330As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN 331objects. 332 333Device Tree 334----------- 335 336DSA features a standardized binding which is documented in 337``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper 338functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query 339per-port PHY specific details: interface connection, MDIO bus location etc.. 340 341Driver development 342================== 343 344DSA switch drivers need to implement a dsa_switch_ops structure which will 345contain the various members described below. 346 347``register_switch_driver()`` registers this dsa_switch_ops in its internal list 348of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite. 349 350Unless requested differently by setting the priv_size member accordingly, DSA 351does not allocate any driver private context space. 352 353Switch configuration 354-------------------- 355 356- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported, 357 should be a valid value from the ``dsa_tag_protocol`` enum 358 359- ``probe``: probe routine which will be invoked by the DSA platform device upon 360 registration to test for the presence/absence of a switch device. For MDIO 361 devices, it is recommended to issue a read towards internal registers using 362 the switch pseudo-PHY and return whether this is a supported device. For other 363 buses, return a non-NULL string 364 365- ``setup``: setup function for the switch, this function is responsible for setting 366 up the ``dsa_switch_ops`` private structure with all it needs: register maps, 367 interrupts, mutexes, locks etc.. This function is also expected to properly 368 configure the switch to separate all network interfaces from each other, that 369 is, they should be isolated by the switch hardware itself, typically by creating 370 a Port-based VLAN ID for each port and allowing only the CPU port and the 371 specific port to be in the forwarding vector. Ports that are unused by the 372 platform should be disabled. Past this function, the switch is expected to be 373 fully configured and ready to serve any kind of request. It is recommended 374 to issue a software reset of the switch during this setup function in order to 375 avoid relying on what a previous software agent such as a bootloader/firmware 376 may have previously configured. 377 378PHY devices and link management 379------------------------------- 380 381- ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs, 382 if the PHY library PHY driver needs to know about information it cannot obtain 383 on its own (e.g.: coming from switch memory mapped registers), this function 384 should return a 32-bits bitmask of "flags", that is private between the switch 385 driver and the Ethernet PHY driver in ``drivers/net/phy/\*``. 386 387- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read 388 the switch port MDIO registers. If unavailable, return 0xffff for each read. 389 For builtin switch Ethernet PHYs, this function should allow reading the link 390 status, auto-negotiation results, link partner pages etc.. 391 392- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write 393 to the switch port MDIO registers. If unavailable return a negative error 394 code. 395 396- ``adjust_link``: Function invoked by the PHY library when a slave network device 397 is attached to a PHY device. This function is responsible for appropriately 398 configuring the switch port link parameters: speed, duplex, pause based on 399 what the ``phy_device`` is providing. 400 401- ``fixed_link_update``: Function invoked by the PHY library, and specifically by 402 the fixed PHY driver asking the switch driver for link parameters that could 403 not be auto-negotiated, or obtained by reading the PHY registers through MDIO. 404 This is particularly useful for specific kinds of hardware such as QSGMII, 405 MoCA or other kinds of non-MDIO managed PHYs where out of band link 406 information is obtained 407 408Ethtool operations 409------------------ 410 411- ``get_strings``: ethtool function used to query the driver's strings, will 412 typically return statistics strings, private flags strings etc. 413 414- ``get_ethtool_stats``: ethtool function used to query per-port statistics and 415 return their values. DSA overlays slave network devices general statistics: 416 RX/TX counters from the network device, with switch driver specific statistics 417 per port 418 419- ``get_sset_count``: ethtool function used to query the number of statistics items 420 421- ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this 422 function may, for certain implementations also query the master network device 423 Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN 424 425- ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port, 426 direct counterpart to set_wol with similar restrictions 427 428- ``set_eee``: ethtool function which is used to configure a switch port EEE (Green 429 Ethernet) settings, can optionally invoke the PHY library to enable EEE at the 430 PHY level if relevant. This function should enable EEE at the switch port MAC 431 controller and data-processing logic 432 433- ``get_eee``: ethtool function which is used to query a switch port EEE settings, 434 this function should return the EEE state of the switch port MAC controller 435 and data-processing logic as well as query the PHY for its currently configured 436 EEE settings 437 438- ``get_eeprom_len``: ethtool function returning for a given switch the EEPROM 439 length/size in bytes 440 441- ``get_eeprom``: ethtool function returning for a given switch the EEPROM contents 442 443- ``set_eeprom``: ethtool function writing specified data to a given switch EEPROM 444 445- ``get_regs_len``: ethtool function returning the register length for a given 446 switch 447 448- ``get_regs``: ethtool function returning the Ethernet switch internal register 449 contents. This function might require user-land code in ethtool to 450 pretty-print register values and registers 451 452Power management 453---------------- 454 455- ``suspend``: function invoked by the DSA platform device when the system goes to 456 suspend, should quiesce all Ethernet switch activities, but keep ports 457 participating in Wake-on-LAN active as well as additional wake-up logic if 458 supported 459 460- ``resume``: function invoked by the DSA platform device when the system resumes, 461 should resume all Ethernet switch activities and re-configure the switch to be 462 in a fully active state 463 464- ``port_enable``: function invoked by the DSA slave network device ndo_open 465 function when a port is administratively brought up, this function should be 466 fully enabling a given switch port. DSA takes care of marking the port with 467 ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it 468 was not, and propagating these changes down to the hardware 469 470- ``port_disable``: function invoked by the DSA slave network device ndo_close 471 function when a port is administratively brought down, this function should be 472 fully disabling a given switch port. DSA takes care of marking the port with 473 ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is 474 disabled while being a bridge member 475 476Bridge layer 477------------ 478 479- ``port_bridge_join``: bridge layer function invoked when a given switch port is 480 added to a bridge, this function should be doing the necessary at the switch 481 level to permit the joining port from being added to the relevant logical 482 domain for it to ingress/egress traffic with other members of the bridge. 483 484- ``port_bridge_leave``: bridge layer function invoked when a given switch port is 485 removed from a bridge, this function should be doing the necessary at the 486 switch level to deny the leaving port from ingress/egress traffic from the 487 remaining bridge members. When the port leaves the bridge, it should be aged 488 out at the switch hardware for the switch to (re) learn MAC addresses behind 489 this port. 490 491- ``port_stp_state_set``: bridge layer function invoked when a given switch port STP 492 state is computed by the bridge layer and should be propagated to switch 493 hardware to forward/block/learn traffic. The switch driver is responsible for 494 computing a STP state change based on current and asked parameters and perform 495 the relevant ageing based on the intersection results 496 497Bridge VLAN filtering 498--------------------- 499 500- ``port_vlan_filtering``: bridge layer function invoked when the bridge gets 501 configured for turning on or off VLAN filtering. If nothing specific needs to 502 be done at the hardware level, this callback does not need to be implemented. 503 When VLAN filtering is turned on, the hardware must be programmed with 504 rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed 505 VLAN ID map/rules. If there is no PVID programmed into the switch port, 506 untagged frames must be rejected as well. When turned off the switch must 507 accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are 508 allowed. 509 510- ``port_vlan_prepare``: bridge layer function invoked when the bridge prepares the 511 configuration of a VLAN on the given port. If the operation is not supported 512 by the hardware, this function should return ``-EOPNOTSUPP`` to inform the bridge 513 code to fallback to a software implementation. No hardware setup must be done 514 in this function. See port_vlan_add for this and details. 515 516- ``port_vlan_add``: bridge layer function invoked when a VLAN is configured 517 (tagged or untagged) for the given switch port 518 519- ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the 520 given switch port 521 522- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback 523 function that the driver has to call for each VLAN the given port is a member 524 of. A switchdev object is used to carry the VID and bridge flags. 525 526- ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a 527 Forwarding Database entry, the switch hardware should be programmed with the 528 specified address in the specified VLAN Id in the forwarding database 529 associated with this VLAN ID. If the operation is not supported, this 530 function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to 531 a software implementation. 532 533.. note:: VLAN ID 0 corresponds to the port private database, which, in the context 534 of DSA, would be its port-based VLAN, used by the associated bridge device. 535 536- ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a 537 Forwarding Database entry, the switch hardware should be programmed to delete 538 the specified MAC address from the specified VLAN ID if it was mapped into 539 this port forwarding database 540 541- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback 542 function that the driver has to call for each MAC address known to be behind 543 the given port. A switchdev object is used to carry the VID and FDB info. 544 545- ``port_mdb_prepare``: bridge layer function invoked when the bridge prepares the 546 installation of a multicast database entry. If the operation is not supported, 547 this function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback 548 to a software implementation. No hardware setup must be done in this function. 549 See ``port_fdb_add`` for this and details. 550 551- ``port_mdb_add``: bridge layer function invoked when the bridge wants to install 552 a multicast database entry, the switch hardware should be programmed with the 553 specified address in the specified VLAN ID in the forwarding database 554 associated with this VLAN ID. 555 556.. note:: VLAN ID 0 corresponds to the port private database, which, in the context 557 of DSA, would be its port-based VLAN, used by the associated bridge device. 558 559- ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a 560 multicast database entry, the switch hardware should be programmed to delete 561 the specified MAC address from the specified VLAN ID if it was mapped into 562 this port forwarding database. 563 564- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback 565 function that the driver has to call for each MAC address known to be behind 566 the given port. A switchdev object is used to carry the VID and MDB info. 567 568TODO 569==== 570 571Making SWITCHDEV and DSA converge towards an unified codebase 572------------------------------------------------------------- 573 574SWITCHDEV properly takes care of abstracting the networking stack with offload 575capable hardware, but does not enforce a strict switch device driver model. On 576the other DSA enforces a fairly strict device driver model, and deals with most 577of the switch specific. At some point we should envision a merger between these 578two subsystems and get the best of both worlds. 579 580Other hanging fruits 581-------------------- 582 583- making the number of ports fully dynamic and not dependent on ``DSA_MAX_PORTS`` 584- allowing more than one CPU/management interface: 585 http://comments.gmane.org/gmane.linux.network/365657 586- porting more drivers from other vendors: 587 http://comments.gmane.org/gmane.linux.network/365510 588