1482a4360SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 2482a4360SMauro Carvalho Chehab 3482a4360SMauro Carvalho Chehab===================================== 4482a4360SMauro Carvalho ChehabNetwork Devices, the Kernel, and You! 5482a4360SMauro Carvalho Chehab===================================== 6482a4360SMauro Carvalho Chehab 7482a4360SMauro Carvalho Chehab 8482a4360SMauro Carvalho ChehabIntroduction 9482a4360SMauro Carvalho Chehab============ 10482a4360SMauro Carvalho ChehabThe following is a random collection of documentation regarding 11482a4360SMauro Carvalho Chehabnetwork devices. 12482a4360SMauro Carvalho Chehab 132b446e65SJakub Kicinskistruct net_device lifetime rules 142b446e65SJakub Kicinski================================ 15482a4360SMauro Carvalho ChehabNetwork device structures need to persist even after module is unloaded and 16482a4360SMauro Carvalho Chehabmust be allocated with alloc_netdev_mqs() and friends. 17482a4360SMauro Carvalho ChehabIf device has registered successfully, it will be freed on last use 182b446e65SJakub Kicinskiby free_netdev(). This is required to handle the pathological case cleanly 192b446e65SJakub Kicinski(example: ``rmmod mydriver </sys/class/net/myeth/mtu``) 20482a4360SMauro Carvalho Chehab 21482a4360SMauro Carvalho Chehaballoc_netdev_mqs() / alloc_netdev() reserve extra space for driver 22482a4360SMauro Carvalho Chehabprivate data which gets freed when the network device is freed. If 23482a4360SMauro Carvalho Chehabseparately allocated data is attached to the network device 242b446e65SJakub Kicinski(netdev_priv()) then it is up to the module exit handler to free that. 252b446e65SJakub Kicinski 262b446e65SJakub KicinskiThere are two groups of APIs for registering struct net_device. 272b446e65SJakub KicinskiFirst group can be used in normal contexts where ``rtnl_lock`` is not already 282b446e65SJakub Kicinskiheld: register_netdev(), unregister_netdev(). 292b446e65SJakub KicinskiSecond group can be used when ``rtnl_lock`` is already held: 302b446e65SJakub Kicinskiregister_netdevice(), unregister_netdevice(), free_netdevice(). 312b446e65SJakub Kicinski 322b446e65SJakub KicinskiSimple drivers 332b446e65SJakub Kicinski-------------- 342b446e65SJakub Kicinski 352b446e65SJakub KicinskiMost drivers (especially device drivers) handle lifetime of struct net_device 362b446e65SJakub Kicinskiin context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths). 372b446e65SJakub Kicinski 382b446e65SJakub KicinskiIn that case the struct net_device registration is done using 392b446e65SJakub Kicinskithe register_netdev(), and unregister_netdev() functions: 402b446e65SJakub Kicinski 412b446e65SJakub Kicinski.. code-block:: c 422b446e65SJakub Kicinski 432b446e65SJakub Kicinski int probe() 442b446e65SJakub Kicinski { 452b446e65SJakub Kicinski struct my_device_priv *priv; 462b446e65SJakub Kicinski int err; 472b446e65SJakub Kicinski 482b446e65SJakub Kicinski dev = alloc_netdev_mqs(...); 492b446e65SJakub Kicinski if (!dev) 502b446e65SJakub Kicinski return -ENOMEM; 512b446e65SJakub Kicinski priv = netdev_priv(dev); 522b446e65SJakub Kicinski 532b446e65SJakub Kicinski /* ... do all device setup before calling register_netdev() ... 542b446e65SJakub Kicinski */ 552b446e65SJakub Kicinski 562b446e65SJakub Kicinski err = register_netdev(dev); 572b446e65SJakub Kicinski if (err) 582b446e65SJakub Kicinski goto err_undo; 592b446e65SJakub Kicinski 602b446e65SJakub Kicinski /* net_device is visible to the user! */ 612b446e65SJakub Kicinski 622b446e65SJakub Kicinski err_undo: 632b446e65SJakub Kicinski /* ... undo the device setup ... */ 642b446e65SJakub Kicinski free_netdev(dev); 652b446e65SJakub Kicinski return err; 662b446e65SJakub Kicinski } 672b446e65SJakub Kicinski 682b446e65SJakub Kicinski void remove() 692b446e65SJakub Kicinski { 702b446e65SJakub Kicinski unregister_netdev(dev); 712b446e65SJakub Kicinski free_netdev(dev); 722b446e65SJakub Kicinski } 732b446e65SJakub Kicinski 742b446e65SJakub KicinskiNote that after calling register_netdev() the device is visible in the system. 752b446e65SJakub KicinskiUsers can open it and start sending / receiving traffic immediately, 762b446e65SJakub Kicinskior run any other callback, so all initialization must be done prior to 772b446e65SJakub Kicinskiregistration. 782b446e65SJakub Kicinski 792b446e65SJakub Kicinskiunregister_netdev() closes the device and waits for all users to be done 802b446e65SJakub Kicinskiwith it. The memory of struct net_device itself may still be referenced 812b446e65SJakub Kicinskiby sysfs but all operations on that device will fail. 822b446e65SJakub Kicinski 832b446e65SJakub Kicinskifree_netdev() can be called after unregister_netdev() returns on when 842b446e65SJakub Kicinskiregister_netdev() failed. 852b446e65SJakub Kicinski 862b446e65SJakub KicinskiDevice management under RTNL 872b446e65SJakub Kicinski---------------------------- 882b446e65SJakub Kicinski 892b446e65SJakub KicinskiRegistering struct net_device while in context which already holds 902b446e65SJakub Kicinskithe ``rtnl_lock`` requires extra care. In those scenarios most drivers 912b446e65SJakub Kicinskiwill want to make use of struct net_device's ``needs_free_netdev`` 922b446e65SJakub Kicinskiand ``priv_destructor`` members for freeing of state. 932b446e65SJakub Kicinski 942b446e65SJakub KicinskiExample flow of netdev handling under ``rtnl_lock``: 952b446e65SJakub Kicinski 962b446e65SJakub Kicinski.. code-block:: c 972b446e65SJakub Kicinski 982b446e65SJakub Kicinski static void my_setup(struct net_device *dev) 992b446e65SJakub Kicinski { 1002b446e65SJakub Kicinski dev->needs_free_netdev = true; 1012b446e65SJakub Kicinski } 1022b446e65SJakub Kicinski 1032b446e65SJakub Kicinski static void my_destructor(struct net_device *dev) 1042b446e65SJakub Kicinski { 1052b446e65SJakub Kicinski some_obj_destroy(priv->obj); 1062b446e65SJakub Kicinski some_uninit(priv); 1072b446e65SJakub Kicinski } 1082b446e65SJakub Kicinski 1092b446e65SJakub Kicinski int create_link() 1102b446e65SJakub Kicinski { 1112b446e65SJakub Kicinski struct my_device_priv *priv; 1122b446e65SJakub Kicinski int err; 1132b446e65SJakub Kicinski 1142b446e65SJakub Kicinski ASSERT_RTNL(); 1152b446e65SJakub Kicinski 1162b446e65SJakub Kicinski dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup); 1172b446e65SJakub Kicinski if (!dev) 1182b446e65SJakub Kicinski return -ENOMEM; 1192b446e65SJakub Kicinski priv = netdev_priv(dev); 1202b446e65SJakub Kicinski 1212b446e65SJakub Kicinski /* Implicit constructor */ 1222b446e65SJakub Kicinski err = some_init(priv); 1232b446e65SJakub Kicinski if (err) 1242b446e65SJakub Kicinski goto err_free_dev; 1252b446e65SJakub Kicinski 1262b446e65SJakub Kicinski priv->obj = some_obj_create(); 1272b446e65SJakub Kicinski if (!priv->obj) { 1282b446e65SJakub Kicinski err = -ENOMEM; 1292b446e65SJakub Kicinski goto err_some_uninit; 1302b446e65SJakub Kicinski } 1312b446e65SJakub Kicinski /* End of constructor, set the destructor: */ 1322b446e65SJakub Kicinski dev->priv_destructor = my_destructor; 1332b446e65SJakub Kicinski 1342b446e65SJakub Kicinski err = register_netdevice(dev); 1352b446e65SJakub Kicinski if (err) 1362b446e65SJakub Kicinski /* register_netdevice() calls destructor on failure */ 1372b446e65SJakub Kicinski goto err_free_dev; 1382b446e65SJakub Kicinski 1392b446e65SJakub Kicinski /* If anything fails now unregister_netdevice() (or unregister_netdev()) 1402b446e65SJakub Kicinski * will take care of calling my_destructor and free_netdev(). 1412b446e65SJakub Kicinski */ 1422b446e65SJakub Kicinski 1432b446e65SJakub Kicinski return 0; 1442b446e65SJakub Kicinski 1452b446e65SJakub Kicinski err_some_uninit: 1462b446e65SJakub Kicinski some_uninit(priv); 1472b446e65SJakub Kicinski err_free_dev: 1482b446e65SJakub Kicinski free_netdev(dev); 1492b446e65SJakub Kicinski return err; 1502b446e65SJakub Kicinski } 1512b446e65SJakub Kicinski 1522b446e65SJakub KicinskiIf struct net_device.priv_destructor is set it will be called by the core 1532b446e65SJakub Kicinskisome time after unregister_netdevice(), it will also be called if 1542b446e65SJakub Kicinskiregister_netdevice() fails. The callback may be invoked with or without 1552b446e65SJakub Kicinski``rtnl_lock`` held. 1562b446e65SJakub Kicinski 1572b446e65SJakub KicinskiThere is no explicit constructor callback, driver "constructs" the private 1582b446e65SJakub Kicinskinetdev state after allocating it and before registration. 1592b446e65SJakub Kicinski 1602b446e65SJakub KicinskiSetting struct net_device.needs_free_netdev makes core call free_netdevice() 1612b446e65SJakub Kicinskiautomatically after unregister_netdevice() when all references to the device 1622b446e65SJakub Kicinskiare gone. It only takes effect after a successful call to register_netdevice() 1632b446e65SJakub Kicinskiso if register_netdevice() fails driver is responsible for calling 1642b446e65SJakub Kicinskifree_netdev(). 1652b446e65SJakub Kicinski 1662b446e65SJakub Kicinskifree_netdev() is safe to call on error paths right after unregister_netdevice() 1672b446e65SJakub Kicinskior when register_netdevice() fails. Parts of netdev (de)registration process 1682b446e65SJakub Kicinskihappen after ``rtnl_lock`` is released, therefore in those cases free_netdev() 1692b446e65SJakub Kicinskiwill defer some of the processing until ``rtnl_lock`` is released. 1702b446e65SJakub Kicinski 1712b446e65SJakub KicinskiDevices spawned from struct rtnl_link_ops should never free the 1722b446e65SJakub Kicinskistruct net_device directly. 1732b446e65SJakub Kicinski 1742b446e65SJakub Kicinski.ndo_init and .ndo_uninit 1752b446e65SJakub Kicinski~~~~~~~~~~~~~~~~~~~~~~~~~ 1762b446e65SJakub Kicinski 1772b446e65SJakub Kicinski``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device 1782b446e65SJakub Kicinskiregistration and de-registration, under ``rtnl_lock``. Drivers can use 1792b446e65SJakub Kicinskithose e.g. when parts of their init process need to run under ``rtnl_lock``. 1802b446e65SJakub Kicinski 1812b446e65SJakub Kicinski``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit`` 1822b446e65SJakub Kicinskiruns during de-registering after device is closed but other subsystems 1832b446e65SJakub Kicinskimay still have outstanding references to the netdevice. 184482a4360SMauro Carvalho Chehab 185482a4360SMauro Carvalho ChehabMTU 186482a4360SMauro Carvalho Chehab=== 187482a4360SMauro Carvalho ChehabEach network device has a Maximum Transfer Unit. The MTU does not 188482a4360SMauro Carvalho Chehabinclude any link layer protocol overhead. Upper layer protocols must 189482a4360SMauro Carvalho Chehabnot pass a socket buffer (skb) to a device to transmit with more data 190482a4360SMauro Carvalho Chehabthan the mtu. The MTU does not include link layer header overhead, so 191482a4360SMauro Carvalho Chehabfor example on Ethernet if the standard MTU is 1500 bytes used, the 192482a4360SMauro Carvalho Chehabactual skb will contain up to 1514 bytes because of the Ethernet 193482a4360SMauro Carvalho Chehabheader. Devices should allow for the 4 byte VLAN header as well. 194482a4360SMauro Carvalho Chehab 195482a4360SMauro Carvalho ChehabSegmentation Offload (GSO, TSO) is an exception to this rule. The 196482a4360SMauro Carvalho Chehabupper layer protocol may pass a large socket buffer to the device 197482a4360SMauro Carvalho Chehabtransmit routine, and the device will break that up into separate 198482a4360SMauro Carvalho Chehabpackets based on the current MTU. 199482a4360SMauro Carvalho Chehab 200482a4360SMauro Carvalho ChehabMTU is symmetrical and applies both to receive and transmit. A device 201482a4360SMauro Carvalho Chehabmust be able to receive at least the maximum size packet allowed by 202482a4360SMauro Carvalho Chehabthe MTU. A network device may use the MTU as mechanism to size receive 203482a4360SMauro Carvalho Chehabbuffers, but the device should allow packets with VLAN header. With 204482a4360SMauro Carvalho Chehabstandard Ethernet mtu of 1500 bytes, the device should allow up to 205482a4360SMauro Carvalho Chehab1518 byte packets (1500 + 14 header + 4 tag). The device may either: 206482a4360SMauro Carvalho Chehabdrop, truncate, or pass up oversize packets, but dropping oversize 207482a4360SMauro Carvalho Chehabpackets is preferred. 208482a4360SMauro Carvalho Chehab 209482a4360SMauro Carvalho Chehab 210482a4360SMauro Carvalho Chehabstruct net_device synchronization rules 211482a4360SMauro Carvalho Chehab======================================= 212482a4360SMauro Carvalho Chehabndo_open: 213482a4360SMauro Carvalho Chehab Synchronization: rtnl_lock() semaphore. 214482a4360SMauro Carvalho Chehab Context: process 215482a4360SMauro Carvalho Chehab 216482a4360SMauro Carvalho Chehabndo_stop: 217482a4360SMauro Carvalho Chehab Synchronization: rtnl_lock() semaphore. 218482a4360SMauro Carvalho Chehab Context: process 219482a4360SMauro Carvalho Chehab Note: netif_running() is guaranteed false 220482a4360SMauro Carvalho Chehab 221482a4360SMauro Carvalho Chehabndo_do_ioctl: 222482a4360SMauro Carvalho Chehab Synchronization: rtnl_lock() semaphore. 223482a4360SMauro Carvalho Chehab Context: process 224482a4360SMauro Carvalho Chehab 225*3d9d00bdSArnd Bergmann This is only called by network subsystems internally, 226*3d9d00bdSArnd Bergmann not by user space calling ioctl as it was in before 227*3d9d00bdSArnd Bergmann linux-5.14. 228*3d9d00bdSArnd Bergmann 229*3d9d00bdSArnd Bergmannndo_siocbond: 230*3d9d00bdSArnd Bergmann Synchronization: rtnl_lock() semaphore. 231*3d9d00bdSArnd Bergmann Context: process 232*3d9d00bdSArnd Bergmann 233*3d9d00bdSArnd Bergmann Used by the bonding driver for the SIOCBOND family of 234*3d9d00bdSArnd Bergmann ioctl commands. 235*3d9d00bdSArnd Bergmann 236ad7eab2aSArnd Bergmannndo_siocwandev: 237ad7eab2aSArnd Bergmann Synchronization: rtnl_lock() semaphore. 238ad7eab2aSArnd Bergmann Context: process 239ad7eab2aSArnd Bergmann 240ad7eab2aSArnd Bergmann Used by the drivers/net/wan framework to handle 241ad7eab2aSArnd Bergmann the SIOCWANDEV ioctl with the if_settings structure. 242ad7eab2aSArnd Bergmann 243b9067f5dSArnd Bergmannndo_siocdevprivate: 244b9067f5dSArnd Bergmann Synchronization: rtnl_lock() semaphore. 245b9067f5dSArnd Bergmann Context: process 246b9067f5dSArnd Bergmann 247b9067f5dSArnd Bergmann This is used to implement SIOCDEVPRIVATE ioctl helpers. 248b9067f5dSArnd Bergmann These should not be added to new drivers, so don't use. 249b9067f5dSArnd Bergmann 250a7605370SArnd Bergmannndo_eth_ioctl: 251a7605370SArnd Bergmann Synchronization: rtnl_lock() semaphore. 252a7605370SArnd Bergmann Context: process 253a7605370SArnd Bergmann 254482a4360SMauro Carvalho Chehabndo_get_stats: 2559f9d41f0SJakub Kicinski Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU. 2569f9d41f0SJakub Kicinski Context: atomic (can't sleep under rwlock or RCU) 257482a4360SMauro Carvalho Chehab 258482a4360SMauro Carvalho Chehabndo_start_xmit: 259482a4360SMauro Carvalho Chehab Synchronization: __netif_tx_lock spinlock. 260482a4360SMauro Carvalho Chehab 261482a4360SMauro Carvalho Chehab When the driver sets NETIF_F_LLTX in dev->features this will be 262482a4360SMauro Carvalho Chehab called without holding netif_tx_lock. In this case the driver 263482a4360SMauro Carvalho Chehab has to lock by itself when needed. 264482a4360SMauro Carvalho Chehab The locking there should also properly protect against 265482a4360SMauro Carvalho Chehab set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated. 266482a4360SMauro Carvalho Chehab Don't use it for new drivers. 267482a4360SMauro Carvalho Chehab 268482a4360SMauro Carvalho Chehab Context: Process with BHs disabled or BH (timer), 269482a4360SMauro Carvalho Chehab will be called with interrupts disabled by netconsole. 270482a4360SMauro Carvalho Chehab 271482a4360SMauro Carvalho Chehab Return codes: 272482a4360SMauro Carvalho Chehab 273482a4360SMauro Carvalho Chehab * NETDEV_TX_OK everything ok. 274482a4360SMauro Carvalho Chehab * NETDEV_TX_BUSY Cannot transmit packet, try later 275482a4360SMauro Carvalho Chehab Usually a bug, means queue start/stop flow control is broken in 276482a4360SMauro Carvalho Chehab the driver. Note: the driver must NOT put the skb in its DMA ring. 277482a4360SMauro Carvalho Chehab 278482a4360SMauro Carvalho Chehabndo_tx_timeout: 279482a4360SMauro Carvalho Chehab Synchronization: netif_tx_lock spinlock; all TX queues frozen. 280482a4360SMauro Carvalho Chehab Context: BHs disabled 281482a4360SMauro Carvalho Chehab Notes: netif_queue_stopped() is guaranteed true 282482a4360SMauro Carvalho Chehab 283482a4360SMauro Carvalho Chehabndo_set_rx_mode: 284482a4360SMauro Carvalho Chehab Synchronization: netif_addr_lock spinlock. 285482a4360SMauro Carvalho Chehab Context: BHs disabled 286482a4360SMauro Carvalho Chehab 287482a4360SMauro Carvalho Chehabstruct napi_struct synchronization rules 288482a4360SMauro Carvalho Chehab======================================== 289482a4360SMauro Carvalho Chehabnapi->poll: 290482a4360SMauro Carvalho Chehab Synchronization: 291482a4360SMauro Carvalho Chehab NAPI_STATE_SCHED bit in napi->state. Device 292482a4360SMauro Carvalho Chehab driver's ndo_stop method will invoke napi_disable() on 293482a4360SMauro Carvalho Chehab all NAPI instances which will do a sleeping poll on the 294482a4360SMauro Carvalho Chehab NAPI_STATE_SCHED napi->state bit, waiting for all pending 295482a4360SMauro Carvalho Chehab NAPI activity to cease. 296482a4360SMauro Carvalho Chehab 297482a4360SMauro Carvalho Chehab Context: 298482a4360SMauro Carvalho Chehab softirq 299482a4360SMauro Carvalho Chehab will be called with interrupts disabled by netconsole. 300