xref: /openbmc/linux/Documentation/networking/netdevices.rst (revision 762f99f4f3cb41a775b5157dd761217beba65873)
1482a4360SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2482a4360SMauro Carvalho Chehab
3482a4360SMauro Carvalho Chehab=====================================
4482a4360SMauro Carvalho ChehabNetwork Devices, the Kernel, and You!
5482a4360SMauro Carvalho Chehab=====================================
6482a4360SMauro Carvalho Chehab
7482a4360SMauro Carvalho Chehab
8482a4360SMauro Carvalho ChehabIntroduction
9482a4360SMauro Carvalho Chehab============
10482a4360SMauro Carvalho ChehabThe following is a random collection of documentation regarding
11482a4360SMauro Carvalho Chehabnetwork devices.
12482a4360SMauro Carvalho Chehab
132b446e65SJakub Kicinskistruct net_device lifetime rules
142b446e65SJakub Kicinski================================
15482a4360SMauro Carvalho ChehabNetwork device structures need to persist even after module is unloaded and
16482a4360SMauro Carvalho Chehabmust be allocated with alloc_netdev_mqs() and friends.
17482a4360SMauro Carvalho ChehabIf device has registered successfully, it will be freed on last use
182b446e65SJakub Kicinskiby free_netdev(). This is required to handle the pathological case cleanly
192b446e65SJakub Kicinski(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
20482a4360SMauro Carvalho Chehab
21482a4360SMauro Carvalho Chehaballoc_netdev_mqs() / alloc_netdev() reserve extra space for driver
22482a4360SMauro Carvalho Chehabprivate data which gets freed when the network device is freed. If
23482a4360SMauro Carvalho Chehabseparately allocated data is attached to the network device
242b446e65SJakub Kicinski(netdev_priv()) then it is up to the module exit handler to free that.
252b446e65SJakub Kicinski
262b446e65SJakub KicinskiThere are two groups of APIs for registering struct net_device.
272b446e65SJakub KicinskiFirst group can be used in normal contexts where ``rtnl_lock`` is not already
282b446e65SJakub Kicinskiheld: register_netdev(), unregister_netdev().
292b446e65SJakub KicinskiSecond group can be used when ``rtnl_lock`` is already held:
302b446e65SJakub Kicinskiregister_netdevice(), unregister_netdevice(), free_netdevice().
312b446e65SJakub Kicinski
322b446e65SJakub KicinskiSimple drivers
332b446e65SJakub Kicinski--------------
342b446e65SJakub Kicinski
352b446e65SJakub KicinskiMost drivers (especially device drivers) handle lifetime of struct net_device
362b446e65SJakub Kicinskiin context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
372b446e65SJakub Kicinski
382b446e65SJakub KicinskiIn that case the struct net_device registration is done using
392b446e65SJakub Kicinskithe register_netdev(), and unregister_netdev() functions:
402b446e65SJakub Kicinski
412b446e65SJakub Kicinski.. code-block:: c
422b446e65SJakub Kicinski
432b446e65SJakub Kicinski  int probe()
442b446e65SJakub Kicinski  {
452b446e65SJakub Kicinski    struct my_device_priv *priv;
462b446e65SJakub Kicinski    int err;
472b446e65SJakub Kicinski
482b446e65SJakub Kicinski    dev = alloc_netdev_mqs(...);
492b446e65SJakub Kicinski    if (!dev)
502b446e65SJakub Kicinski      return -ENOMEM;
512b446e65SJakub Kicinski    priv = netdev_priv(dev);
522b446e65SJakub Kicinski
532b446e65SJakub Kicinski    /* ... do all device setup before calling register_netdev() ...
542b446e65SJakub Kicinski     */
552b446e65SJakub Kicinski
562b446e65SJakub Kicinski    err = register_netdev(dev);
572b446e65SJakub Kicinski    if (err)
582b446e65SJakub Kicinski      goto err_undo;
592b446e65SJakub Kicinski
602b446e65SJakub Kicinski    /* net_device is visible to the user! */
612b446e65SJakub Kicinski
622b446e65SJakub Kicinski  err_undo:
632b446e65SJakub Kicinski    /* ... undo the device setup ... */
642b446e65SJakub Kicinski    free_netdev(dev);
652b446e65SJakub Kicinski    return err;
662b446e65SJakub Kicinski  }
672b446e65SJakub Kicinski
682b446e65SJakub Kicinski  void remove()
692b446e65SJakub Kicinski  {
702b446e65SJakub Kicinski    unregister_netdev(dev);
712b446e65SJakub Kicinski    free_netdev(dev);
722b446e65SJakub Kicinski  }
732b446e65SJakub Kicinski
742b446e65SJakub KicinskiNote that after calling register_netdev() the device is visible in the system.
752b446e65SJakub KicinskiUsers can open it and start sending / receiving traffic immediately,
762b446e65SJakub Kicinskior run any other callback, so all initialization must be done prior to
772b446e65SJakub Kicinskiregistration.
782b446e65SJakub Kicinski
792b446e65SJakub Kicinskiunregister_netdev() closes the device and waits for all users to be done
802b446e65SJakub Kicinskiwith it. The memory of struct net_device itself may still be referenced
812b446e65SJakub Kicinskiby sysfs but all operations on that device will fail.
822b446e65SJakub Kicinski
832b446e65SJakub Kicinskifree_netdev() can be called after unregister_netdev() returns on when
842b446e65SJakub Kicinskiregister_netdev() failed.
852b446e65SJakub Kicinski
862b446e65SJakub KicinskiDevice management under RTNL
872b446e65SJakub Kicinski----------------------------
882b446e65SJakub Kicinski
892b446e65SJakub KicinskiRegistering struct net_device while in context which already holds
902b446e65SJakub Kicinskithe ``rtnl_lock`` requires extra care. In those scenarios most drivers
912b446e65SJakub Kicinskiwill want to make use of struct net_device's ``needs_free_netdev``
922b446e65SJakub Kicinskiand ``priv_destructor`` members for freeing of state.
932b446e65SJakub Kicinski
942b446e65SJakub KicinskiExample flow of netdev handling under ``rtnl_lock``:
952b446e65SJakub Kicinski
962b446e65SJakub Kicinski.. code-block:: c
972b446e65SJakub Kicinski
982b446e65SJakub Kicinski  static void my_setup(struct net_device *dev)
992b446e65SJakub Kicinski  {
1002b446e65SJakub Kicinski    dev->needs_free_netdev = true;
1012b446e65SJakub Kicinski  }
1022b446e65SJakub Kicinski
1032b446e65SJakub Kicinski  static void my_destructor(struct net_device *dev)
1042b446e65SJakub Kicinski  {
1052b446e65SJakub Kicinski    some_obj_destroy(priv->obj);
1062b446e65SJakub Kicinski    some_uninit(priv);
1072b446e65SJakub Kicinski  }
1082b446e65SJakub Kicinski
1092b446e65SJakub Kicinski  int create_link()
1102b446e65SJakub Kicinski  {
1112b446e65SJakub Kicinski    struct my_device_priv *priv;
1122b446e65SJakub Kicinski    int err;
1132b446e65SJakub Kicinski
1142b446e65SJakub Kicinski    ASSERT_RTNL();
1152b446e65SJakub Kicinski
1162b446e65SJakub Kicinski    dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
1172b446e65SJakub Kicinski    if (!dev)
1182b446e65SJakub Kicinski      return -ENOMEM;
1192b446e65SJakub Kicinski    priv = netdev_priv(dev);
1202b446e65SJakub Kicinski
1212b446e65SJakub Kicinski    /* Implicit constructor */
1222b446e65SJakub Kicinski    err = some_init(priv);
1232b446e65SJakub Kicinski    if (err)
1242b446e65SJakub Kicinski      goto err_free_dev;
1252b446e65SJakub Kicinski
1262b446e65SJakub Kicinski    priv->obj = some_obj_create();
1272b446e65SJakub Kicinski    if (!priv->obj) {
1282b446e65SJakub Kicinski      err = -ENOMEM;
1292b446e65SJakub Kicinski      goto err_some_uninit;
1302b446e65SJakub Kicinski    }
1312b446e65SJakub Kicinski    /* End of constructor, set the destructor: */
1322b446e65SJakub Kicinski    dev->priv_destructor = my_destructor;
1332b446e65SJakub Kicinski
1342b446e65SJakub Kicinski    err = register_netdevice(dev);
1352b446e65SJakub Kicinski    if (err)
1362b446e65SJakub Kicinski      /* register_netdevice() calls destructor on failure */
1372b446e65SJakub Kicinski      goto err_free_dev;
1382b446e65SJakub Kicinski
1392b446e65SJakub Kicinski    /* If anything fails now unregister_netdevice() (or unregister_netdev())
1402b446e65SJakub Kicinski     * will take care of calling my_destructor and free_netdev().
1412b446e65SJakub Kicinski     */
1422b446e65SJakub Kicinski
1432b446e65SJakub Kicinski    return 0;
1442b446e65SJakub Kicinski
1452b446e65SJakub Kicinski  err_some_uninit:
1462b446e65SJakub Kicinski    some_uninit(priv);
1472b446e65SJakub Kicinski  err_free_dev:
1482b446e65SJakub Kicinski    free_netdev(dev);
1492b446e65SJakub Kicinski    return err;
1502b446e65SJakub Kicinski  }
1512b446e65SJakub Kicinski
1522b446e65SJakub KicinskiIf struct net_device.priv_destructor is set it will be called by the core
1532b446e65SJakub Kicinskisome time after unregister_netdevice(), it will also be called if
1542b446e65SJakub Kicinskiregister_netdevice() fails. The callback may be invoked with or without
1552b446e65SJakub Kicinski``rtnl_lock`` held.
1562b446e65SJakub Kicinski
1572b446e65SJakub KicinskiThere is no explicit constructor callback, driver "constructs" the private
1582b446e65SJakub Kicinskinetdev state after allocating it and before registration.
1592b446e65SJakub Kicinski
1602b446e65SJakub KicinskiSetting struct net_device.needs_free_netdev makes core call free_netdevice()
1612b446e65SJakub Kicinskiautomatically after unregister_netdevice() when all references to the device
1622b446e65SJakub Kicinskiare gone. It only takes effect after a successful call to register_netdevice()
1632b446e65SJakub Kicinskiso if register_netdevice() fails driver is responsible for calling
1642b446e65SJakub Kicinskifree_netdev().
1652b446e65SJakub Kicinski
1662b446e65SJakub Kicinskifree_netdev() is safe to call on error paths right after unregister_netdevice()
1672b446e65SJakub Kicinskior when register_netdevice() fails. Parts of netdev (de)registration process
1682b446e65SJakub Kicinskihappen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
1692b446e65SJakub Kicinskiwill defer some of the processing until ``rtnl_lock`` is released.
1702b446e65SJakub Kicinski
1712b446e65SJakub KicinskiDevices spawned from struct rtnl_link_ops should never free the
1722b446e65SJakub Kicinskistruct net_device directly.
1732b446e65SJakub Kicinski
1742b446e65SJakub Kicinski.ndo_init and .ndo_uninit
1752b446e65SJakub Kicinski~~~~~~~~~~~~~~~~~~~~~~~~~
1762b446e65SJakub Kicinski
1772b446e65SJakub Kicinski``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
1782b446e65SJakub Kicinskiregistration and de-registration, under ``rtnl_lock``. Drivers can use
1792b446e65SJakub Kicinskithose e.g. when parts of their init process need to run under ``rtnl_lock``.
1802b446e65SJakub Kicinski
1812b446e65SJakub Kicinski``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
1822b446e65SJakub Kicinskiruns during de-registering after device is closed but other subsystems
1832b446e65SJakub Kicinskimay still have outstanding references to the netdevice.
184482a4360SMauro Carvalho Chehab
185482a4360SMauro Carvalho ChehabMTU
186482a4360SMauro Carvalho Chehab===
187482a4360SMauro Carvalho ChehabEach network device has a Maximum Transfer Unit. The MTU does not
188482a4360SMauro Carvalho Chehabinclude any link layer protocol overhead. Upper layer protocols must
189482a4360SMauro Carvalho Chehabnot pass a socket buffer (skb) to a device to transmit with more data
190482a4360SMauro Carvalho Chehabthan the mtu. The MTU does not include link layer header overhead, so
191482a4360SMauro Carvalho Chehabfor example on Ethernet if the standard MTU is 1500 bytes used, the
192482a4360SMauro Carvalho Chehabactual skb will contain up to 1514 bytes because of the Ethernet
193482a4360SMauro Carvalho Chehabheader. Devices should allow for the 4 byte VLAN header as well.
194482a4360SMauro Carvalho Chehab
195482a4360SMauro Carvalho ChehabSegmentation Offload (GSO, TSO) is an exception to this rule.  The
196482a4360SMauro Carvalho Chehabupper layer protocol may pass a large socket buffer to the device
197482a4360SMauro Carvalho Chehabtransmit routine, and the device will break that up into separate
198482a4360SMauro Carvalho Chehabpackets based on the current MTU.
199482a4360SMauro Carvalho Chehab
200482a4360SMauro Carvalho ChehabMTU is symmetrical and applies both to receive and transmit. A device
201482a4360SMauro Carvalho Chehabmust be able to receive at least the maximum size packet allowed by
202482a4360SMauro Carvalho Chehabthe MTU. A network device may use the MTU as mechanism to size receive
203482a4360SMauro Carvalho Chehabbuffers, but the device should allow packets with VLAN header. With
204482a4360SMauro Carvalho Chehabstandard Ethernet mtu of 1500 bytes, the device should allow up to
205482a4360SMauro Carvalho Chehab1518 byte packets (1500 + 14 header + 4 tag).  The device may either:
206482a4360SMauro Carvalho Chehabdrop, truncate, or pass up oversize packets, but dropping oversize
207482a4360SMauro Carvalho Chehabpackets is preferred.
208482a4360SMauro Carvalho Chehab
209482a4360SMauro Carvalho Chehab
210482a4360SMauro Carvalho Chehabstruct net_device synchronization rules
211482a4360SMauro Carvalho Chehab=======================================
212482a4360SMauro Carvalho Chehabndo_open:
213482a4360SMauro Carvalho Chehab	Synchronization: rtnl_lock() semaphore.
214482a4360SMauro Carvalho Chehab	Context: process
215482a4360SMauro Carvalho Chehab
216482a4360SMauro Carvalho Chehabndo_stop:
217482a4360SMauro Carvalho Chehab	Synchronization: rtnl_lock() semaphore.
218482a4360SMauro Carvalho Chehab	Context: process
219482a4360SMauro Carvalho Chehab	Note: netif_running() is guaranteed false
220482a4360SMauro Carvalho Chehab
221482a4360SMauro Carvalho Chehabndo_do_ioctl:
222482a4360SMauro Carvalho Chehab	Synchronization: rtnl_lock() semaphore.
223482a4360SMauro Carvalho Chehab	Context: process
224482a4360SMauro Carvalho Chehab
225*3d9d00bdSArnd Bergmann        This is only called by network subsystems internally,
226*3d9d00bdSArnd Bergmann        not by user space calling ioctl as it was in before
227*3d9d00bdSArnd Bergmann        linux-5.14.
228*3d9d00bdSArnd Bergmann
229*3d9d00bdSArnd Bergmannndo_siocbond:
230*3d9d00bdSArnd Bergmann        Synchronization: rtnl_lock() semaphore.
231*3d9d00bdSArnd Bergmann        Context: process
232*3d9d00bdSArnd Bergmann
233*3d9d00bdSArnd Bergmann        Used by the bonding driver for the SIOCBOND family of
234*3d9d00bdSArnd Bergmann        ioctl commands.
235*3d9d00bdSArnd Bergmann
236ad7eab2aSArnd Bergmannndo_siocwandev:
237ad7eab2aSArnd Bergmann	Synchronization: rtnl_lock() semaphore.
238ad7eab2aSArnd Bergmann	Context: process
239ad7eab2aSArnd Bergmann
240ad7eab2aSArnd Bergmann	Used by the drivers/net/wan framework to handle
241ad7eab2aSArnd Bergmann	the SIOCWANDEV ioctl with the if_settings structure.
242ad7eab2aSArnd Bergmann
243b9067f5dSArnd Bergmannndo_siocdevprivate:
244b9067f5dSArnd Bergmann	Synchronization: rtnl_lock() semaphore.
245b9067f5dSArnd Bergmann	Context: process
246b9067f5dSArnd Bergmann
247b9067f5dSArnd Bergmann	This is used to implement SIOCDEVPRIVATE ioctl helpers.
248b9067f5dSArnd Bergmann	These should not be added to new drivers, so don't use.
249b9067f5dSArnd Bergmann
250a7605370SArnd Bergmannndo_eth_ioctl:
251a7605370SArnd Bergmann	Synchronization: rtnl_lock() semaphore.
252a7605370SArnd Bergmann	Context: process
253a7605370SArnd Bergmann
254482a4360SMauro Carvalho Chehabndo_get_stats:
2559f9d41f0SJakub Kicinski	Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU.
2569f9d41f0SJakub Kicinski	Context: atomic (can't sleep under rwlock or RCU)
257482a4360SMauro Carvalho Chehab
258482a4360SMauro Carvalho Chehabndo_start_xmit:
259482a4360SMauro Carvalho Chehab	Synchronization: __netif_tx_lock spinlock.
260482a4360SMauro Carvalho Chehab
261482a4360SMauro Carvalho Chehab	When the driver sets NETIF_F_LLTX in dev->features this will be
262482a4360SMauro Carvalho Chehab	called without holding netif_tx_lock. In this case the driver
263482a4360SMauro Carvalho Chehab	has to lock by itself when needed.
264482a4360SMauro Carvalho Chehab	The locking there should also properly protect against
265482a4360SMauro Carvalho Chehab	set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
266482a4360SMauro Carvalho Chehab	Don't use it for new drivers.
267482a4360SMauro Carvalho Chehab
268482a4360SMauro Carvalho Chehab	Context: Process with BHs disabled or BH (timer),
269482a4360SMauro Carvalho Chehab		 will be called with interrupts disabled by netconsole.
270482a4360SMauro Carvalho Chehab
271482a4360SMauro Carvalho Chehab	Return codes:
272482a4360SMauro Carvalho Chehab
273482a4360SMauro Carvalho Chehab	* NETDEV_TX_OK everything ok.
274482a4360SMauro Carvalho Chehab	* NETDEV_TX_BUSY Cannot transmit packet, try later
275482a4360SMauro Carvalho Chehab	  Usually a bug, means queue start/stop flow control is broken in
276482a4360SMauro Carvalho Chehab	  the driver. Note: the driver must NOT put the skb in its DMA ring.
277482a4360SMauro Carvalho Chehab
278482a4360SMauro Carvalho Chehabndo_tx_timeout:
279482a4360SMauro Carvalho Chehab	Synchronization: netif_tx_lock spinlock; all TX queues frozen.
280482a4360SMauro Carvalho Chehab	Context: BHs disabled
281482a4360SMauro Carvalho Chehab	Notes: netif_queue_stopped() is guaranteed true
282482a4360SMauro Carvalho Chehab
283482a4360SMauro Carvalho Chehabndo_set_rx_mode:
284482a4360SMauro Carvalho Chehab	Synchronization: netif_addr_lock spinlock.
285482a4360SMauro Carvalho Chehab	Context: BHs disabled
286482a4360SMauro Carvalho Chehab
287482a4360SMauro Carvalho Chehabstruct napi_struct synchronization rules
288482a4360SMauro Carvalho Chehab========================================
289482a4360SMauro Carvalho Chehabnapi->poll:
290482a4360SMauro Carvalho Chehab	Synchronization:
291482a4360SMauro Carvalho Chehab		NAPI_STATE_SCHED bit in napi->state.  Device
292482a4360SMauro Carvalho Chehab		driver's ndo_stop method will invoke napi_disable() on
293482a4360SMauro Carvalho Chehab		all NAPI instances which will do a sleeping poll on the
294482a4360SMauro Carvalho Chehab		NAPI_STATE_SCHED napi->state bit, waiting for all pending
295482a4360SMauro Carvalho Chehab		NAPI activity to cease.
296482a4360SMauro Carvalho Chehab
297482a4360SMauro Carvalho Chehab	Context:
298482a4360SMauro Carvalho Chehab		 softirq
299482a4360SMauro Carvalho Chehab		 will be called with interrupts disabled by netconsole.
300