xref: /openbmc/linux/Documentation/userspace-api/netlink/intro.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
1510156a7SJakub Kicinski.. SPDX-License-Identifier: BSD-3-Clause
2510156a7SJakub Kicinski
3510156a7SJakub Kicinski=======================
4510156a7SJakub KicinskiIntroduction to Netlink
5510156a7SJakub Kicinski=======================
6510156a7SJakub Kicinski
7510156a7SJakub KicinskiNetlink is often described as an ioctl() replacement.
8510156a7SJakub KicinskiIt aims to replace fixed-format C structures as supplied
9510156a7SJakub Kicinskito ioctl() with a format which allows an easy way to add
10510156a7SJakub Kicinskior extended the arguments.
11510156a7SJakub Kicinski
12510156a7SJakub KicinskiTo achieve this Netlink uses a minimal fixed-format metadata header
13510156a7SJakub Kicinskifollowed by multiple attributes in the TLV (type, length, value) format.
14510156a7SJakub Kicinski
15510156a7SJakub KicinskiUnfortunately the protocol has evolved over the years, in an organic
16510156a7SJakub Kicinskiand undocumented fashion, making it hard to coherently explain.
17510156a7SJakub KicinskiTo make the most practical sense this document starts by describing
18510156a7SJakub Kicinskinetlink as it is used today and dives into more "historical" uses
19510156a7SJakub Kicinskiin later sections.
20510156a7SJakub Kicinski
21510156a7SJakub KicinskiOpening a socket
22510156a7SJakub Kicinski================
23510156a7SJakub Kicinski
24510156a7SJakub KicinskiNetlink communication happens over sockets, a socket needs to be
25510156a7SJakub Kicinskiopened first:
26510156a7SJakub Kicinski
27510156a7SJakub Kicinski.. code-block:: c
28510156a7SJakub Kicinski
29510156a7SJakub Kicinski  fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
30510156a7SJakub Kicinski
31510156a7SJakub KicinskiThe use of sockets allows for a natural way of exchanging information
32510156a7SJakub Kicinskiin both directions (to and from the kernel). The operations are still
33510156a7SJakub Kicinskiperformed synchronously when applications send() the request but
34510156a7SJakub Kicinskia separate recv() system call is needed to read the reply.
35510156a7SJakub Kicinski
36510156a7SJakub KicinskiA very simplified flow of a Netlink "call" will therefore look
37510156a7SJakub Kicinskisomething like:
38510156a7SJakub Kicinski
39510156a7SJakub Kicinski.. code-block:: c
40510156a7SJakub Kicinski
41510156a7SJakub Kicinski  fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
42510156a7SJakub Kicinski
43510156a7SJakub Kicinski  /* format the request */
44510156a7SJakub Kicinski  send(fd, &request, sizeof(request));
45510156a7SJakub Kicinski  n = recv(fd, &response, RSP_BUFFER_SIZE);
46510156a7SJakub Kicinski  /* interpret the response */
47510156a7SJakub Kicinski
48510156a7SJakub KicinskiNetlink also provides natural support for "dumping", i.e. communicating
49510156a7SJakub Kicinskito user space all objects of a certain type (e.g. dumping all network
50510156a7SJakub Kicinskiinterfaces).
51510156a7SJakub Kicinski
52510156a7SJakub Kicinski.. code-block:: c
53510156a7SJakub Kicinski
54510156a7SJakub Kicinski  fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
55510156a7SJakub Kicinski
56510156a7SJakub Kicinski  /* format the dump request */
57510156a7SJakub Kicinski  send(fd, &request, sizeof(request));
58510156a7SJakub Kicinski  while (1) {
59510156a7SJakub Kicinski    n = recv(fd, &buffer, RSP_BUFFER_SIZE);
60510156a7SJakub Kicinski    /* one recv() call can read multiple messages, hence the loop below */
61510156a7SJakub Kicinski    for (nl_msg in buffer) {
62510156a7SJakub Kicinski      if (nl_msg.nlmsg_type == NLMSG_DONE)
63510156a7SJakub Kicinski        goto dump_finished;
64510156a7SJakub Kicinski      /* process the object */
65510156a7SJakub Kicinski    }
66510156a7SJakub Kicinski  }
67510156a7SJakub Kicinski  dump_finished:
68510156a7SJakub Kicinski
69510156a7SJakub KicinskiThe first two arguments of the socket() call require little explanation -
70510156a7SJakub Kicinskiit is opening a Netlink socket, with all headers provided by the user
71510156a7SJakub Kicinski(hence NETLINK, RAW). The last argument is the protocol within Netlink.
72510156a7SJakub KicinskiThis field used to identify the subsystem with which the socket will
73510156a7SJakub Kicinskicommunicate.
74510156a7SJakub Kicinski
75510156a7SJakub KicinskiClassic vs Generic Netlink
76510156a7SJakub Kicinski--------------------------
77510156a7SJakub Kicinski
78510156a7SJakub KicinskiInitial implementation of Netlink depended on a static allocation
79510156a7SJakub Kicinskiof IDs to subsystems and provided little supporting infrastructure.
80510156a7SJakub KicinskiLet us refer to those protocols collectively as **Classic Netlink**.
81510156a7SJakub KicinskiThe list of them is defined on top of the ``include/uapi/linux/netlink.h``
82510156a7SJakub Kicinskifile, they include among others - general networking (NETLINK_ROUTE),
83510156a7SJakub KicinskiiSCSI (NETLINK_ISCSI), and audit (NETLINK_AUDIT).
84510156a7SJakub Kicinski
85510156a7SJakub Kicinski**Generic Netlink** (introduced in 2005) allows for dynamic registration of
86510156a7SJakub Kicinskisubsystems (and subsystem ID allocation), introspection and simplifies
87510156a7SJakub Kicinskiimplementing the kernel side of the interface.
88510156a7SJakub Kicinski
89510156a7SJakub KicinskiThe following section describes how to use Generic Netlink, as the
90510156a7SJakub Kicinskinumber of subsystems using Generic Netlink outnumbers the older
91510156a7SJakub Kicinskiprotocols by an order of magnitude. There are also no plans for adding
92510156a7SJakub Kicinskimore Classic Netlink protocols to the kernel.
93510156a7SJakub KicinskiBasic information on how communicating with core networking parts of
94510156a7SJakub Kicinskithe Linux kernel (or another of the 20 subsystems using Classic
95510156a7SJakub KicinskiNetlink) differs from Generic Netlink is provided later in this document.
96510156a7SJakub Kicinski
97510156a7SJakub KicinskiGeneric Netlink
98510156a7SJakub Kicinski===============
99510156a7SJakub Kicinski
100510156a7SJakub KicinskiIn addition to the Netlink fixed metadata header each Netlink protocol
101510156a7SJakub Kicinskidefines its own fixed metadata header. (Similarly to how network
102510156a7SJakub Kicinskiheaders stack - Ethernet > IP > TCP we have Netlink > Generic N. > Family.)
103510156a7SJakub Kicinski
104510156a7SJakub KicinskiA Netlink message always starts with struct nlmsghdr, which is followed
105510156a7SJakub Kicinskiby a protocol-specific header. In case of Generic Netlink the protocol
106510156a7SJakub Kicinskiheader is struct genlmsghdr.
107510156a7SJakub Kicinski
108510156a7SJakub KicinskiThe practical meaning of the fields in case of Generic Netlink is as follows:
109510156a7SJakub Kicinski
110510156a7SJakub Kicinski.. code-block:: c
111510156a7SJakub Kicinski
112510156a7SJakub Kicinski  struct nlmsghdr {
113510156a7SJakub Kicinski	__u32	nlmsg_len;	/* Length of message including headers */
114510156a7SJakub Kicinski	__u16	nlmsg_type;	/* Generic Netlink Family (subsystem) ID */
115510156a7SJakub Kicinski	__u16	nlmsg_flags;	/* Flags - request or dump */
116510156a7SJakub Kicinski	__u32	nlmsg_seq;	/* Sequence number */
117510156a7SJakub Kicinski	__u32	nlmsg_pid;	/* Port ID, set to 0 */
118510156a7SJakub Kicinski  };
119510156a7SJakub Kicinski  struct genlmsghdr {
120510156a7SJakub Kicinski	__u8	cmd;		/* Command, as defined by the Family */
121510156a7SJakub Kicinski	__u8	version;	/* Irrelevant, set to 1 */
122510156a7SJakub Kicinski	__u16	reserved;	/* Reserved, set to 0 */
123510156a7SJakub Kicinski  };
124510156a7SJakub Kicinski  /* TLV attributes follow... */
125510156a7SJakub Kicinski
126510156a7SJakub KicinskiIn Classic Netlink :c:member:`nlmsghdr.nlmsg_type` used to identify
127510156a7SJakub Kicinskiwhich operation within the subsystem the message was referring to
128510156a7SJakub Kicinski(e.g. get information about a netdev). Generic Netlink needs to mux
129510156a7SJakub Kicinskimultiple subsystems in a single protocol so it uses this field to
130510156a7SJakub Kicinskiidentify the subsystem, and :c:member:`genlmsghdr.cmd` identifies
131510156a7SJakub Kicinskithe operation instead. (See :ref:`res_fam` for
132510156a7SJakub Kicinskiinformation on how to find the Family ID of the subsystem of interest.)
133510156a7SJakub KicinskiNote that the first 16 values (0 - 15) of this field are reserved for
134510156a7SJakub Kicinskicontrol messages both in Classic Netlink and Generic Netlink.
135510156a7SJakub KicinskiSee :ref:`nl_msg_type` for more details.
136510156a7SJakub Kicinski
137510156a7SJakub KicinskiThere are 3 usual types of message exchanges on a Netlink socket:
138510156a7SJakub Kicinski
139510156a7SJakub Kicinski - performing a single action (``do``);
140510156a7SJakub Kicinski - dumping information (``dump``);
141510156a7SJakub Kicinski - getting asynchronous notifications (``multicast``).
142510156a7SJakub Kicinski
143510156a7SJakub KicinskiClassic Netlink is very flexible and presumably allows other types
144510156a7SJakub Kicinskiof exchanges to happen, but in practice those are the three that get
145510156a7SJakub Kicinskiused.
146510156a7SJakub Kicinski
147510156a7SJakub KicinskiAsynchronous notifications are sent by the kernel and received by
148510156a7SJakub Kicinskithe user sockets which subscribed to them. ``do`` and ``dump`` requests
149510156a7SJakub Kicinskiare initiated by the user. :c:member:`nlmsghdr.nlmsg_flags` should
150510156a7SJakub Kicinskibe set as follows:
151510156a7SJakub Kicinski
152510156a7SJakub Kicinski - for ``do``: ``NLM_F_REQUEST | NLM_F_ACK``
153510156a7SJakub Kicinski - for ``dump``: ``NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP``
154510156a7SJakub Kicinski
155510156a7SJakub Kicinski:c:member:`nlmsghdr.nlmsg_seq` should be a set to a monotonically
156510156a7SJakub Kicinskiincreasing value. The value gets echoed back in responses and doesn't
157510156a7SJakub Kicinskimatter in practice, but setting it to an increasing value for each
158510156a7SJakub Kicinskimessage sent is considered good hygiene. The purpose of the field is
159510156a7SJakub Kicinskimatching responses to requests. Asynchronous notifications will have
160510156a7SJakub Kicinski:c:member:`nlmsghdr.nlmsg_seq` of ``0``.
161510156a7SJakub Kicinski
162510156a7SJakub Kicinski:c:member:`nlmsghdr.nlmsg_pid` is the Netlink equivalent of an address.
163510156a7SJakub KicinskiThis field can be set to ``0`` when talking to the kernel.
164510156a7SJakub KicinskiSee :ref:`nlmsg_pid` for the (uncommon) uses of the field.
165510156a7SJakub Kicinski
166510156a7SJakub KicinskiThe expected use for :c:member:`genlmsghdr.version` was to allow
167510156a7SJakub Kicinskiversioning of the APIs provided by the subsystems. No subsystem to
168510156a7SJakub Kicinskidate made significant use of this field, so setting it to ``1`` seems
169510156a7SJakub Kicinskilike a safe bet.
170510156a7SJakub Kicinski
171510156a7SJakub Kicinski.. _nl_msg_type:
172510156a7SJakub Kicinski
173510156a7SJakub KicinskiNetlink message types
174510156a7SJakub Kicinski---------------------
175510156a7SJakub Kicinski
176510156a7SJakub KicinskiAs previously mentioned :c:member:`nlmsghdr.nlmsg_type` carries
177510156a7SJakub Kicinskiprotocol specific values but the first 16 identifiers are reserved
178510156a7SJakub Kicinski(first subsystem specific message type should be equal to
179510156a7SJakub Kicinski``NLMSG_MIN_TYPE`` which is ``0x10``).
180510156a7SJakub Kicinski
181510156a7SJakub KicinskiThere are only 4 Netlink control messages defined:
182510156a7SJakub Kicinski
183510156a7SJakub Kicinski - ``NLMSG_NOOP`` - ignore the message, not used in practice;
184510156a7SJakub Kicinski - ``NLMSG_ERROR`` - carries the return code of an operation;
185510156a7SJakub Kicinski - ``NLMSG_DONE`` - marks the end of a dump;
186510156a7SJakub Kicinski - ``NLMSG_OVERRUN`` - socket buffer has overflown, not used to date.
187510156a7SJakub Kicinski
188510156a7SJakub Kicinski``NLMSG_ERROR`` and ``NLMSG_DONE`` are of practical importance.
189510156a7SJakub KicinskiThey carry return codes for operations. Note that unless
190510156a7SJakub Kicinskithe ``NLM_F_ACK`` flag is set on the request Netlink will not respond
191510156a7SJakub Kicinskiwith ``NLMSG_ERROR`` if there is no error. To avoid having to special-case
192510156a7SJakub Kicinskithis quirk it is recommended to always set ``NLM_F_ACK``.
193510156a7SJakub Kicinski
194510156a7SJakub KicinskiThe format of ``NLMSG_ERROR`` is described by struct nlmsgerr::
195510156a7SJakub Kicinski
196510156a7SJakub Kicinski  ----------------------------------------------
197510156a7SJakub Kicinski  | struct nlmsghdr - response header          |
198510156a7SJakub Kicinski  ----------------------------------------------
199510156a7SJakub Kicinski  |    int error                               |
200510156a7SJakub Kicinski  ----------------------------------------------
201510156a7SJakub Kicinski  | struct nlmsghdr - original request header |
202510156a7SJakub Kicinski  ----------------------------------------------
203510156a7SJakub Kicinski  | ** optionally (1) payload of the request   |
204510156a7SJakub Kicinski  ----------------------------------------------
205510156a7SJakub Kicinski  | ** optionally (2) extended ACK             |
206510156a7SJakub Kicinski  ----------------------------------------------
207510156a7SJakub Kicinski
208510156a7SJakub KicinskiThere are two instances of struct nlmsghdr here, first of the response
209510156a7SJakub Kicinskiand second of the request. ``NLMSG_ERROR`` carries the information about
210510156a7SJakub Kicinskithe request which led to the error. This could be useful when trying
211510156a7SJakub Kicinskito match requests to responses or re-parse the request to dump it into
212510156a7SJakub Kicinskilogs.
213510156a7SJakub Kicinski
214510156a7SJakub KicinskiThe payload of the request is not echoed in messages reporting success
215510156a7SJakub Kicinski(``error == 0``) or if ``NETLINK_CAP_ACK`` setsockopt() was set.
216510156a7SJakub KicinskiThe latter is common
217510156a7SJakub Kicinskiand perhaps recommended as having to read a copy of every request back
218510156a7SJakub Kicinskifrom the kernel is rather wasteful. The absence of request payload
219510156a7SJakub Kicinskiis indicated by ``NLM_F_CAPPED`` in :c:member:`nlmsghdr.nlmsg_flags`.
220510156a7SJakub Kicinski
221510156a7SJakub KicinskiThe second optional element of ``NLMSG_ERROR`` are the extended ACK
222510156a7SJakub Kicinskiattributes. See :ref:`ext_ack` for more details. The presence
223510156a7SJakub Kicinskiof extended ACK is indicated by ``NLM_F_ACK_TLVS`` in
224510156a7SJakub Kicinski:c:member:`nlmsghdr.nlmsg_flags`.
225510156a7SJakub Kicinski
226510156a7SJakub Kicinski``NLMSG_DONE`` is simpler, the request is never echoed but the extended
227510156a7SJakub KicinskiACK attributes may be present::
228510156a7SJakub Kicinski
229510156a7SJakub Kicinski  ----------------------------------------------
230510156a7SJakub Kicinski  | struct nlmsghdr - response header          |
231510156a7SJakub Kicinski  ----------------------------------------------
232510156a7SJakub Kicinski  |    int error                               |
233510156a7SJakub Kicinski  ----------------------------------------------
234510156a7SJakub Kicinski  | ** optionally extended ACK                 |
235510156a7SJakub Kicinski  ----------------------------------------------
236510156a7SJakub Kicinski
237510156a7SJakub Kicinski.. _res_fam:
238510156a7SJakub Kicinski
239510156a7SJakub KicinskiResolving the Family ID
240510156a7SJakub Kicinski-----------------------
241510156a7SJakub Kicinski
242510156a7SJakub KicinskiThis section explains how to find the Family ID of a subsystem.
243510156a7SJakub KicinskiIt also serves as an example of Generic Netlink communication.
244510156a7SJakub Kicinski
245510156a7SJakub KicinskiGeneric Netlink is itself a subsystem exposed via the Generic Netlink API.
246510156a7SJakub KicinskiTo avoid a circular dependency Generic Netlink has a statically allocated
247510156a7SJakub KicinskiFamily ID (``GENL_ID_CTRL`` which is equal to ``NLMSG_MIN_TYPE``).
248510156a7SJakub KicinskiThe Generic Netlink family implements a command used to find out information
249510156a7SJakub Kicinskiabout other families (``CTRL_CMD_GETFAMILY``).
250510156a7SJakub Kicinski
251510156a7SJakub KicinskiTo get information about the Generic Netlink family named for example
252510156a7SJakub Kicinski``"test1"`` we need to send a message on the previously opened Generic Netlink
253510156a7SJakub Kicinskisocket. The message should target the Generic Netlink Family (1), be a
254510156a7SJakub Kicinski``do`` (2) call to ``CTRL_CMD_GETFAMILY`` (3). A ``dump`` version of this
255510156a7SJakub Kicinskicall would make the kernel respond with information about *all* the families
256510156a7SJakub Kicinskiit knows about. Last but not least the name of the family in question has
257510156a7SJakub Kicinskito be specified (4) as an attribute with the appropriate type::
258510156a7SJakub Kicinski
259510156a7SJakub Kicinski  struct nlmsghdr:
260510156a7SJakub Kicinski    __u32 nlmsg_len:	32
261510156a7SJakub Kicinski    __u16 nlmsg_type:	GENL_ID_CTRL               // (1)
262510156a7SJakub Kicinski    __u16 nlmsg_flags:	NLM_F_REQUEST | NLM_F_ACK  // (2)
263510156a7SJakub Kicinski    __u32 nlmsg_seq:	1
264510156a7SJakub Kicinski    __u32 nlmsg_pid:	0
265510156a7SJakub Kicinski
266510156a7SJakub Kicinski  struct genlmsghdr:
267510156a7SJakub Kicinski    __u8 cmd:		CTRL_CMD_GETFAMILY         // (3)
268510156a7SJakub Kicinski    __u8 version:	2 /* or 1, doesn't matter */
269510156a7SJakub Kicinski    __u16 reserved:	0
270510156a7SJakub Kicinski
271510156a7SJakub Kicinski  struct nlattr:                                   // (4)
272510156a7SJakub Kicinski    __u16 nla_len:	10
273510156a7SJakub Kicinski    __u16 nla_type:	CTRL_ATTR_FAMILY_NAME
274510156a7SJakub Kicinski    char data: 		test1\0
275510156a7SJakub Kicinski
276510156a7SJakub Kicinski  (padding:)
277510156a7SJakub Kicinski    char data:		\0\0
278510156a7SJakub Kicinski
279510156a7SJakub KicinskiThe length fields in Netlink (:c:member:`nlmsghdr.nlmsg_len`
280510156a7SJakub Kicinskiand :c:member:`nlattr.nla_len`) always *include* the header.
281510156a7SJakub KicinskiAttribute headers in netlink must be aligned to 4 bytes from the start
282510156a7SJakub Kicinskiof the message, hence the extra ``\0\0`` after ``CTRL_ATTR_FAMILY_NAME``.
283510156a7SJakub KicinskiThe attribute lengths *exclude* the padding.
284510156a7SJakub Kicinski
285510156a7SJakub KicinskiIf the family is found kernel will reply with two messages, the response
286510156a7SJakub Kicinskiwith all the information about the family::
287510156a7SJakub Kicinski
288510156a7SJakub Kicinski  /* Message #1 - reply */
289510156a7SJakub Kicinski  struct nlmsghdr:
290510156a7SJakub Kicinski    __u32 nlmsg_len:	136
291510156a7SJakub Kicinski    __u16 nlmsg_type:	GENL_ID_CTRL
292510156a7SJakub Kicinski    __u16 nlmsg_flags:	0
293510156a7SJakub Kicinski    __u32 nlmsg_seq:	1    /* echoed from our request */
294510156a7SJakub Kicinski    __u32 nlmsg_pid:	5831 /* The PID of our user space process */
295510156a7SJakub Kicinski
296510156a7SJakub Kicinski  struct genlmsghdr:
297510156a7SJakub Kicinski    __u8 cmd:		CTRL_CMD_GETFAMILY
298510156a7SJakub Kicinski    __u8 version:	2
299510156a7SJakub Kicinski    __u16 reserved:	0
300510156a7SJakub Kicinski
301510156a7SJakub Kicinski  struct nlattr:
302510156a7SJakub Kicinski    __u16 nla_len:	10
303510156a7SJakub Kicinski    __u16 nla_type:	CTRL_ATTR_FAMILY_NAME
304510156a7SJakub Kicinski    char data: 		test1\0
305510156a7SJakub Kicinski
306510156a7SJakub Kicinski  (padding:)
307510156a7SJakub Kicinski    data:		\0\0
308510156a7SJakub Kicinski
309510156a7SJakub Kicinski  struct nlattr:
310510156a7SJakub Kicinski    __u16 nla_len:	6
311510156a7SJakub Kicinski    __u16 nla_type:	CTRL_ATTR_FAMILY_ID
312510156a7SJakub Kicinski    __u16: 		123  /* The Family ID we are after */
313510156a7SJakub Kicinski
314510156a7SJakub Kicinski  (padding:)
315510156a7SJakub Kicinski    char data:		\0\0
316510156a7SJakub Kicinski
317510156a7SJakub Kicinski  struct nlattr:
318510156a7SJakub Kicinski    __u16 nla_len:	9
319510156a7SJakub Kicinski    __u16 nla_type:	CTRL_ATTR_FAMILY_VERSION
320510156a7SJakub Kicinski    __u16: 		1
321510156a7SJakub Kicinski
322510156a7SJakub Kicinski  /* ... etc, more attributes will follow. */
323510156a7SJakub Kicinski
324510156a7SJakub KicinskiAnd the error code (success) since ``NLM_F_ACK`` had been set on the request::
325510156a7SJakub Kicinski
326510156a7SJakub Kicinski  /* Message #2 - the ACK */
327510156a7SJakub Kicinski  struct nlmsghdr:
328510156a7SJakub Kicinski    __u32 nlmsg_len:	36
329510156a7SJakub Kicinski    __u16 nlmsg_type:	NLMSG_ERROR
330510156a7SJakub Kicinski    __u16 nlmsg_flags:	NLM_F_CAPPED /* There won't be a payload */
331510156a7SJakub Kicinski    __u32 nlmsg_seq:	1    /* echoed from our request */
332510156a7SJakub Kicinski    __u32 nlmsg_pid:	5831 /* The PID of our user space process */
333510156a7SJakub Kicinski
334510156a7SJakub Kicinski  int error:		0
335510156a7SJakub Kicinski
336510156a7SJakub Kicinski  struct nlmsghdr: /* Copy of the request header as we sent it */
337510156a7SJakub Kicinski    __u32 nlmsg_len:	32
338510156a7SJakub Kicinski    __u16 nlmsg_type:	GENL_ID_CTRL
339510156a7SJakub Kicinski    __u16 nlmsg_flags:	NLM_F_REQUEST | NLM_F_ACK
340510156a7SJakub Kicinski    __u32 nlmsg_seq:	1
341510156a7SJakub Kicinski    __u32 nlmsg_pid:	0
342510156a7SJakub Kicinski
343510156a7SJakub KicinskiThe order of attributes (struct nlattr) is not guaranteed so the user
344510156a7SJakub Kicinskihas to walk the attributes and parse them.
345510156a7SJakub Kicinski
346510156a7SJakub KicinskiNote that Generic Netlink sockets are not associated or bound to a single
347510156a7SJakub Kicinskifamily. A socket can be used to exchange messages with many different
348510156a7SJakub Kicinskifamilies, selecting the recipient family on message-by-message basis using
349510156a7SJakub Kicinskithe :c:member:`nlmsghdr.nlmsg_type` field.
350510156a7SJakub Kicinski
351510156a7SJakub Kicinski.. _ext_ack:
352510156a7SJakub Kicinski
353510156a7SJakub KicinskiExtended ACK
354510156a7SJakub Kicinski------------
355510156a7SJakub Kicinski
356510156a7SJakub KicinskiExtended ACK controls reporting of additional error/warning TLVs
357510156a7SJakub Kicinskiin ``NLMSG_ERROR`` and ``NLMSG_DONE`` messages. To maintain backward
358510156a7SJakub Kicinskicompatibility this feature has to be explicitly enabled by setting
359510156a7SJakub Kicinskithe ``NETLINK_EXT_ACK`` setsockopt() to ``1``.
360510156a7SJakub Kicinski
361510156a7SJakub KicinskiTypes of extended ack attributes are defined in enum nlmsgerr_attrs.
362690252f1SJakub KicinskiThe most commonly used attributes are ``NLMSGERR_ATTR_MSG``,
363690252f1SJakub Kicinski``NLMSGERR_ATTR_OFFS`` and ``NLMSGERR_ATTR_MISS_*``.
364510156a7SJakub Kicinski
365510156a7SJakub Kicinski``NLMSGERR_ATTR_MSG`` carries a message in English describing
366510156a7SJakub Kicinskithe encountered problem. These messages are far more detailed
367510156a7SJakub Kicinskithan what can be expressed thru standard UNIX error codes.
368510156a7SJakub Kicinski
369510156a7SJakub Kicinski``NLMSGERR_ATTR_OFFS`` points to the attribute which caused the problem.
370510156a7SJakub Kicinski
371690252f1SJakub Kicinski``NLMSGERR_ATTR_MISS_TYPE`` and ``NLMSGERR_ATTR_MISS_NEST``
372690252f1SJakub Kicinskiinform about a missing attribute.
373690252f1SJakub Kicinski
374510156a7SJakub KicinskiExtended ACKs can be reported on errors as well as in case of success.
375510156a7SJakub KicinskiThe latter should be treated as a warning.
376510156a7SJakub Kicinski
377510156a7SJakub KicinskiExtended ACKs greatly improve the usability of Netlink and should
378510156a7SJakub Kicinskialways be enabled, appropriately parsed and reported to the user.
379510156a7SJakub Kicinski
380510156a7SJakub KicinskiAdvanced topics
381510156a7SJakub Kicinski===============
382510156a7SJakub Kicinski
383510156a7SJakub KicinskiDump consistency
384510156a7SJakub Kicinski----------------
385510156a7SJakub Kicinski
386510156a7SJakub KicinskiSome of the data structures kernel uses for storing objects make
387510156a7SJakub Kicinskiit hard to provide an atomic snapshot of all the objects in a dump
388510156a7SJakub Kicinski(without impacting the fast-paths updating them).
389510156a7SJakub Kicinski
390510156a7SJakub KicinskiKernel may set the ``NLM_F_DUMP_INTR`` flag on any message in a dump
391510156a7SJakub Kicinski(including the ``NLMSG_DONE`` message) if the dump was interrupted and
392510156a7SJakub Kicinskimay be inconsistent (e.g. missing objects). User space should retry
393510156a7SJakub Kicinskithe dump if it sees the flag set.
394510156a7SJakub Kicinski
395510156a7SJakub KicinskiIntrospection
396510156a7SJakub Kicinski-------------
397510156a7SJakub Kicinski
398510156a7SJakub KicinskiThe basic introspection abilities are enabled by access to the Family
399510156a7SJakub Kicinskiobject as reported in :ref:`res_fam`. User can query information about
400510156a7SJakub Kicinskithe Generic Netlink family, including which operations are supported
401510156a7SJakub Kicinskiby the kernel and what attributes the kernel understands.
402510156a7SJakub KicinskiFamily information includes the highest ID of an attribute kernel can parse,
403510156a7SJakub Kicinskia separate command (``CTRL_CMD_GETPOLICY``) provides detailed information
404510156a7SJakub Kicinskiabout supported attributes, including ranges of values the kernel accepts.
405510156a7SJakub Kicinski
406510156a7SJakub KicinskiQuerying family information is useful in cases when user space needs
407510156a7SJakub Kicinskito make sure that the kernel has support for a feature before issuing
408510156a7SJakub Kicinskia request.
409510156a7SJakub Kicinski
410510156a7SJakub Kicinski.. _nlmsg_pid:
411510156a7SJakub Kicinski
412510156a7SJakub Kicinskinlmsg_pid
413510156a7SJakub Kicinski---------
414510156a7SJakub Kicinski
415510156a7SJakub Kicinski:c:member:`nlmsghdr.nlmsg_pid` is the Netlink equivalent of an address.
416510156a7SJakub KicinskiIt is referred to as Port ID, sometimes Process ID because for historical
417510156a7SJakub Kicinskireasons if the application does not select (bind() to) an explicit Port ID
418510156a7SJakub Kicinskikernel will automatically assign it the ID equal to its Process ID
419510156a7SJakub Kicinski(as reported by the getpid() system call).
420510156a7SJakub Kicinski
421510156a7SJakub KicinskiSimilarly to the bind() semantics of the TCP/IP network protocols the value
422510156a7SJakub Kicinskiof zero means "assign automatically", hence it is common for applications
423510156a7SJakub Kicinskito leave the :c:member:`nlmsghdr.nlmsg_pid` field initialized to ``0``.
424510156a7SJakub Kicinski
425510156a7SJakub KicinskiThe field is still used today in rare cases when kernel needs to send
426510156a7SJakub Kicinskia unicast notification. User space application can use bind() to associate
427510156a7SJakub Kicinskiits socket with a specific PID, it then communicates its PID to the kernel.
428510156a7SJakub KicinskiThis way the kernel can reach the specific user space process.
429510156a7SJakub Kicinski
430510156a7SJakub KicinskiThis sort of communication is utilized in UMH (User Mode Helper)-like
431510156a7SJakub Kicinskiscenarios when kernel needs to trigger user space processing or ask user
432510156a7SJakub Kicinskispace for a policy decision.
433510156a7SJakub Kicinski
434510156a7SJakub KicinskiMulticast notifications
435510156a7SJakub Kicinski-----------------------
436510156a7SJakub Kicinski
437510156a7SJakub KicinskiOne of the strengths of Netlink is the ability to send event notifications
438510156a7SJakub Kicinskito user space. This is a unidirectional form of communication (kernel ->
439510156a7SJakub Kicinskiuser) and does not involve any control messages like ``NLMSG_ERROR`` or
440510156a7SJakub Kicinski``NLMSG_DONE``.
441510156a7SJakub Kicinski
442510156a7SJakub KicinskiFor example the Generic Netlink family itself defines a set of multicast
443510156a7SJakub Kicinskinotifications about registered families. When a new family is added the
444510156a7SJakub Kicinskisockets subscribed to the notifications will get the following message::
445510156a7SJakub Kicinski
446510156a7SJakub Kicinski  struct nlmsghdr:
447510156a7SJakub Kicinski    __u32 nlmsg_len:	136
448510156a7SJakub Kicinski    __u16 nlmsg_type:	GENL_ID_CTRL
449510156a7SJakub Kicinski    __u16 nlmsg_flags:	0
450510156a7SJakub Kicinski    __u32 nlmsg_seq:	0
451510156a7SJakub Kicinski    __u32 nlmsg_pid:	0
452510156a7SJakub Kicinski
453510156a7SJakub Kicinski  struct genlmsghdr:
454510156a7SJakub Kicinski    __u8 cmd:		CTRL_CMD_NEWFAMILY
455510156a7SJakub Kicinski    __u8 version:	2
456510156a7SJakub Kicinski    __u16 reserved:	0
457510156a7SJakub Kicinski
458510156a7SJakub Kicinski  struct nlattr:
459510156a7SJakub Kicinski    __u16 nla_len:	10
460510156a7SJakub Kicinski    __u16 nla_type:	CTRL_ATTR_FAMILY_NAME
461510156a7SJakub Kicinski    char data: 		test1\0
462510156a7SJakub Kicinski
463510156a7SJakub Kicinski  (padding:)
464510156a7SJakub Kicinski    data:		\0\0
465510156a7SJakub Kicinski
466510156a7SJakub Kicinski  struct nlattr:
467510156a7SJakub Kicinski    __u16 nla_len:	6
468510156a7SJakub Kicinski    __u16 nla_type:	CTRL_ATTR_FAMILY_ID
469510156a7SJakub Kicinski    __u16: 		123  /* The Family ID we are after */
470510156a7SJakub Kicinski
471510156a7SJakub Kicinski  (padding:)
472510156a7SJakub Kicinski    char data:		\0\0
473510156a7SJakub Kicinski
474510156a7SJakub Kicinski  struct nlattr:
475510156a7SJakub Kicinski    __u16 nla_len:	9
476510156a7SJakub Kicinski    __u16 nla_type:	CTRL_ATTR_FAMILY_VERSION
477510156a7SJakub Kicinski    __u16: 		1
478510156a7SJakub Kicinski
479510156a7SJakub Kicinski  /* ... etc, more attributes will follow. */
480510156a7SJakub Kicinski
481510156a7SJakub KicinskiThe notification contains the same information as the response
482510156a7SJakub Kicinskito the ``CTRL_CMD_GETFAMILY`` request.
483510156a7SJakub Kicinski
484510156a7SJakub KicinskiThe Netlink headers of the notification are mostly 0 and irrelevant.
485510156a7SJakub KicinskiThe :c:member:`nlmsghdr.nlmsg_seq` may be either zero or a monotonically
486510156a7SJakub Kicinskiincreasing notification sequence number maintained by the family.
487510156a7SJakub Kicinski
488510156a7SJakub KicinskiTo receive notifications the user socket must subscribe to the relevant
489510156a7SJakub Kicinskinotification group. Much like the Family ID, the Group ID for a given
490510156a7SJakub Kicinskimulticast group is dynamic and can be found inside the Family information.
491510156a7SJakub KicinskiThe ``CTRL_ATTR_MCAST_GROUPS`` attribute contains nests with names
492510156a7SJakub Kicinski(``CTRL_ATTR_MCAST_GRP_NAME``) and IDs (``CTRL_ATTR_MCAST_GRP_ID``) of
493510156a7SJakub Kicinskithe groups family.
494510156a7SJakub Kicinski
495510156a7SJakub KicinskiOnce the Group ID is known a setsockopt() call adds the socket to the group:
496510156a7SJakub Kicinski
497510156a7SJakub Kicinski.. code-block:: c
498510156a7SJakub Kicinski
499510156a7SJakub Kicinski  unsigned int group_id;
500510156a7SJakub Kicinski
501510156a7SJakub Kicinski  /* .. find the group ID... */
502510156a7SJakub Kicinski
503510156a7SJakub Kicinski  setsockopt(fd, SOL_NETLINK, NETLINK_ADD_MEMBERSHIP,
504510156a7SJakub Kicinski             &group_id, sizeof(group_id));
505510156a7SJakub Kicinski
506510156a7SJakub KicinskiThe socket will now receive notifications.
507510156a7SJakub Kicinski
508510156a7SJakub KicinskiIt is recommended to use separate sockets for receiving notifications
509510156a7SJakub Kicinskiand sending requests to the kernel. The asynchronous nature of notifications
510510156a7SJakub Kicinskimeans that they may get mixed in with the responses making the message
511510156a7SJakub Kicinskihandling much harder.
512510156a7SJakub Kicinski
513510156a7SJakub KicinskiBuffer sizing
514510156a7SJakub Kicinski-------------
515510156a7SJakub Kicinski
516510156a7SJakub KicinskiNetlink sockets are datagram sockets rather than stream sockets,
517510156a7SJakub Kicinskimeaning that each message must be received in its entirety by a single
518510156a7SJakub Kicinskirecv()/recvmsg() system call. If the buffer provided by the user is too
519510156a7SJakub Kicinskishort, the message will be truncated and the ``MSG_TRUNC`` flag set
520510156a7SJakub Kicinskiin struct msghdr (struct msghdr is the second argument
521510156a7SJakub Kicinskiof the recvmsg() system call, *not* a Netlink header).
522510156a7SJakub Kicinski
523510156a7SJakub KicinskiUpon truncation the remaining part of the message is discarded.
524510156a7SJakub Kicinski
525510156a7SJakub KicinskiNetlink expects that the user buffer will be at least 8kB or a page
526510156a7SJakub Kicinskisize of the CPU architecture, whichever is bigger. Particular Netlink
527510156a7SJakub Kicinskifamilies may, however, require a larger buffer. 32kB buffer is recommended
528510156a7SJakub Kicinskifor most efficient handling of dumps (larger buffer fits more dumped
529510156a7SJakub Kicinskiobjects and therefore fewer recvmsg() calls are needed).
530510156a7SJakub Kicinski
531*ee940b57SDonald Hunter.. _classic_netlink:
532*ee940b57SDonald Hunter
533510156a7SJakub KicinskiClassic Netlink
534510156a7SJakub Kicinski===============
535510156a7SJakub Kicinski
536510156a7SJakub KicinskiThe main differences between Classic and Generic Netlink are the dynamic
537510156a7SJakub Kicinskiallocation of subsystem identifiers and availability of introspection.
538510156a7SJakub KicinskiIn theory the protocol does not differ significantly, however, in practice
539510156a7SJakub KicinskiClassic Netlink experimented with concepts which were abandoned in Generic
540510156a7SJakub KicinskiNetlink (really, they usually only found use in a small corner of a single
541510156a7SJakub Kicinskisubsystem). This section is meant as an explainer of a few of such concepts,
542510156a7SJakub Kicinskiwith the explicit goal of giving the Generic Netlink
543510156a7SJakub Kicinskiusers the confidence to ignore them when reading the uAPI headers.
544510156a7SJakub Kicinski
545510156a7SJakub KicinskiMost of the concepts and examples here refer to the ``NETLINK_ROUTE`` family,
546510156a7SJakub Kicinskiwhich covers much of the configuration of the Linux networking stack.
547510156a7SJakub KicinskiReal documentation of that family, deserves a chapter (or a book) of its own.
548510156a7SJakub Kicinski
549510156a7SJakub KicinskiFamilies
550510156a7SJakub Kicinski--------
551510156a7SJakub Kicinski
552510156a7SJakub KicinskiNetlink refers to subsystems as families. This is a remnant of using
553510156a7SJakub Kicinskisockets and the concept of protocol families, which are part of message
554510156a7SJakub Kicinskidemultiplexing in ``NETLINK_ROUTE``.
555510156a7SJakub Kicinski
556510156a7SJakub KicinskiSadly every layer of encapsulation likes to refer to whatever it's carrying
557510156a7SJakub Kicinskias "families" making the term very confusing:
558510156a7SJakub Kicinski
559510156a7SJakub Kicinski 1. AF_NETLINK is a bona fide socket protocol family
560510156a7SJakub Kicinski 2. AF_NETLINK's documentation refers to what comes after its own
561510156a7SJakub Kicinski    header (struct nlmsghdr) in a message as a "Family Header"
562510156a7SJakub Kicinski 3. Generic Netlink is a family for AF_NETLINK (struct genlmsghdr follows
563510156a7SJakub Kicinski    struct nlmsghdr), yet it also calls its users "Families".
564510156a7SJakub Kicinski
565510156a7SJakub KicinskiNote that the Generic Netlink Family IDs are in a different "ID space"
566510156a7SJakub Kicinskiand overlap with Classic Netlink protocol numbers (e.g. ``NETLINK_CRYPTO``
567510156a7SJakub Kicinskihas the Classic Netlink protocol ID of 21 which Generic Netlink will
568510156a7SJakub Kicinskihappily allocate to one of its families as well).
569510156a7SJakub Kicinski
570510156a7SJakub KicinskiStrict checking
571510156a7SJakub Kicinski---------------
572510156a7SJakub Kicinski
573510156a7SJakub KicinskiThe ``NETLINK_GET_STRICT_CHK`` socket option enables strict input checking
574510156a7SJakub Kicinskiin ``NETLINK_ROUTE``. It was needed because historically kernel did not
575510156a7SJakub Kicinskivalidate the fields of structures it didn't process. This made it impossible
576510156a7SJakub Kicinskito start using those fields later without risking regressions in applications
577510156a7SJakub Kicinskiwhich initialized them incorrectly or not at all.
578510156a7SJakub Kicinski
579510156a7SJakub Kicinski``NETLINK_GET_STRICT_CHK`` declares that the application is initializing
580510156a7SJakub Kicinskiall fields correctly. It also opts into validating that message does not
581510156a7SJakub Kicinskicontain trailing data and requests that kernel rejects attributes with
582510156a7SJakub Kicinskitype higher than largest attribute type known to the kernel.
583510156a7SJakub Kicinski
584510156a7SJakub Kicinski``NETLINK_GET_STRICT_CHK`` is not used outside of ``NETLINK_ROUTE``.
585510156a7SJakub Kicinski
586510156a7SJakub KicinskiUnknown attributes
587510156a7SJakub Kicinski------------------
588510156a7SJakub Kicinski
589510156a7SJakub KicinskiHistorically Netlink ignored all unknown attributes. The thinking was that
590510156a7SJakub Kicinskiit would free the application from having to probe what kernel supports.
591510156a7SJakub KicinskiThe application could make a request to change the state and check which
592510156a7SJakub Kicinskiparts of the request "stuck".
593510156a7SJakub Kicinski
594510156a7SJakub KicinskiThis is no longer the case for new Generic Netlink families and those opting
595510156a7SJakub Kicinskiin to strict checking. See enum netlink_validation for validation types
596510156a7SJakub Kicinskiperformed.
597510156a7SJakub Kicinski
598510156a7SJakub KicinskiFixed metadata and structures
599510156a7SJakub Kicinski-----------------------------
600510156a7SJakub Kicinski
601510156a7SJakub KicinskiClassic Netlink made liberal use of fixed-format structures within
602510156a7SJakub Kicinskithe messages. Messages would commonly have a structure with
603510156a7SJakub Kicinskia considerable number of fields after struct nlmsghdr. It was also
604510156a7SJakub Kicinskicommon to put structures with multiple members inside attributes,
605510156a7SJakub Kicinskiwithout breaking each member into an attribute of its own.
606510156a7SJakub Kicinski
607510156a7SJakub KicinskiThis has caused problems with validation and extensibility and
608510156a7SJakub Kicinskitherefore using binary structures is actively discouraged for new
609510156a7SJakub Kicinskiattributes.
610510156a7SJakub Kicinski
611510156a7SJakub KicinskiRequest types
612510156a7SJakub Kicinski-------------
613510156a7SJakub Kicinski
614510156a7SJakub Kicinski``NETLINK_ROUTE`` categorized requests into 4 types ``NEW``, ``DEL``, ``GET``,
615510156a7SJakub Kicinskiand ``SET``. Each object can handle all or some of those requests
616510156a7SJakub Kicinski(objects being netdevs, routes, addresses, qdiscs etc.) Request type
617510156a7SJakub Kicinskiis defined by the 2 lowest bits of the message type, so commands for
618510156a7SJakub Kicinskinew objects would always be allocated with a stride of 4.
619510156a7SJakub Kicinski
620d56b699dSBjorn HelgaasEach object would also have its own fixed metadata shared by all request
621510156a7SJakub Kicinskitypes (e.g. struct ifinfomsg for netdev requests, struct ifaddrmsg for address
622510156a7SJakub Kicinskirequests, struct tcmsg for qdisc requests).
623510156a7SJakub Kicinski
624510156a7SJakub KicinskiEven though other protocols and Generic Netlink commands often use
625510156a7SJakub Kicinskithe same verbs in their message names (``GET``, ``SET``) the concept
626510156a7SJakub Kicinskiof request types did not find wider adoption.
627510156a7SJakub Kicinski
6285493a2adSJakub KicinskiNotification echo
6295493a2adSJakub Kicinski-----------------
630510156a7SJakub Kicinski
6315493a2adSJakub Kicinski``NLM_F_ECHO`` requests for notifications resulting from the request
6325493a2adSJakub Kicinskito be queued onto the requesting socket. This is useful to discover
6335493a2adSJakub Kicinskithe impact of the request.
634510156a7SJakub Kicinski
6355493a2adSJakub KicinskiNote that this feature is not universally implemented.
636510156a7SJakub Kicinski
6375493a2adSJakub KicinskiOther request-type-specific flags
6385493a2adSJakub Kicinski---------------------------------
6395493a2adSJakub Kicinski
6405493a2adSJakub KicinskiClassic Netlink defined various flags for its ``GET``, ``NEW``
6415493a2adSJakub Kicinskiand ``DEL`` requests in the upper byte of nlmsg_flags in struct nlmsghdr.
6425493a2adSJakub KicinskiSince request types have not been generalized the request type specific
6435493a2adSJakub Kicinskiflags are rarely used (and considered deprecated for new families).
6445493a2adSJakub Kicinski
6455493a2adSJakub KicinskiFor ``GET`` - ``NLM_F_ROOT`` and ``NLM_F_MATCH`` are combined into
6465493a2adSJakub Kicinski``NLM_F_DUMP``, and not used separately. ``NLM_F_ATOMIC`` is never used.
6475493a2adSJakub Kicinski
6485493a2adSJakub KicinskiFor ``DEL`` - ``NLM_F_NONREC`` is only used by nftables and ``NLM_F_BULK``
6495493a2adSJakub Kicinskionly by FDB some operations.
6505493a2adSJakub Kicinski
6515493a2adSJakub KicinskiThe flags for ``NEW`` are used most commonly in classic Netlink. Unfortunately,
6525493a2adSJakub Kicinskithe meaning is not crystal clear. The following description is based on the
6535493a2adSJakub Kicinskibest guess of the intention of the authors, and in practice all families
6545493a2adSJakub Kicinskistray from it in one way or another. ``NLM_F_REPLACE`` asks to replace
6555493a2adSJakub Kicinskian existing object, if no matching object exists the operation should fail.
6565493a2adSJakub Kicinski``NLM_F_EXCL`` has the opposite semantics and only succeeds if object already
6575493a2adSJakub Kicinskiexisted.
6585493a2adSJakub Kicinski``NLM_F_CREATE`` asks for the object to be created if it does not
6595493a2adSJakub Kicinskiexist, it can be combined with ``NLM_F_REPLACE`` and ``NLM_F_EXCL``.
6605493a2adSJakub Kicinski
6615493a2adSJakub KicinskiA comment in the main Netlink uAPI header states::
6625493a2adSJakub Kicinski
6635493a2adSJakub Kicinski   4.4BSD ADD		NLM_F_CREATE|NLM_F_EXCL
6645493a2adSJakub Kicinski   4.4BSD CHANGE	NLM_F_REPLACE
6655493a2adSJakub Kicinski
6665493a2adSJakub Kicinski   True CHANGE		NLM_F_CREATE|NLM_F_REPLACE
6675493a2adSJakub Kicinski   Append		NLM_F_CREATE
6685493a2adSJakub Kicinski   Check		NLM_F_EXCL
6695493a2adSJakub Kicinski
6705493a2adSJakub Kicinskiwhich seems to indicate that those flags predate request types.
6715493a2adSJakub Kicinski``NLM_F_REPLACE`` without ``NLM_F_CREATE`` was initially used instead
6725493a2adSJakub Kicinskiof ``SET`` commands.
6735493a2adSJakub Kicinski``NLM_F_EXCL`` without ``NLM_F_CREATE`` was used to check if object exists
6745493a2adSJakub Kicinskiwithout creating it, presumably predating ``GET`` commands.
6755493a2adSJakub Kicinski
6765493a2adSJakub Kicinski``NLM_F_APPEND`` indicates that if one key can have multiple objects associated
6775493a2adSJakub Kicinskiwith it (e.g. multiple next-hop objects for a route) the new object should be
6785493a2adSJakub Kicinskiadded to the list rather than replacing the entire list.
679510156a7SJakub Kicinski
680510156a7SJakub KicinskiuAPI reference
681510156a7SJakub Kicinski==============
682510156a7SJakub Kicinski
683510156a7SJakub Kicinski.. kernel-doc:: include/uapi/linux/netlink.h
684