xref: /openbmc/linux/Documentation/admin-guide/mm/swap_numa.rst (revision 498a1cf902c31c3af398082d65cf150b33b367e6)
1===========================================
2Automatically bind swap device to numa node
3===========================================
4
5If the system has more than one swap device and swap device has the node
6information, we can make use of this information to decide which swap
7device to use in get_swap_pages() to get better performance.
8
9
10How to use this feature
11=======================
12
13Swap device has priority and that decides the order of it to be used. To make
14use of automatically binding, there is no need to manipulate priority settings
15for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
16swapB, with swapA attached to node 0 and swapB attached to node 1, are going
17to be swapped on. Simply swapping them on by doing::
18
19	# swapon /dev/swapA
20	# swapon /dev/swapB
21
22Then node 0 will use the two swap devices in the order of swapA then swapB and
23node 1 will use the two swap devices in the order of swapB then swapA. Note
24that the order of them being swapped on doesn't matter.
25
26A more complex example on a 4 node machine. Assume 6 swap devices are going to
27be swapped on: swapA and swapB are attached to node 0, swapC is attached to
28node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
29The way to swap them on is the same as above::
30
31	# swapon /dev/swapA
32	# swapon /dev/swapB
33	# swapon /dev/swapC
34	# swapon /dev/swapD
35	# swapon /dev/swapE
36	# swapon /dev/swapF
37
38Then node 0 will use them in the order of::
39
40	swapA/swapB -> swapC -> swapD -> swapE -> swapF
41
42swapA and swapB will be used in a round robin mode before any other swap device.
43
44node 1 will use them in the order of::
45
46	swapC -> swapA -> swapB -> swapD -> swapE -> swapF
47
48node 2 will use them in the order of::
49
50	swapD/swapE -> swapA -> swapB -> swapC -> swapF
51
52Similaly, swapD and swapE will be used in a round robin mode before any
53other swap devices.
54
55node 3 will use them in the order of::
56
57	swapF -> swapA -> swapB -> swapC -> swapD -> swapE
58
59
60Implementation details
61======================
62
63The current code uses a priority based list, swap_avail_list, to decide
64which swap device to use and if multiple swap devices share the same
65priority, they are used round robin. This change here replaces the single
66global swap_avail_list with a per-numa-node list, i.e. for each numa node,
67it sees its own priority based list of available swap devices. Swap
68device's priority can be promoted on its matching node's swap_avail_list.
69
70The current swap device's priority is set as: user can set a >=0 value,
71or the system will pick one starting from -1 then downwards. The priority
72value in the swap_avail_list is the negated value of the swap device's
73due to plist being sorted from low to high. The new policy doesn't change
74the semantics for priority >=0 cases, the previous starting from -1 then
75downwards now becomes starting from -2 then downwards and -1 is reserved
76as the promoted value. So if multiple swap devices are attached to the same
77node, they will all be promoted to priority -1 on that node's plist and will
78be used round robin before any other swap devices.
79