xref: /openbmc/linux/Documentation/scheduler/sched-design-CFS.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
1d6a3b247SMauro Carvalho Chehab=============
2d6a3b247SMauro Carvalho ChehabCFS Scheduler
3d6a3b247SMauro Carvalho Chehab=============
4d6a3b247SMauro Carvalho Chehab
5d6a3b247SMauro Carvalho Chehab
6d6a3b247SMauro Carvalho Chehab1.  OVERVIEW
7d6a3b247SMauro Carvalho Chehab============
8d6a3b247SMauro Carvalho Chehab
9d6a3b247SMauro Carvalho ChehabCFS stands for "Completely Fair Scheduler," and is the new "desktop" process
10d6a3b247SMauro Carvalho Chehabscheduler implemented by Ingo Molnar and merged in Linux 2.6.23.  It is the
11d6a3b247SMauro Carvalho Chehabreplacement for the previous vanilla scheduler's SCHED_OTHER interactivity
12d6a3b247SMauro Carvalho Chehabcode.
13d6a3b247SMauro Carvalho Chehab
14d6a3b247SMauro Carvalho Chehab80% of CFS's design can be summed up in a single sentence: CFS basically models
15d6a3b247SMauro Carvalho Chehaban "ideal, precise multi-tasking CPU" on real hardware.
16d6a3b247SMauro Carvalho Chehab
17d6a3b247SMauro Carvalho Chehab"Ideal multi-tasking CPU" is a (non-existent  :-)) CPU that has 100% physical
18d6a3b247SMauro Carvalho Chehabpower and which can run each task at precise equal speed, in parallel, each at
19d6a3b247SMauro Carvalho Chehab1/nr_running speed.  For example: if there are 2 tasks running, then it runs
20d6a3b247SMauro Carvalho Chehabeach at 50% physical power --- i.e., actually in parallel.
21d6a3b247SMauro Carvalho Chehab
22d6a3b247SMauro Carvalho ChehabOn real hardware, we can run only a single task at once, so we have to
23d6a3b247SMauro Carvalho Chehabintroduce the concept of "virtual runtime."  The virtual runtime of a task
24d6a3b247SMauro Carvalho Chehabspecifies when its next timeslice would start execution on the ideal
25d6a3b247SMauro Carvalho Chehabmulti-tasking CPU described above.  In practice, the virtual runtime of a task
26d6a3b247SMauro Carvalho Chehabis its actual runtime normalized to the total number of running tasks.
27d6a3b247SMauro Carvalho Chehab
28d6a3b247SMauro Carvalho Chehab
29d6a3b247SMauro Carvalho Chehab
30d6a3b247SMauro Carvalho Chehab2.  FEW IMPLEMENTATION DETAILS
31d6a3b247SMauro Carvalho Chehab==============================
32d6a3b247SMauro Carvalho Chehab
33d6a3b247SMauro Carvalho ChehabIn CFS the virtual runtime is expressed and tracked via the per-task
34d6a3b247SMauro Carvalho Chehabp->se.vruntime (nanosec-unit) value.  This way, it's possible to accurately
35d6a3b247SMauro Carvalho Chehabtimestamp and measure the "expected CPU time" a task should have gotten.
36d6a3b247SMauro Carvalho Chehab
37f1779d13SKir Kolyshkin   Small detail: on "ideal" hardware, at any time all tasks would have the same
38d6a3b247SMauro Carvalho Chehab   p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
39f1779d13SKir Kolyshkin   would ever get "out of balance" from the "ideal" share of CPU time.
40d6a3b247SMauro Carvalho Chehab
41d6a3b247SMauro Carvalho ChehabCFS's task picking logic is based on this p->se.vruntime value and it is thus
42d6a3b247SMauro Carvalho Chehabvery simple: it always tries to run the task with the smallest p->se.vruntime
43d6a3b247SMauro Carvalho Chehabvalue (i.e., the task which executed least so far).  CFS always tries to split
44d6a3b247SMauro Carvalho Chehabup CPU time between runnable tasks as close to "ideal multitasking hardware" as
45d6a3b247SMauro Carvalho Chehabpossible.
46d6a3b247SMauro Carvalho Chehab
47d6a3b247SMauro Carvalho ChehabMost of the rest of CFS's design just falls out of this really simple concept,
48d6a3b247SMauro Carvalho Chehabwith a few add-on embellishments like nice levels, multiprocessing and various
49d6a3b247SMauro Carvalho Chehabalgorithm variants to recognize sleepers.
50d6a3b247SMauro Carvalho Chehab
51d6a3b247SMauro Carvalho Chehab
52d6a3b247SMauro Carvalho Chehab
53d6a3b247SMauro Carvalho Chehab3.  THE RBTREE
54d6a3b247SMauro Carvalho Chehab==============
55d6a3b247SMauro Carvalho Chehab
56d6a3b247SMauro Carvalho ChehabCFS's design is quite radical: it does not use the old data structures for the
57d6a3b247SMauro Carvalho Chehabrunqueues, but it uses a time-ordered rbtree to build a "timeline" of future
58d6a3b247SMauro Carvalho Chehabtask execution, and thus has no "array switch" artifacts (by which both the
59d6a3b247SMauro Carvalho Chehabprevious vanilla scheduler and RSDL/SD are affected).
60d6a3b247SMauro Carvalho Chehab
61d6a3b247SMauro Carvalho ChehabCFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
62d6a3b247SMauro Carvalho Chehabincreasing value tracking the smallest vruntime among all tasks in the
63d6a3b247SMauro Carvalho Chehabrunqueue.  The total amount of work done by the system is tracked using
64d6a3b247SMauro Carvalho Chehabmin_vruntime; that value is used to place newly activated entities on the left
65d6a3b247SMauro Carvalho Chehabside of the tree as much as possible.
66d6a3b247SMauro Carvalho Chehab
67d6a3b247SMauro Carvalho ChehabThe total number of running tasks in the runqueue is accounted through the
68d6a3b247SMauro Carvalho Chehabrq->cfs.load value, which is the sum of the weights of the tasks queued on the
69d6a3b247SMauro Carvalho Chehabrunqueue.
70d6a3b247SMauro Carvalho Chehab
71d6a3b247SMauro Carvalho ChehabCFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
72d6a3b247SMauro Carvalho Chehabp->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it.
73d6a3b247SMauro Carvalho ChehabAs the system progresses forwards, the executed tasks are put into the tree
74d6a3b247SMauro Carvalho Chehabmore and more to the right --- slowly but surely giving a chance for every task
75d6a3b247SMauro Carvalho Chehabto become the "leftmost task" and thus get on the CPU within a deterministic
76d6a3b247SMauro Carvalho Chehabamount of time.
77d6a3b247SMauro Carvalho Chehab
78d6a3b247SMauro Carvalho ChehabSumming up, CFS works like this: it runs a task a bit, and when the task
79d6a3b247SMauro Carvalho Chehabschedules (or a scheduler tick happens) the task's CPU usage is "accounted
80d6a3b247SMauro Carvalho Chehabfor": the (small) time it just spent using the physical CPU is added to
81d6a3b247SMauro Carvalho Chehabp->se.vruntime.  Once p->se.vruntime gets high enough so that another task
82d6a3b247SMauro Carvalho Chehabbecomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
83d6a3b247SMauro Carvalho Chehabsmall amount of "granularity" distance relative to the leftmost task so that we
84d6a3b247SMauro Carvalho Chehabdo not over-schedule tasks and trash the cache), then the new leftmost task is
85d6a3b247SMauro Carvalho Chehabpicked and the current task is preempted.
86d6a3b247SMauro Carvalho Chehab
87d6a3b247SMauro Carvalho Chehab
88d6a3b247SMauro Carvalho Chehab
89d6a3b247SMauro Carvalho Chehab4.  SOME FEATURES OF CFS
90d6a3b247SMauro Carvalho Chehab========================
91d6a3b247SMauro Carvalho Chehab
92d6a3b247SMauro Carvalho ChehabCFS uses nanosecond granularity accounting and does not rely on any jiffies or
93d6a3b247SMauro Carvalho Chehabother HZ detail.  Thus the CFS scheduler has no notion of "timeslices" in the
94d6a3b247SMauro Carvalho Chehabway the previous scheduler had, and has no heuristics whatsoever.  There is
95d6a3b247SMauro Carvalho Chehabonly one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
96d6a3b247SMauro Carvalho Chehab
97*2f88c8e8SShrikanth Hegde   /sys/kernel/debug/sched/base_slice_ns
98d6a3b247SMauro Carvalho Chehab
99d6a3b247SMauro Carvalho Chehabwhich can be used to tune the scheduler from "desktop" (i.e., low latencies) to
100d6a3b247SMauro Carvalho Chehab"server" (i.e., good batching) workloads.  It defaults to a setting suitable
101d6a3b247SMauro Carvalho Chehabfor desktop workloads.  SCHED_BATCH is handled by the CFS scheduler module too.
102d6a3b247SMauro Carvalho Chehab
103d6a3b247SMauro Carvalho ChehabDue to its design, the CFS scheduler is not prone to any of the "attacks" that
104d6a3b247SMauro Carvalho Chehabexist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
105d6a3b247SMauro Carvalho Chehabchew.c, ring-test.c, massive_intr.c all work fine and do not impact
106d6a3b247SMauro Carvalho Chehabinteractivity and produce the expected behavior.
107d6a3b247SMauro Carvalho Chehab
108d6a3b247SMauro Carvalho ChehabThe CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
109d6a3b247SMauro Carvalho Chehabthan the previous vanilla scheduler: both types of workloads are isolated much
110d6a3b247SMauro Carvalho Chehabmore aggressively.
111d6a3b247SMauro Carvalho Chehab
112d6a3b247SMauro Carvalho ChehabSMP load-balancing has been reworked/sanitized: the runqueue-walking
113d6a3b247SMauro Carvalho Chehabassumptions are gone from the load-balancing code now, and iterators of the
114d6a3b247SMauro Carvalho Chehabscheduling modules are used.  The balancing code got quite a bit simpler as a
115d6a3b247SMauro Carvalho Chehabresult.
116d6a3b247SMauro Carvalho Chehab
117d6a3b247SMauro Carvalho Chehab
118d6a3b247SMauro Carvalho Chehab
119d6a3b247SMauro Carvalho Chehab5. Scheduling policies
120d6a3b247SMauro Carvalho Chehab======================
121d6a3b247SMauro Carvalho Chehab
122d6a3b247SMauro Carvalho ChehabCFS implements three scheduling policies:
123d6a3b247SMauro Carvalho Chehab
124d6a3b247SMauro Carvalho Chehab  - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
125d6a3b247SMauro Carvalho Chehab    policy that is used for regular tasks.
126d6a3b247SMauro Carvalho Chehab
127d6a3b247SMauro Carvalho Chehab  - SCHED_BATCH: Does not preempt nearly as often as regular tasks
128d6a3b247SMauro Carvalho Chehab    would, thereby allowing tasks to run longer and make better use of
129d6a3b247SMauro Carvalho Chehab    caches but at the cost of interactivity. This is well suited for
130d6a3b247SMauro Carvalho Chehab    batch jobs.
131d6a3b247SMauro Carvalho Chehab
132d6a3b247SMauro Carvalho Chehab  - SCHED_IDLE: This is even weaker than nice 19, but its not a true
133d6a3b247SMauro Carvalho Chehab    idle timer scheduler in order to avoid to get into priority
134d6a3b247SMauro Carvalho Chehab    inversion problems which would deadlock the machine.
135d6a3b247SMauro Carvalho Chehab
136d6a3b247SMauro Carvalho ChehabSCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
137d6a3b247SMauro Carvalho ChehabPOSIX.
138d6a3b247SMauro Carvalho Chehab
139d6a3b247SMauro Carvalho ChehabThe command chrt from util-linux-ng 2.13.1.1 can set all of these except
140d6a3b247SMauro Carvalho ChehabSCHED_IDLE.
141d6a3b247SMauro Carvalho Chehab
142d6a3b247SMauro Carvalho Chehab
143d6a3b247SMauro Carvalho Chehab
144d6a3b247SMauro Carvalho Chehab6.  SCHEDULING CLASSES
145d6a3b247SMauro Carvalho Chehab======================
146d6a3b247SMauro Carvalho Chehab
147d6a3b247SMauro Carvalho ChehabThe new CFS scheduler has been designed in such a way to introduce "Scheduling
148d6a3b247SMauro Carvalho ChehabClasses," an extensible hierarchy of scheduler modules.  These modules
149d6a3b247SMauro Carvalho Chehabencapsulate scheduling policy details and are handled by the scheduler core
150d6a3b247SMauro Carvalho Chehabwithout the core code assuming too much about them.
151d6a3b247SMauro Carvalho Chehab
152d6a3b247SMauro Carvalho Chehabsched/fair.c implements the CFS scheduler described above.
153d6a3b247SMauro Carvalho Chehab
154d6a3b247SMauro Carvalho Chehabsched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
155d6a3b247SMauro Carvalho Chehabthe previous vanilla scheduler did.  It uses 100 runqueues (for all 100 RT
156d6a3b247SMauro Carvalho Chehabpriority levels, instead of 140 in the previous scheduler) and it needs no
157d6a3b247SMauro Carvalho Chehabexpired array.
158d6a3b247SMauro Carvalho Chehab
159d6a3b247SMauro Carvalho ChehabScheduling classes are implemented through the sched_class structure, which
160d6a3b247SMauro Carvalho Chehabcontains hooks to functions that must be called whenever an interesting event
161d6a3b247SMauro Carvalho Chehaboccurs.
162d6a3b247SMauro Carvalho Chehab
163d6a3b247SMauro Carvalho ChehabThis is the (partial) list of the hooks:
164d6a3b247SMauro Carvalho Chehab
165d6a3b247SMauro Carvalho Chehab - enqueue_task(...)
166d6a3b247SMauro Carvalho Chehab
167d6a3b247SMauro Carvalho Chehab   Called when a task enters a runnable state.
168d6a3b247SMauro Carvalho Chehab   It puts the scheduling entity (task) into the red-black tree and
169d6a3b247SMauro Carvalho Chehab   increments the nr_running variable.
170d6a3b247SMauro Carvalho Chehab
171d6a3b247SMauro Carvalho Chehab - dequeue_task(...)
172d6a3b247SMauro Carvalho Chehab
173d6a3b247SMauro Carvalho Chehab   When a task is no longer runnable, this function is called to keep the
174d6a3b247SMauro Carvalho Chehab   corresponding scheduling entity out of the red-black tree.  It decrements
175d6a3b247SMauro Carvalho Chehab   the nr_running variable.
176d6a3b247SMauro Carvalho Chehab
177d6a3b247SMauro Carvalho Chehab - yield_task(...)
178d6a3b247SMauro Carvalho Chehab
179d6a3b247SMauro Carvalho Chehab   This function is basically just a dequeue followed by an enqueue, unless the
180d6a3b247SMauro Carvalho Chehab   compat_yield sysctl is turned on; in that case, it places the scheduling
181d6a3b247SMauro Carvalho Chehab   entity at the right-most end of the red-black tree.
182d6a3b247SMauro Carvalho Chehab
183d6a3b247SMauro Carvalho Chehab - check_preempt_curr(...)
184d6a3b247SMauro Carvalho Chehab
185d6a3b247SMauro Carvalho Chehab   This function checks if a task that entered the runnable state should
186d6a3b247SMauro Carvalho Chehab   preempt the currently running task.
187d6a3b247SMauro Carvalho Chehab
188d6a3b247SMauro Carvalho Chehab - pick_next_task(...)
189d6a3b247SMauro Carvalho Chehab
190d6a3b247SMauro Carvalho Chehab   This function chooses the most appropriate task eligible to run next.
191d6a3b247SMauro Carvalho Chehab
192d6a3b247SMauro Carvalho Chehab - set_curr_task(...)
193d6a3b247SMauro Carvalho Chehab
194d6a3b247SMauro Carvalho Chehab   This function is called when a task changes its scheduling class or changes
195d6a3b247SMauro Carvalho Chehab   its task group.
196d6a3b247SMauro Carvalho Chehab
197d6a3b247SMauro Carvalho Chehab - task_tick(...)
198d6a3b247SMauro Carvalho Chehab
199d6a3b247SMauro Carvalho Chehab   This function is mostly called from time tick functions; it might lead to
200d6a3b247SMauro Carvalho Chehab   process switch.  This drives the running preemption.
201d6a3b247SMauro Carvalho Chehab
202d6a3b247SMauro Carvalho Chehab
203d6a3b247SMauro Carvalho Chehab
204d6a3b247SMauro Carvalho Chehab
205d6a3b247SMauro Carvalho Chehab7.  GROUP SCHEDULER EXTENSIONS TO CFS
206d6a3b247SMauro Carvalho Chehab=====================================
207d6a3b247SMauro Carvalho Chehab
208d6a3b247SMauro Carvalho ChehabNormally, the scheduler operates on individual tasks and strives to provide
209d6a3b247SMauro Carvalho Chehabfair CPU time to each task.  Sometimes, it may be desirable to group tasks and
210d6a3b247SMauro Carvalho Chehabprovide fair CPU time to each such task group.  For example, it may be
211d6a3b247SMauro Carvalho Chehabdesirable to first provide fair CPU time to each user on the system and then to
212d6a3b247SMauro Carvalho Chehabeach task belonging to a user.
213d6a3b247SMauro Carvalho Chehab
214d6a3b247SMauro Carvalho ChehabCONFIG_CGROUP_SCHED strives to achieve exactly that.  It lets tasks to be
215d6a3b247SMauro Carvalho Chehabgrouped and divides CPU time fairly among such groups.
216d6a3b247SMauro Carvalho Chehab
217d6a3b247SMauro Carvalho ChehabCONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
218d6a3b247SMauro Carvalho ChehabSCHED_RR) tasks.
219d6a3b247SMauro Carvalho Chehab
220d6a3b247SMauro Carvalho ChehabCONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
221d6a3b247SMauro Carvalho ChehabSCHED_BATCH) tasks.
222d6a3b247SMauro Carvalho Chehab
223d6a3b247SMauro Carvalho Chehab   These options need CONFIG_CGROUPS to be defined, and let the administrator
224d6a3b247SMauro Carvalho Chehab   create arbitrary groups of tasks, using the "cgroup" pseudo filesystem.  See
225da82c92fSMauro Carvalho Chehab   Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem.
226d6a3b247SMauro Carvalho Chehab
227d6a3b247SMauro Carvalho ChehabWhen CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
228d6a3b247SMauro Carvalho Chehabgroup created using the pseudo filesystem.  See example steps below to create
229d6a3b247SMauro Carvalho Chehabtask groups and modify their CPU share using the "cgroups" pseudo filesystem::
230d6a3b247SMauro Carvalho Chehab
231d6a3b247SMauro Carvalho Chehab	# mount -t tmpfs cgroup_root /sys/fs/cgroup
232d6a3b247SMauro Carvalho Chehab	# mkdir /sys/fs/cgroup/cpu
233d6a3b247SMauro Carvalho Chehab	# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
234d6a3b247SMauro Carvalho Chehab	# cd /sys/fs/cgroup/cpu
235d6a3b247SMauro Carvalho Chehab
236d6a3b247SMauro Carvalho Chehab	# mkdir multimedia	# create "multimedia" group of tasks
237d6a3b247SMauro Carvalho Chehab	# mkdir browser		# create "browser" group of tasks
238d6a3b247SMauro Carvalho Chehab
239d6a3b247SMauro Carvalho Chehab	# #Configure the multimedia group to receive twice the CPU bandwidth
240d6a3b247SMauro Carvalho Chehab	# #that of browser group
241d6a3b247SMauro Carvalho Chehab
242d6a3b247SMauro Carvalho Chehab	# echo 2048 > multimedia/cpu.shares
243d6a3b247SMauro Carvalho Chehab	# echo 1024 > browser/cpu.shares
244d6a3b247SMauro Carvalho Chehab
245d6a3b247SMauro Carvalho Chehab	# firefox &	# Launch firefox and move it to "browser" group
246d6a3b247SMauro Carvalho Chehab	# echo <firefox_pid> > browser/tasks
247d6a3b247SMauro Carvalho Chehab
248d6a3b247SMauro Carvalho Chehab	# #Launch gmplayer (or your favourite movie player)
249d6a3b247SMauro Carvalho Chehab	# echo <movie_player_pid> > multimedia/tasks
250