1d6a3b247SMauro Carvalho Chehab============= 2d6a3b247SMauro Carvalho ChehabCFS Scheduler 3d6a3b247SMauro Carvalho Chehab============= 4d6a3b247SMauro Carvalho Chehab 5d6a3b247SMauro Carvalho Chehab 6d6a3b247SMauro Carvalho Chehab1. OVERVIEW 7d6a3b247SMauro Carvalho Chehab============ 8d6a3b247SMauro Carvalho Chehab 9d6a3b247SMauro Carvalho ChehabCFS stands for "Completely Fair Scheduler," and is the new "desktop" process 10d6a3b247SMauro Carvalho Chehabscheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the 11d6a3b247SMauro Carvalho Chehabreplacement for the previous vanilla scheduler's SCHED_OTHER interactivity 12d6a3b247SMauro Carvalho Chehabcode. 13d6a3b247SMauro Carvalho Chehab 14d6a3b247SMauro Carvalho Chehab80% of CFS's design can be summed up in a single sentence: CFS basically models 15d6a3b247SMauro Carvalho Chehaban "ideal, precise multi-tasking CPU" on real hardware. 16d6a3b247SMauro Carvalho Chehab 17d6a3b247SMauro Carvalho Chehab"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical 18d6a3b247SMauro Carvalho Chehabpower and which can run each task at precise equal speed, in parallel, each at 19d6a3b247SMauro Carvalho Chehab1/nr_running speed. For example: if there are 2 tasks running, then it runs 20d6a3b247SMauro Carvalho Chehabeach at 50% physical power --- i.e., actually in parallel. 21d6a3b247SMauro Carvalho Chehab 22d6a3b247SMauro Carvalho ChehabOn real hardware, we can run only a single task at once, so we have to 23d6a3b247SMauro Carvalho Chehabintroduce the concept of "virtual runtime." The virtual runtime of a task 24d6a3b247SMauro Carvalho Chehabspecifies when its next timeslice would start execution on the ideal 25d6a3b247SMauro Carvalho Chehabmulti-tasking CPU described above. In practice, the virtual runtime of a task 26d6a3b247SMauro Carvalho Chehabis its actual runtime normalized to the total number of running tasks. 27d6a3b247SMauro Carvalho Chehab 28d6a3b247SMauro Carvalho Chehab 29d6a3b247SMauro Carvalho Chehab 30d6a3b247SMauro Carvalho Chehab2. FEW IMPLEMENTATION DETAILS 31d6a3b247SMauro Carvalho Chehab============================== 32d6a3b247SMauro Carvalho Chehab 33d6a3b247SMauro Carvalho ChehabIn CFS the virtual runtime is expressed and tracked via the per-task 34d6a3b247SMauro Carvalho Chehabp->se.vruntime (nanosec-unit) value. This way, it's possible to accurately 35d6a3b247SMauro Carvalho Chehabtimestamp and measure the "expected CPU time" a task should have gotten. 36d6a3b247SMauro Carvalho Chehab 37f1779d13SKir Kolyshkin Small detail: on "ideal" hardware, at any time all tasks would have the same 38d6a3b247SMauro Carvalho Chehab p->se.vruntime value --- i.e., tasks would execute simultaneously and no task 39f1779d13SKir Kolyshkin would ever get "out of balance" from the "ideal" share of CPU time. 40d6a3b247SMauro Carvalho Chehab 41d6a3b247SMauro Carvalho ChehabCFS's task picking logic is based on this p->se.vruntime value and it is thus 42d6a3b247SMauro Carvalho Chehabvery simple: it always tries to run the task with the smallest p->se.vruntime 43d6a3b247SMauro Carvalho Chehabvalue (i.e., the task which executed least so far). CFS always tries to split 44d6a3b247SMauro Carvalho Chehabup CPU time between runnable tasks as close to "ideal multitasking hardware" as 45d6a3b247SMauro Carvalho Chehabpossible. 46d6a3b247SMauro Carvalho Chehab 47d6a3b247SMauro Carvalho ChehabMost of the rest of CFS's design just falls out of this really simple concept, 48d6a3b247SMauro Carvalho Chehabwith a few add-on embellishments like nice levels, multiprocessing and various 49d6a3b247SMauro Carvalho Chehabalgorithm variants to recognize sleepers. 50d6a3b247SMauro Carvalho Chehab 51d6a3b247SMauro Carvalho Chehab 52d6a3b247SMauro Carvalho Chehab 53d6a3b247SMauro Carvalho Chehab3. THE RBTREE 54d6a3b247SMauro Carvalho Chehab============== 55d6a3b247SMauro Carvalho Chehab 56d6a3b247SMauro Carvalho ChehabCFS's design is quite radical: it does not use the old data structures for the 57d6a3b247SMauro Carvalho Chehabrunqueues, but it uses a time-ordered rbtree to build a "timeline" of future 58d6a3b247SMauro Carvalho Chehabtask execution, and thus has no "array switch" artifacts (by which both the 59d6a3b247SMauro Carvalho Chehabprevious vanilla scheduler and RSDL/SD are affected). 60d6a3b247SMauro Carvalho Chehab 61d6a3b247SMauro Carvalho ChehabCFS also maintains the rq->cfs.min_vruntime value, which is a monotonic 62d6a3b247SMauro Carvalho Chehabincreasing value tracking the smallest vruntime among all tasks in the 63d6a3b247SMauro Carvalho Chehabrunqueue. The total amount of work done by the system is tracked using 64d6a3b247SMauro Carvalho Chehabmin_vruntime; that value is used to place newly activated entities on the left 65d6a3b247SMauro Carvalho Chehabside of the tree as much as possible. 66d6a3b247SMauro Carvalho Chehab 67d6a3b247SMauro Carvalho ChehabThe total number of running tasks in the runqueue is accounted through the 68d6a3b247SMauro Carvalho Chehabrq->cfs.load value, which is the sum of the weights of the tasks queued on the 69d6a3b247SMauro Carvalho Chehabrunqueue. 70d6a3b247SMauro Carvalho Chehab 71d6a3b247SMauro Carvalho ChehabCFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the 72d6a3b247SMauro Carvalho Chehabp->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it. 73d6a3b247SMauro Carvalho ChehabAs the system progresses forwards, the executed tasks are put into the tree 74d6a3b247SMauro Carvalho Chehabmore and more to the right --- slowly but surely giving a chance for every task 75d6a3b247SMauro Carvalho Chehabto become the "leftmost task" and thus get on the CPU within a deterministic 76d6a3b247SMauro Carvalho Chehabamount of time. 77d6a3b247SMauro Carvalho Chehab 78d6a3b247SMauro Carvalho ChehabSumming up, CFS works like this: it runs a task a bit, and when the task 79d6a3b247SMauro Carvalho Chehabschedules (or a scheduler tick happens) the task's CPU usage is "accounted 80d6a3b247SMauro Carvalho Chehabfor": the (small) time it just spent using the physical CPU is added to 81d6a3b247SMauro Carvalho Chehabp->se.vruntime. Once p->se.vruntime gets high enough so that another task 82d6a3b247SMauro Carvalho Chehabbecomes the "leftmost task" of the time-ordered rbtree it maintains (plus a 83d6a3b247SMauro Carvalho Chehabsmall amount of "granularity" distance relative to the leftmost task so that we 84d6a3b247SMauro Carvalho Chehabdo not over-schedule tasks and trash the cache), then the new leftmost task is 85d6a3b247SMauro Carvalho Chehabpicked and the current task is preempted. 86d6a3b247SMauro Carvalho Chehab 87d6a3b247SMauro Carvalho Chehab 88d6a3b247SMauro Carvalho Chehab 89d6a3b247SMauro Carvalho Chehab4. SOME FEATURES OF CFS 90d6a3b247SMauro Carvalho Chehab======================== 91d6a3b247SMauro Carvalho Chehab 92d6a3b247SMauro Carvalho ChehabCFS uses nanosecond granularity accounting and does not rely on any jiffies or 93d6a3b247SMauro Carvalho Chehabother HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the 94d6a3b247SMauro Carvalho Chehabway the previous scheduler had, and has no heuristics whatsoever. There is 95d6a3b247SMauro Carvalho Chehabonly one central tunable (you have to switch on CONFIG_SCHED_DEBUG): 96d6a3b247SMauro Carvalho Chehab 97*2f88c8e8SShrikanth Hegde /sys/kernel/debug/sched/base_slice_ns 98d6a3b247SMauro Carvalho Chehab 99d6a3b247SMauro Carvalho Chehabwhich can be used to tune the scheduler from "desktop" (i.e., low latencies) to 100d6a3b247SMauro Carvalho Chehab"server" (i.e., good batching) workloads. It defaults to a setting suitable 101d6a3b247SMauro Carvalho Chehabfor desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too. 102d6a3b247SMauro Carvalho Chehab 103d6a3b247SMauro Carvalho ChehabDue to its design, the CFS scheduler is not prone to any of the "attacks" that 104d6a3b247SMauro Carvalho Chehabexist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, 105d6a3b247SMauro Carvalho Chehabchew.c, ring-test.c, massive_intr.c all work fine and do not impact 106d6a3b247SMauro Carvalho Chehabinteractivity and produce the expected behavior. 107d6a3b247SMauro Carvalho Chehab 108d6a3b247SMauro Carvalho ChehabThe CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH 109d6a3b247SMauro Carvalho Chehabthan the previous vanilla scheduler: both types of workloads are isolated much 110d6a3b247SMauro Carvalho Chehabmore aggressively. 111d6a3b247SMauro Carvalho Chehab 112d6a3b247SMauro Carvalho ChehabSMP load-balancing has been reworked/sanitized: the runqueue-walking 113d6a3b247SMauro Carvalho Chehabassumptions are gone from the load-balancing code now, and iterators of the 114d6a3b247SMauro Carvalho Chehabscheduling modules are used. The balancing code got quite a bit simpler as a 115d6a3b247SMauro Carvalho Chehabresult. 116d6a3b247SMauro Carvalho Chehab 117d6a3b247SMauro Carvalho Chehab 118d6a3b247SMauro Carvalho Chehab 119d6a3b247SMauro Carvalho Chehab5. Scheduling policies 120d6a3b247SMauro Carvalho Chehab====================== 121d6a3b247SMauro Carvalho Chehab 122d6a3b247SMauro Carvalho ChehabCFS implements three scheduling policies: 123d6a3b247SMauro Carvalho Chehab 124d6a3b247SMauro Carvalho Chehab - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling 125d6a3b247SMauro Carvalho Chehab policy that is used for regular tasks. 126d6a3b247SMauro Carvalho Chehab 127d6a3b247SMauro Carvalho Chehab - SCHED_BATCH: Does not preempt nearly as often as regular tasks 128d6a3b247SMauro Carvalho Chehab would, thereby allowing tasks to run longer and make better use of 129d6a3b247SMauro Carvalho Chehab caches but at the cost of interactivity. This is well suited for 130d6a3b247SMauro Carvalho Chehab batch jobs. 131d6a3b247SMauro Carvalho Chehab 132d6a3b247SMauro Carvalho Chehab - SCHED_IDLE: This is even weaker than nice 19, but its not a true 133d6a3b247SMauro Carvalho Chehab idle timer scheduler in order to avoid to get into priority 134d6a3b247SMauro Carvalho Chehab inversion problems which would deadlock the machine. 135d6a3b247SMauro Carvalho Chehab 136d6a3b247SMauro Carvalho ChehabSCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by 137d6a3b247SMauro Carvalho ChehabPOSIX. 138d6a3b247SMauro Carvalho Chehab 139d6a3b247SMauro Carvalho ChehabThe command chrt from util-linux-ng 2.13.1.1 can set all of these except 140d6a3b247SMauro Carvalho ChehabSCHED_IDLE. 141d6a3b247SMauro Carvalho Chehab 142d6a3b247SMauro Carvalho Chehab 143d6a3b247SMauro Carvalho Chehab 144d6a3b247SMauro Carvalho Chehab6. SCHEDULING CLASSES 145d6a3b247SMauro Carvalho Chehab====================== 146d6a3b247SMauro Carvalho Chehab 147d6a3b247SMauro Carvalho ChehabThe new CFS scheduler has been designed in such a way to introduce "Scheduling 148d6a3b247SMauro Carvalho ChehabClasses," an extensible hierarchy of scheduler modules. These modules 149d6a3b247SMauro Carvalho Chehabencapsulate scheduling policy details and are handled by the scheduler core 150d6a3b247SMauro Carvalho Chehabwithout the core code assuming too much about them. 151d6a3b247SMauro Carvalho Chehab 152d6a3b247SMauro Carvalho Chehabsched/fair.c implements the CFS scheduler described above. 153d6a3b247SMauro Carvalho Chehab 154d6a3b247SMauro Carvalho Chehabsched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than 155d6a3b247SMauro Carvalho Chehabthe previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT 156d6a3b247SMauro Carvalho Chehabpriority levels, instead of 140 in the previous scheduler) and it needs no 157d6a3b247SMauro Carvalho Chehabexpired array. 158d6a3b247SMauro Carvalho Chehab 159d6a3b247SMauro Carvalho ChehabScheduling classes are implemented through the sched_class structure, which 160d6a3b247SMauro Carvalho Chehabcontains hooks to functions that must be called whenever an interesting event 161d6a3b247SMauro Carvalho Chehaboccurs. 162d6a3b247SMauro Carvalho Chehab 163d6a3b247SMauro Carvalho ChehabThis is the (partial) list of the hooks: 164d6a3b247SMauro Carvalho Chehab 165d6a3b247SMauro Carvalho Chehab - enqueue_task(...) 166d6a3b247SMauro Carvalho Chehab 167d6a3b247SMauro Carvalho Chehab Called when a task enters a runnable state. 168d6a3b247SMauro Carvalho Chehab It puts the scheduling entity (task) into the red-black tree and 169d6a3b247SMauro Carvalho Chehab increments the nr_running variable. 170d6a3b247SMauro Carvalho Chehab 171d6a3b247SMauro Carvalho Chehab - dequeue_task(...) 172d6a3b247SMauro Carvalho Chehab 173d6a3b247SMauro Carvalho Chehab When a task is no longer runnable, this function is called to keep the 174d6a3b247SMauro Carvalho Chehab corresponding scheduling entity out of the red-black tree. It decrements 175d6a3b247SMauro Carvalho Chehab the nr_running variable. 176d6a3b247SMauro Carvalho Chehab 177d6a3b247SMauro Carvalho Chehab - yield_task(...) 178d6a3b247SMauro Carvalho Chehab 179d6a3b247SMauro Carvalho Chehab This function is basically just a dequeue followed by an enqueue, unless the 180d6a3b247SMauro Carvalho Chehab compat_yield sysctl is turned on; in that case, it places the scheduling 181d6a3b247SMauro Carvalho Chehab entity at the right-most end of the red-black tree. 182d6a3b247SMauro Carvalho Chehab 183d6a3b247SMauro Carvalho Chehab - check_preempt_curr(...) 184d6a3b247SMauro Carvalho Chehab 185d6a3b247SMauro Carvalho Chehab This function checks if a task that entered the runnable state should 186d6a3b247SMauro Carvalho Chehab preempt the currently running task. 187d6a3b247SMauro Carvalho Chehab 188d6a3b247SMauro Carvalho Chehab - pick_next_task(...) 189d6a3b247SMauro Carvalho Chehab 190d6a3b247SMauro Carvalho Chehab This function chooses the most appropriate task eligible to run next. 191d6a3b247SMauro Carvalho Chehab 192d6a3b247SMauro Carvalho Chehab - set_curr_task(...) 193d6a3b247SMauro Carvalho Chehab 194d6a3b247SMauro Carvalho Chehab This function is called when a task changes its scheduling class or changes 195d6a3b247SMauro Carvalho Chehab its task group. 196d6a3b247SMauro Carvalho Chehab 197d6a3b247SMauro Carvalho Chehab - task_tick(...) 198d6a3b247SMauro Carvalho Chehab 199d6a3b247SMauro Carvalho Chehab This function is mostly called from time tick functions; it might lead to 200d6a3b247SMauro Carvalho Chehab process switch. This drives the running preemption. 201d6a3b247SMauro Carvalho Chehab 202d6a3b247SMauro Carvalho Chehab 203d6a3b247SMauro Carvalho Chehab 204d6a3b247SMauro Carvalho Chehab 205d6a3b247SMauro Carvalho Chehab7. GROUP SCHEDULER EXTENSIONS TO CFS 206d6a3b247SMauro Carvalho Chehab===================================== 207d6a3b247SMauro Carvalho Chehab 208d6a3b247SMauro Carvalho ChehabNormally, the scheduler operates on individual tasks and strives to provide 209d6a3b247SMauro Carvalho Chehabfair CPU time to each task. Sometimes, it may be desirable to group tasks and 210d6a3b247SMauro Carvalho Chehabprovide fair CPU time to each such task group. For example, it may be 211d6a3b247SMauro Carvalho Chehabdesirable to first provide fair CPU time to each user on the system and then to 212d6a3b247SMauro Carvalho Chehabeach task belonging to a user. 213d6a3b247SMauro Carvalho Chehab 214d6a3b247SMauro Carvalho ChehabCONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be 215d6a3b247SMauro Carvalho Chehabgrouped and divides CPU time fairly among such groups. 216d6a3b247SMauro Carvalho Chehab 217d6a3b247SMauro Carvalho ChehabCONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and 218d6a3b247SMauro Carvalho ChehabSCHED_RR) tasks. 219d6a3b247SMauro Carvalho Chehab 220d6a3b247SMauro Carvalho ChehabCONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and 221d6a3b247SMauro Carvalho ChehabSCHED_BATCH) tasks. 222d6a3b247SMauro Carvalho Chehab 223d6a3b247SMauro Carvalho Chehab These options need CONFIG_CGROUPS to be defined, and let the administrator 224d6a3b247SMauro Carvalho Chehab create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See 225da82c92fSMauro Carvalho Chehab Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem. 226d6a3b247SMauro Carvalho Chehab 227d6a3b247SMauro Carvalho ChehabWhen CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each 228d6a3b247SMauro Carvalho Chehabgroup created using the pseudo filesystem. See example steps below to create 229d6a3b247SMauro Carvalho Chehabtask groups and modify their CPU share using the "cgroups" pseudo filesystem:: 230d6a3b247SMauro Carvalho Chehab 231d6a3b247SMauro Carvalho Chehab # mount -t tmpfs cgroup_root /sys/fs/cgroup 232d6a3b247SMauro Carvalho Chehab # mkdir /sys/fs/cgroup/cpu 233d6a3b247SMauro Carvalho Chehab # mount -t cgroup -ocpu none /sys/fs/cgroup/cpu 234d6a3b247SMauro Carvalho Chehab # cd /sys/fs/cgroup/cpu 235d6a3b247SMauro Carvalho Chehab 236d6a3b247SMauro Carvalho Chehab # mkdir multimedia # create "multimedia" group of tasks 237d6a3b247SMauro Carvalho Chehab # mkdir browser # create "browser" group of tasks 238d6a3b247SMauro Carvalho Chehab 239d6a3b247SMauro Carvalho Chehab # #Configure the multimedia group to receive twice the CPU bandwidth 240d6a3b247SMauro Carvalho Chehab # #that of browser group 241d6a3b247SMauro Carvalho Chehab 242d6a3b247SMauro Carvalho Chehab # echo 2048 > multimedia/cpu.shares 243d6a3b247SMauro Carvalho Chehab # echo 1024 > browser/cpu.shares 244d6a3b247SMauro Carvalho Chehab 245d6a3b247SMauro Carvalho Chehab # firefox & # Launch firefox and move it to "browser" group 246d6a3b247SMauro Carvalho Chehab # echo <firefox_pid> > browser/tasks 247d6a3b247SMauro Carvalho Chehab 248d6a3b247SMauro Carvalho Chehab # #Launch gmplayer (or your favourite movie player) 249d6a3b247SMauro Carvalho Chehab # echo <movie_player_pid> > multimedia/tasks 250