10159bb02SJoel Fernandes (Google).. SPDX-License-Identifier: GPL-2.0
20159bb02SJoel Fernandes (Google)
30159bb02SJoel Fernandes (Google)===============
40159bb02SJoel Fernandes (Google)Core Scheduling
50159bb02SJoel Fernandes (Google)===============
60159bb02SJoel Fernandes (Google)Core scheduling support allows userspace to define groups of tasks that can
70159bb02SJoel Fernandes (Google)share a core. These groups can be specified either for security usecases (one
80159bb02SJoel Fernandes (Google)group of tasks don't trust another), or for performance usecases (some
90159bb02SJoel Fernandes (Google)workloads may benefit from running on the same core as they don't need the same
100159bb02SJoel Fernandes (Google)hardware resources of the shared core, or may prefer different cores if they
110159bb02SJoel Fernandes (Google)do share hardware resource needs). This document only describes the security
120159bb02SJoel Fernandes (Google)usecase.
130159bb02SJoel Fernandes (Google)
140159bb02SJoel Fernandes (Google)Security usecase
150159bb02SJoel Fernandes (Google)----------------
160159bb02SJoel Fernandes (Google)A cross-HT attack involves the attacker and victim running on different Hyper
170159bb02SJoel Fernandes (Google)Threads of the same core. MDS and L1TF are examples of such attacks.  The only
180159bb02SJoel Fernandes (Google)full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
190159bb02SJoel Fernandes (Google)scheduling is a scheduler feature that can mitigate some (not all) cross-HT
200159bb02SJoel Fernandes (Google)attacks. It allows HT to be turned on safely by ensuring that only tasks in a
210159bb02SJoel Fernandes (Google)user-designated trusted group can share a core. This increase in core sharing
220159bb02SJoel Fernandes (Google)can also improve performance, however it is not guaranteed that performance
230159bb02SJoel Fernandes (Google)will always improve, though that is seen to be the case with a number of real
240159bb02SJoel Fernandes (Google)world workloads. In theory, core scheduling aims to perform at least as good as
250159bb02SJoel Fernandes (Google)when Hyper Threading is disabled. In practice, this is mostly the case though
260159bb02SJoel Fernandes (Google)not always: as synchronizing scheduling decisions across 2 or more CPUs in a
270159bb02SJoel Fernandes (Google)core involves additional overhead - especially when the system is lightly
280159bb02SJoel Fernandes (Google)loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
290159bb02SJoel Fernandes (Google)scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
300159bb02SJoel Fernandes (Google)total number of CPUs. Please measure the performance of your workloads always.
310159bb02SJoel Fernandes (Google)
320159bb02SJoel Fernandes (Google)Usage
330159bb02SJoel Fernandes (Google)-----
340159bb02SJoel Fernandes (Google)Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
350159bb02SJoel Fernandes (Google)Using this feature, userspace defines groups of tasks that can be co-scheduled
360159bb02SJoel Fernandes (Google)on the same core. The core scheduler uses this information to make sure that
370159bb02SJoel Fernandes (Google)tasks that are not in the same group never run simultaneously on a core, while
380159bb02SJoel Fernandes (Google)doing its best to satisfy the system's scheduling requirements.
390159bb02SJoel Fernandes (Google)
400159bb02SJoel Fernandes (Google)Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
410159bb02SJoel Fernandes (Google)This interface provides support for the creation of core scheduling groups, as
420159bb02SJoel Fernandes (Google)well as admission and removal of tasks from created groups::
430159bb02SJoel Fernandes (Google)
440159bb02SJoel Fernandes (Google)    #include <sys/prctl.h>
450159bb02SJoel Fernandes (Google)
460159bb02SJoel Fernandes (Google)    int prctl(int option, unsigned long arg2, unsigned long arg3,
470159bb02SJoel Fernandes (Google)            unsigned long arg4, unsigned long arg5);
480159bb02SJoel Fernandes (Google)
490159bb02SJoel Fernandes (Google)option:
500159bb02SJoel Fernandes (Google)    ``PR_SCHED_CORE``
510159bb02SJoel Fernandes (Google)
520159bb02SJoel Fernandes (Google)arg2:
530159bb02SJoel Fernandes (Google)    Command for operation, must be one off:
540159bb02SJoel Fernandes (Google)
550159bb02SJoel Fernandes (Google)    - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``.
560159bb02SJoel Fernandes (Google)    - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``.
570159bb02SJoel Fernandes (Google)    - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``.
580159bb02SJoel Fernandes (Google)    - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``.
590159bb02SJoel Fernandes (Google)
600159bb02SJoel Fernandes (Google)arg3:
610159bb02SJoel Fernandes (Google)    ``pid`` of the task for which the operation applies.
620159bb02SJoel Fernandes (Google)
630159bb02SJoel Fernandes (Google)arg4:
6461bc346cSEugene Syromiatnikov    ``pid_type`` for which the operation applies. It is one of
6561bc346cSEugene Syromiatnikov    ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants.  For example, if arg4
6661bc346cSEugene Syromiatnikov    is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command
670159bb02SJoel Fernandes (Google)    will be performed for all tasks in the task group of ``pid``.
680159bb02SJoel Fernandes (Google)
690159bb02SJoel Fernandes (Google)arg5:
701266e5a8SThomas Weißschuh    userspace pointer to an unsigned long long for storing the cookie returned
711266e5a8SThomas Weißschuh    by ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
720159bb02SJoel Fernandes (Google)
730159bb02SJoel Fernandes (Google)In order for a process to push a cookie to, or pull a cookie from a process, it
740159bb02SJoel Fernandes (Google)is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the
750159bb02SJoel Fernandes (Google)process.
760159bb02SJoel Fernandes (Google)
770159bb02SJoel Fernandes (Google)Building hierarchies of tasks
780159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
790159bb02SJoel Fernandes (Google)The simplest way to build hierarchies of threads/processes which share a
800159bb02SJoel Fernandes (Google)cookie and thus a core is to rely on the fact that the core-sched cookie is
810159bb02SJoel Fernandes (Google)inherited across forks/clones and execs, thus setting a cookie for the
820159bb02SJoel Fernandes (Google)'initial' script/executable/daemon will place every spawned child in the
830159bb02SJoel Fernandes (Google)same core-sched group.
840159bb02SJoel Fernandes (Google)
850159bb02SJoel Fernandes (Google)Cookie Transferral
860159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~
870159bb02SJoel Fernandes (Google)Transferring a cookie between the current and other tasks is possible using
880159bb02SJoel Fernandes (Google)PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
890159bb02SJoel Fernandes (Google)specified task or a share a cookie with a task. In combination this allows a
900159bb02SJoel Fernandes (Google)simple helper program to pull a cookie from a task in an existing core
910159bb02SJoel Fernandes (Google)scheduling group and share it with already running tasks.
920159bb02SJoel Fernandes (Google)
930159bb02SJoel Fernandes (Google)Design/Implementation
940159bb02SJoel Fernandes (Google)---------------------
950159bb02SJoel Fernandes (Google)Each task that is tagged is assigned a cookie internally in the kernel. As
960159bb02SJoel Fernandes (Google)mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
970159bb02SJoel Fernandes (Google)each other and share a core.
980159bb02SJoel Fernandes (Google)
990159bb02SJoel Fernandes (Google)The basic idea is that, every schedule event tries to select tasks for all the
1000159bb02SJoel Fernandes (Google)siblings of a core such that all the selected tasks running on a core are
1010159bb02SJoel Fernandes (Google)trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
1020159bb02SJoel Fernandes (Google)The idle task is considered special, as it trusts everything and everything
1030159bb02SJoel Fernandes (Google)trusts it.
1040159bb02SJoel Fernandes (Google)
1050159bb02SJoel Fernandes (Google)During a schedule() event on any sibling of a core, the highest priority task on
1060159bb02SJoel Fernandes (Google)the sibling's core is picked and assigned to the sibling calling schedule(), if
1070159bb02SJoel Fernandes (Google)the sibling has the task enqueued. For rest of the siblings in the core,
1080159bb02SJoel Fernandes (Google)highest priority task with the same cookie is selected if there is one runnable
1090159bb02SJoel Fernandes (Google)in their individual run queues. If a task with same cookie is not available,
1100159bb02SJoel Fernandes (Google)the idle task is selected.  Idle task is globally trusted.
1110159bb02SJoel Fernandes (Google)
1120159bb02SJoel Fernandes (Google)Once a task has been selected for all the siblings in the core, an IPI is sent to
1130159bb02SJoel Fernandes (Google)siblings for whom a new task was selected. Siblings on receiving the IPI will
1140159bb02SJoel Fernandes (Google)switch to the new task immediately. If an idle task is selected for a sibling,
1150159bb02SJoel Fernandes (Google)then the sibling is considered to be in a `forced idle` state. I.e., it may
1160159bb02SJoel Fernandes (Google)have tasks on its on runqueue to run, however it will still have to run idle.
1170159bb02SJoel Fernandes (Google)More on this in the next section.
1180159bb02SJoel Fernandes (Google)
1190159bb02SJoel Fernandes (Google)Forced-idling of hyperthreads
1200159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1210159bb02SJoel Fernandes (Google)The scheduler tries its best to find tasks that trust each other such that all
1220159bb02SJoel Fernandes (Google)tasks selected to be scheduled are of the highest priority in a core.  However,
1230159bb02SJoel Fernandes (Google)it is possible that some runqueues had tasks that were incompatible with the
1240159bb02SJoel Fernandes (Google)highest priority ones in the core. Favoring security over fairness, one or more
1250159bb02SJoel Fernandes (Google)siblings could be forced to select a lower priority task if the highest
1260159bb02SJoel Fernandes (Google)priority task is not trusted with respect to the core wide highest priority
1270159bb02SJoel Fernandes (Google)task.  If a sibling does not have a trusted task to run, it will be forced idle
1280159bb02SJoel Fernandes (Google)by the scheduler (idle thread is scheduled to run).
1290159bb02SJoel Fernandes (Google)
1300159bb02SJoel Fernandes (Google)When the highest priority task is selected to run, a reschedule-IPI is sent to
1310159bb02SJoel Fernandes (Google)the sibling to force it into idle. This results in 4 cases which need to be
1320159bb02SJoel Fernandes (Google)considered depending on whether a VM or a regular usermode process was running
1330159bb02SJoel Fernandes (Google)on either HT::
1340159bb02SJoel Fernandes (Google)
1350159bb02SJoel Fernandes (Google)          HT1 (attack)            HT2 (victim)
1360159bb02SJoel Fernandes (Google)   A      idle -> user space      user space -> idle
1370159bb02SJoel Fernandes (Google)   B      idle -> user space      guest -> idle
1380159bb02SJoel Fernandes (Google)   C      idle -> guest           user space -> idle
1390159bb02SJoel Fernandes (Google)   D      idle -> guest           guest -> idle
1400159bb02SJoel Fernandes (Google)
1410159bb02SJoel Fernandes (Google)Note that for better performance, we do not wait for the destination CPU
1420159bb02SJoel Fernandes (Google)(victim) to enter idle mode. This is because the sending of the IPI would bring
1430159bb02SJoel Fernandes (Google)the destination CPU immediately into kernel mode from user space, or VMEXIT
1440159bb02SJoel Fernandes (Google)in the case of guests. At best, this would only leak some scheduler metadata
1450159bb02SJoel Fernandes (Google)which may not be worth protecting. It is also possible that the IPI is received
1460159bb02SJoel Fernandes (Google)too late on some architectures, but this has not been observed in the case of
1470159bb02SJoel Fernandes (Google)x86.
1480159bb02SJoel Fernandes (Google)
1490159bb02SJoel Fernandes (Google)Trust model
1500159bb02SJoel Fernandes (Google)~~~~~~~~~~~
1510159bb02SJoel Fernandes (Google)Core scheduling maintains trust relationships amongst groups of tasks by
1520159bb02SJoel Fernandes (Google)assigning them a tag that is the same cookie value.
1530159bb02SJoel Fernandes (Google)When a system with core scheduling boots, all tasks are considered to trust
1540159bb02SJoel Fernandes (Google)each other. This is because the core scheduler does not have information about
1550159bb02SJoel Fernandes (Google)trust relationships until userspace uses the above mentioned interfaces, to
1560159bb02SJoel Fernandes (Google)communicate them. In other words, all tasks have a default cookie value of 0.
1570159bb02SJoel Fernandes (Google)and are considered system-wide trusted. The forced-idling of siblings running
1580159bb02SJoel Fernandes (Google)cookie-0 tasks is also avoided.
1590159bb02SJoel Fernandes (Google)
1600159bb02SJoel Fernandes (Google)Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
1610159bb02SJoel Fernandes (Google)within such groups are considered to trust each other, but do not trust those
1620159bb02SJoel Fernandes (Google)outside. Tasks outside the group also don't trust tasks within.
1630159bb02SJoel Fernandes (Google)
1640159bb02SJoel Fernandes (Google)Limitations of core-scheduling
1650159bb02SJoel Fernandes (Google)------------------------------
1660159bb02SJoel Fernandes (Google)Core scheduling tries to guarantee that only trusted tasks run concurrently on a
1670159bb02SJoel Fernandes (Google)core. But there could be small window of time during which untrusted tasks run
1680159bb02SJoel Fernandes (Google)concurrently or kernel could be running concurrently with a task not trusted by
1690159bb02SJoel Fernandes (Google)kernel.
1700159bb02SJoel Fernandes (Google)
1710159bb02SJoel Fernandes (Google)IPI processing delays
1720159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~~~~
1730159bb02SJoel Fernandes (Google)Core scheduling selects only trusted tasks to run together. IPI is used to notify
1740159bb02SJoel Fernandes (Google)the siblings to switch to the new task. But there could be hardware delays in
1750159bb02SJoel Fernandes (Google)receiving of the IPI on some arch (on x86, this has not been observed). This may
1760159bb02SJoel Fernandes (Google)cause an attacker task to start running on a CPU before its siblings receive the
1770159bb02SJoel Fernandes (Google)IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
1780159bb02SJoel Fernandes (Google)may populate data in the cache and micro architectural buffers after the attacker
1790159bb02SJoel Fernandes (Google)starts to run and this is a possibility for data leak.
1800159bb02SJoel Fernandes (Google)
1810159bb02SJoel Fernandes (Google)Open cross-HT issues that core scheduling does not solve
1820159bb02SJoel Fernandes (Google)--------------------------------------------------------
1830159bb02SJoel Fernandes (Google)1. For MDS
1840159bb02SJoel Fernandes (Google)~~~~~~~~~~
185ce48ee81SFabio M. De FrancescoCore scheduling cannot protect against MDS attacks between the siblings
186ce48ee81SFabio M. De Francescorunning in user mode and the others running in kernel mode. Even though all
187ce48ee81SFabio M. De Francescosiblings run tasks which trust each other, when the kernel is executing
188ce48ee81SFabio M. De Francescocode on behalf of a task, it cannot trust the code running in the
189ce48ee81SFabio M. De Francescosibling. Such attacks are possible for any combination of sibling CPU modes
190ce48ee81SFabio M. De Francesco(host or guest mode).
1910159bb02SJoel Fernandes (Google)
1920159bb02SJoel Fernandes (Google)2. For L1TF
1930159bb02SJoel Fernandes (Google)~~~~~~~~~~~
1940159bb02SJoel Fernandes (Google)Core scheduling cannot protect against an L1TF guest attacker exploiting a
1950159bb02SJoel Fernandes (Google)guest or host victim. This is because the guest attacker can craft invalid
1960159bb02SJoel Fernandes (Google)PTEs which are not inverted due to a vulnerable guest kernel. The only
1970159bb02SJoel Fernandes (Google)solution is to disable EPT (Extended Page Tables).
1980159bb02SJoel Fernandes (Google)
1990159bb02SJoel Fernandes (Google)For both MDS and L1TF, if the guest vCPU is configured to not trust each
2000159bb02SJoel Fernandes (Google)other (by tagging separately), then the guest to guest attacks would go away.
2010159bb02SJoel Fernandes (Google)Or it could be a system admin policy which considers guest to guest attacks as
2020159bb02SJoel Fernandes (Google)a guest problem.
2030159bb02SJoel Fernandes (Google)
2040159bb02SJoel Fernandes (Google)Another approach to resolve these would be to make every untrusted task on the
2050159bb02SJoel Fernandes (Google)system to not trust every other untrusted task. While this could reduce
2060159bb02SJoel Fernandes (Google)parallelism of the untrusted tasks, it would still solve the above issues while
2070159bb02SJoel Fernandes (Google)allowing system processes (trusted tasks) to share a core.
2080159bb02SJoel Fernandes (Google)
2090159bb02SJoel Fernandes (Google)3. Protecting the kernel (IRQ, syscall, VMEXIT)
2100159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2110159bb02SJoel Fernandes (Google)Unfortunately, core scheduling does not protect kernel contexts running on
2120159bb02SJoel Fernandes (Google)sibling hyperthreads from one another. Prototypes of mitigations have been posted
2130159bb02SJoel Fernandes (Google)to LKML to solve this, but it is debatable whether such windows are practically
2140159bb02SJoel Fernandes (Google)exploitable, and whether the performance overhead of the prototypes are worth
2150159bb02SJoel Fernandes (Google)it (not to mention, the added code complexity).
2160159bb02SJoel Fernandes (Google)
2170159bb02SJoel Fernandes (Google)Other Use cases
2180159bb02SJoel Fernandes (Google)---------------
2190159bb02SJoel Fernandes (Google)The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
2200159bb02SJoel Fernandes (Google)with SMT enabled. There are other use cases where this feature could be used:
2210159bb02SJoel Fernandes (Google)
2220159bb02SJoel Fernandes (Google)- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
2230159bb02SJoel Fernandes (Google)  that uses SIMD instructions etc.
2240159bb02SJoel Fernandes (Google)- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
2250159bb02SJoel Fernandes (Google)  together could also be realized using core scheduling. One example is vCPUs of
2260159bb02SJoel Fernandes (Google)  a VM.
227