admin-guide/hw-vuln/core-scheduling.rst

0159bb02SJoel Fernandes (Google).. SPDX-License-Identifier: GPL-2.0
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)===============
0159bb02SJoel Fernandes (Google)Core Scheduling
0159bb02SJoel Fernandes (Google)===============
0159bb02SJoel Fernandes (Google)Core scheduling support allows userspace to define groups of tasks that can
0159bb02SJoel Fernandes (Google)share a core. These groups can be specified either for security usecases (one
0159bb02SJoel Fernandes (Google)group of tasks don't trust another), or for performance usecases (some
0159bb02SJoel Fernandes (Google)workloads may benefit from running on the same core as they don't need the same
0159bb02SJoel Fernandes (Google)hardware resources of the shared core, or may prefer different cores if they
0159bb02SJoel Fernandes (Google)do share hardware resource needs). This document only describes the security
0159bb02SJoel Fernandes (Google)usecase.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Security usecase
0159bb02SJoel Fernandes (Google)----------------
0159bb02SJoel Fernandes (Google)A cross-HT attack involves the attacker and victim running on different Hyper
0159bb02SJoel Fernandes (Google)Threads of the same core. MDS and L1TF are examples of such attacks.  The only
0159bb02SJoel Fernandes (Google)full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core
0159bb02SJoel Fernandes (Google)scheduling is a scheduler feature that can mitigate some (not all) cross-HT
0159bb02SJoel Fernandes (Google)attacks. It allows HT to be turned on safely by ensuring that only tasks in a
0159bb02SJoel Fernandes (Google)user-designated trusted group can share a core. This increase in core sharing
0159bb02SJoel Fernandes (Google)can also improve performance, however it is not guaranteed that performance
0159bb02SJoel Fernandes (Google)will always improve, though that is seen to be the case with a number of real
0159bb02SJoel Fernandes (Google)world workloads. In theory, core scheduling aims to perform at least as good as
0159bb02SJoel Fernandes (Google)when Hyper Threading is disabled. In practice, this is mostly the case though
0159bb02SJoel Fernandes (Google)not always: as synchronizing scheduling decisions across 2 or more CPUs in a
0159bb02SJoel Fernandes (Google)core involves additional overhead - especially when the system is lightly
0159bb02SJoel Fernandes (Google)loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core
0159bb02SJoel Fernandes (Google)scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the
0159bb02SJoel Fernandes (Google)total number of CPUs. Please measure the performance of your workloads always.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Usage
0159bb02SJoel Fernandes (Google)-----
0159bb02SJoel Fernandes (Google)Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option.
0159bb02SJoel Fernandes (Google)Using this feature, userspace defines groups of tasks that can be co-scheduled
0159bb02SJoel Fernandes (Google)on the same core. The core scheduler uses this information to make sure that
0159bb02SJoel Fernandes (Google)tasks that are not in the same group never run simultaneously on a core, while
0159bb02SJoel Fernandes (Google)doing its best to satisfy the system's scheduling requirements.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface.
0159bb02SJoel Fernandes (Google)This interface provides support for the creation of core scheduling groups, as
0159bb02SJoel Fernandes (Google)well as admission and removal of tasks from created groups::
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)    #include <sys/prctl.h>
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)    int prctl(int option, unsigned long arg2, unsigned long arg3,
0159bb02SJoel Fernandes (Google)            unsigned long arg4, unsigned long arg5);
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)option:
0159bb02SJoel Fernandes (Google)    ``PR_SCHED_CORE``
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)arg2:
0159bb02SJoel Fernandes (Google)    Command for operation, must be one off:
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)    - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``.
0159bb02SJoel Fernandes (Google)    - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``.
0159bb02SJoel Fernandes (Google)    - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``.
0159bb02SJoel Fernandes (Google)    - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)arg3:
0159bb02SJoel Fernandes (Google)    ``pid`` of the task for which the operation applies.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)arg4:
61bc346cSEugene Syromiatnikov    ``pid_type`` for which the operation applies. It is one of
61bc346cSEugene Syromiatnikov    ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants.  For example, if arg4
61bc346cSEugene Syromiatnikov    is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command
0159bb02SJoel Fernandes (Google)    will be performed for all tasks in the task group of ``pid``.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)arg5:
1266e5a8SThomas Weißschuh    userspace pointer to an unsigned long long for storing the cookie returned
1266e5a8SThomas Weißschuh    by ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)In order for a process to push a cookie to, or pull a cookie from a process, it
0159bb02SJoel Fernandes (Google)is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the
0159bb02SJoel Fernandes (Google)process.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Building hierarchies of tasks
0159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0159bb02SJoel Fernandes (Google)The simplest way to build hierarchies of threads/processes which share a
0159bb02SJoel Fernandes (Google)cookie and thus a core is to rely on the fact that the core-sched cookie is
0159bb02SJoel Fernandes (Google)inherited across forks/clones and execs, thus setting a cookie for the
0159bb02SJoel Fernandes (Google)'initial' script/executable/daemon will place every spawned child in the
0159bb02SJoel Fernandes (Google)same core-sched group.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Cookie Transferral
0159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~
0159bb02SJoel Fernandes (Google)Transferring a cookie between the current and other tasks is possible using
0159bb02SJoel Fernandes (Google)PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a
0159bb02SJoel Fernandes (Google)specified task or a share a cookie with a task. In combination this allows a
0159bb02SJoel Fernandes (Google)simple helper program to pull a cookie from a task in an existing core
0159bb02SJoel Fernandes (Google)scheduling group and share it with already running tasks.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Design/Implementation
0159bb02SJoel Fernandes (Google)---------------------
0159bb02SJoel Fernandes (Google)Each task that is tagged is assigned a cookie internally in the kernel. As
0159bb02SJoel Fernandes (Google)mentioned in `Usage`_, tasks with the same cookie value are assumed to trust
0159bb02SJoel Fernandes (Google)each other and share a core.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)The basic idea is that, every schedule event tries to select tasks for all the
0159bb02SJoel Fernandes (Google)siblings of a core such that all the selected tasks running on a core are
0159bb02SJoel Fernandes (Google)trusted (same cookie) at any point in time. Kernel threads are assumed trusted.
0159bb02SJoel Fernandes (Google)The idle task is considered special, as it trusts everything and everything
0159bb02SJoel Fernandes (Google)trusts it.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)During a schedule() event on any sibling of a core, the highest priority task on
0159bb02SJoel Fernandes (Google)the sibling's core is picked and assigned to the sibling calling schedule(), if
0159bb02SJoel Fernandes (Google)the sibling has the task enqueued. For rest of the siblings in the core,
0159bb02SJoel Fernandes (Google)highest priority task with the same cookie is selected if there is one runnable
0159bb02SJoel Fernandes (Google)in their individual run queues. If a task with same cookie is not available,
0159bb02SJoel Fernandes (Google)the idle task is selected.  Idle task is globally trusted.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Once a task has been selected for all the siblings in the core, an IPI is sent to
0159bb02SJoel Fernandes (Google)siblings for whom a new task was selected. Siblings on receiving the IPI will
0159bb02SJoel Fernandes (Google)switch to the new task immediately. If an idle task is selected for a sibling,
0159bb02SJoel Fernandes (Google)then the sibling is considered to be in a `forced idle` state. I.e., it may
0159bb02SJoel Fernandes (Google)have tasks on its on runqueue to run, however it will still have to run idle.
0159bb02SJoel Fernandes (Google)More on this in the next section.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Forced-idling of hyperthreads
0159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0159bb02SJoel Fernandes (Google)The scheduler tries its best to find tasks that trust each other such that all
0159bb02SJoel Fernandes (Google)tasks selected to be scheduled are of the highest priority in a core.  However,
0159bb02SJoel Fernandes (Google)it is possible that some runqueues had tasks that were incompatible with the
0159bb02SJoel Fernandes (Google)highest priority ones in the core. Favoring security over fairness, one or more
0159bb02SJoel Fernandes (Google)siblings could be forced to select a lower priority task if the highest
0159bb02SJoel Fernandes (Google)priority task is not trusted with respect to the core wide highest priority
0159bb02SJoel Fernandes (Google)task.  If a sibling does not have a trusted task to run, it will be forced idle
0159bb02SJoel Fernandes (Google)by the scheduler (idle thread is scheduled to run).
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)When the highest priority task is selected to run, a reschedule-IPI is sent to
0159bb02SJoel Fernandes (Google)the sibling to force it into idle. This results in 4 cases which need to be
0159bb02SJoel Fernandes (Google)considered depending on whether a VM or a regular usermode process was running
0159bb02SJoel Fernandes (Google)on either HT::
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)          HT1 (attack)            HT2 (victim)
0159bb02SJoel Fernandes (Google)   A      idle -> user space      user space -> idle
0159bb02SJoel Fernandes (Google)   B      idle -> user space      guest -> idle
0159bb02SJoel Fernandes (Google)   C      idle -> guest           user space -> idle
0159bb02SJoel Fernandes (Google)   D      idle -> guest           guest -> idle
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Note that for better performance, we do not wait for the destination CPU
0159bb02SJoel Fernandes (Google)(victim) to enter idle mode. This is because the sending of the IPI would bring
0159bb02SJoel Fernandes (Google)the destination CPU immediately into kernel mode from user space, or VMEXIT
0159bb02SJoel Fernandes (Google)in the case of guests. At best, this would only leak some scheduler metadata
0159bb02SJoel Fernandes (Google)which may not be worth protecting. It is also possible that the IPI is received
0159bb02SJoel Fernandes (Google)too late on some architectures, but this has not been observed in the case of
0159bb02SJoel Fernandes (Google)x86.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Trust model
0159bb02SJoel Fernandes (Google)~~~~~~~~~~~
0159bb02SJoel Fernandes (Google)Core scheduling maintains trust relationships amongst groups of tasks by
0159bb02SJoel Fernandes (Google)assigning them a tag that is the same cookie value.
0159bb02SJoel Fernandes (Google)When a system with core scheduling boots, all tasks are considered to trust
0159bb02SJoel Fernandes (Google)each other. This is because the core scheduler does not have information about
0159bb02SJoel Fernandes (Google)trust relationships until userspace uses the above mentioned interfaces, to
0159bb02SJoel Fernandes (Google)communicate them. In other words, all tasks have a default cookie value of 0.
0159bb02SJoel Fernandes (Google)and are considered system-wide trusted. The forced-idling of siblings running
0159bb02SJoel Fernandes (Google)cookie-0 tasks is also avoided.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Once userspace uses the above mentioned interfaces to group sets of tasks, tasks
0159bb02SJoel Fernandes (Google)within such groups are considered to trust each other, but do not trust those
0159bb02SJoel Fernandes (Google)outside. Tasks outside the group also don't trust tasks within.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Limitations of core-scheduling
0159bb02SJoel Fernandes (Google)------------------------------
0159bb02SJoel Fernandes (Google)Core scheduling tries to guarantee that only trusted tasks run concurrently on a
0159bb02SJoel Fernandes (Google)core. But there could be small window of time during which untrusted tasks run
0159bb02SJoel Fernandes (Google)concurrently or kernel could be running concurrently with a task not trusted by
0159bb02SJoel Fernandes (Google)kernel.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)IPI processing delays
0159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~~~~
0159bb02SJoel Fernandes (Google)Core scheduling selects only trusted tasks to run together. IPI is used to notify
0159bb02SJoel Fernandes (Google)the siblings to switch to the new task. But there could be hardware delays in
0159bb02SJoel Fernandes (Google)receiving of the IPI on some arch (on x86, this has not been observed). This may
0159bb02SJoel Fernandes (Google)cause an attacker task to start running on a CPU before its siblings receive the
0159bb02SJoel Fernandes (Google)IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings
0159bb02SJoel Fernandes (Google)may populate data in the cache and micro architectural buffers after the attacker
0159bb02SJoel Fernandes (Google)starts to run and this is a possibility for data leak.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Open cross-HT issues that core scheduling does not solve
0159bb02SJoel Fernandes (Google)--------------------------------------------------------
0159bb02SJoel Fernandes (Google)1. For MDS
0159bb02SJoel Fernandes (Google)~~~~~~~~~~
ce48ee81SFabio M. De FrancescoCore scheduling cannot protect against MDS attacks between the siblings
ce48ee81SFabio M. De Francescorunning in user mode and the others running in kernel mode. Even though all
ce48ee81SFabio M. De Francescosiblings run tasks which trust each other, when the kernel is executing
ce48ee81SFabio M. De Francescocode on behalf of a task, it cannot trust the code running in the
ce48ee81SFabio M. De Francescosibling. Such attacks are possible for any combination of sibling CPU modes
ce48ee81SFabio M. De Francesco(host or guest mode).
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)2. For L1TF
0159bb02SJoel Fernandes (Google)~~~~~~~~~~~
0159bb02SJoel Fernandes (Google)Core scheduling cannot protect against an L1TF guest attacker exploiting a
0159bb02SJoel Fernandes (Google)guest or host victim. This is because the guest attacker can craft invalid
0159bb02SJoel Fernandes (Google)PTEs which are not inverted due to a vulnerable guest kernel. The only
0159bb02SJoel Fernandes (Google)solution is to disable EPT (Extended Page Tables).
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)For both MDS and L1TF, if the guest vCPU is configured to not trust each
0159bb02SJoel Fernandes (Google)other (by tagging separately), then the guest to guest attacks would go away.
0159bb02SJoel Fernandes (Google)Or it could be a system admin policy which considers guest to guest attacks as
0159bb02SJoel Fernandes (Google)a guest problem.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Another approach to resolve these would be to make every untrusted task on the
0159bb02SJoel Fernandes (Google)system to not trust every other untrusted task. While this could reduce
0159bb02SJoel Fernandes (Google)parallelism of the untrusted tasks, it would still solve the above issues while
0159bb02SJoel Fernandes (Google)allowing system processes (trusted tasks) to share a core.
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)3. Protecting the kernel (IRQ, syscall, VMEXIT)
0159bb02SJoel Fernandes (Google)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0159bb02SJoel Fernandes (Google)Unfortunately, core scheduling does not protect kernel contexts running on
0159bb02SJoel Fernandes (Google)sibling hyperthreads from one another. Prototypes of mitigations have been posted
0159bb02SJoel Fernandes (Google)to LKML to solve this, but it is debatable whether such windows are practically
0159bb02SJoel Fernandes (Google)exploitable, and whether the performance overhead of the prototypes are worth
0159bb02SJoel Fernandes (Google)it (not to mention, the added code complexity).
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)Other Use cases
0159bb02SJoel Fernandes (Google)---------------
0159bb02SJoel Fernandes (Google)The main use case for Core scheduling is mitigating the cross-HT vulnerabilities
0159bb02SJoel Fernandes (Google)with SMT enabled. There are other use cases where this feature could be used:
0159bb02SJoel Fernandes (Google)
0159bb02SJoel Fernandes (Google)- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks
0159bb02SJoel Fernandes (Google)  that uses SIMD instructions etc.
0159bb02SJoel Fernandes (Google)- Gang scheduling: Requirements for a group of tasks that needs to be scheduled
0159bb02SJoel Fernandes (Google)  together could also be realized using core scheduling. One example is vCPUs of
0159bb02SJoel Fernandes (Google)  a VM.