xref: /openbmc/linux/Documentation/admin-guide/syscall-user-dispatch.rst (revision 1ac731c529cd4d6adbce134754b51ff7d822b145)
1a4452e67SGabriel Krisman Bertazi.. SPDX-License-Identifier: GPL-2.0
2a4452e67SGabriel Krisman Bertazi
3a4452e67SGabriel Krisman Bertazi=====================
4a4452e67SGabriel Krisman BertaziSyscall User Dispatch
5a4452e67SGabriel Krisman Bertazi=====================
6a4452e67SGabriel Krisman Bertazi
7a4452e67SGabriel Krisman BertaziBackground
8a4452e67SGabriel Krisman Bertazi----------
9a4452e67SGabriel Krisman Bertazi
10a4452e67SGabriel Krisman BertaziCompatibility layers like Wine need a way to efficiently emulate system
11a4452e67SGabriel Krisman Bertazicalls of only a part of their process - the part that has the
12a4452e67SGabriel Krisman Bertaziincompatible code - while being able to execute native syscalls without
13a4452e67SGabriel Krisman Bertazia high performance penalty on the native part of the process.  Seccomp
14a4452e67SGabriel Krisman Bertazifalls short on this task, since it has limited support to efficiently
15a4452e67SGabriel Krisman Bertazifilter syscalls based on memory regions, and it doesn't support removing
16a4452e67SGabriel Krisman Bertazifilters.  Therefore a new mechanism is necessary.
17a4452e67SGabriel Krisman Bertazi
18a4452e67SGabriel Krisman BertaziSyscall User Dispatch brings the filtering of the syscall dispatcher
19a4452e67SGabriel Krisman Bertaziaddress back to userspace.  The application is in control of a flip
20a4452e67SGabriel Krisman Bertaziswitch, indicating the current personality of the process.  A
21a4452e67SGabriel Krisman Bertazimultiple-personality application can then flip the switch without
22a4452e67SGabriel Krisman Bertaziinvoking the kernel, when crossing the compatibility layer API
23a4452e67SGabriel Krisman Bertaziboundaries, to enable/disable the syscall redirection and execute
24a4452e67SGabriel Krisman Bertazisyscalls directly (disabled) or send them to be emulated in userspace
25a4452e67SGabriel Krisman Bertazithrough a SIGSYS.
26a4452e67SGabriel Krisman Bertazi
27a4452e67SGabriel Krisman BertaziThe goal of this design is to provide very quick compatibility layer
28a4452e67SGabriel Krisman Bertaziboundary crosses, which is achieved by not executing a syscall to change
29a4452e67SGabriel Krisman Bertazipersonality every time the compatibility layer executes.  Instead, a
30a4452e67SGabriel Krisman Bertaziuserspace memory region exposed to the kernel indicates the current
31a4452e67SGabriel Krisman Bertazipersonality, and the application simply modifies that variable to
32a4452e67SGabriel Krisman Bertaziconfigure the mechanism.
33a4452e67SGabriel Krisman Bertazi
34a4452e67SGabriel Krisman BertaziThere is a relatively high cost associated with handling signals on most
35a4452e67SGabriel Krisman Bertaziarchitectures, like x86, but at least for Wine, syscalls issued by
36a4452e67SGabriel Krisman Bertazinative Windows code are currently not known to be a performance problem,
37a4452e67SGabriel Krisman Bertazisince they are quite rare, at least for modern gaming applications.
38a4452e67SGabriel Krisman Bertazi
39a4452e67SGabriel Krisman BertaziSince this mechanism is designed to capture syscalls issued by
40a4452e67SGabriel Krisman Bertazinon-native applications, it must function on syscalls whose invocation
41a4452e67SGabriel Krisman BertaziABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
42a4452e67SGabriel Krisman Bertazidoesn't rely on any of the syscall ABI to make the filtering.  It uses
43a4452e67SGabriel Krisman Bertazionly the syscall dispatcher address and the userspace key.
44a4452e67SGabriel Krisman Bertazi
45a4452e67SGabriel Krisman BertaziAs the ABI of these intercepted syscalls is unknown to Linux, these
46a4452e67SGabriel Krisman Bertazisyscalls are not instrumentable via ptrace or the syscall tracepoints.
47a4452e67SGabriel Krisman Bertazi
48a4452e67SGabriel Krisman BertaziInterface
49a4452e67SGabriel Krisman Bertazi---------
50a4452e67SGabriel Krisman Bertazi
51a4452e67SGabriel Krisman BertaziA thread can setup this mechanism on supported kernels by executing the
52a4452e67SGabriel Krisman Bertazifollowing prctl:
53a4452e67SGabriel Krisman Bertazi
54a4452e67SGabriel Krisman Bertazi  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
55a4452e67SGabriel Krisman Bertazi
56a4452e67SGabriel Krisman Bertazi<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
57a4452e67SGabriel Krisman Bertazidisable the mechanism globally for that thread.  When
58a4452e67SGabriel Krisman BertaziPR_SYS_DISPATCH_OFF is used, the other fields must be zero.
59a4452e67SGabriel Krisman Bertazi
60a4452e67SGabriel Krisman Bertazi[<offset>, <offset>+<length>) delimit a memory region interval
61a4452e67SGabriel Krisman Bertazifrom which syscalls are always executed directly, regardless of the
62a4452e67SGabriel Krisman Bertaziuserspace selector.  This provides a fast path for the C library, which
63a4452e67SGabriel Krisman Bertaziincludes the most common syscall dispatchers in the native code
64a4452e67SGabriel Krisman Bertaziapplications, and also provides a way for the signal handler to return
65a4452e67SGabriel Krisman Bertaziwithout triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this
66a4452e67SGabriel Krisman Bertaziinterface should make sure that at least the signal trampoline code is
67a4452e67SGabriel Krisman Bertaziincluded in this region. In addition, for syscalls that implement the
68a4452e67SGabriel Krisman Bertazitrampoline code on the vDSO, that trampoline is never intercepted.
69a4452e67SGabriel Krisman Bertazi
70a4452e67SGabriel Krisman Bertazi[selector] is a pointer to a char-sized region in the process memory
71a4452e67SGabriel Krisman Bertaziregion, that provides a quick way to enable disable syscall redirection
72a4452e67SGabriel Krisman Bertazithread-wide, without the need to invoke the kernel directly.  selector
7336a6c843SGabriel Krisman Bertazican be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
7436a6c843SGabriel Krisman BertaziAny other value should terminate the program with a SIGSYS.
75a4452e67SGabriel Krisman Bertazi
76*3f67987cSGregory PriceAdditionally, a tasks syscall user dispatch configuration can be peeked
77*3f67987cSGregory Priceand poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
78*3f67987cSGregory Pricerequests. This is useful for checkpoint/restart software.
79*3f67987cSGregory Price
80a4452e67SGabriel Krisman BertaziSecurity Notes
81a4452e67SGabriel Krisman Bertazi--------------
82a4452e67SGabriel Krisman Bertazi
83a4452e67SGabriel Krisman BertaziSyscall User Dispatch provides functionality for compatibility layers to
84a4452e67SGabriel Krisman Bertaziquickly capture system calls issued by a non-native part of the
85a4452e67SGabriel Krisman Bertaziapplication, while not impacting the Linux native regions of the
86a4452e67SGabriel Krisman Bertaziprocess.  It is not a mechanism for sandboxing system calls, and it
87a4452e67SGabriel Krisman Bertazishould not be seen as a security mechanism, since it is trivial for a
88a4452e67SGabriel Krisman Bertazimalicious application to subvert the mechanism by jumping to an allowed
89a4452e67SGabriel Krisman Bertazidispatcher region prior to executing the syscall, or to discover the
90a4452e67SGabriel Krisman Bertaziaddress and modify the selector value.  If the use case requires any
91a4452e67SGabriel Krisman Bertazikind of security sandboxing, Seccomp should be used instead.
92a4452e67SGabriel Krisman Bertazi
93a4452e67SGabriel Krisman BertaziAny fork or exec of the existing process resets the mechanism to
94a4452e67SGabriel Krisman BertaziPR_SYS_DISPATCH_OFF.
95