1.. SPDX-License-Identifier: GPL-2.0 2 3===================== 4Syscall User Dispatch 5===================== 6 7Background 8---------- 9 10Compatibility layers like Wine need a way to efficiently emulate system 11calls of only a part of their process - the part that has the 12incompatible code - while being able to execute native syscalls without 13a high performance penalty on the native part of the process. Seccomp 14falls short on this task, since it has limited support to efficiently 15filter syscalls based on memory regions, and it doesn't support removing 16filters. Therefore a new mechanism is necessary. 17 18Syscall User Dispatch brings the filtering of the syscall dispatcher 19address back to userspace. The application is in control of a flip 20switch, indicating the current personality of the process. A 21multiple-personality application can then flip the switch without 22invoking the kernel, when crossing the compatibility layer API 23boundaries, to enable/disable the syscall redirection and execute 24syscalls directly (disabled) or send them to be emulated in userspace 25through a SIGSYS. 26 27The goal of this design is to provide very quick compatibility layer 28boundary crosses, which is achieved by not executing a syscall to change 29personality every time the compatibility layer executes. Instead, a 30userspace memory region exposed to the kernel indicates the current 31personality, and the application simply modifies that variable to 32configure the mechanism. 33 34There is a relatively high cost associated with handling signals on most 35architectures, like x86, but at least for Wine, syscalls issued by 36native Windows code are currently not known to be a performance problem, 37since they are quite rare, at least for modern gaming applications. 38 39Since this mechanism is designed to capture syscalls issued by 40non-native applications, it must function on syscalls whose invocation 41ABI is completely unexpected to Linux. Syscall User Dispatch, therefore 42doesn't rely on any of the syscall ABI to make the filtering. It uses 43only the syscall dispatcher address and the userspace key. 44 45As the ABI of these intercepted syscalls is unknown to Linux, these 46syscalls are not instrumentable via ptrace or the syscall tracepoints. 47 48Interface 49--------- 50 51A thread can setup this mechanism on supported kernels by executing the 52following prctl: 53 54 prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) 55 56<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and 57disable the mechanism globally for that thread. When 58PR_SYS_DISPATCH_OFF is used, the other fields must be zero. 59 60[<offset>, <offset>+<length>) delimit a memory region interval 61from which syscalls are always executed directly, regardless of the 62userspace selector. This provides a fast path for the C library, which 63includes the most common syscall dispatchers in the native code 64applications, and also provides a way for the signal handler to return 65without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this 66interface should make sure that at least the signal trampoline code is 67included in this region. In addition, for syscalls that implement the 68trampoline code on the vDSO, that trampoline is never intercepted. 69 70[selector] is a pointer to a char-sized region in the process memory 71region, that provides a quick way to enable disable syscall redirection 72thread-wide, without the need to invoke the kernel directly. selector 73can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK. 74Any other value should terminate the program with a SIGSYS. 75 76Additionally, a tasks syscall user dispatch configuration can be peeked 77and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace 78requests. This is useful for checkpoint/restart software. 79 80Security Notes 81-------------- 82 83Syscall User Dispatch provides functionality for compatibility layers to 84quickly capture system calls issued by a non-native part of the 85application, while not impacting the Linux native regions of the 86process. It is not a mechanism for sandboxing system calls, and it 87should not be seen as a security mechanism, since it is trivial for a 88malicious application to subvert the mechanism by jumping to an allowed 89dispatcher region prior to executing the syscall, or to discover the 90address and modify the selector value. If the use case requires any 91kind of security sandboxing, Seccomp should be used instead. 92 93Any fork or exec of the existing process resets the mechanism to 94PR_SYS_DISPATCH_OFF. 95