1a4452e67SGabriel Krisman Bertazi.. SPDX-License-Identifier: GPL-2.0 2a4452e67SGabriel Krisman Bertazi 3a4452e67SGabriel Krisman Bertazi===================== 4a4452e67SGabriel Krisman BertaziSyscall User Dispatch 5a4452e67SGabriel Krisman Bertazi===================== 6a4452e67SGabriel Krisman Bertazi 7a4452e67SGabriel Krisman BertaziBackground 8a4452e67SGabriel Krisman Bertazi---------- 9a4452e67SGabriel Krisman Bertazi 10a4452e67SGabriel Krisman BertaziCompatibility layers like Wine need a way to efficiently emulate system 11a4452e67SGabriel Krisman Bertazicalls of only a part of their process - the part that has the 12a4452e67SGabriel Krisman Bertaziincompatible code - while being able to execute native syscalls without 13a4452e67SGabriel Krisman Bertazia high performance penalty on the native part of the process. Seccomp 14a4452e67SGabriel Krisman Bertazifalls short on this task, since it has limited support to efficiently 15a4452e67SGabriel Krisman Bertazifilter syscalls based on memory regions, and it doesn't support removing 16a4452e67SGabriel Krisman Bertazifilters. Therefore a new mechanism is necessary. 17a4452e67SGabriel Krisman Bertazi 18a4452e67SGabriel Krisman BertaziSyscall User Dispatch brings the filtering of the syscall dispatcher 19a4452e67SGabriel Krisman Bertaziaddress back to userspace. The application is in control of a flip 20a4452e67SGabriel Krisman Bertaziswitch, indicating the current personality of the process. A 21a4452e67SGabriel Krisman Bertazimultiple-personality application can then flip the switch without 22a4452e67SGabriel Krisman Bertaziinvoking the kernel, when crossing the compatibility layer API 23a4452e67SGabriel Krisman Bertaziboundaries, to enable/disable the syscall redirection and execute 24a4452e67SGabriel Krisman Bertazisyscalls directly (disabled) or send them to be emulated in userspace 25a4452e67SGabriel Krisman Bertazithrough a SIGSYS. 26a4452e67SGabriel Krisman Bertazi 27a4452e67SGabriel Krisman BertaziThe goal of this design is to provide very quick compatibility layer 28a4452e67SGabriel Krisman Bertaziboundary crosses, which is achieved by not executing a syscall to change 29a4452e67SGabriel Krisman Bertazipersonality every time the compatibility layer executes. Instead, a 30a4452e67SGabriel Krisman Bertaziuserspace memory region exposed to the kernel indicates the current 31a4452e67SGabriel Krisman Bertazipersonality, and the application simply modifies that variable to 32a4452e67SGabriel Krisman Bertaziconfigure the mechanism. 33a4452e67SGabriel Krisman Bertazi 34a4452e67SGabriel Krisman BertaziThere is a relatively high cost associated with handling signals on most 35a4452e67SGabriel Krisman Bertaziarchitectures, like x86, but at least for Wine, syscalls issued by 36a4452e67SGabriel Krisman Bertazinative Windows code are currently not known to be a performance problem, 37a4452e67SGabriel Krisman Bertazisince they are quite rare, at least for modern gaming applications. 38a4452e67SGabriel Krisman Bertazi 39a4452e67SGabriel Krisman BertaziSince this mechanism is designed to capture syscalls issued by 40a4452e67SGabriel Krisman Bertazinon-native applications, it must function on syscalls whose invocation 41a4452e67SGabriel Krisman BertaziABI is completely unexpected to Linux. Syscall User Dispatch, therefore 42a4452e67SGabriel Krisman Bertazidoesn't rely on any of the syscall ABI to make the filtering. It uses 43a4452e67SGabriel Krisman Bertazionly the syscall dispatcher address and the userspace key. 44a4452e67SGabriel Krisman Bertazi 45a4452e67SGabriel Krisman BertaziAs the ABI of these intercepted syscalls is unknown to Linux, these 46a4452e67SGabriel Krisman Bertazisyscalls are not instrumentable via ptrace or the syscall tracepoints. 47a4452e67SGabriel Krisman Bertazi 48a4452e67SGabriel Krisman BertaziInterface 49a4452e67SGabriel Krisman Bertazi--------- 50a4452e67SGabriel Krisman Bertazi 51a4452e67SGabriel Krisman BertaziA thread can setup this mechanism on supported kernels by executing the 52a4452e67SGabriel Krisman Bertazifollowing prctl: 53a4452e67SGabriel Krisman Bertazi 54a4452e67SGabriel Krisman Bertazi prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) 55a4452e67SGabriel Krisman Bertazi 56a4452e67SGabriel Krisman Bertazi<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and 57a4452e67SGabriel Krisman Bertazidisable the mechanism globally for that thread. When 58a4452e67SGabriel Krisman BertaziPR_SYS_DISPATCH_OFF is used, the other fields must be zero. 59a4452e67SGabriel Krisman Bertazi 60a4452e67SGabriel Krisman Bertazi[<offset>, <offset>+<length>) delimit a memory region interval 61a4452e67SGabriel Krisman Bertazifrom which syscalls are always executed directly, regardless of the 62a4452e67SGabriel Krisman Bertaziuserspace selector. This provides a fast path for the C library, which 63a4452e67SGabriel Krisman Bertaziincludes the most common syscall dispatchers in the native code 64a4452e67SGabriel Krisman Bertaziapplications, and also provides a way for the signal handler to return 65a4452e67SGabriel Krisman Bertaziwithout triggering a nested SIGSYS on (rt\_)sigreturn. Users of this 66a4452e67SGabriel Krisman Bertaziinterface should make sure that at least the signal trampoline code is 67a4452e67SGabriel Krisman Bertaziincluded in this region. In addition, for syscalls that implement the 68a4452e67SGabriel Krisman Bertazitrampoline code on the vDSO, that trampoline is never intercepted. 69a4452e67SGabriel Krisman Bertazi 70a4452e67SGabriel Krisman Bertazi[selector] is a pointer to a char-sized region in the process memory 71a4452e67SGabriel Krisman Bertaziregion, that provides a quick way to enable disable syscall redirection 72a4452e67SGabriel Krisman Bertazithread-wide, without the need to invoke the kernel directly. selector 7336a6c843SGabriel Krisman Bertazican be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK. 7436a6c843SGabriel Krisman BertaziAny other value should terminate the program with a SIGSYS. 75a4452e67SGabriel Krisman Bertazi 76*3f67987cSGregory PriceAdditionally, a tasks syscall user dispatch configuration can be peeked 77*3f67987cSGregory Priceand poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace 78*3f67987cSGregory Pricerequests. This is useful for checkpoint/restart software. 79*3f67987cSGregory Price 80a4452e67SGabriel Krisman BertaziSecurity Notes 81a4452e67SGabriel Krisman Bertazi-------------- 82a4452e67SGabriel Krisman Bertazi 83a4452e67SGabriel Krisman BertaziSyscall User Dispatch provides functionality for compatibility layers to 84a4452e67SGabriel Krisman Bertaziquickly capture system calls issued by a non-native part of the 85a4452e67SGabriel Krisman Bertaziapplication, while not impacting the Linux native regions of the 86a4452e67SGabriel Krisman Bertaziprocess. It is not a mechanism for sandboxing system calls, and it 87a4452e67SGabriel Krisman Bertazishould not be seen as a security mechanism, since it is trivial for a 88a4452e67SGabriel Krisman Bertazimalicious application to subvert the mechanism by jumping to an allowed 89a4452e67SGabriel Krisman Bertazidispatcher region prior to executing the syscall, or to discover the 90a4452e67SGabriel Krisman Bertaziaddress and modify the selector value. If the use case requires any 91a4452e67SGabriel Krisman Bertazikind of security sandboxing, Seccomp should be used instead. 92a4452e67SGabriel Krisman Bertazi 93a4452e67SGabriel Krisman BertaziAny fork or exec of the existing process resets the mechanism to 94a4452e67SGabriel Krisman BertaziPR_SYS_DISPATCH_OFF. 95