1458f69efSMauro Carvalho Chehab====================================================== 2458f69efSMauro Carvalho Chehabhrtimers - subsystem for high-resolution kernel timers 3458f69efSMauro Carvalho Chehab====================================================== 4458f69efSMauro Carvalho Chehab 5458f69efSMauro Carvalho ChehabThis patch introduces a new subsystem for high-resolution kernel timers. 6458f69efSMauro Carvalho Chehab 7458f69efSMauro Carvalho ChehabOne might ask the question: we already have a timer subsystem 8458f69efSMauro Carvalho Chehab(kernel/timers.c), why do we need two timer subsystems? After a lot of 9458f69efSMauro Carvalho Chehabback and forth trying to integrate high-resolution and high-precision 10458f69efSMauro Carvalho Chehabfeatures into the existing timer framework, and after testing various 11458f69efSMauro Carvalho Chehabsuch high-resolution timer implementations in practice, we came to the 12458f69efSMauro Carvalho Chehabconclusion that the timer wheel code is fundamentally not suitable for 13458f69efSMauro Carvalho Chehabsuch an approach. We initially didn't believe this ('there must be a way 14458f69efSMauro Carvalho Chehabto solve this'), and spent a considerable effort trying to integrate 15458f69efSMauro Carvalho Chehabthings into the timer wheel, but we failed. In hindsight, there are 16458f69efSMauro Carvalho Chehabseveral reasons why such integration is hard/impossible: 17458f69efSMauro Carvalho Chehab 18458f69efSMauro Carvalho Chehab- the forced handling of low-resolution and high-resolution timers in 19458f69efSMauro Carvalho Chehab the same way leads to a lot of compromises, macro magic and #ifdef 20458f69efSMauro Carvalho Chehab mess. The timers.c code is very "tightly coded" around jiffies and 21458f69efSMauro Carvalho Chehab 32-bitness assumptions, and has been honed and micro-optimized for a 22458f69efSMauro Carvalho Chehab relatively narrow use case (jiffies in a relatively narrow HZ range) 23458f69efSMauro Carvalho Chehab for many years - and thus even small extensions to it easily break 24458f69efSMauro Carvalho Chehab the wheel concept, leading to even worse compromises. The timer wheel 25458f69efSMauro Carvalho Chehab code is very good and tight code, there's zero problems with it in its 26458f69efSMauro Carvalho Chehab current usage - but it is simply not suitable to be extended for 27458f69efSMauro Carvalho Chehab high-res timers. 28458f69efSMauro Carvalho Chehab 29458f69efSMauro Carvalho Chehab- the unpredictable [O(N)] overhead of cascading leads to delays which 30458f69efSMauro Carvalho Chehab necessitate a more complex handling of high resolution timers, which 31458f69efSMauro Carvalho Chehab in turn decreases robustness. Such a design still leads to rather large 32458f69efSMauro Carvalho Chehab timing inaccuracies. Cascading is a fundamental property of the timer 33458f69efSMauro Carvalho Chehab wheel concept, it cannot be 'designed out' without inevitably 34458f69efSMauro Carvalho Chehab degrading other portions of the timers.c code in an unacceptable way. 35458f69efSMauro Carvalho Chehab 36458f69efSMauro Carvalho Chehab- the implementation of the current posix-timer subsystem on top of 37458f69efSMauro Carvalho Chehab the timer wheel has already introduced a quite complex handling of 38458f69efSMauro Carvalho Chehab the required readjusting of absolute CLOCK_REALTIME timers at 39458f69efSMauro Carvalho Chehab settimeofday or NTP time - further underlying our experience by 40458f69efSMauro Carvalho Chehab example: that the timer wheel data structure is too rigid for high-res 41458f69efSMauro Carvalho Chehab timers. 42458f69efSMauro Carvalho Chehab 43458f69efSMauro Carvalho Chehab- the timer wheel code is most optimal for use cases which can be 44458f69efSMauro Carvalho Chehab identified as "timeouts". Such timeouts are usually set up to cover 45458f69efSMauro Carvalho Chehab error conditions in various I/O paths, such as networking and block 46458f69efSMauro Carvalho Chehab I/O. The vast majority of those timers never expire and are rarely 47458f69efSMauro Carvalho Chehab recascaded because the expected correct event arrives in time so they 48458f69efSMauro Carvalho Chehab can be removed from the timer wheel before any further processing of 49458f69efSMauro Carvalho Chehab them becomes necessary. Thus the users of these timeouts can accept 50458f69efSMauro Carvalho Chehab the granularity and precision tradeoffs of the timer wheel, and 51458f69efSMauro Carvalho Chehab largely expect the timer subsystem to have near-zero overhead. 52458f69efSMauro Carvalho Chehab Accurate timing for them is not a core purpose - in fact most of the 53458f69efSMauro Carvalho Chehab timeout values used are ad-hoc. For them it is at most a necessary 54458f69efSMauro Carvalho Chehab evil to guarantee the processing of actual timeout completions 55458f69efSMauro Carvalho Chehab (because most of the timeouts are deleted before completion), which 56458f69efSMauro Carvalho Chehab should thus be as cheap and unintrusive as possible. 57458f69efSMauro Carvalho Chehab 58458f69efSMauro Carvalho ChehabThe primary users of precision timers are user-space applications that 59458f69efSMauro Carvalho Chehabutilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel 60458f69efSMauro Carvalho Chehabusers like drivers and subsystems which require precise timed events 61458f69efSMauro Carvalho Chehab(e.g. multimedia) can benefit from the availability of a separate 62458f69efSMauro Carvalho Chehabhigh-resolution timer subsystem as well. 63458f69efSMauro Carvalho Chehab 64458f69efSMauro Carvalho ChehabWhile this subsystem does not offer high-resolution clock sources just 65458f69efSMauro Carvalho Chehabyet, the hrtimer subsystem can be easily extended with high-resolution 66458f69efSMauro Carvalho Chehabclock capabilities, and patches for that exist and are maturing quickly. 67458f69efSMauro Carvalho ChehabThe increasing demand for realtime and multimedia applications along 68458f69efSMauro Carvalho Chehabwith other potential users for precise timers gives another reason to 69458f69efSMauro Carvalho Chehabseparate the "timeout" and "precise timer" subsystems. 70458f69efSMauro Carvalho Chehab 71458f69efSMauro Carvalho ChehabAnother potential benefit is that such a separation allows even more 72458f69efSMauro Carvalho Chehabspecial-purpose optimization of the existing timer wheel for the low 73458f69efSMauro Carvalho Chehabresolution and low precision use cases - once the precision-sensitive 74458f69efSMauro Carvalho ChehabAPIs are separated from the timer wheel and are migrated over to 75458f69efSMauro Carvalho Chehabhrtimers. E.g. we could decrease the frequency of the timeout subsystem 76458f69efSMauro Carvalho Chehabfrom 250 Hz to 100 HZ (or even smaller). 77458f69efSMauro Carvalho Chehab 78458f69efSMauro Carvalho Chehabhrtimer subsystem implementation details 79458f69efSMauro Carvalho Chehab---------------------------------------- 80458f69efSMauro Carvalho Chehab 81458f69efSMauro Carvalho Chehabthe basic design considerations were: 82458f69efSMauro Carvalho Chehab 83458f69efSMauro Carvalho Chehab- simplicity 84458f69efSMauro Carvalho Chehab 85458f69efSMauro Carvalho Chehab- data structure not bound to jiffies or any other granularity. All the 86458f69efSMauro Carvalho Chehab kernel logic works at 64-bit nanoseconds resolution - no compromises. 87458f69efSMauro Carvalho Chehab 88458f69efSMauro Carvalho Chehab- simplification of existing, timing related kernel code 89458f69efSMauro Carvalho Chehab 90458f69efSMauro Carvalho Chehabanother basic requirement was the immediate enqueueing and ordering of 91458f69efSMauro Carvalho Chehabtimers at activation time. After looking at several possible solutions 92458f69efSMauro Carvalho Chehabsuch as radix trees and hashes, we chose the red black tree as the basic 93458f69efSMauro Carvalho Chehabdata structure. Rbtrees are available as a library in the kernel and are 94458f69efSMauro Carvalho Chehabused in various performance-critical areas of e.g. memory management and 95458f69efSMauro Carvalho Chehabfile systems. The rbtree is solely used for time sorted ordering, while 96458f69efSMauro Carvalho Chehaba separate list is used to give the expiry code fast access to the 97458f69efSMauro Carvalho Chehabqueued timers, without having to walk the rbtree. 98458f69efSMauro Carvalho Chehab 99458f69efSMauro Carvalho Chehab(This separate list is also useful for later when we'll introduce 100458f69efSMauro Carvalho Chehabhigh-resolution clocks, where we need separate pending and expired 101458f69efSMauro Carvalho Chehabqueues while keeping the time-order intact.) 102458f69efSMauro Carvalho Chehab 103458f69efSMauro Carvalho ChehabTime-ordered enqueueing is not purely for the purposes of 104458f69efSMauro Carvalho Chehabhigh-resolution clocks though, it also simplifies the handling of 105458f69efSMauro Carvalho Chehababsolute timers based on a low-resolution CLOCK_REALTIME. The existing 106458f69efSMauro Carvalho Chehabimplementation needed to keep an extra list of all armed absolute 107458f69efSMauro Carvalho ChehabCLOCK_REALTIME timers along with complex locking. In case of 108458f69efSMauro Carvalho Chehabsettimeofday and NTP, all the timers (!) had to be dequeued, the 109458f69efSMauro Carvalho Chehabtime-changing code had to fix them up one by one, and all of them had to 110458f69efSMauro Carvalho Chehabbe enqueued again. The time-ordered enqueueing and the storage of the 111458f69efSMauro Carvalho Chehabexpiry time in absolute time units removes all this complex and poorly 112458f69efSMauro Carvalho Chehabscaling code from the posix-timer implementation - the clock can simply 113458f69efSMauro Carvalho Chehabbe set without having to touch the rbtree. This also makes the handling 114458f69efSMauro Carvalho Chehabof posix-timers simpler in general. 115458f69efSMauro Carvalho Chehab 116458f69efSMauro Carvalho ChehabThe locking and per-CPU behavior of hrtimers was mostly taken from the 117458f69efSMauro Carvalho Chehabexisting timer wheel code, as it is mature and well suited. Sharing code 118458f69efSMauro Carvalho Chehabwas not really a win, due to the different data structures. Also, the 119458f69efSMauro Carvalho Chehabhrtimer functions now have clearer behavior and clearer names - such as 120458f69efSMauro Carvalho Chehabhrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly 12187bdd932SThomas Gleixnerequivalent to timer_delete() and timer_delete_sync()] - so there's no direct 122458f69efSMauro Carvalho Chehab1:1 mapping between them on the algorithmic level, and thus no real 123458f69efSMauro Carvalho Chehabpotential for code sharing either. 124458f69efSMauro Carvalho Chehab 125458f69efSMauro Carvalho ChehabBasic data types: every time value, absolute or relative, is in a 126*4c093cbbSGeert Uytterhoevenspecial nanosecond-resolution 64bit type: ktime_t. 127*4c093cbbSGeert Uytterhoeven(Originally, the kernel-internal representation of ktime_t values and 128*4c093cbbSGeert Uytterhoevenoperations was implemented via macros and inline functions, and could be 129*4c093cbbSGeert Uytterhoevenswitched between a "hybrid union" type and a plain "scalar" 64bit 130*4c093cbbSGeert Uytterhoevennanoseconds representation (at compile time). This was abandoned in the 131*4c093cbbSGeert Uytterhoevencontext of the Y2038 work.) 132458f69efSMauro Carvalho Chehab 133458f69efSMauro Carvalho Chehabhrtimers - rounding of timer values 134458f69efSMauro Carvalho Chehab----------------------------------- 135458f69efSMauro Carvalho Chehab 136458f69efSMauro Carvalho Chehabthe hrtimer code will round timer events to lower-resolution clocks 137458f69efSMauro Carvalho Chehabbecause it has to. Otherwise it will do no artificial rounding at all. 138458f69efSMauro Carvalho Chehab 139458f69efSMauro Carvalho Chehabone question is, what resolution value should be returned to the user by 140458f69efSMauro Carvalho Chehabthe clock_getres() interface. This will return whatever real resolution 141458f69efSMauro Carvalho Chehaba given clock has - be it low-res, high-res, or artificially-low-res. 142458f69efSMauro Carvalho Chehab 143458f69efSMauro Carvalho Chehabhrtimers - testing and verification 144458f69efSMauro Carvalho Chehab----------------------------------- 145458f69efSMauro Carvalho Chehab 146458f69efSMauro Carvalho ChehabWe used the high-resolution clock subsystem on top of hrtimers to verify 147458f69efSMauro Carvalho Chehabthe hrtimer implementation details in praxis, and we also ran the posix 148458f69efSMauro Carvalho Chehabtimer tests in order to ensure specification compliance. We also ran 149458f69efSMauro Carvalho Chehabtests on low-resolution clocks. 150458f69efSMauro Carvalho Chehab 151458f69efSMauro Carvalho ChehabThe hrtimer patch converts the following kernel functionality to use 152458f69efSMauro Carvalho Chehabhrtimers: 153458f69efSMauro Carvalho Chehab 154458f69efSMauro Carvalho Chehab - nanosleep 155458f69efSMauro Carvalho Chehab - itimers 156458f69efSMauro Carvalho Chehab - posix-timers 157458f69efSMauro Carvalho Chehab 158458f69efSMauro Carvalho ChehabThe conversion of nanosleep and posix-timers enabled the unification of 159458f69efSMauro Carvalho Chehabnanosleep and clock_nanosleep. 160458f69efSMauro Carvalho Chehab 161458f69efSMauro Carvalho ChehabThe code was successfully compiled for the following platforms: 162458f69efSMauro Carvalho Chehab 163458f69efSMauro Carvalho Chehab i386, x86_64, ARM, PPC, PPC64, IA64 164458f69efSMauro Carvalho Chehab 165458f69efSMauro Carvalho ChehabThe code was run-tested on the following platforms: 166458f69efSMauro Carvalho Chehab 167458f69efSMauro Carvalho Chehab i386(UP/SMP), x86_64(UP/SMP), ARM, PPC 168458f69efSMauro Carvalho Chehab 169458f69efSMauro Carvalho Chehabhrtimers were also integrated into the -rt tree, along with a 170458f69efSMauro Carvalho Chehabhrtimers-based high-resolution clock implementation, so the hrtimers 171458f69efSMauro Carvalho Chehabcode got a healthy amount of testing and use in practice. 172458f69efSMauro Carvalho Chehab 173458f69efSMauro Carvalho Chehab Thomas Gleixner, Ingo Molnar 174