1*458f69efSMauro Carvalho Chehab====================================================== 2*458f69efSMauro Carvalho Chehabhrtimers - subsystem for high-resolution kernel timers 3*458f69efSMauro Carvalho Chehab====================================================== 4*458f69efSMauro Carvalho Chehab 5*458f69efSMauro Carvalho ChehabThis patch introduces a new subsystem for high-resolution kernel timers. 6*458f69efSMauro Carvalho Chehab 7*458f69efSMauro Carvalho ChehabOne might ask the question: we already have a timer subsystem 8*458f69efSMauro Carvalho Chehab(kernel/timers.c), why do we need two timer subsystems? After a lot of 9*458f69efSMauro Carvalho Chehabback and forth trying to integrate high-resolution and high-precision 10*458f69efSMauro Carvalho Chehabfeatures into the existing timer framework, and after testing various 11*458f69efSMauro Carvalho Chehabsuch high-resolution timer implementations in practice, we came to the 12*458f69efSMauro Carvalho Chehabconclusion that the timer wheel code is fundamentally not suitable for 13*458f69efSMauro Carvalho Chehabsuch an approach. We initially didn't believe this ('there must be a way 14*458f69efSMauro Carvalho Chehabto solve this'), and spent a considerable effort trying to integrate 15*458f69efSMauro Carvalho Chehabthings into the timer wheel, but we failed. In hindsight, there are 16*458f69efSMauro Carvalho Chehabseveral reasons why such integration is hard/impossible: 17*458f69efSMauro Carvalho Chehab 18*458f69efSMauro Carvalho Chehab- the forced handling of low-resolution and high-resolution timers in 19*458f69efSMauro Carvalho Chehab the same way leads to a lot of compromises, macro magic and #ifdef 20*458f69efSMauro Carvalho Chehab mess. The timers.c code is very "tightly coded" around jiffies and 21*458f69efSMauro Carvalho Chehab 32-bitness assumptions, and has been honed and micro-optimized for a 22*458f69efSMauro Carvalho Chehab relatively narrow use case (jiffies in a relatively narrow HZ range) 23*458f69efSMauro Carvalho Chehab for many years - and thus even small extensions to it easily break 24*458f69efSMauro Carvalho Chehab the wheel concept, leading to even worse compromises. The timer wheel 25*458f69efSMauro Carvalho Chehab code is very good and tight code, there's zero problems with it in its 26*458f69efSMauro Carvalho Chehab current usage - but it is simply not suitable to be extended for 27*458f69efSMauro Carvalho Chehab high-res timers. 28*458f69efSMauro Carvalho Chehab 29*458f69efSMauro Carvalho Chehab- the unpredictable [O(N)] overhead of cascading leads to delays which 30*458f69efSMauro Carvalho Chehab necessitate a more complex handling of high resolution timers, which 31*458f69efSMauro Carvalho Chehab in turn decreases robustness. Such a design still leads to rather large 32*458f69efSMauro Carvalho Chehab timing inaccuracies. Cascading is a fundamental property of the timer 33*458f69efSMauro Carvalho Chehab wheel concept, it cannot be 'designed out' without inevitably 34*458f69efSMauro Carvalho Chehab degrading other portions of the timers.c code in an unacceptable way. 35*458f69efSMauro Carvalho Chehab 36*458f69efSMauro Carvalho Chehab- the implementation of the current posix-timer subsystem on top of 37*458f69efSMauro Carvalho Chehab the timer wheel has already introduced a quite complex handling of 38*458f69efSMauro Carvalho Chehab the required readjusting of absolute CLOCK_REALTIME timers at 39*458f69efSMauro Carvalho Chehab settimeofday or NTP time - further underlying our experience by 40*458f69efSMauro Carvalho Chehab example: that the timer wheel data structure is too rigid for high-res 41*458f69efSMauro Carvalho Chehab timers. 42*458f69efSMauro Carvalho Chehab 43*458f69efSMauro Carvalho Chehab- the timer wheel code is most optimal for use cases which can be 44*458f69efSMauro Carvalho Chehab identified as "timeouts". Such timeouts are usually set up to cover 45*458f69efSMauro Carvalho Chehab error conditions in various I/O paths, such as networking and block 46*458f69efSMauro Carvalho Chehab I/O. The vast majority of those timers never expire and are rarely 47*458f69efSMauro Carvalho Chehab recascaded because the expected correct event arrives in time so they 48*458f69efSMauro Carvalho Chehab can be removed from the timer wheel before any further processing of 49*458f69efSMauro Carvalho Chehab them becomes necessary. Thus the users of these timeouts can accept 50*458f69efSMauro Carvalho Chehab the granularity and precision tradeoffs of the timer wheel, and 51*458f69efSMauro Carvalho Chehab largely expect the timer subsystem to have near-zero overhead. 52*458f69efSMauro Carvalho Chehab Accurate timing for them is not a core purpose - in fact most of the 53*458f69efSMauro Carvalho Chehab timeout values used are ad-hoc. For them it is at most a necessary 54*458f69efSMauro Carvalho Chehab evil to guarantee the processing of actual timeout completions 55*458f69efSMauro Carvalho Chehab (because most of the timeouts are deleted before completion), which 56*458f69efSMauro Carvalho Chehab should thus be as cheap and unintrusive as possible. 57*458f69efSMauro Carvalho Chehab 58*458f69efSMauro Carvalho ChehabThe primary users of precision timers are user-space applications that 59*458f69efSMauro Carvalho Chehabutilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel 60*458f69efSMauro Carvalho Chehabusers like drivers and subsystems which require precise timed events 61*458f69efSMauro Carvalho Chehab(e.g. multimedia) can benefit from the availability of a separate 62*458f69efSMauro Carvalho Chehabhigh-resolution timer subsystem as well. 63*458f69efSMauro Carvalho Chehab 64*458f69efSMauro Carvalho ChehabWhile this subsystem does not offer high-resolution clock sources just 65*458f69efSMauro Carvalho Chehabyet, the hrtimer subsystem can be easily extended with high-resolution 66*458f69efSMauro Carvalho Chehabclock capabilities, and patches for that exist and are maturing quickly. 67*458f69efSMauro Carvalho ChehabThe increasing demand for realtime and multimedia applications along 68*458f69efSMauro Carvalho Chehabwith other potential users for precise timers gives another reason to 69*458f69efSMauro Carvalho Chehabseparate the "timeout" and "precise timer" subsystems. 70*458f69efSMauro Carvalho Chehab 71*458f69efSMauro Carvalho ChehabAnother potential benefit is that such a separation allows even more 72*458f69efSMauro Carvalho Chehabspecial-purpose optimization of the existing timer wheel for the low 73*458f69efSMauro Carvalho Chehabresolution and low precision use cases - once the precision-sensitive 74*458f69efSMauro Carvalho ChehabAPIs are separated from the timer wheel and are migrated over to 75*458f69efSMauro Carvalho Chehabhrtimers. E.g. we could decrease the frequency of the timeout subsystem 76*458f69efSMauro Carvalho Chehabfrom 250 Hz to 100 HZ (or even smaller). 77*458f69efSMauro Carvalho Chehab 78*458f69efSMauro Carvalho Chehabhrtimer subsystem implementation details 79*458f69efSMauro Carvalho Chehab---------------------------------------- 80*458f69efSMauro Carvalho Chehab 81*458f69efSMauro Carvalho Chehabthe basic design considerations were: 82*458f69efSMauro Carvalho Chehab 83*458f69efSMauro Carvalho Chehab- simplicity 84*458f69efSMauro Carvalho Chehab 85*458f69efSMauro Carvalho Chehab- data structure not bound to jiffies or any other granularity. All the 86*458f69efSMauro Carvalho Chehab kernel logic works at 64-bit nanoseconds resolution - no compromises. 87*458f69efSMauro Carvalho Chehab 88*458f69efSMauro Carvalho Chehab- simplification of existing, timing related kernel code 89*458f69efSMauro Carvalho Chehab 90*458f69efSMauro Carvalho Chehabanother basic requirement was the immediate enqueueing and ordering of 91*458f69efSMauro Carvalho Chehabtimers at activation time. After looking at several possible solutions 92*458f69efSMauro Carvalho Chehabsuch as radix trees and hashes, we chose the red black tree as the basic 93*458f69efSMauro Carvalho Chehabdata structure. Rbtrees are available as a library in the kernel and are 94*458f69efSMauro Carvalho Chehabused in various performance-critical areas of e.g. memory management and 95*458f69efSMauro Carvalho Chehabfile systems. The rbtree is solely used for time sorted ordering, while 96*458f69efSMauro Carvalho Chehaba separate list is used to give the expiry code fast access to the 97*458f69efSMauro Carvalho Chehabqueued timers, without having to walk the rbtree. 98*458f69efSMauro Carvalho Chehab 99*458f69efSMauro Carvalho Chehab(This separate list is also useful for later when we'll introduce 100*458f69efSMauro Carvalho Chehabhigh-resolution clocks, where we need separate pending and expired 101*458f69efSMauro Carvalho Chehabqueues while keeping the time-order intact.) 102*458f69efSMauro Carvalho Chehab 103*458f69efSMauro Carvalho ChehabTime-ordered enqueueing is not purely for the purposes of 104*458f69efSMauro Carvalho Chehabhigh-resolution clocks though, it also simplifies the handling of 105*458f69efSMauro Carvalho Chehababsolute timers based on a low-resolution CLOCK_REALTIME. The existing 106*458f69efSMauro Carvalho Chehabimplementation needed to keep an extra list of all armed absolute 107*458f69efSMauro Carvalho ChehabCLOCK_REALTIME timers along with complex locking. In case of 108*458f69efSMauro Carvalho Chehabsettimeofday and NTP, all the timers (!) had to be dequeued, the 109*458f69efSMauro Carvalho Chehabtime-changing code had to fix them up one by one, and all of them had to 110*458f69efSMauro Carvalho Chehabbe enqueued again. The time-ordered enqueueing and the storage of the 111*458f69efSMauro Carvalho Chehabexpiry time in absolute time units removes all this complex and poorly 112*458f69efSMauro Carvalho Chehabscaling code from the posix-timer implementation - the clock can simply 113*458f69efSMauro Carvalho Chehabbe set without having to touch the rbtree. This also makes the handling 114*458f69efSMauro Carvalho Chehabof posix-timers simpler in general. 115*458f69efSMauro Carvalho Chehab 116*458f69efSMauro Carvalho ChehabThe locking and per-CPU behavior of hrtimers was mostly taken from the 117*458f69efSMauro Carvalho Chehabexisting timer wheel code, as it is mature and well suited. Sharing code 118*458f69efSMauro Carvalho Chehabwas not really a win, due to the different data structures. Also, the 119*458f69efSMauro Carvalho Chehabhrtimer functions now have clearer behavior and clearer names - such as 120*458f69efSMauro Carvalho Chehabhrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly 121*458f69efSMauro Carvalho Chehabequivalent to del_timer() and del_timer_sync()] - so there's no direct 122*458f69efSMauro Carvalho Chehab1:1 mapping between them on the algorithmic level, and thus no real 123*458f69efSMauro Carvalho Chehabpotential for code sharing either. 124*458f69efSMauro Carvalho Chehab 125*458f69efSMauro Carvalho ChehabBasic data types: every time value, absolute or relative, is in a 126*458f69efSMauro Carvalho Chehabspecial nanosecond-resolution type: ktime_t. The kernel-internal 127*458f69efSMauro Carvalho Chehabrepresentation of ktime_t values and operations is implemented via 128*458f69efSMauro Carvalho Chehabmacros and inline functions, and can be switched between a "hybrid 129*458f69efSMauro Carvalho Chehabunion" type and a plain "scalar" 64bit nanoseconds representation (at 130*458f69efSMauro Carvalho Chehabcompile time). The hybrid union type optimizes time conversions on 32bit 131*458f69efSMauro Carvalho ChehabCPUs. This build-time-selectable ktime_t storage format was implemented 132*458f69efSMauro Carvalho Chehabto avoid the performance impact of 64-bit multiplications and divisions 133*458f69efSMauro Carvalho Chehabon 32bit CPUs. Such operations are frequently necessary to convert 134*458f69efSMauro Carvalho Chehabbetween the storage formats provided by kernel and userspace interfaces 135*458f69efSMauro Carvalho Chehaband the internal time format. (See include/linux/ktime.h for further 136*458f69efSMauro Carvalho Chehabdetails.) 137*458f69efSMauro Carvalho Chehab 138*458f69efSMauro Carvalho Chehabhrtimers - rounding of timer values 139*458f69efSMauro Carvalho Chehab----------------------------------- 140*458f69efSMauro Carvalho Chehab 141*458f69efSMauro Carvalho Chehabthe hrtimer code will round timer events to lower-resolution clocks 142*458f69efSMauro Carvalho Chehabbecause it has to. Otherwise it will do no artificial rounding at all. 143*458f69efSMauro Carvalho Chehab 144*458f69efSMauro Carvalho Chehabone question is, what resolution value should be returned to the user by 145*458f69efSMauro Carvalho Chehabthe clock_getres() interface. This will return whatever real resolution 146*458f69efSMauro Carvalho Chehaba given clock has - be it low-res, high-res, or artificially-low-res. 147*458f69efSMauro Carvalho Chehab 148*458f69efSMauro Carvalho Chehabhrtimers - testing and verification 149*458f69efSMauro Carvalho Chehab----------------------------------- 150*458f69efSMauro Carvalho Chehab 151*458f69efSMauro Carvalho ChehabWe used the high-resolution clock subsystem ontop of hrtimers to verify 152*458f69efSMauro Carvalho Chehabthe hrtimer implementation details in praxis, and we also ran the posix 153*458f69efSMauro Carvalho Chehabtimer tests in order to ensure specification compliance. We also ran 154*458f69efSMauro Carvalho Chehabtests on low-resolution clocks. 155*458f69efSMauro Carvalho Chehab 156*458f69efSMauro Carvalho ChehabThe hrtimer patch converts the following kernel functionality to use 157*458f69efSMauro Carvalho Chehabhrtimers: 158*458f69efSMauro Carvalho Chehab 159*458f69efSMauro Carvalho Chehab - nanosleep 160*458f69efSMauro Carvalho Chehab - itimers 161*458f69efSMauro Carvalho Chehab - posix-timers 162*458f69efSMauro Carvalho Chehab 163*458f69efSMauro Carvalho ChehabThe conversion of nanosleep and posix-timers enabled the unification of 164*458f69efSMauro Carvalho Chehabnanosleep and clock_nanosleep. 165*458f69efSMauro Carvalho Chehab 166*458f69efSMauro Carvalho ChehabThe code was successfully compiled for the following platforms: 167*458f69efSMauro Carvalho Chehab 168*458f69efSMauro Carvalho Chehab i386, x86_64, ARM, PPC, PPC64, IA64 169*458f69efSMauro Carvalho Chehab 170*458f69efSMauro Carvalho ChehabThe code was run-tested on the following platforms: 171*458f69efSMauro Carvalho Chehab 172*458f69efSMauro Carvalho Chehab i386(UP/SMP), x86_64(UP/SMP), ARM, PPC 173*458f69efSMauro Carvalho Chehab 174*458f69efSMauro Carvalho Chehabhrtimers were also integrated into the -rt tree, along with a 175*458f69efSMauro Carvalho Chehabhrtimers-based high-resolution clock implementation, so the hrtimers 176*458f69efSMauro Carvalho Chehabcode got a healthy amount of testing and use in practice. 177*458f69efSMauro Carvalho Chehab 178*458f69efSMauro Carvalho Chehab Thomas Gleixner, Ingo Molnar 179