xref: /openbmc/linux/Documentation/timers/hrtimers.rst (revision 458f69ef36656dc74679667380422dd8063eabfb)
1*458f69efSMauro Carvalho Chehab======================================================
2*458f69efSMauro Carvalho Chehabhrtimers - subsystem for high-resolution kernel timers
3*458f69efSMauro Carvalho Chehab======================================================
4*458f69efSMauro Carvalho Chehab
5*458f69efSMauro Carvalho ChehabThis patch introduces a new subsystem for high-resolution kernel timers.
6*458f69efSMauro Carvalho Chehab
7*458f69efSMauro Carvalho ChehabOne might ask the question: we already have a timer subsystem
8*458f69efSMauro Carvalho Chehab(kernel/timers.c), why do we need two timer subsystems? After a lot of
9*458f69efSMauro Carvalho Chehabback and forth trying to integrate high-resolution and high-precision
10*458f69efSMauro Carvalho Chehabfeatures into the existing timer framework, and after testing various
11*458f69efSMauro Carvalho Chehabsuch high-resolution timer implementations in practice, we came to the
12*458f69efSMauro Carvalho Chehabconclusion that the timer wheel code is fundamentally not suitable for
13*458f69efSMauro Carvalho Chehabsuch an approach. We initially didn't believe this ('there must be a way
14*458f69efSMauro Carvalho Chehabto solve this'), and spent a considerable effort trying to integrate
15*458f69efSMauro Carvalho Chehabthings into the timer wheel, but we failed. In hindsight, there are
16*458f69efSMauro Carvalho Chehabseveral reasons why such integration is hard/impossible:
17*458f69efSMauro Carvalho Chehab
18*458f69efSMauro Carvalho Chehab- the forced handling of low-resolution and high-resolution timers in
19*458f69efSMauro Carvalho Chehab  the same way leads to a lot of compromises, macro magic and #ifdef
20*458f69efSMauro Carvalho Chehab  mess. The timers.c code is very "tightly coded" around jiffies and
21*458f69efSMauro Carvalho Chehab  32-bitness assumptions, and has been honed and micro-optimized for a
22*458f69efSMauro Carvalho Chehab  relatively narrow use case (jiffies in a relatively narrow HZ range)
23*458f69efSMauro Carvalho Chehab  for many years - and thus even small extensions to it easily break
24*458f69efSMauro Carvalho Chehab  the wheel concept, leading to even worse compromises. The timer wheel
25*458f69efSMauro Carvalho Chehab  code is very good and tight code, there's zero problems with it in its
26*458f69efSMauro Carvalho Chehab  current usage - but it is simply not suitable to be extended for
27*458f69efSMauro Carvalho Chehab  high-res timers.
28*458f69efSMauro Carvalho Chehab
29*458f69efSMauro Carvalho Chehab- the unpredictable [O(N)] overhead of cascading leads to delays which
30*458f69efSMauro Carvalho Chehab  necessitate a more complex handling of high resolution timers, which
31*458f69efSMauro Carvalho Chehab  in turn decreases robustness. Such a design still leads to rather large
32*458f69efSMauro Carvalho Chehab  timing inaccuracies. Cascading is a fundamental property of the timer
33*458f69efSMauro Carvalho Chehab  wheel concept, it cannot be 'designed out' without inevitably
34*458f69efSMauro Carvalho Chehab  degrading other portions of the timers.c code in an unacceptable way.
35*458f69efSMauro Carvalho Chehab
36*458f69efSMauro Carvalho Chehab- the implementation of the current posix-timer subsystem on top of
37*458f69efSMauro Carvalho Chehab  the timer wheel has already introduced a quite complex handling of
38*458f69efSMauro Carvalho Chehab  the required readjusting of absolute CLOCK_REALTIME timers at
39*458f69efSMauro Carvalho Chehab  settimeofday or NTP time - further underlying our experience by
40*458f69efSMauro Carvalho Chehab  example: that the timer wheel data structure is too rigid for high-res
41*458f69efSMauro Carvalho Chehab  timers.
42*458f69efSMauro Carvalho Chehab
43*458f69efSMauro Carvalho Chehab- the timer wheel code is most optimal for use cases which can be
44*458f69efSMauro Carvalho Chehab  identified as "timeouts". Such timeouts are usually set up to cover
45*458f69efSMauro Carvalho Chehab  error conditions in various I/O paths, such as networking and block
46*458f69efSMauro Carvalho Chehab  I/O. The vast majority of those timers never expire and are rarely
47*458f69efSMauro Carvalho Chehab  recascaded because the expected correct event arrives in time so they
48*458f69efSMauro Carvalho Chehab  can be removed from the timer wheel before any further processing of
49*458f69efSMauro Carvalho Chehab  them becomes necessary. Thus the users of these timeouts can accept
50*458f69efSMauro Carvalho Chehab  the granularity and precision tradeoffs of the timer wheel, and
51*458f69efSMauro Carvalho Chehab  largely expect the timer subsystem to have near-zero overhead.
52*458f69efSMauro Carvalho Chehab  Accurate timing for them is not a core purpose - in fact most of the
53*458f69efSMauro Carvalho Chehab  timeout values used are ad-hoc. For them it is at most a necessary
54*458f69efSMauro Carvalho Chehab  evil to guarantee the processing of actual timeout completions
55*458f69efSMauro Carvalho Chehab  (because most of the timeouts are deleted before completion), which
56*458f69efSMauro Carvalho Chehab  should thus be as cheap and unintrusive as possible.
57*458f69efSMauro Carvalho Chehab
58*458f69efSMauro Carvalho ChehabThe primary users of precision timers are user-space applications that
59*458f69efSMauro Carvalho Chehabutilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
60*458f69efSMauro Carvalho Chehabusers like drivers and subsystems which require precise timed events
61*458f69efSMauro Carvalho Chehab(e.g. multimedia) can benefit from the availability of a separate
62*458f69efSMauro Carvalho Chehabhigh-resolution timer subsystem as well.
63*458f69efSMauro Carvalho Chehab
64*458f69efSMauro Carvalho ChehabWhile this subsystem does not offer high-resolution clock sources just
65*458f69efSMauro Carvalho Chehabyet, the hrtimer subsystem can be easily extended with high-resolution
66*458f69efSMauro Carvalho Chehabclock capabilities, and patches for that exist and are maturing quickly.
67*458f69efSMauro Carvalho ChehabThe increasing demand for realtime and multimedia applications along
68*458f69efSMauro Carvalho Chehabwith other potential users for precise timers gives another reason to
69*458f69efSMauro Carvalho Chehabseparate the "timeout" and "precise timer" subsystems.
70*458f69efSMauro Carvalho Chehab
71*458f69efSMauro Carvalho ChehabAnother potential benefit is that such a separation allows even more
72*458f69efSMauro Carvalho Chehabspecial-purpose optimization of the existing timer wheel for the low
73*458f69efSMauro Carvalho Chehabresolution and low precision use cases - once the precision-sensitive
74*458f69efSMauro Carvalho ChehabAPIs are separated from the timer wheel and are migrated over to
75*458f69efSMauro Carvalho Chehabhrtimers. E.g. we could decrease the frequency of the timeout subsystem
76*458f69efSMauro Carvalho Chehabfrom 250 Hz to 100 HZ (or even smaller).
77*458f69efSMauro Carvalho Chehab
78*458f69efSMauro Carvalho Chehabhrtimer subsystem implementation details
79*458f69efSMauro Carvalho Chehab----------------------------------------
80*458f69efSMauro Carvalho Chehab
81*458f69efSMauro Carvalho Chehabthe basic design considerations were:
82*458f69efSMauro Carvalho Chehab
83*458f69efSMauro Carvalho Chehab- simplicity
84*458f69efSMauro Carvalho Chehab
85*458f69efSMauro Carvalho Chehab- data structure not bound to jiffies or any other granularity. All the
86*458f69efSMauro Carvalho Chehab  kernel logic works at 64-bit nanoseconds resolution - no compromises.
87*458f69efSMauro Carvalho Chehab
88*458f69efSMauro Carvalho Chehab- simplification of existing, timing related kernel code
89*458f69efSMauro Carvalho Chehab
90*458f69efSMauro Carvalho Chehabanother basic requirement was the immediate enqueueing and ordering of
91*458f69efSMauro Carvalho Chehabtimers at activation time. After looking at several possible solutions
92*458f69efSMauro Carvalho Chehabsuch as radix trees and hashes, we chose the red black tree as the basic
93*458f69efSMauro Carvalho Chehabdata structure. Rbtrees are available as a library in the kernel and are
94*458f69efSMauro Carvalho Chehabused in various performance-critical areas of e.g. memory management and
95*458f69efSMauro Carvalho Chehabfile systems. The rbtree is solely used for time sorted ordering, while
96*458f69efSMauro Carvalho Chehaba separate list is used to give the expiry code fast access to the
97*458f69efSMauro Carvalho Chehabqueued timers, without having to walk the rbtree.
98*458f69efSMauro Carvalho Chehab
99*458f69efSMauro Carvalho Chehab(This separate list is also useful for later when we'll introduce
100*458f69efSMauro Carvalho Chehabhigh-resolution clocks, where we need separate pending and expired
101*458f69efSMauro Carvalho Chehabqueues while keeping the time-order intact.)
102*458f69efSMauro Carvalho Chehab
103*458f69efSMauro Carvalho ChehabTime-ordered enqueueing is not purely for the purposes of
104*458f69efSMauro Carvalho Chehabhigh-resolution clocks though, it also simplifies the handling of
105*458f69efSMauro Carvalho Chehababsolute timers based on a low-resolution CLOCK_REALTIME. The existing
106*458f69efSMauro Carvalho Chehabimplementation needed to keep an extra list of all armed absolute
107*458f69efSMauro Carvalho ChehabCLOCK_REALTIME timers along with complex locking. In case of
108*458f69efSMauro Carvalho Chehabsettimeofday and NTP, all the timers (!) had to be dequeued, the
109*458f69efSMauro Carvalho Chehabtime-changing code had to fix them up one by one, and all of them had to
110*458f69efSMauro Carvalho Chehabbe enqueued again. The time-ordered enqueueing and the storage of the
111*458f69efSMauro Carvalho Chehabexpiry time in absolute time units removes all this complex and poorly
112*458f69efSMauro Carvalho Chehabscaling code from the posix-timer implementation - the clock can simply
113*458f69efSMauro Carvalho Chehabbe set without having to touch the rbtree. This also makes the handling
114*458f69efSMauro Carvalho Chehabof posix-timers simpler in general.
115*458f69efSMauro Carvalho Chehab
116*458f69efSMauro Carvalho ChehabThe locking and per-CPU behavior of hrtimers was mostly taken from the
117*458f69efSMauro Carvalho Chehabexisting timer wheel code, as it is mature and well suited. Sharing code
118*458f69efSMauro Carvalho Chehabwas not really a win, due to the different data structures. Also, the
119*458f69efSMauro Carvalho Chehabhrtimer functions now have clearer behavior and clearer names - such as
120*458f69efSMauro Carvalho Chehabhrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
121*458f69efSMauro Carvalho Chehabequivalent to del_timer() and del_timer_sync()] - so there's no direct
122*458f69efSMauro Carvalho Chehab1:1 mapping between them on the algorithmic level, and thus no real
123*458f69efSMauro Carvalho Chehabpotential for code sharing either.
124*458f69efSMauro Carvalho Chehab
125*458f69efSMauro Carvalho ChehabBasic data types: every time value, absolute or relative, is in a
126*458f69efSMauro Carvalho Chehabspecial nanosecond-resolution type: ktime_t. The kernel-internal
127*458f69efSMauro Carvalho Chehabrepresentation of ktime_t values and operations is implemented via
128*458f69efSMauro Carvalho Chehabmacros and inline functions, and can be switched between a "hybrid
129*458f69efSMauro Carvalho Chehabunion" type and a plain "scalar" 64bit nanoseconds representation (at
130*458f69efSMauro Carvalho Chehabcompile time). The hybrid union type optimizes time conversions on 32bit
131*458f69efSMauro Carvalho ChehabCPUs. This build-time-selectable ktime_t storage format was implemented
132*458f69efSMauro Carvalho Chehabto avoid the performance impact of 64-bit multiplications and divisions
133*458f69efSMauro Carvalho Chehabon 32bit CPUs. Such operations are frequently necessary to convert
134*458f69efSMauro Carvalho Chehabbetween the storage formats provided by kernel and userspace interfaces
135*458f69efSMauro Carvalho Chehaband the internal time format. (See include/linux/ktime.h for further
136*458f69efSMauro Carvalho Chehabdetails.)
137*458f69efSMauro Carvalho Chehab
138*458f69efSMauro Carvalho Chehabhrtimers - rounding of timer values
139*458f69efSMauro Carvalho Chehab-----------------------------------
140*458f69efSMauro Carvalho Chehab
141*458f69efSMauro Carvalho Chehabthe hrtimer code will round timer events to lower-resolution clocks
142*458f69efSMauro Carvalho Chehabbecause it has to. Otherwise it will do no artificial rounding at all.
143*458f69efSMauro Carvalho Chehab
144*458f69efSMauro Carvalho Chehabone question is, what resolution value should be returned to the user by
145*458f69efSMauro Carvalho Chehabthe clock_getres() interface. This will return whatever real resolution
146*458f69efSMauro Carvalho Chehaba given clock has - be it low-res, high-res, or artificially-low-res.
147*458f69efSMauro Carvalho Chehab
148*458f69efSMauro Carvalho Chehabhrtimers - testing and verification
149*458f69efSMauro Carvalho Chehab-----------------------------------
150*458f69efSMauro Carvalho Chehab
151*458f69efSMauro Carvalho ChehabWe used the high-resolution clock subsystem ontop of hrtimers to verify
152*458f69efSMauro Carvalho Chehabthe hrtimer implementation details in praxis, and we also ran the posix
153*458f69efSMauro Carvalho Chehabtimer tests in order to ensure specification compliance. We also ran
154*458f69efSMauro Carvalho Chehabtests on low-resolution clocks.
155*458f69efSMauro Carvalho Chehab
156*458f69efSMauro Carvalho ChehabThe hrtimer patch converts the following kernel functionality to use
157*458f69efSMauro Carvalho Chehabhrtimers:
158*458f69efSMauro Carvalho Chehab
159*458f69efSMauro Carvalho Chehab - nanosleep
160*458f69efSMauro Carvalho Chehab - itimers
161*458f69efSMauro Carvalho Chehab - posix-timers
162*458f69efSMauro Carvalho Chehab
163*458f69efSMauro Carvalho ChehabThe conversion of nanosleep and posix-timers enabled the unification of
164*458f69efSMauro Carvalho Chehabnanosleep and clock_nanosleep.
165*458f69efSMauro Carvalho Chehab
166*458f69efSMauro Carvalho ChehabThe code was successfully compiled for the following platforms:
167*458f69efSMauro Carvalho Chehab
168*458f69efSMauro Carvalho Chehab i386, x86_64, ARM, PPC, PPC64, IA64
169*458f69efSMauro Carvalho Chehab
170*458f69efSMauro Carvalho ChehabThe code was run-tested on the following platforms:
171*458f69efSMauro Carvalho Chehab
172*458f69efSMauro Carvalho Chehab i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
173*458f69efSMauro Carvalho Chehab
174*458f69efSMauro Carvalho Chehabhrtimers were also integrated into the -rt tree, along with a
175*458f69efSMauro Carvalho Chehabhrtimers-based high-resolution clock implementation, so the hrtimers
176*458f69efSMauro Carvalho Chehabcode got a healthy amount of testing and use in practice.
177*458f69efSMauro Carvalho Chehab
178*458f69efSMauro Carvalho Chehab	Thomas Gleixner, Ingo Molnar
179