xref: /openbmc/linux/Documentation/timers/hrtimers.rst (revision 1ac731c529cd4d6adbce134754b51ff7d822b145)
1458f69efSMauro Carvalho Chehab======================================================
2458f69efSMauro Carvalho Chehabhrtimers - subsystem for high-resolution kernel timers
3458f69efSMauro Carvalho Chehab======================================================
4458f69efSMauro Carvalho Chehab
5458f69efSMauro Carvalho ChehabThis patch introduces a new subsystem for high-resolution kernel timers.
6458f69efSMauro Carvalho Chehab
7458f69efSMauro Carvalho ChehabOne might ask the question: we already have a timer subsystem
8458f69efSMauro Carvalho Chehab(kernel/timers.c), why do we need two timer subsystems? After a lot of
9458f69efSMauro Carvalho Chehabback and forth trying to integrate high-resolution and high-precision
10458f69efSMauro Carvalho Chehabfeatures into the existing timer framework, and after testing various
11458f69efSMauro Carvalho Chehabsuch high-resolution timer implementations in practice, we came to the
12458f69efSMauro Carvalho Chehabconclusion that the timer wheel code is fundamentally not suitable for
13458f69efSMauro Carvalho Chehabsuch an approach. We initially didn't believe this ('there must be a way
14458f69efSMauro Carvalho Chehabto solve this'), and spent a considerable effort trying to integrate
15458f69efSMauro Carvalho Chehabthings into the timer wheel, but we failed. In hindsight, there are
16458f69efSMauro Carvalho Chehabseveral reasons why such integration is hard/impossible:
17458f69efSMauro Carvalho Chehab
18458f69efSMauro Carvalho Chehab- the forced handling of low-resolution and high-resolution timers in
19458f69efSMauro Carvalho Chehab  the same way leads to a lot of compromises, macro magic and #ifdef
20458f69efSMauro Carvalho Chehab  mess. The timers.c code is very "tightly coded" around jiffies and
21458f69efSMauro Carvalho Chehab  32-bitness assumptions, and has been honed and micro-optimized for a
22458f69efSMauro Carvalho Chehab  relatively narrow use case (jiffies in a relatively narrow HZ range)
23458f69efSMauro Carvalho Chehab  for many years - and thus even small extensions to it easily break
24458f69efSMauro Carvalho Chehab  the wheel concept, leading to even worse compromises. The timer wheel
25458f69efSMauro Carvalho Chehab  code is very good and tight code, there's zero problems with it in its
26458f69efSMauro Carvalho Chehab  current usage - but it is simply not suitable to be extended for
27458f69efSMauro Carvalho Chehab  high-res timers.
28458f69efSMauro Carvalho Chehab
29458f69efSMauro Carvalho Chehab- the unpredictable [O(N)] overhead of cascading leads to delays which
30458f69efSMauro Carvalho Chehab  necessitate a more complex handling of high resolution timers, which
31458f69efSMauro Carvalho Chehab  in turn decreases robustness. Such a design still leads to rather large
32458f69efSMauro Carvalho Chehab  timing inaccuracies. Cascading is a fundamental property of the timer
33458f69efSMauro Carvalho Chehab  wheel concept, it cannot be 'designed out' without inevitably
34458f69efSMauro Carvalho Chehab  degrading other portions of the timers.c code in an unacceptable way.
35458f69efSMauro Carvalho Chehab
36458f69efSMauro Carvalho Chehab- the implementation of the current posix-timer subsystem on top of
37458f69efSMauro Carvalho Chehab  the timer wheel has already introduced a quite complex handling of
38458f69efSMauro Carvalho Chehab  the required readjusting of absolute CLOCK_REALTIME timers at
39458f69efSMauro Carvalho Chehab  settimeofday or NTP time - further underlying our experience by
40458f69efSMauro Carvalho Chehab  example: that the timer wheel data structure is too rigid for high-res
41458f69efSMauro Carvalho Chehab  timers.
42458f69efSMauro Carvalho Chehab
43458f69efSMauro Carvalho Chehab- the timer wheel code is most optimal for use cases which can be
44458f69efSMauro Carvalho Chehab  identified as "timeouts". Such timeouts are usually set up to cover
45458f69efSMauro Carvalho Chehab  error conditions in various I/O paths, such as networking and block
46458f69efSMauro Carvalho Chehab  I/O. The vast majority of those timers never expire and are rarely
47458f69efSMauro Carvalho Chehab  recascaded because the expected correct event arrives in time so they
48458f69efSMauro Carvalho Chehab  can be removed from the timer wheel before any further processing of
49458f69efSMauro Carvalho Chehab  them becomes necessary. Thus the users of these timeouts can accept
50458f69efSMauro Carvalho Chehab  the granularity and precision tradeoffs of the timer wheel, and
51458f69efSMauro Carvalho Chehab  largely expect the timer subsystem to have near-zero overhead.
52458f69efSMauro Carvalho Chehab  Accurate timing for them is not a core purpose - in fact most of the
53458f69efSMauro Carvalho Chehab  timeout values used are ad-hoc. For them it is at most a necessary
54458f69efSMauro Carvalho Chehab  evil to guarantee the processing of actual timeout completions
55458f69efSMauro Carvalho Chehab  (because most of the timeouts are deleted before completion), which
56458f69efSMauro Carvalho Chehab  should thus be as cheap and unintrusive as possible.
57458f69efSMauro Carvalho Chehab
58458f69efSMauro Carvalho ChehabThe primary users of precision timers are user-space applications that
59458f69efSMauro Carvalho Chehabutilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
60458f69efSMauro Carvalho Chehabusers like drivers and subsystems which require precise timed events
61458f69efSMauro Carvalho Chehab(e.g. multimedia) can benefit from the availability of a separate
62458f69efSMauro Carvalho Chehabhigh-resolution timer subsystem as well.
63458f69efSMauro Carvalho Chehab
64458f69efSMauro Carvalho ChehabWhile this subsystem does not offer high-resolution clock sources just
65458f69efSMauro Carvalho Chehabyet, the hrtimer subsystem can be easily extended with high-resolution
66458f69efSMauro Carvalho Chehabclock capabilities, and patches for that exist and are maturing quickly.
67458f69efSMauro Carvalho ChehabThe increasing demand for realtime and multimedia applications along
68458f69efSMauro Carvalho Chehabwith other potential users for precise timers gives another reason to
69458f69efSMauro Carvalho Chehabseparate the "timeout" and "precise timer" subsystems.
70458f69efSMauro Carvalho Chehab
71458f69efSMauro Carvalho ChehabAnother potential benefit is that such a separation allows even more
72458f69efSMauro Carvalho Chehabspecial-purpose optimization of the existing timer wheel for the low
73458f69efSMauro Carvalho Chehabresolution and low precision use cases - once the precision-sensitive
74458f69efSMauro Carvalho ChehabAPIs are separated from the timer wheel and are migrated over to
75458f69efSMauro Carvalho Chehabhrtimers. E.g. we could decrease the frequency of the timeout subsystem
76458f69efSMauro Carvalho Chehabfrom 250 Hz to 100 HZ (or even smaller).
77458f69efSMauro Carvalho Chehab
78458f69efSMauro Carvalho Chehabhrtimer subsystem implementation details
79458f69efSMauro Carvalho Chehab----------------------------------------
80458f69efSMauro Carvalho Chehab
81458f69efSMauro Carvalho Chehabthe basic design considerations were:
82458f69efSMauro Carvalho Chehab
83458f69efSMauro Carvalho Chehab- simplicity
84458f69efSMauro Carvalho Chehab
85458f69efSMauro Carvalho Chehab- data structure not bound to jiffies or any other granularity. All the
86458f69efSMauro Carvalho Chehab  kernel logic works at 64-bit nanoseconds resolution - no compromises.
87458f69efSMauro Carvalho Chehab
88458f69efSMauro Carvalho Chehab- simplification of existing, timing related kernel code
89458f69efSMauro Carvalho Chehab
90458f69efSMauro Carvalho Chehabanother basic requirement was the immediate enqueueing and ordering of
91458f69efSMauro Carvalho Chehabtimers at activation time. After looking at several possible solutions
92458f69efSMauro Carvalho Chehabsuch as radix trees and hashes, we chose the red black tree as the basic
93458f69efSMauro Carvalho Chehabdata structure. Rbtrees are available as a library in the kernel and are
94458f69efSMauro Carvalho Chehabused in various performance-critical areas of e.g. memory management and
95458f69efSMauro Carvalho Chehabfile systems. The rbtree is solely used for time sorted ordering, while
96458f69efSMauro Carvalho Chehaba separate list is used to give the expiry code fast access to the
97458f69efSMauro Carvalho Chehabqueued timers, without having to walk the rbtree.
98458f69efSMauro Carvalho Chehab
99458f69efSMauro Carvalho Chehab(This separate list is also useful for later when we'll introduce
100458f69efSMauro Carvalho Chehabhigh-resolution clocks, where we need separate pending and expired
101458f69efSMauro Carvalho Chehabqueues while keeping the time-order intact.)
102458f69efSMauro Carvalho Chehab
103458f69efSMauro Carvalho ChehabTime-ordered enqueueing is not purely for the purposes of
104458f69efSMauro Carvalho Chehabhigh-resolution clocks though, it also simplifies the handling of
105458f69efSMauro Carvalho Chehababsolute timers based on a low-resolution CLOCK_REALTIME. The existing
106458f69efSMauro Carvalho Chehabimplementation needed to keep an extra list of all armed absolute
107458f69efSMauro Carvalho ChehabCLOCK_REALTIME timers along with complex locking. In case of
108458f69efSMauro Carvalho Chehabsettimeofday and NTP, all the timers (!) had to be dequeued, the
109458f69efSMauro Carvalho Chehabtime-changing code had to fix them up one by one, and all of them had to
110458f69efSMauro Carvalho Chehabbe enqueued again. The time-ordered enqueueing and the storage of the
111458f69efSMauro Carvalho Chehabexpiry time in absolute time units removes all this complex and poorly
112458f69efSMauro Carvalho Chehabscaling code from the posix-timer implementation - the clock can simply
113458f69efSMauro Carvalho Chehabbe set without having to touch the rbtree. This also makes the handling
114458f69efSMauro Carvalho Chehabof posix-timers simpler in general.
115458f69efSMauro Carvalho Chehab
116458f69efSMauro Carvalho ChehabThe locking and per-CPU behavior of hrtimers was mostly taken from the
117458f69efSMauro Carvalho Chehabexisting timer wheel code, as it is mature and well suited. Sharing code
118458f69efSMauro Carvalho Chehabwas not really a win, due to the different data structures. Also, the
119458f69efSMauro Carvalho Chehabhrtimer functions now have clearer behavior and clearer names - such as
120458f69efSMauro Carvalho Chehabhrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
12187bdd932SThomas Gleixnerequivalent to timer_delete() and timer_delete_sync()] - so there's no direct
122458f69efSMauro Carvalho Chehab1:1 mapping between them on the algorithmic level, and thus no real
123458f69efSMauro Carvalho Chehabpotential for code sharing either.
124458f69efSMauro Carvalho Chehab
125458f69efSMauro Carvalho ChehabBasic data types: every time value, absolute or relative, is in a
126*4c093cbbSGeert Uytterhoevenspecial nanosecond-resolution 64bit type: ktime_t.
127*4c093cbbSGeert Uytterhoeven(Originally, the kernel-internal representation of ktime_t values and
128*4c093cbbSGeert Uytterhoevenoperations was implemented via macros and inline functions, and could be
129*4c093cbbSGeert Uytterhoevenswitched between a "hybrid union" type and a plain "scalar" 64bit
130*4c093cbbSGeert Uytterhoevennanoseconds representation (at compile time). This was abandoned in the
131*4c093cbbSGeert Uytterhoevencontext of the Y2038 work.)
132458f69efSMauro Carvalho Chehab
133458f69efSMauro Carvalho Chehabhrtimers - rounding of timer values
134458f69efSMauro Carvalho Chehab-----------------------------------
135458f69efSMauro Carvalho Chehab
136458f69efSMauro Carvalho Chehabthe hrtimer code will round timer events to lower-resolution clocks
137458f69efSMauro Carvalho Chehabbecause it has to. Otherwise it will do no artificial rounding at all.
138458f69efSMauro Carvalho Chehab
139458f69efSMauro Carvalho Chehabone question is, what resolution value should be returned to the user by
140458f69efSMauro Carvalho Chehabthe clock_getres() interface. This will return whatever real resolution
141458f69efSMauro Carvalho Chehaba given clock has - be it low-res, high-res, or artificially-low-res.
142458f69efSMauro Carvalho Chehab
143458f69efSMauro Carvalho Chehabhrtimers - testing and verification
144458f69efSMauro Carvalho Chehab-----------------------------------
145458f69efSMauro Carvalho Chehab
146458f69efSMauro Carvalho ChehabWe used the high-resolution clock subsystem on top of hrtimers to verify
147458f69efSMauro Carvalho Chehabthe hrtimer implementation details in praxis, and we also ran the posix
148458f69efSMauro Carvalho Chehabtimer tests in order to ensure specification compliance. We also ran
149458f69efSMauro Carvalho Chehabtests on low-resolution clocks.
150458f69efSMauro Carvalho Chehab
151458f69efSMauro Carvalho ChehabThe hrtimer patch converts the following kernel functionality to use
152458f69efSMauro Carvalho Chehabhrtimers:
153458f69efSMauro Carvalho Chehab
154458f69efSMauro Carvalho Chehab - nanosleep
155458f69efSMauro Carvalho Chehab - itimers
156458f69efSMauro Carvalho Chehab - posix-timers
157458f69efSMauro Carvalho Chehab
158458f69efSMauro Carvalho ChehabThe conversion of nanosleep and posix-timers enabled the unification of
159458f69efSMauro Carvalho Chehabnanosleep and clock_nanosleep.
160458f69efSMauro Carvalho Chehab
161458f69efSMauro Carvalho ChehabThe code was successfully compiled for the following platforms:
162458f69efSMauro Carvalho Chehab
163458f69efSMauro Carvalho Chehab i386, x86_64, ARM, PPC, PPC64, IA64
164458f69efSMauro Carvalho Chehab
165458f69efSMauro Carvalho ChehabThe code was run-tested on the following platforms:
166458f69efSMauro Carvalho Chehab
167458f69efSMauro Carvalho Chehab i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
168458f69efSMauro Carvalho Chehab
169458f69efSMauro Carvalho Chehabhrtimers were also integrated into the -rt tree, along with a
170458f69efSMauro Carvalho Chehabhrtimers-based high-resolution clock implementation, so the hrtimers
171458f69efSMauro Carvalho Chehabcode got a healthy amount of testing and use in practice.
172458f69efSMauro Carvalho Chehab
173458f69efSMauro Carvalho Chehab	Thomas Gleixner, Ingo Molnar
174