xref: /openbmc/linux/Documentation/virt/kvm/x86/timekeeping.rst (revision 077e5f4f5528777ab72f4dc336569207504dc876)
1.. SPDX-License-Identifier: GPL-2.0
2
3======================================================
4Timekeeping Virtualization for X86-Based Architectures
5======================================================
6
7:Author: Zachary Amsden <zamsden@redhat.com>
8:Copyright: (c) 2010, Red Hat.  All rights reserved.
9
10.. Contents
11
12   1) Overview
13   2) Timing Devices
14   3) TSC Hardware
15   4) Virtualization Problems
16
171. Overview
18===========
19
20One of the most complicated parts of the X86 platform, and specifically,
21the virtualization of this platform is the plethora of timing devices available
22and the complexity of emulating those devices.  In addition, virtualization of
23time introduces a new set of challenges because it introduces a multiplexed
24division of time beyond the control of the guest CPU.
25
26First, we will describe the various timekeeping hardware available, then
27present some of the problems which arise and solutions available, giving
28specific recommendations for certain classes of KVM guests.
29
30The purpose of this document is to collect data and information relevant to
31timekeeping which may be difficult to find elsewhere, specifically,
32information relevant to KVM and hardware-based virtualization.
33
342. Timing Devices
35=================
36
37First we discuss the basic hardware devices available.  TSC and the related
38KVM clock are special enough to warrant a full exposition and are described in
39the following section.
40
412.1. i8254 - PIT
42----------------
43
44One of the first timer devices available is the programmable interrupt timer,
45or PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three
46channels which can be programmed to deliver periodic or one-shot interrupts.
47These three channels can be configured in different modes and have individual
48counters.  Channel 1 and 2 were not available for general use in the original
49IBM PC, and historically were connected to control RAM refresh and the PC
50speaker.  Now the PIT is typically integrated as part of an emulated chipset
51and a separate physical PIT is not used.
52
53The PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done
54using single or multiple byte access to the I/O ports.  There are 6 modes
55available, but not all modes are available to all timers, as only timer 2
56has a connected gate input, required for modes 1 and 5.  The gate line is
57controlled by port 61h, bit 0, as illustrated in the following diagram::
58
59  --------------             ----------------
60  |            |           |                |
61  |  1.1932 MHz|---------->| CLOCK      OUT | ---------> IRQ 0
62  |    Clock   |   |       |                |
63  --------------   |    +->| GATE  TIMER 0  |
64                   |        ----------------
65                   |
66                   |        ----------------
67                   |       |                |
68                   |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM
69                   |       |                |            (aka /dev/null)
70                   |    +->| GATE  TIMER 1  |
71                   |        ----------------
72                   |
73                   |        ----------------
74                   |       |                |
75                   |------>| CLOCK      OUT | ---------> Port 61h, bit 5
76                           |                |      |
77  Port 61h, bit 0 -------->| GATE  TIMER 2  |       \_.----   ____
78                            ----------------         _|    )--|LPF|---Speaker
79                                                    / *----   \___/
80  Port 61h, bit 1 ---------------------------------/
81
82The timer modes are now described.
83
84Mode 0: Single Timeout.
85 This is a one-shot software timeout that counts down
86 when the gate is high (always true for timers 0 and 1).  When the count
87 reaches zero, the output goes high.
88
89Mode 1: Triggered One-shot.
90 The output is initially set high.  When the gate
91 line is set high, a countdown is initiated (which does not stop if the gate is
92 lowered), during which the output is set low.  When the count reaches zero,
93 the output goes high.
94
95Mode 2: Rate Generator.
96 The output is initially set high.  When the countdown
97 reaches 1, the output goes low for one count and then returns high.  The value
98 is reloaded and the countdown automatically resumes.  If the gate line goes
99 low, the count is halted.  If the output is low when the gate is lowered, the
100 output automatically goes high (this only affects timer 2).
101
102Mode 3: Square Wave.
103 This generates a high / low square wave.  The count
104 determines the length of the pulse, which alternates between high and low
105 when zero is reached.  The count only proceeds when gate is high and is
106 automatically reloaded on reaching zero.  The count is decremented twice at
107 each clock to generate a full high / low cycle at the full periodic rate.
108 If the count is even, the clock remains high for N/2 counts and low for N/2
109 counts; if the clock is odd, the clock is high for (N+1)/2 counts and low
110 for (N-1)/2 counts.  Only even values are latched by the counter, so odd
111 values are not observed when reading.  This is the intended mode for timer 2,
112 which generates sine-like tones by low-pass filtering the square wave output.
113
114Mode 4: Software Strobe.
115 After programming this mode and loading the counter,
116 the output remains high until the counter reaches zero.  Then the output
117 goes low for 1 clock cycle and returns high.  The counter is not reloaded.
118 Counting only occurs when gate is high.
119
120Mode 5: Hardware Strobe.
121 After programming and loading the counter, the
122 output remains high.  When the gate is raised, a countdown is initiated
123 (which does not stop if the gate is lowered).  When the counter reaches zero,
124 the output goes low for 1 clock cycle and then returns high.  The counter is
125 not reloaded.
126
127In addition to normal binary counting, the PIT supports BCD counting.  The
128command port, 0x43 is used to set the counter and mode for each of the three
129timers.
130
131PIT commands, issued to port 0x43, using the following bit encoding::
132
133  Bit 7-4: Command (See table below)
134  Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined)
135  Bit 0  : Binary (0) / BCD (1)
136
137Command table::
138
139  0000 - Latch Timer 0 count for port 0x40
140	sample and hold the count to be read in port 0x40;
141	additional commands ignored until counter is read;
142	mode bits ignored.
143
144  0001 - Set Timer 0 LSB mode for port 0x40
145	set timer to read LSB only and force MSB to zero;
146	mode bits set timer mode
147
148  0010 - Set Timer 0 MSB mode for port 0x40
149	set timer to read MSB only and force LSB to zero;
150	mode bits set timer mode
151
152  0011 - Set Timer 0 16-bit mode for port 0x40
153	set timer to read / write LSB first, then MSB;
154	mode bits set timer mode
155
156  0100 - Latch Timer 1 count for port 0x41 - as described above
157  0101 - Set Timer 1 LSB mode for port 0x41 - as described above
158  0110 - Set Timer 1 MSB mode for port 0x41 - as described above
159  0111 - Set Timer 1 16-bit mode for port 0x41 - as described above
160
161  1000 - Latch Timer 2 count for port 0x42 - as described above
162  1001 - Set Timer 2 LSB mode for port 0x42 - as described above
163  1010 - Set Timer 2 MSB mode for port 0x42 - as described above
164  1011 - Set Timer 2 16-bit mode for port 0x42 as described above
165
166  1101 - General counter latch
167	Latch combination of counters into corresponding ports
168	Bit 3 = Counter 2
169	Bit 2 = Counter 1
170	Bit 1 = Counter 0
171	Bit 0 = Unused
172
173  1110 - Latch timer status
174	Latch combination of counter mode into corresponding ports
175	Bit 3 = Counter 2
176	Bit 2 = Counter 1
177	Bit 1 = Counter 0
178
179	The output of ports 0x40-0x42 following this command will be:
180
181	Bit 7 = Output pin
182	Bit 6 = Count loaded (0 if timer has expired)
183	Bit 5-4 = Read / Write mode
184	    01 = MSB only
185	    10 = LSB only
186	    11 = LSB / MSB (16-bit)
187	Bit 3-1 = Mode
188	Bit 0 = Binary (0) / BCD mode (1)
189
1902.2. RTC
191--------
192
193The second device which was available in the original PC was the MC146818 real
194time clock.  The original device is now obsolete, and usually emulated by the
195system chipset, sometimes by an HPET and some frankenstein IRQ routing.
196
197The RTC is accessed through CMOS variables, which uses an index register to
198control which bytes are read.  Since there is only one index register, read
199of the CMOS and read of the RTC require lock protection (in addition, it is
200dangerous to allow userspace utilities such as hwclock to have direct RTC
201access, as they could corrupt kernel reads and writes of CMOS memory).
202
203The RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt
204can function as a periodic timer, an additional once a day alarm, and can issue
205interrupts after an update of the CMOS registers by the MC146818 is complete.
206The type of interrupt is signalled in the RTC status registers.
207
208The RTC will update the current time fields by battery power even while the
209system is off.  The current time fields should not be read while an update is
210in progress, as indicated in the status register.
211
212The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be
213programmed to a 32kHz divider if the RTC is to count seconds.
214
215This is the RAM map originally used for the RTC/CMOS::
216
217  Location    Size    Description
218  ------------------------------------------
219  00h         byte    Current second (BCD)
220  01h         byte    Seconds alarm (BCD)
221  02h         byte    Current minute (BCD)
222  03h         byte    Minutes alarm (BCD)
223  04h         byte    Current hour (BCD)
224  05h         byte    Hours alarm (BCD)
225  06h         byte    Current day of week (BCD)
226  07h         byte    Current day of month (BCD)
227  08h         byte    Current month (BCD)
228  09h         byte    Current year (BCD)
229  0Ah         byte    Register A
230                       bit 7   = Update in progress
231                       bit 6-4 = Divider for clock
232                                  000 = 4.194 MHz
233                                  001 = 1.049 MHz
234                                  010 = 32 kHz
235                                  10X = test modes
236                                  110 = reset / disable
237                                  111 = reset / disable
238                       bit 3-0 = Rate selection for periodic interrupt
239                                  000 = periodic timer disabled
240                                  001 = 3.90625 uS
241                                  010 = 7.8125 uS
242                                  011 = .122070 mS
243                                  100 = .244141 mS
244                                     ...
245                                 1101 = 125 mS
246                                 1110 = 250 mS
247                                 1111 = 500 mS
248  0Bh         byte    Register B
249                       bit 7   = Run (0) / Halt (1)
250                       bit 6   = Periodic interrupt enable
251                       bit 5   = Alarm interrupt enable
252                       bit 4   = Update-ended interrupt enable
253                       bit 3   = Square wave interrupt enable
254                       bit 2   = BCD calendar (0) / Binary (1)
255                       bit 1   = 12-hour mode (0) / 24-hour mode (1)
256                       bit 0   = 0 (DST off) / 1 (DST enabled)
257  OCh         byte    Register C (read only)
258                       bit 7   = interrupt request flag (IRQF)
259                       bit 6   = periodic interrupt flag (PF)
260                       bit 5   = alarm interrupt flag (AF)
261                       bit 4   = update interrupt flag (UF)
262                       bit 3-0 = reserved
263  ODh         byte    Register D (read only)
264                       bit 7   = RTC has power
265                       bit 6-0 = reserved
266  32h         byte    Current century BCD (*)
267  (*) location vendor specific and now determined from ACPI global tables
268
2692.3. APIC
270---------
271
272On Pentium and later processors, an on-board timer is available to each CPU
273as part of the Advanced Programmable Interrupt Controller.  The APIC is
274accessed through memory-mapped registers and provides interrupt service to each
275CPU, used for IPIs and local timer interrupts.
276
277Although in theory the APIC is a safe and stable source for local interrupts,
278in practice, many bugs and glitches have occurred due to the special nature of
279the APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect
280the use of the APIC and that workarounds may be required.  In addition, some of
281these workarounds pose unique constraints for virtualization - requiring either
282extra overhead incurred from extra reads of memory-mapped I/O or additional
283functionality that may be more computationally expensive to implement.
284
285Since the APIC is documented quite well in the Intel and AMD manuals, we will
286avoid repetition of the detail here.  It should be pointed out that the APIC
287timer is programmed through the LVT (local vector timer) register, is capable
288of one-shot or periodic operation, and is based on the bus clock divided down
289by the programmable divider register.
290
2912.4. HPET
292---------
293
294HPET is quite complex, and was originally intended to replace the PIT / RTC
295support of the X86 PC.  It remains to be seen whether that will be the case, as
296the de facto standard of PC hardware is to emulate these older devices.  Some
297systems designated as legacy free may support only the HPET as a hardware timer
298device.
299
300The HPET spec is rather loose and vague, requiring at least 3 hardware timers,
301but allowing implementation freedom to support many more.  It also imposes no
302fixed rate on the timer frequency, but does impose some extremal values on
303frequency, error and slew.
304
305In general, the HPET is recommended as a high precision (compared to PIT /RTC)
306time source which is independent of local variation (as there is only one HPET
307in any given system).  The HPET is also memory-mapped, and its presence is
308indicated through ACPI tables by the BIOS.
309
310Detailed specification of the HPET is beyond the current scope of this
311document, as it is also very well documented elsewhere.
312
3132.5. Offboard Timers
314--------------------
315
316Several cards, both proprietary (watchdog boards) and commonplace (e1000) have
317timing chips built into the cards which may have registers which are accessible
318to kernel or user drivers.  To the author's knowledge, using these to generate
319a clocksource for a Linux or other kernel has not yet been attempted and is in
320general frowned upon as not playing by the agreed rules of the game.  Such a
321timer device would require additional support to be virtualized properly and is
322not considered important at this time as no known operating system does this.
323
3243. TSC Hardware
325===============
326
327The TSC or time stamp counter is relatively simple in theory; it counts
328instruction cycles issued by the processor, which can be used as a measure of
329time.  In practice, due to a number of problems, it is the most complicated
330timekeeping device to use.
331
332The TSC is represented internally as a 64-bit MSR which can be read with the
333RDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware
334limitations made it possible to write the TSC, but generally on old hardware it
335was only possible to write the low 32-bits of the 64-bit counter, and the upper
33632-bits of the counter were cleared.  Now, however, on Intel processors family
3370Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction
338has been lifted and all 64-bits are writable.  On AMD systems, the ability to
339write the TSC MSR is not an architectural guarantee.
340
341The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by
342means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access.
343
344Some vendors have implemented an additional instruction, RDTSCP, which returns
345atomically not just the TSC, but an indicator which corresponds to the
346processor number.  This can be used to index into an array of TSC variables to
347determine offset information in SMP systems where TSCs are not synchronized.
348The presence of this instruction must be determined by consulting CPUID feature
349bits.
350
351Both VMX and SVM provide extension fields in the virtualization hardware which
352allows the guest visible TSC to be offset by a constant.  Newer implementations
353promise to allow the TSC to additionally be scaled, but this hardware is not
354yet widely available.
355
3563.1. TSC synchronization
357------------------------
358
359The TSC is a CPU-local clock in most implementations.  This means, on SMP
360platforms, the TSCs of different CPUs may start at different times depending
361on when the CPUs are powered on.  Generally, CPUs on the same die will share
362the same clock, however, this is not always the case.
363
364The BIOS may attempt to resynchronize the TSCs during the poweron process and
365the operating system or other system software may attempt to do this as well.
366Several hardware limitations make the problem worse - if it is not possible to
367write the full 64-bits of the TSC, it may be impossible to match the TSC in
368newly arriving CPUs to that of the rest of the system, resulting in
369unsynchronized TSCs.  This may be done by BIOS or system software, but in
370practice, getting a perfectly synchronized TSC will not be possible unless all
371values are read from the same clock, which generally only is possible on single
372socket systems or those with special hardware support.
373
3743.2. TSC and CPU hotplug
375------------------------
376
377As touched on already, CPUs which arrive later than the boot time of the system
378may not have a TSC value that is synchronized with the rest of the system.
379Either system software, BIOS, or SMM code may actually try to establish the TSC
380to a value matching the rest of the system, but a perfect match is usually not
381a guarantee.  This can have the effect of bringing a system from a state where
382TSC is synchronized back to a state where TSC synchronization flaws, however
383small, may be exposed to the OS and any virtualization environment.
384
3853.3. TSC and multi-socket / NUMA
386--------------------------------
387
388Multi-socket systems, especially large multi-socket systems are likely to have
389individual clocksources rather than a single, universally distributed clock.
390Since these clocks are driven by different crystals, they will not have
391perfectly matched frequency, and temperature and electrical variations will
392cause the CPU clocks, and thus the TSCs to drift over time.  Depending on the
393exact clock and bus design, the drift may or may not be fixed in absolute
394error, and may accumulate over time.
395
396In addition, very large systems may deliberately slew the clocks of individual
397cores.  This technique, known as spread-spectrum clocking, reduces EMI at the
398clock frequency and harmonics of it, which may be required to pass FCC
399standards for telecommunications and computer equipment.
400
401It is recommended not to trust the TSCs to remain synchronized on NUMA or
402multiple socket systems for these reasons.
403
4043.4. TSC and C-states
405---------------------
406
407C-states, or idling states of the processor, especially C1E and deeper sleep
408states may be problematic for TSC as well.  The TSC may stop advancing in such
409a state, resulting in a TSC which is behind that of other CPUs when execution
410is resumed.  Such CPUs must be detected and flagged by the operating system
411based on CPU and chipset identifications.
412
413The TSC in such a case may be corrected by catching it up to a known external
414clocksource.
415
4163.5. TSC frequency change / P-states
417------------------------------------
418
419To make things slightly more interesting, some CPUs may change frequency.  They
420may or may not run the TSC at the same rate, and because the frequency change
421may be staggered or slewed, at some points in time, the TSC rate may not be
422known other than falling within a range of values.  In this case, the TSC will
423not be a stable time source, and must be calibrated against a known, stable,
424external clock to be a usable source of time.
425
426Whether the TSC runs at a constant rate or scales with the P-state is model
427dependent and must be determined by inspecting CPUID, chipset or vendor
428specific MSR fields.
429
430In addition, some vendors have known bugs where the P-state is actually
431compensated for properly during normal operation, but when the processor is
432inactive, the P-state may be raised temporarily to service cache misses from
433other processors.  In such cases, the TSC on halted CPUs could advance faster
434than that of non-halted processors.  AMD Turion processors are known to have
435this problem.
436
4373.6. TSC and STPCLK / T-states
438------------------------------
439
440External signals given to the processor may also have the effect of stopping
441the TSC.  This is typically done for thermal emergency power control to prevent
442an overheating condition, and typically, there is no way to detect that this
443condition has happened.
444
4453.7. TSC virtualization - VMX
446-----------------------------
447
448VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
449instructions, which is enough for full virtualization of TSC in any manner.  In
450addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET
451field specified in the VMCS.  Special instructions must be used to read and
452write the VMCS field.
453
4543.8. TSC virtualization - SVM
455-----------------------------
456
457SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP
458instructions, which is enough for full virtualization of TSC in any manner.  In
459addition, SVM allows passing through the host TSC plus an additional offset
460field specified in the SVM control block.
461
4623.9. TSC feature bits in Linux
463------------------------------
464
465In summary, there is no way to guarantee the TSC remains in perfect
466synchronization unless it is explicitly guaranteed by the architecture.  Even
467if so, the TSCs in multi-sockets or NUMA systems may still run independently
468despite being locally consistent.
469
470The following feature bits are used by Linux to signal various TSC attributes,
471but they can only be taken to be meaningful for UP or single node systems.
472
473=========================	=======================================
474X86_FEATURE_TSC			The TSC is available in hardware
475X86_FEATURE_RDTSCP		The RDTSCP instruction is available
476X86_FEATURE_CONSTANT_TSC	The TSC rate is unchanged with P-states
477X86_FEATURE_NONSTOP_TSC		The TSC does not stop in C-states
478X86_FEATURE_TSC_RELIABLE	TSC sync checks are skipped (VMware)
479=========================	=======================================
480
4814. Virtualization Problems
482==========================
483
484Timekeeping is especially problematic for virtualization because a number of
485challenges arise.  The most obvious problem is that time is now shared between
486the host and, potentially, a number of virtual machines.  Thus the virtual
487operating system does not run with 100% usage of the CPU, despite the fact that
488it may very well make that assumption.  It may expect it to remain true to very
489exacting bounds when interrupt sources are disabled, but in reality only its
490virtual interrupt sources are disabled, and the machine may still be preempted
491at any time.  This causes problems as the passage of real time, the injection
492of machine interrupts and the associated clock sources are no longer completely
493synchronized with real time.
494
495This same problem can occur on native hardware to a degree, as SMM mode may
496steal cycles from the naturally on X86 systems when SMM mode is used by the
497BIOS, but not in such an extreme fashion.  However, the fact that SMM mode may
498cause similar problems to virtualization makes it a good justification for
499solving many of these problems on bare metal.
500
5014.1. Interrupt clocking
502-----------------------
503
504One of the most immediate problems that occurs with legacy operating systems
505is that the system timekeeping routines are often designed to keep track of
506time by counting periodic interrupts.  These interrupts may come from the PIT
507or the RTC, but the problem is the same: the host virtualization engine may not
508be able to deliver the proper number of interrupts per second, and so guest
509time may fall behind.  This is especially problematic if a high interrupt rate
510is selected, such as 1000 HZ, which is unfortunately the default for many Linux
511guests.
512
513There are three approaches to solving this problem; first, it may be possible
514to simply ignore it.  Guests which have a separate time source for tracking
515'wall clock' or 'real time' may not need any adjustment of their interrupts to
516maintain proper time.  If this is not sufficient, it may be necessary to inject
517additional interrupts into the guest in order to increase the effective
518interrupt rate.  This approach leads to complications in extreme conditions,
519where host load or guest lag is too much to compensate for, and thus another
520solution to the problem has risen: the guest may need to become aware of lost
521ticks and compensate for them internally.  Although promising in theory, the
522implementation of this policy in Linux has been extremely error prone, and a
523number of buggy variants of lost tick compensation are distributed across
524commonly used Linux systems.
525
526Windows uses periodic RTC clocking as a means of keeping time internally, and
527thus requires interrupt slewing to keep proper time.  It does use a low enough
528rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in
529practice.
530
5314.2. TSC sampling and serialization
532-----------------------------------
533
534As the highest precision time source available, the cycle counter of the CPU
535has aroused much interest from developers.  As explained above, this timer has
536many problems unique to its nature as a local, potentially unstable and
537potentially unsynchronized source.  One issue which is not unique to the TSC,
538but is highlighted because of its very precise nature is sampling delay.  By
539definition, the counter, once read is already old.  However, it is also
540possible for the counter to be read ahead of the actual use of the result.
541This is a consequence of the superscalar execution of the instruction stream,
542which may execute instructions out of order.  Such execution is called
543non-serialized.  Forcing serialized execution is necessary for precise
544measurement with the TSC, and requires a serializing instruction, such as CPUID
545or an MSR read.
546
547Since CPUID may actually be virtualized by a trap and emulate mechanism, this
548serialization can pose a performance issue for hardware virtualization.  An
549accurate time stamp counter reading may therefore not always be available, and
550it may be necessary for an implementation to guard against "backwards" reads of
551the TSC as seen from other CPUs, even in an otherwise perfectly synchronized
552system.
553
5544.3. Timespec aliasing
555----------------------
556
557Additionally, this lack of serialization from the TSC poses another challenge
558when using results of the TSC when measured against another time source.  As
559the TSC is much higher precision, many possible values of the TSC may be read
560while another clock is still expressing the same value.
561
562That is, you may read (T,T+10) while external clock C maintains the same value.
563Due to non-serialized reads, you may actually end up with a range which
564fluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but
565calibrated against an external value may have a range of valid values.
566Re-calibrating this computation may actually cause time, as computed after the
567calibration, to go backwards, compared with time computed before the
568calibration.
569
570This problem is particularly pronounced with an internal time source in Linux,
571the kernel time, which is expressed in the theoretically high resolution
572timespec - but which advances in much larger granularity intervals, sometimes
573at the rate of jiffies, and possibly in catchup modes, at a much larger step.
574
575This aliasing requires care in the computation and recalibration of kvmclock
576and any other values derived from TSC computation (such as TSC virtualization
577itself).
578
5794.4. Migration
580--------------
581
582Migration of a virtual machine raises problems for timekeeping in two ways.
583First, the migration itself may take time, during which interrupts cannot be
584delivered, and after which, the guest time may need to be caught up.  NTP may
585be able to help to some degree here, as the clock correction required is
586typically small enough to fall in the NTP-correctable window.
587
588An additional concern is that timers based off the TSC (or HPET, if the raw bus
589clock is exposed) may now be running at different rates, requiring compensation
590in some way in the hypervisor by virtualizing these timers.  In addition,
591migrating to a faster machine may preclude the use of a passthrough TSC, as a
592faster clock cannot be made visible to a guest without the potential of time
593advancing faster than usual.  A slower clock is less of a problem, as it can
594always be caught up to the original rate.  KVM clock avoids these problems by
595simply storing multipliers and offsets against the TSC for the guest to convert
596back into nanosecond resolution values.
597
5984.5. Scheduling
599---------------
600
601Since scheduling may be based on precise timing and firing of interrupts, the
602scheduling algorithms of an operating system may be adversely affected by
603virtualization.  In theory, the effect is random and should be universally
604distributed, but in contrived as well as real scenarios (guest device access,
605causes of virtualization exits, possible context switch), this may not always
606be the case.  The effect of this has not been well studied.
607
608In an attempt to work around this, several implementations have provided a
609paravirtualized scheduler clock, which reveals the true amount of CPU time for
610which a virtual machine has been running.
611
6124.6. Watchdogs
613--------------
614
615Watchdog timers, such as the lock detector in Linux may fire accidentally when
616running under hardware virtualization due to timer interrupts being delayed or
617misinterpretation of the passage of real time.  Usually, these warnings are
618spurious and can be ignored, but in some circumstances it may be necessary to
619disable such detection.
620
6214.7. Delays and precision timing
622--------------------------------
623
624Precise timing and delays may not be possible in a virtualized system.  This
625can happen if the system is controlling physical hardware, or issues delays to
626compensate for slower I/O to and from devices.  The first issue is not solvable
627in general for a virtualized system; hardware control software can't be
628adequately virtualized without a full real-time operating system, which would
629require an RT aware virtualization platform.
630
631The second issue may cause performance problems, but this is unlikely to be a
632significant issue.  In many cases these delays may be eliminated through
633configuration or paravirtualization.
634
6354.8. Covert channels and leaks
636------------------------------
637
638In addition to the above problems, time information will inevitably leak to the
639guest about the host in anything but a perfect implementation of virtualized
640time.  This may allow the guest to infer the presence of a hypervisor (as in a
641red-pill type detection), and it may allow information to leak between guests
642by using CPU utilization itself as a signalling channel.  Preventing such
643problems would require completely isolated virtual time which may not track
644real time any longer.  This may be useful in certain security or QA contexts,
645but in general isn't recommended for real-world deployment scenarios.
646