#
14bd62d5 |
| 25-Mar-2024 |
Adrian Hunter <adrian.hunter@intel.com> |
clocksource: Make watchdog and suspend-timing multiplication overflow safe
[ Upstream commit d0304569fb019d1bcfbbbce1ce6df6b96f04079b ]
Kernel timekeeping is designed to keep the change in cycles (
clocksource: Make watchdog and suspend-timing multiplication overflow safe
[ Upstream commit d0304569fb019d1bcfbbbce1ce6df6b96f04079b ]
Kernel timekeeping is designed to keep the change in cycles (since the last timer interrupt) below max_cycles, which prevents multiplication overflow when converting cycles to nanoseconds. However, if timer interrupts stop, the clocksource_cyc2ns() calculation will eventually overflow.
Add protection against that. Simplify by folding together clocksource_delta() and clocksource_cyc2ns() into cycles_to_nsec_safe(). Check against max_cycles, falling back to a slower higher precision calculation.
Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240325064023.2997-20-adrian.hunter@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
9d6193fd |
| 02-Aug-2024 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()
[ Upstream commit f2655ac2c06a15558e51ed6529de280e1553c86e ]
The current "nretries > 1 || nretries >= max_retries" check in cs_watchd
clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()
[ Upstream commit f2655ac2c06a15558e51ed6529de280e1553c86e ]
The current "nretries > 1 || nretries >= max_retries" check in cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if nretries is greater than 1. The intent is instead to never warn on the first try, but otherwise warn if the successful retry was the last retry.
Therefore, change that "||" to "&&".
Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
03c38555 |
| 21-Feb-2024 |
Feng Tang <feng.tang@intel.com> |
clocksource: Scale the watchdog read retries automatically
[ Upstream commit 2ed08e4bc53298db3f87b528cd804cb0cce066a9 ]
On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled dur
clocksource: Scale the watchdog read retries automatically
[ Upstream commit 2ed08e4bc53298db3f87b528cd804cb0cce066a9 ]
On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled during boot time on about one out of 120 boot attempts:
clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable tsc: Marking TSC unstable due to clocksource watchdog TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. clocksource: Switched to clocksource hpet
The reason is that for platform with a large number of CPUs, there are sporadic big or huge read latencies while reading the watchog/clocksource during boot or when system is under stress work load, and the frequency and maximum value of the latency goes up with the number of online CPUs.
The cCurrent code already has logic to detect and filter such high latency case by reading the watchdog twice and checking the two deltas. Due to the randomness of the latency, there is a low probabilty that the first delta (latency) is big, but the second delta is small and looks valid. The watchdog code retries the readouts by default twice, which is not necessarily sufficient for systems with a large number of CPUs.
There is a command line parameter 'max_cswd_read_retries' which allows to increase the number of retries, but that's not user friendly as it needs to be tweaked per system. As the number of required retries is proportional to the number of online CPUs, this parameter can be calculated at runtime.
Scale and enlarge the number of retries according to the number of online CPUs and remove the command line parameter completely.
[ tglx: Massaged change log and comments ]
Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jin Wang <jin1.wang@intel.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240221060859.1027450-1-feng.tang@intel.com Stable-dep-of: f2655ac2c06a ("clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
af7ab5da |
| 22-Jan-2024 |
Jiri Wiesner <jwiesner@suse.de> |
clocksource: Skip watchdog check for large watchdog intervals
commit 644649553508b9bacf0fc7a5bdc4f9e0165576a5 upstream.
There have been reports of the watchdog marking clocksources unstable on mach
clocksource: Skip watchdog check for large watchdog intervals
commit 644649553508b9bacf0fc7a5bdc4f9e0165576a5 upstream.
There have been reports of the watchdog marking clocksources unstable on machines with 8 NUMA nodes:
clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'hpet' wd_nsec: 14523447520 clocksource: 'tsc' cs_nsec: 14524115132
The measured clocksource skew - the absolute difference between cs_nsec and wd_nsec - was 668 microseconds:
cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
The kernel used 200 microseconds for the uncertainty_margin of both the clocksource and watchdog, resulting in a threshold of 400 microseconds (the md variable). Both the cs_nsec and the wd_nsec value indicate that the readout interval was circa 14.5 seconds. The observed behaviour is that watchdog checks failed for large readout intervals on 8 NUMA node machines. This indicates that the size of the skew was directly proportinal to the length of the readout interval on those machines. The measured clocksource skew, 668 microseconds, was evaluated against a threshold (the md variable) that is suited for readout intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
The intention of 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") was to tighten the threshold for evaluating skew and set the lower bound for the uncertainty_margin of clocksources to twice WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to 125 microseconds to fit the limit of NTP, which is able to use a clocksource that suffers from up to 500 microseconds of skew per second. Both the TSC and the HPET use default uncertainty_margin. When the readout interval gets stretched the default uncertainty_margin is no longer a suitable lower bound for evaluating skew - it imposes a limit that is far stricter than the skew with which NTP can deal.
The root causes of the skew being directly proportinal to the length of the readout interval are:
* the inaccuracy of the shift/mult pairs of clocksources and the watchdog * the conversion to nanoseconds is imprecise for large readout intervals
Prevent this by skipping the current watchdog check if the readout interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin (of the TSC and HPET) corresponds to a limit on clocksource skew of 250 ppm (microseconds of skew per second). To keep the limit imposed by NTP (500 microseconds of skew per second) for all possible readout intervals, the margins would have to be scaled so that the threshold value is proportional to the length of the actual readout interval.
As for why the readout interval may get stretched: Since the watchdog is executed in softirq context the expiration of the watchdog timer can get severely delayed on account of a ksoftirqd thread not getting to run in a timely manner. Surely, a system with such belated softirq execution is not working well and the scheduling issue should be looked into but the clocksource watchdog should be able to deal with it accordingly.
Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") Suggested-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Feng Tang <feng.tang@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240122172350.GA740@incl Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
e40806e9 |
| 07-Jun-2023 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Handle negative skews in "skew is too large" messages
The nanosecond-to-millisecond skew computation uses unsigned arithmetic, which produces user-unfriendly large positive numbers for
clocksource: Handle negative skews in "skew is too large" messages
The nanosecond-to-millisecond skew computation uses unsigned arithmetic, which produces user-unfriendly large positive numbers for negative skews. Therefore, use signed arithmetic for this computation in order to preserve the negativity.
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Reported-by: Feng Tang <feng.tang@intel.com> Fixes: dd029269947a ("clocksource: Improve "skew is too large" messages") Reviewed-by: Feng Tang <feng.tang@intel.com> Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
show more ...
|
#
76edc27e |
| 30-May-2023 |
Azeem Shaikh <azeemshaikh38@gmail.com> |
clocksource: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to
clocksource: Replace all non-returning strlcpy with strscpy
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe.
[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89
Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Acked-by: John Stultz <jstultz@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230530163546.986188-1-azeemshaikh38@gmail.com
show more ...
|
#
b7082cdf |
| 20-Dec-2022 |
Feng Tang <feng.tang@intel.com> |
clocksource: Suspend the watchdog temporarily when high read latency detected
Bugs have been reported on 8 sockets x86 machines in which the TSC was wrongly disabled when the system is under heavy w
clocksource: Suspend the watchdog temporarily when high read latency detected
Bugs have been reported on 8 sockets x86 machines in which the TSC was wrongly disabled when the system is under heavy workload.
[ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped! [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped! [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998) [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564. [ 821.067990] clocksource: Switched to clocksource hpet
This can be reproduced by running memory intensive 'stream' tests, or some of the stress-ng subcases such as 'ioport'.
The reason for these issues is the when system is under heavy load, the read latency of the clocksources can be very high. Even lightweight TSC reads can show high latencies, and latencies are much worse for external clocksources such as HPET or the APIC PM timer. These latencies can result in false-positive clocksource-unstable determinations.
These issues were initially reported by a customer running on a production system, and this problem was reproduced on several generations of Xeon servers, especially when running the stress-ng test. These Xeon servers were not production systems, but they did have the latest steppings and firmware.
Given that the clocksource watchdog is a continual diagnostic check with frequency of twice a second, there is no need to rush it when the system is under heavy load. Therefore, when high clocksource read latencies are detected, suspend the watchdog timer for 5 minutes.
Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Waiman Long <longman@redhat.com> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
show more ...
|
#
dd029269 |
| 13-Dec-2022 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Improve "skew is too large" messages
When clocksource_watchdog() detects excessive clocksource skew compared to the watchdog clocksource, it marks the clocksource under test as unstable
clocksource: Improve "skew is too large" messages
When clocksource_watchdog() detects excessive clocksource skew compared to the watchdog clocksource, it marks the clocksource under test as unstable and prints several lines worth of message. But that message is unclear to anyone unfamiliar with the code:
clocksource: timekeeping watchdog on CPU2: Marking clocksource 'wdtest-ktime' as unstable because the skew is too large: clocksource: 'kvm-clock' wd_nsec: 400744390 wd_now: 612625c2c wd_last: 5fa7f7c66 mask: ffffffffffffffff clocksource: 'wdtest-ktime' cs_nsec: 600744034 cs_now: 173081397a292d4f cs_last: 17308139565a8ced mask: ffffffffffffffff clocksource: 'kvm-clock' (not 'wdtest-ktime') is current clocksource.
Therefore, add the following line near the end of that message:
Clocksource 'wdtest-ktime' skewed 199999644 ns (199 ms) over watchdog 'kvm-clock' interval of 400744390 ns (400 ms)
This new line clearly indicates the amount of skew between the two clocksources, along with the duration of the time interval over which the skew occurred, both in nanoseconds and milliseconds.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Feng Tang <feng.tang@intel.com>
show more ...
|
#
f092eb34 |
| 13-Dec-2022 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Improve read-back-delay message
When cs_watchdog_read() is unable to get a qualifying clocksource read within the limit set by max_cswd_read_retries, it prints a message and marks the c
clocksource: Improve read-back-delay message
When cs_watchdog_read() is unable to get a qualifying clocksource read within the limit set by max_cswd_read_retries, it prints a message and marks the clocksource under test as unstable. But that message is unclear to anyone unfamiliar with the code:
clocksource: timekeeping watchdog on CPU13: wd-tsc-wd read-back delay 1000614ns, attempt 3, marking unstable
Therefore, add some context so that the message appears as follows:
clocksource: timekeeping watchdog on CPU13: wd-tsc-wd excessive read-back delay of 1000614ns vs. limit of 125000ns, wd-wd read-back delay only 27ns, attempt 3, marking tsc unstable
Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Feng Tang <feng.tang@intel.com>
show more ...
|
#
c37e85c1 |
| 06-Dec-2022 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Loosen clocksource watchdog constraints
Currently, MAX_SKEW_USEC is set to 100 microseconds, which has worked reasonably well. However, NTP is willing to tolerate 500 microseconds of s
clocksource: Loosen clocksource watchdog constraints
Currently, MAX_SKEW_USEC is set to 100 microseconds, which has worked reasonably well. However, NTP is willing to tolerate 500 microseconds of skew per second, and a clocksource that is good enough for NTP should be good enough for the clocksource watchdog. The watchdog's skew is controlled by MAX_SKEW_USEC and the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option. However, these values are doubled before being associated with a clocksource's ->uncertainty_margin, and the ->uncertainty_margin values of the pair of clocksource's being compared are summed before checking against the skew.
Therefore, set both MAX_SKEW_USEC and the default for the CLOCKSOURCE_WATCHDOG_MAX_SKEW_US Kconfig option to 125 microseconds of skew per second, resulting in 500 microseconds of skew per second in the clocksource watchdog's skew comparison.
Suggested-by Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
show more ...
|
#
beaa1ffe |
| 16-Nov-2022 |
Yunying Sun <yunying.sun@intel.com> |
clocksource: Print clocksource name when clocksource is tested unstable
Some "TSC fall back to HPET" messages appear on systems having more than 2 NUMA nodes:
clocksource: timekeeping watchdog on C
clocksource: Print clocksource name when clocksource is tested unstable
Some "TSC fall back to HPET" messages appear on systems having more than 2 NUMA nodes:
clocksource: timekeeping watchdog on CPU168: hpet read-back delay of 4296200ns, attempt 4, marking unstable
The "hpet" here is misleading the clocksource watchdog is really doing repeated reads of "hpet" in order to check for unrelated delays. Therefore, print the name of the clocksource under test, prefixed by "wd-" and suffixed by "-wd", for example, "wd-tsc-wd".
Signed-off-by: Yunying Sun <yunying.sun@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
show more ...
|
#
8032bf12 |
| 09-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:
@@ expression E; @@ - prandom_u32_max + get_random_u32_below (E)
Reviewed-
treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:
@@ expression E; @@ - prandom_u32_max + get_random_u32_below (E)
Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
show more ...
|
#
81895a65 |
| 05-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes t
treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script:
@basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) )
@multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@
- RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E);
// Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@
value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@
- (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT)
@collapse_ret@ type T; identifier VAR; expression E; @@
{ - T VAR; - VAR = (E); - return VAR; + return E; }
@drop_var@ type T; identifier VAR; @@
{ - T VAR; ... when != VAR }
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
show more ...
|
#
8afbcaf8 |
| 10-Feb-2022 |
Yury Norov <yury.norov@gmail.com> |
clocksource: Replace cpumask_weight() with cpumask_empty()
clocksource_verify_percpu() calls cpumask_weight() to check if any bit of a given cpumask is set.
This can be done more efficiently with c
clocksource: Replace cpumask_weight() with cpumask_empty()
clocksource_verify_percpu() calls cpumask_weight() to check if any bit of a given cpumask is set.
This can be done more efficiently with cpumask_empty() because cpumask_empty() stops traversing the cpumask as soon as it finds first set bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220210224933.379149-24-yury.norov@gmail.com
show more ...
|
#
fc153c1c |
| 05-Dec-2021 |
Waiman Long <longman@redhat.com> |
clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
A watchdog maximum skew of 100us may still be too small for some systems or archs. It may also be too small when some kernel debug config opti
clocksource: Add a Kconfig option for WATCHDOG_MAX_SKEW
A watchdog maximum skew of 100us may still be too small for some systems or archs. It may also be too small when some kernel debug config options are enabled. So add a new Kconfig option CLOCKSOURCE_WATCHDOG_MAX_SKEW_US to allow kernel builders to have more control on the threshold for marking clocksource as unstable.
Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
show more ...
|
#
9b51d9d8 |
| 14-Aug-2021 |
Yury Norov <yury.norov@gmail.com> |
cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
cpumask_first() is a more effective analogue of 'next' version if n == -1 (which means start == 0). This patch replaces 'next'
cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
cpumask_first() is a more effective analogue of 'next' version if n == -1 (which means start == 0). This patch replaces 'next' with 'first' where things look trivial.
There's no cpumask_first_zero() function, so create it.
Signed-off-by: Yury Norov <yury.norov@gmail.com> Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
show more ...
|
#
1a562067 |
| 18-Nov-2021 |
Waiman Long <longman@redhat.com> |
clocksource: Reduce the default clocksource_watchdog() retries to 2
With the previous patch, there is an extra watchdog read in each retry. Now the total number of clocksource reads is increased to
clocksource: Reduce the default clocksource_watchdog() retries to 2
With the previous patch, there is an extra watchdog read in each retry. Now the total number of clocksource reads is increased to 4 per iteration. In order to avoid increasing the clock skew check overhead, the default maximum number of retries is reduced from 3 to 2 to maintain the same 12 clocksource reads in the worst case.
Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
show more ...
|
#
c86ff8c5 |
| 18-Nov-2021 |
Waiman Long <longman@redhat.com> |
clocksource: Avoid accidental unstable marking of clocksources
Since commit db3a34e17433 ("clocksource: Retry clock read if long delays detected") and commit 2e27e793e280 ("clocksource: Reduce clock
clocksource: Avoid accidental unstable marking of clocksources
Since commit db3a34e17433 ("clocksource: Retry clock read if long delays detected") and commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), it is found that tsc clocksource fallback to hpet can sometimes happen on both Intel and AMD systems especially when they are running stressful benchmarking workloads. Of the 23 systems tested with a v5.14 kernel, 10 of them have switched to hpet clock source during the test run.
The result of falling back to hpet is a drastic reduction of performance when running benchmarks. For example, the fio performance tests can drop up to 70% whereas the iperf3 performance can drop up to 80%.
4 hpet fallbacks happened during bootup. They were:
[ 8.749399] clocksource: timekeeping watchdog on CPU13: hpet read-back delay of 263750ns, attempt 4, marking unstable [ 12.044610] clocksource: timekeeping watchdog on CPU19: hpet read-back delay of 186166ns, attempt 4, marking unstable [ 17.336941] clocksource: timekeeping watchdog on CPU28: hpet read-back delay of 182291ns, attempt 4, marking unstable [ 17.518565] clocksource: timekeeping watchdog on CPU34: hpet read-back delay of 252196ns, attempt 4, marking unstable
Other fallbacks happen when the systems were running stressful benchmarks. For example:
[ 2685.867873] clocksource: timekeeping watchdog on CPU117: hpet read-back delay of 57269ns, attempt 4, marking unstable [46215.471228] clocksource: timekeeping watchdog on CPU8: hpet read-back delay of 61460ns, attempt 4, marking unstable
Commit 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold"), changed the skew margin from 100us to 50us. I think this is too small and can easily be exceeded when running some stressful workloads on a thermally stressed system. So it is switched back to 100us.
Even a maximum skew margin of 100us may be too small in for some systems when booting up especially if those systems are under thermal stress. To eliminate the case that the large skew is due to the system being too busy slowing down the reading of both the watchdog and the clocksource, an extra consecutive read of watchdog clock is being done to check this.
The consecutive watchdog read delay is compared against WATCHDOG_MAX_SKEW/2. If the delay exceeds the limit, we assume that the system is just too busy. A warning will be printed to the console and the clock skew check is skipped for this round.
Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected") Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
show more ...
|
#
698429f9 |
| 03-Aug-2021 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
clocksource: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock()
clocksource: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210803141621.780504-35-bigeasy@linutronix.de
show more ...
|
#
22a22383 |
| 27-May-2021 |
Feng Tang <feng.tang@intel.com> |
clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
Currently when an unstable clocksource is detected, the raw counters of that clocksource and watchdog will be printed,
clocksource: Print deviation in nanoseconds when a clocksource becomes unstable
Currently when an unstable clocksource is detected, the raw counters of that clocksource and watchdog will be printed, which can only be understood after some math calculation.
So print the delta in nanoseconds as well to make it easier for humans to check the results.
[ paulmck: Fix typo. ]
Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210527190124.440372-6-paulmck@kernel.org
show more ...
|
#
1253b9b8 |
| 27-May-2021 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Provide kernel module to test clocksource watchdog
When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays t
clocksource: Provide kernel module to test clocksource watchdog
When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. It would be good to have a way of testing the clocksource watchdog's ability to distinguish between these two causes of clock skew and instability.
Therefore, provide a new clocksource-wdtest module selected by a new TEST_CLOCKSOURCE_WATCHDOG Kconfig option. This module has a single module parameter named "holdoff" that provides the number of seconds of delay before testing should start, which defaults to zero when built as a module and to 10 seconds when built directly into the kernel. Very large systems that boot slowly may need to increase the value of this module parameter.
This module uses hand-crafted clocksource structures to do its testing, thus avoiding messing up timing for the rest of the kernel and for user applications. This module first verifies that the ->uncertainty_margin field of the clocksource structures are set sanely. It then tests the delay-detection capability of the clocksource watchdog, increasing the number of consecutive delays injected, first provoking console messages complaining about the delays and finally forcing a clock-skew event. Unexpected test results cause at least one WARN_ON_ONCE() console splat. If there are no splats, the test has passed. Finally, it fuzzes the value returned from a clocksource to test the clocksource watchdog's ability to detect time skew.
This module checks the state of its clocksource after each test, and uses WARN_ON_ONCE() to emit a console splat if there are any failures. This should enable all types of test frameworks to detect any such failures.
This facility is intended for diagnostic use only, and should be avoided on production systems.
Reported-by: Chris Mason <clm@fb.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-5-paulmck@kernel.org
show more ...
|
#
2e27e793 |
| 27-May-2021 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Reduce clocksource-skew threshold
Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed by
clocksource: Reduce clocksource-skew threshold
Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in a 500-millisecond WATCHDOG_INTERVAL. This requires that clocks be skewed by more than 12.5% in order to be marked unstable. Except that a clock that is skewed by that much is probably destroying unsuspecting software right and left. And given that there are now checks for false-positive skews due to delays between reading the two clocks, it should be possible to greatly decrease WATCHDOG_THRESHOLD, at least for fine-grained clocks such as TSC.
Therefore, add a new uncertainty_margin field to the clocksource structure that contains the maximum uncertainty in nanoseconds for the corresponding clock. This field may be initialized manually, as it is for clocksource_tsc_early and clocksource_jiffies, which is copied to refined_jiffies. If the field is not initialized manually, it will be computed at clock-registry time as the period of the clock in question based on the scale and freq parameters to __clocksource_update_freq_scale() function. If either of those two parameters are zero, the tens-of-milliseconds WATCHDOG_THRESHOLD is used as a cowardly alternative to dividing by zero. No matter how the uncertainty_margin field is calculated, it is bounded below by twice WATCHDOG_MAX_SKEW, that is, by 100 microseconds.
Note that manually initialized uncertainty_margin fields are not adjusted, but there is a WARN_ON_ONCE() that triggers if any such field is less than twice WATCHDOG_MAX_SKEW. This WARN_ON_ONCE() is intended to discourage production use of the one-nanosecond uncertainty_margin values that are used to test the clock-skew code itself.
The actual clock-skew check uses the sum of the uncertainty_margin fields of the two clocksource structures being compared. Integer overflow is avoided because the largest computed value of the uncertainty_margin fields is one billion (10^9), and double that value fits into an unsigned int. However, if someone manually specifies (say) UINT_MAX, they will get what they deserve.
Note that the refined_jiffies uncertainty_margin field is initialized to TICK_NSEC, which means that skew checks involving this clocksource will be sufficently forgiving. In a similar vein, the clocksource_tsc_early uncertainty_margin field is initialized to 32*NSEC_PER_MSEC, which replicates the current behavior and allows custom setting if needed in order to address the rare skews detected for this clocksource in current mainline.
Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-4-paulmck@kernel.org
show more ...
|
#
fa218f1c |
| 27-May-2021 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Limit number of CPUs checked for clock synchronization
Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU, that clock is checked on all CPUs. This is thorough,
clocksource: Limit number of CPUs checked for clock synchronization
Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU, that clock is checked on all CPUs. This is thorough, but might not be what you want on a system with a few tens of CPUs, let alone a few hundred of them.
Therefore, by default check only up to eight randomly chosen CPUs. Also provide a new clocksource.verify_n_cpus kernel boot parameter. A value of -1 says to check all of the CPUs, and a non-negative value says to randomly select that number of CPUs, without concern about selecting the same CPU multiple times. However, make use of a cpumask so that a given CPU will be checked at most once.
Suggested-by: Thomas Gleixner <tglx@linutronix.de> # For verify_n_cpus=1. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-3-paulmck@kernel.org
show more ...
|
#
7560c02b |
| 27-May-2021 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Check per-CPU clock synchronization when marked unstable
Some sorts of per-CPU clock sources have a history of going out of synchronization with each other. However, this problem has p
clocksource: Check per-CPU clock synchronization when marked unstable
Some sorts of per-CPU clock sources have a history of going out of synchronization with each other. However, this problem has purportedy been solved in the past ten years. Except that it is all too possible that the problem has instead simply been made less likely, which might mean that some of the occasional "Marking clocksource 'tsc' as unstable" messages might be due to desynchronization. How would anyone know?
Therefore apply CPU-to-CPU synchronization checking to newly unstable clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are printed, with the caveat that if it is the reporting CPU that is itself desynchronized, it will appear that all the other clocks are wrong. Just like in real life.
Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-2-paulmck@kernel.org
show more ...
|
#
db3a34e1 |
| 27-May-2021 |
Paul E. McKenney <paulmck@kernel.org> |
clocksource: Retry clock read if long delays detected
When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen
clocksource: Retry clock read if long delays detected
When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable.
Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent.
Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org
show more ...
|