Revision tags: v6.6.67, v6.6.66, v6.6.65, v6.6.64, v6.6.63, v6.6.62, v6.6.61, v6.6.60, v6.6.59, v6.6.58, v6.6.57, v6.6.56, v6.6.55, v6.6.54, v6.6.53, v6.6.52, v6.6.51, v6.6.50, v6.6.49, v6.6.48, v6.6.47, v6.6.46, v6.6.45, v6.6.44, v6.6.43, v6.6.42, v6.6.41, v6.6.40, v6.6.39, v6.6.38, v6.6.37, v6.6.36, v6.6.35, v6.6.34, v6.6.33, v6.6.32, v6.6.31, v6.6.30, v6.6.29, v6.6.28, v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3 |
|
#
c900529f |
| 12-Sep-2023 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-fixes into drm-misc-fixes
Forwarding to v6.6-rc1.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
Revision tags: v6.5.2, v6.1.51, v6.5.1 |
|
#
4ad0a4c2 |
| 31-Aug-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and hon
Merge tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Add HOTPLUG_SMT support (/sys/devices/system/cpu/smt) and honour the configured SMT state when hotplugging CPUs into the system
- Combine final TLB flush and lazy TLB mm shootdown IPIs when using the Radix MMU to avoid a broadcast TLBIE flush on exit
- Drop the exclusion between ptrace/perf watchpoints, and drop the now unused associated arch hooks
- Add support for the "nohlt" command line option to disable CPU idle
- Add support for -fpatchable-function-entry for ftrace, with GCC >= 13.1
- Rework memory block size determination, and support 256MB size on systems with GPUs that have hotpluggable memory
- Various other small features and fixes
Thanks to Andrew Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Athira Rajeev, Benjamin Gray, Christophe Leroy, Frederic Barrat, Gautam Menghani, Geoff Levand, Hari Bathini, Immad Mir, Jialin Zhang, Joel Stanley, Jordan Niethe, Justin Stitt, Kajol Jain, Kees Cook, Krzysztof Kozlowski, Laurent Dufour, Liang He, Linus Walleij, Mahesh Salgaonkar, Masahiro Yamada, Michal Suchanek, Nageswara R Sastry, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nick Desaulniers, Omar Sandoval, Randy Dunlap, Reza Arbab, Rob Herring, Russell Currey, Sourabh Jain, Thomas Gleixner, Trevor Woerner, Uwe Kleine-König, Vaibhav Jain, Xiongfeng Wang, Yuan Tan, Zhang Rui, and Zheng Zengkai.
* tag 'powerpc-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (135 commits) macintosh/ams: linux/platform_device.h is needed powerpc/xmon: Reapply "Relax frame size for clang" powerpc/mm/book3s64: Use 256M as the upper limit with coherent device memory attached powerpc/mm/book3s64: Fix build error with SPARSEMEM disabled powerpc/iommu: Fix notifiers being shared by PCI and VIO buses powerpc/mpc5xxx: Add missing fwnode_handle_put() powerpc/config: Disable SLAB_DEBUG_ON in skiroot powerpc/pseries: Remove unused hcall tracing instruction powerpc/pseries: Fix hcall tracepoints with JUMP_LABEL=n powerpc: dts: add missing space before { powerpc/eeh: Use pci_dev_id() to simplify the code powerpc/64s: Move CPU -mtune options into Kconfig powerpc/powermac: Fix unused function warning powerpc/pseries: Rework lppaca_shared_proc() to avoid DEBUG_PREEMPT powerpc: Don't include lppaca.h in paca.h powerpc/pseries: Move hcall_vphn() prototype into vphn.h powerpc/pseries: Move VPHN constants into vphn.h cxl: Drop unused detach_spa() powerpc: Drop zalloc_maybe_bootmem() powerpc/powernv: Use struct opal_prd_msg in more places ...
show more ...
|
Revision tags: v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43 |
|
#
53834a0c |
| 31-Jul-2023 |
Benjamin Gray <bgray@linux.ibm.com> |
perf/hw_breakpoint: Remove arch breakpoint hooks
PowerPC was the only user of these hooks, and has been refactored to no longer require them. There is no need to keep them around, so remove them to
perf/hw_breakpoint: Remove arch breakpoint hooks
PowerPC was the only user of these hooks, and has been refactored to no longer require them. There is no need to keep them around, so remove them to reduce complexity.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230801011744.153973-8-bgray@linux.ibm.com
show more ...
|
Revision tags: v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13 |
|
#
4f2c0a4a |
| 13-Dec-2022 |
Nick Terrell <terrelln@fb.com> |
Merge branch 'main' into zstd-linus
|
#
cfd1f6c1 |
| 13-Dec-2022 |
Jiri Kosina <jkosina@suse.cz> |
Merge branch 'for-6.2/apple' into for-linus
- new quirks for select Apple keyboards (Kerem Karabay, Aditya Garg)
|
#
e291c116 |
| 12-Dec-2022 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare input updates for 6.2 merge window.
|
Revision tags: v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80 |
|
#
29583dfc |
| 21-Nov-2022 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-next into drm-misc-next-fixes
Backmerging to update drm-misc-next-fixes for the final phase of the release cycle.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
Revision tags: v6.0.9, v5.15.79 |
|
#
002c6ca7 |
| 14-Nov-2022 |
Rodrigo Vivi <rodrigo.vivi@intel.com> |
Merge drm/drm-next into drm-intel-next
Catch up on 6.1-rc cycle in order to solve the intel_backlight conflict on linux-next.
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
|
Revision tags: v6.0.8, v5.15.78 |
|
#
d93618da |
| 04-Nov-2022 |
Joonas Lahtinen <joonas.lahtinen@linux.intel.com> |
Merge drm/drm-next into drm-intel-gt-next
Needed to bring in v6.1-rc1 which contains commit f683b9d61319 ("i915: use the VMA iterator") which is needed for series https://patchwork.freedesktop.org/s
Merge drm/drm-next into drm-intel-gt-next
Needed to bring in v6.1-rc1 which contains commit f683b9d61319 ("i915: use the VMA iterator") which is needed for series https://patchwork.freedesktop.org/series/110083/ .
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
show more ...
|
Revision tags: v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4 |
|
#
14e77332 |
| 21-Oct-2022 |
Nick Terrell <terrelln@fb.com> |
Merge branch 'main' into zstd-next
|
Revision tags: v6.0.3 |
|
#
1aca5ce0 |
| 20-Oct-2022 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-fixes into drm-misc-fixes
Backmerging to get v6.1-rc1.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
#
008f05a7 |
| 19-Oct-2022 |
Mark Brown <broonie@kernel.org> |
ASoC: jz4752b: Capture fixes
Merge series from Siarhei Volkau <lis8215@gmail.com>:
The patchset fixes: - Line In path stays powered off during capturing or bypass to mixer. - incorrectly repre
ASoC: jz4752b: Capture fixes
Merge series from Siarhei Volkau <lis8215@gmail.com>:
The patchset fixes: - Line In path stays powered off during capturing or bypass to mixer. - incorrectly represented dB values in alsamixer, et al. - incorrect represented Capture input selector in alsamixer in Playback tab. - wrong control selected as Capture Master
show more ...
|
#
a140a6a2 |
| 18-Oct-2022 |
Maxime Ripard <maxime@cerno.tech> |
Merge drm/drm-next into drm-misc-next
Let's kick-off this release cycle.
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
|
#
c29a017f |
| 17-Oct-2022 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge tag 'v6.1-rc1' into next
Merge with mainline to bring in the latest changes to twl4030 driver.
|
#
8048b835 |
| 16-Oct-2022 |
Andrew Morton <akpm@linux-foundation.org> |
Merge branch 'master' into mm-hotfixes-stable
|
Revision tags: v6.0.2, v5.15.74, v5.15.73, v6.0.1 |
|
#
3871d93b |
| 10-Oct-2022 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar: "PMU driver updates:
- Add AMD Last Branch Record Extension
Merge tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar: "PMU driver updates:
- Add AMD Last Branch Record Extension Version 2 (LbrExtV2) feature support for Zen 4 processors.
- Extend the perf ABI to provide branch speculation information, if available, and use this on CPUs that have it (eg. LbrExtV2).
- Improve Intel PEBS TSC timestamp handling & integration.
- Add Intel Raptor Lake S CPU support.
- Add 'perf mem' and 'perf c2c' memory profiling support on AMD CPUs by utilizing IBS tagged load/store samples.
- Clean up & optimize various x86 PMU details.
HW breakpoints:
- Big rework to optimize the code for systems with hundreds of CPUs and thousands of breakpoints:
- Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem per-CPU rwsem that is read-locked during most of the key operations.
- Improve the O(#cpus * #tasks) logic in toggle_bp_slot() and fetch_bp_busy_slots().
- Apply micro-optimizations & cleanups.
- Misc cleanups & enhancements"
* tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits) perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex perf: Fix pmu_filter_match() perf: Fix lockdep_assert_event_ctx() perf/x86/amd/lbr: Adjust LBR regardless of filtering perf/x86/utils: Fix uninitialized var in get_branch_type() perf/uapi: Define PERF_MEM_SNOOPX_PEER in kernel header file perf/x86/amd: Support PERF_SAMPLE_PHY_ADDR perf/x86/amd: Support PERF_SAMPLE_ADDR perf/x86/amd: Support PERF_SAMPLE_{WEIGHT|WEIGHT_STRUCT} perf/x86/amd: Support PERF_SAMPLE_DATA_SRC perf/x86/amd: Add IBS OP_DATA2 DataSrc bit definitions perf/mem: Introduce PERF_MEM_LVLNUM_{EXTN_MEM|IO} perf/x86/uncore: Add new Raptor Lake S support perf/x86/cstate: Add new Raptor Lake S support perf/x86/msr: Add new Raptor Lake S support perf/x86: Add new Raptor Lake S support bpf: Check flags for branch stack in bpf_read_branch_records helper perf, hw_breakpoint: Fix use-after-free if perf_event_open() fails perf: Use sample_flags for raw_data perf: Use sample_flags for addr ...
show more ...
|
Revision tags: v5.15.72 |
|
#
82aad7ff |
| 04-Oct-2022 |
Peter Zijlstra <peterz@infradead.org> |
perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex
Perf fuzzer gifted a lockdep splat:
perf_event_init_context() mutex_lock(parent_ctx->mutex); (B) inherit_task_group()
perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex
Perf fuzzer gifted a lockdep splat:
perf_event_init_context() mutex_lock(parent_ctx->mutex); (B) inherit_task_group() inherit_group() inherit_event() perf_event_alloc() perf_try_init_event() := hw_breakpoint_event_init() register_perf_hw_breakpoint() mutex_lock(child->perf_event_mutex); (A)
Which is against the normal (documented) order. Now, this is a false positive in that child is not published yet, but also inherited events never end up on ->perf_event_list.
Annotate this one away.
Fixes: 0912037fec11 ("perf/hw_breakpoint: Reduce contention with large number of tasks") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
show more ...
|
Revision tags: v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64 |
|
#
ecdfb889 |
| 29-Aug-2022 |
Marco Elver <elver@google.com> |
perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets
We can still see that a majority of the time is spent hashing task pointers:
... 16.98% [kernel] [k] rh
perf/hw_breakpoint: Optimize toggle_bp_slot() for CPU-independent task targets
We can still see that a majority of the time is spent hashing task pointers:
... 16.98% [kernel] [k] rhashtable_jhash2 ...
Doing the bookkeeping in toggle_bp_slots() is currently O(#cpus), calling task_bp_pinned() for each CPU, even if task_bp_pinned() is CPU-independent. The reason for this is to update the per-CPU 'tsk_pinned' histogram.
To optimize the CPU-independent case to O(1), keep a separate CPU-independent 'tsk_pinned_all' histogram.
The major source of complexity are transitions between "all CPU-independent task breakpoints" and "mixed CPU-independent and CPU-dependent task breakpoints". The code comments list all cases that require handling.
After this optimization:
| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 | # Running 'breakpoint/thread' benchmark: | # Created/joined 100 threads with 4 breakpoints and 128 parallelism | Total time: 1.758 [sec] | | 34.336621 usecs/op | 4395.087500 usecs/op/cpu
38.08% [kernel] [k] queued_spin_lock_slowpath 10.81% [kernel] [k] smp_cfm_core_cond 3.01% [kernel] [k] update_sg_lb_stats 2.58% [kernel] [k] osq_lock 2.57% [kernel] [k] llist_reverse_order 1.45% [kernel] [k] find_next_bit 1.21% [kernel] [k] flush_tlb_func_common 1.01% [kernel] [k] arch_install_hw_breakpoint
Showing that the time spent hashing keys has become insignificant.
With the given benchmark parameters, that's an improvement of 12% compared with the old O(#cpus) version.
And finally, using the less aggressive parameters from the preceding changes, we now observe:
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.067 [sec] | | 35.292187 usecs/op | 2258.700000 usecs/op/cpu
Which is an improvement of 12% compared to without the histogram optimizations (baseline is 40 usecs/op). This is now on par with the theoretical ideal (constraints disabled), and only 12% slower than no breakpoints at all.
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-15-elver@google.com
show more ...
|
#
9b1933b8 |
| 29-Aug-2022 |
Marco Elver <elver@google.com> |
perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets
Running the perf benchmark with (note: more aggressive parameters vs. preceding changes, but same 256 CPUs host):
perf/hw_breakpoint: Optimize max_bp_pinned_slots() for CPU-independent task targets
Running the perf benchmark with (note: more aggressive parameters vs. preceding changes, but same 256 CPUs host):
| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 | # Running 'breakpoint/thread' benchmark: | # Created/joined 100 threads with 4 breakpoints and 128 parallelism | Total time: 1.989 [sec] | | 38.854160 usecs/op | 4973.332500 usecs/op/cpu
20.43% [kernel] [k] queued_spin_lock_slowpath 18.75% [kernel] [k] osq_lock 16.98% [kernel] [k] rhashtable_jhash2 8.34% [kernel] [k] task_bp_pinned 4.23% [kernel] [k] smp_cfm_core_cond 3.65% [kernel] [k] bcmp 2.83% [kernel] [k] toggle_bp_slot 1.87% [kernel] [k] find_next_bit 1.49% [kernel] [k] __reserve_bp_slot
We can see that a majority of the time is now spent hashing task pointers to index into task_bps_ht in task_bp_pinned().
Obtaining the max_bp_pinned_slots() for CPU-independent task targets currently is O(#cpus), and calls task_bp_pinned() for each CPU, even if the result of task_bp_pinned() is CPU-independent.
The loop in max_bp_pinned_slots() wants to compute the maximum slots across all CPUs. If task_bp_pinned() is CPU-independent, we can do so by obtaining the max slots across all CPUs and adding task_bp_pinned().
To do so in O(1), use a bp_slots_histogram for CPU-pinned slots.
After this optimization:
| $> perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512 | # Running 'breakpoint/thread' benchmark: | # Created/joined 100 threads with 4 breakpoints and 128 parallelism | Total time: 1.930 [sec] | | 37.697832 usecs/op | 4825.322500 usecs/op/cpu
19.13% [kernel] [k] queued_spin_lock_slowpath 18.21% [kernel] [k] rhashtable_jhash2 15.46% [kernel] [k] osq_lock 6.27% [kernel] [k] toggle_bp_slot 5.91% [kernel] [k] task_bp_pinned 5.05% [kernel] [k] smp_cfm_core_cond 1.78% [kernel] [k] update_sg_lb_stats 1.36% [kernel] [k] llist_reverse_order 1.34% [kernel] [k] find_next_bit 1.19% [kernel] [k] bcmp
Suggesting that time spent in task_bp_pinned() has been reduced. However, we're still hashing too much, which will be addressed in the subsequent change.
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-14-elver@google.com
show more ...
|
#
16db2839 |
| 29-Aug-2022 |
Marco Elver <elver@google.com> |
perf/hw_breakpoint: Introduce bp_slots_histogram
Factor out the existing `atomic_t count[N]` into its own struct called 'bp_slots_histogram', to generalize and make its intent clearer in preparation
perf/hw_breakpoint: Introduce bp_slots_histogram
Factor out the existing `atomic_t count[N]` into its own struct called 'bp_slots_histogram', to generalize and make its intent clearer in preparation of reusing elsewhere. The basic idea of bucketing "total uses of N slots" resembles a histogram, so calling it such seems most intuitive.
No functional change.
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-13-elver@google.com
show more ...
|
#
0912037f |
| 29-Aug-2022 |
Marco Elver <elver@google.com> |
perf/hw_breakpoint: Reduce contention with large number of tasks
While optimizing task_bp_pinned()'s runtime complexity to O(1) on average helps reduce time spent in the critical section, we still s
perf/hw_breakpoint: Reduce contention with large number of tasks
While optimizing task_bp_pinned()'s runtime complexity to O(1) on average helps reduce time spent in the critical section, we still suffer due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows that now contention is the biggest issue:
95.93% [kernel] [k] osq_lock 0.70% [kernel] [k] mutex_spin_on_owner 0.22% [kernel] [k] smp_cfm_core_cond 0.18% [kernel] [k] task_bp_pinned 0.18% [kernel] [k] rhashtable_jhash2 0.15% [kernel] [k] queued_spin_lock_slowpath
when running the breakpoint benchmark with (system with 256 CPUs):
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.207 [sec] | | 108.267188 usecs/op | 6929.100000 usecs/op/cpu
The main concern for synchronizing the breakpoint constraints data is that a consistent snapshot of the per-CPU and per-task data is observed.
The access pattern is as follows:
1. If the target is a task: the task's pinned breakpoints are counted, checked for space, and then appended to; only bp_cpuinfo::cpu_pinned is used to check for conflicts with CPU-only breakpoints; bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise unused.
2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is incremented. No per-task breakpoints are checked.
Since rhltable safely synchronizes insertions/deletions, we can allow concurrency as follows:
1. If the target is a task: independent tasks may update and check the constraints concurrently, but same-task target calls need to be serialized; since bp_cpuinfo::tsk_pinned is only updated, but not checked, these modifications can happen concurrently by switching tsk_pinned to atomic_t.
2. If the target is a CPU: access to the per-CPU constraints needs to be serialized with other CPU-target and task-target callers (to stabilize the bp_cpuinfo::tsk_pinned snapshot).
We can allow the above concurrency by introducing a per-CPU constraints data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses task_struct::perf_event_mutex):
1. If the target is a task: acquires perf_event_mutex, and acquires bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes contention in the presence of many read-lock but few write-lock acquisitions: we assume many orders of magnitude more task target breakpoints creations/destructions than CPU target breakpoints.
2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer.
With these changes, contention with thousands of tasks is reduced to the point where waiting on locking no longer dominates the profile:
| $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.077 [sec] | | 40.201563 usecs/op | 2572.900000 usecs/op/cpu
21.54% [kernel] [k] task_bp_pinned 20.18% [kernel] [k] rhashtable_jhash2 6.81% [kernel] [k] toggle_bp_slot 5.47% [kernel] [k] queued_spin_lock_slowpath 3.75% [kernel] [k] smp_cfm_core_cond 3.48% [kernel] [k] bcmp
On this particular setup that's a speedup of 2.7x.
We're also getting closer to the theoretical ideal performance through optimizations in hw_breakpoint.c -- constraints accounting disabled:
| perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.067 [sec] | | 35.286458 usecs/op | 2258.333333 usecs/op/cpu
Which means the current implementation is ~12% slower than the theoretical ideal.
For reference, performance without any breakpoints:
| $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 0 breakpoints and 64 parallelism | Total time: 0.060 [sec] | | 31.365625 usecs/op | 2007.400000 usecs/op/cpu
On a system with 256 CPUs, the theoretical ideal is only ~12% slower than no breakpoints at all; the current implementation is ~28% slower.
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com
show more ...
|
#
24198ad3 |
| 29-Aug-2022 |
Marco Elver <elver@google.com> |
perf/hw_breakpoint: Remove useless code related to flexible breakpoints
Flexible breakpoints have never been implemented, with bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4
perf/hw_breakpoint: Remove useless code related to flexible breakpoints
Flexible breakpoints have never been implemented, with bp_cpuinfo::flexible always being 0. Unfortunately, they still occupy 4 bytes in each bp_cpuinfo and bp_busy_slots, as well as computing the max flexible count in fetch_bp_busy_slots().
This again causes suboptimal code generation, when we always know that `!!slots.flexible` will be 0.
Just get rid of the flexible "placeholder" and remove all real code related to it. Make a note in the comment related to the constraints algorithm but don't remove them from the algorithm, so that if in future flexible breakpoints need supporting, it should be trivial to revive them (along with reverting this change).
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-9-elver@google.com
show more ...
|
#
9caf87be |
| 29-Aug-2022 |
Marco Elver <elver@google.com> |
perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
Due to being a __weak function, hw_breakpoint_weight() will cause the compiler to always emit a call to it. This generates unnecessarily bad
perf/hw_breakpoint: Make hw_breakpoint_weight() inlinable
Due to being a __weak function, hw_breakpoint_weight() will cause the compiler to always emit a call to it. This generates unnecessarily bad code (register spills etc.) for no good reason; in fact it appears in profiles of `perf bench -r 100 breakpoint thread -b 4 -p 128 -t 512`:
... 0.70% [kernel] [k] hw_breakpoint_weight ...
While a small percentage, no architecture defines its own hw_breakpoint_weight() nor are there users outside hw_breakpoint.c, which makes the fact it is currently __weak a poor choice.
Change hw_breakpoint_weight()'s definition to follow a similar protocol to hw_breakpoint_slots(), such that if <asm/hw_breakpoint.h> defines hw_breakpoint_weight(), we'll use it instead.
The result is that it is inlined and no longer shows up in profiles.
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-8-elver@google.com
show more ...
|
#
be3f1525 |
| 29-Aug-2022 |
Marco Elver <elver@google.com> |
perf/hw_breakpoint: Optimize constant number of breakpoint slots
Optimize internal hw_breakpoint state if the architecture's number of breakpoint slots is constant. This avoids several kmalloc() cal
perf/hw_breakpoint: Optimize constant number of breakpoint slots
Optimize internal hw_breakpoint state if the architecture's number of breakpoint slots is constant. This avoids several kmalloc() calls and potentially unnecessary failures if the allocations fail, as well as subtly improves code generation and cache locality.
The protocol is that if an architecture defines hw_breakpoint_slots via the preprocessor, it must be constant and the same for all types.
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-7-elver@google.com
show more ...
|
#
db5f6f85 |
| 29-Aug-2022 |
Marco Elver <elver@google.com> |
perf/hw_breakpoint: Mark data __ro_after_init
Mark read-only data after initialization as __ro_after_init.
While we are here, turn 'constraints_initialized' into a bool.
Signed-off-by: Marco Elver
perf/hw_breakpoint: Mark data __ro_after_init
Mark read-only data after initialization as __ro_after_init.
While we are here, turn 'constraints_initialized' into a bool.
Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220829124719.675715-6-elver@google.com
show more ...
|