Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32 |
|
#
e4ec3318 |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
EEVDF uses this tunable as the base request/slice -- make sure the name reflects this.
Signed-off-by: Peter Zijlstra (Int
sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
EEVDF uses this tunable as the base request/slice -- make sure the name reflects this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
show more ...
|
#
5e963f2b |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/fair: Commit to EEVDF
EEVDF is a better defined scheduling policy, as a result it has less heuristics/tunables. There is no compelling reason to keep CFS around.
Signed-off-by: Peter Zijlstra
sched/fair: Commit to EEVDF
EEVDF is a better defined scheduling policy, as a result it has less heuristics/tunables. There is no compelling reason to keep CFS around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
show more ...
|
#
147f3efa |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob, the weight. The addition of a second, latency oriented parameter, makes so
sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob, the weight. The addition of a second, latency oriented parameter, makes something like WF2Q or EEVDF based a much better fit.
Specifically, EEVDF does EDF like scheduling in the left half of the tree -- those entities that are owed service. Except because this is a virtual time scheduler, the deadlines are in virtual time as well, which is what allows over-subscription.
EEVDF has two parameters:
- weight, or time-slope: which is mapped to nice just as before
- request size, or slice length: which is used to compute the virtual deadline as: vd_i = ve_i + r_i/w_i
Basically, by setting a smaller slice, the deadline will be earlier and the task will be more eligible and ran earlier.
Tick driven preemption is driven by request/slice completion; while wakeup preemption is driven by the deadline.
Because the tree is now effectively an interval tree, and the selection is no longer 'leftmost', over-scheduling is less of a problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
show more ...
|
#
af4cf404 |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/fair: Add cfs_rq::avg_vruntime
In order to move to an eligibility based scheduling policy, we need to have a better approximation of the ideal scheduler.
Specifically, for a virtual time weig
sched/fair: Add cfs_rq::avg_vruntime
In order to move to an eligibility based scheduling policy, we need to have a better approximation of the ideal scheduler.
Specifically, for a virtual time weighted fair queueing based scheduler the ideal scheduler will be the weighted average of the individual virtual runtimes (math in the comment).
As such, compute the weighted average to approximate the ideal scheduler -- note that the approximation is in the individual task behaviour, which isn't strictly conformant.
Specifically consider adding a task with a vruntime left of center, in this case the average will move backwards in time -- something the ideal scheduler would of course never do.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
show more ...
|
#
ed74cc49 |
| 07-Jul-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/debug: Dump domains' sched group flags
There have been a case where the SD_SHARE_CPUCAPACITY sched group flag in a parent domain were not set and propagated properly when a degenerate domain i
sched/debug: Dump domains' sched group flags
There have been a case where the SD_SHARE_CPUCAPACITY sched group flag in a parent domain were not set and propagated properly when a degenerate domain is removed.
Add dump of domain sched group flags of a CPU to make debug easier in the future.
Usage: cat /debug/sched/domains/cpu0/domain1/groups_flags to dump cpu0 domain1's sched group flags.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com
show more ...
|
Revision tags: v6.1.31, v6.1.30, v6.1.29, v6.1.28 |
|
#
a6fcdd8d |
| 06-May-2023 |
晏艳(采苓) <yanyan.yan@antgroup.com> |
sched/debug: Correct printing for rq->nr_uninterruptible
Commit e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") changed the type for rq->nr_uninterruptible from "unsigned long" t
sched/debug: Correct printing for rq->nr_uninterruptible
Commit e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit") changed the type for rq->nr_uninterruptible from "unsigned long" to "unsigned int", but left wrong cast print to /sys/kernel/debug/sched/debug and to the console.
For example, nr_uninterruptible's value is fffffff7 with type "unsigned int", (long)nr_uninterruptible shows 4294967287 while (int)nr_uninterruptible prints -9. So using int cast fixes wrong printing.
Signed-off-by: Yan Yan <yanyan.yan@antgroup.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230506074253.44526-1-yanyan.yan@antgroup.com
show more ...
|
Revision tags: v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16 |
|
#
34320745 |
| 03-Mar-2023 |
Phil Auld <pauld@redhat.com> |
sched/debug: Put sched/domains files under the verbose flag
The debug files under sched/domains can take a long time to regenerate, especially when updates are done one at a time. Move these files u
sched/debug: Put sched/domains files under the verbose flag
The debug files under sched/domains can take a long time to regenerate, especially when updates are done one at a time. Move these files under the sched verbose debug flag. Allow changes to verbose to trigger generation of the files. This lets a user batch the updates but still have the information available. The detailed topology printk messages are also under verbose.
Discussion that lead to this approach can be found in the link below.
Simplified code to maintain use of debugfs bool routines suggested by Michael Ellerman <mpe@ellerman.id.au>.
Signed-off-by: Phil Auld <pauld@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Vishal Chourasia <vishalc@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vishal Chourasia <vishalc@linux.vnet.ibm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/all/Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com/ Link: https://lore.kernel.org/r/20230303183754.3076321-1-pauld@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55 |
|
#
33024536 |
| 13-Jul-2022 |
Huang Ying <ying.huang@intel.com> |
memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4.
To optimize page placement in a memory tiering system with NUMA balancing, the
memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4.
To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory nodes need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). So in this patchset, we implement a new hot page identification algorithm based on the latency between NUMA balancing page table scanning and hint page fault. Which is a kind of mostly frequently accessed (MFU) algorithm.
In NUMA balancing memory tiering mode, if there are hot pages in slow memory node and cold pages in fast memory node, we need to promote/demote hot/cold pages between the fast and cold memory nodes.
A choice is to promote/demote as fast as possible. But the CPU cycles and memory bandwidth consumed by the high promoting/demoting throughput will hurt the latency of some workload because of accessing inflating and slow memory bandwidth contention.
A way to resolve this issue is to restrict the max promoting/demoting throughput. It will take longer to finish the promoting/demoting. But the workload latency will be better. This is implemented in this patchset as the page promotion rate limit mechanism.
The promotion hot threshold is workload and system configuration dependent. So in this patchset, a method to adjust the hot threshold automatically is implemented. The basic idea is to control the number of the candidate promotion pages to match the promotion rate limit.
We used the pmbench memory accessing benchmark tested the patchset on a 2-socket server system with DRAM and PMEM installed. The test results are as follows,
pmbench score promote rate (accesses/s) MB/s ------------- ------------ base 146887704.1 725.6 hot selection 165695601.2 544.0 rate limit 162814569.8 165.2 auto adjustment 170495294.0 136.9
From the results above,
With hot page selection patch [1/3], the pmbench score increases about 12.8%, and promote rate (overhead) decreases about 25.0%, compared with base kernel.
With rate limit patch [2/3], pmbench score decreases about 1.7%, and promote rate decreases about 69.6%, compared with hot page selection patch.
With threshold auto adjustment patch [3/3], pmbench score increases about 4.7%, and promote rate decrease about 17.1%, compared with rate limit patch.
Baolin helped to test the patchset with MySQL on a machine which contains 1 DRAM node (30G) and 1 PMEM node (126G).
sysbench /usr/share/sysbench/oltp_read_write.lua \ ...... --tables=200 \ --table-size=1000000 \ --report-interval=10 \ --threads=16 \ --time=120
The tps can be improved about 5%.
This patch (of 3):
To optimize page placement in a memory tiering system with NUMA balancing, the hot pages in the slow memory node need to be identified. Essentially, the original NUMA balancing implementation selects the mostly recently accessed (MRU) pages to promote. But this isn't a perfect algorithm to identify the hot pages. Because the pages with quite low access frequency may be accessed eventually given the NUMA balancing page table scanning period could be quite long (e.g. 60 seconds). The most frequently accessed (MFU) algorithm is better.
So, in this patch we implemented a better hot page selection algorithm. Which is based on NUMA balancing page table scanning and hint page fault as follows,
- When the page tables of the processes are scanned to change PTE/PMD to be PROT_NONE, the current time is recorded in struct page as scan time.
- When the page is accessed, hint page fault will occur. The scan time is gotten from the struct page. And The hint page fault latency is defined as
hint page fault time - scan time
The shorter the hint page fault latency of a page is, the higher the probability of their access frequency to be higher. So the hint page fault latency is a better estimation of the page hot/cold.
It's hard to find some extra space in struct page to hold the scan time. Fortunately, we can reuse some bits used by the original NUMA balancing.
NUMA balancing uses some bits in struct page to store the page accessing CPU and PID (referring to page_cpupid_xchg_last()). Which is used by the multi-stage node selection algorithm to avoid to migrate pages shared accessed by the NUMA nodes back and forth. But for pages in the slow memory node, even if they are shared accessed by multiple NUMA nodes, as long as the pages are hot, they need to be promoted to the fast memory node. So the accessing CPU and PID information are unnecessary for the slow memory pages. We can reuse these bits in struct page to record the scan time. For the fast memory pages, these bits are used as before.
For the hot threshold, the default value is 1 second, which works well in our performance test. All pages with hint page fault latency < hot threshold will be considered hot.
It's hard for users to determine the hot threshold. So we don't provide a kernel ABI to set it, just provide a debugfs interface for advanced users to experiment. We will continue to work on a hot threshold automatic adjustment mechanism.
The downside of the above method is that the response time to the workload hot spot changing may be much longer. For example,
- A previous cold memory area becomes hot
- The hint page fault will be triggered. But the hint page fault latency isn't shorter than the hot threshold. So the pages will not be promoted.
- When the memory area is scanned again, maybe after a scan period, the hint page fault latency measured will be shorter than the hot threshold and the pages will be promoted.
To mitigate this, if there are enough free space in the fast memory node, the hot threshold will not be used, all pages will be promoted upon the hint page fault for fast response.
Thanks Zhong Jiang reported and tested the fix for a bug when disabling memory tiering mode dynamically.
Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: osalvador <osalvador@suse.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
c2e40659 |
| 02-Sep-2022 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
sched/debug: fix dentry leak in update_sched_domain_debugfs
Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup()) leaks a dentry and with a hotplug stress test, the machine eventua
sched/debug: fix dentry leak in update_sched_domain_debugfs
Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup()) leaks a dentry and with a hotplug stress test, the machine eventually runs out of memory.
Fix this up by using the newly created debugfs_lookup_and_remove() call instead which properly handles the dentry reference counting logic.
Cc: Major Chen <major.chen@samsung.com> Cc: stable <stable@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Reported-by: Kuyo Chang <kuyo.chang@mediatek.com> Tested-by: Kuyo Chang <kuyo.chang@mediatek.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25 |
|
#
801c1419 |
| 22-Feb-2022 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there
Collect all utility functionality source code files into a single kernel/sched/build_utility.c file, via #incl
sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there
Collect all utility functionality source code files into a single kernel/sched/build_utility.c file, via #include-ing the .c files:
kernel/sched/clock.c kernel/sched/completion.c kernel/sched/loadavg.c kernel/sched/swait.c kernel/sched/wait_bit.c kernel/sched/wait.c
CONFIG_CPU_FREQ: kernel/sched/cpufreq.c
CONFIG_CPU_FREQ_GOV_SCHEDUTIL: kernel/sched/cpufreq_schedutil.c
CONFIG_CGROUP_CPUACCT: kernel/sched/cpuacct.c
CONFIG_SCHED_DEBUG: kernel/sched/debug.c
CONFIG_SCHEDSTATS: kernel/sched/stats.c
CONFIG_SMP: kernel/sched/cpupri.c kernel/sched/stop_task.c kernel/sched/topology.c
CONFIG_SCHED_CORE: kernel/sched/core_sched.c
CONFIG_PSI: kernel/sched/psi.c
CONFIG_MEMBARRIER: kernel/sched/membarrier.c
CONFIG_CPU_ISOLATION: kernel/sched/isolation.c
CONFIG_SCHED_AUTOGROUP: kernel/sched/autogroup.c
The goal is to amortize the 60+ KLOC header bloat from over a dozen build units into a single build unit.
The build time of build_utility.c also roughly matches the build time of core.c and fair.c - allowing better load-balancing of scheduler-only rebuilds.
Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
show more ...
|
Revision tags: v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16 |
|
#
28c988c3 |
| 17-Jan-2022 |
Bharata B Rao <bharata@amd.com> |
sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa
The older format of /proc/pid/sched printed home node info which required the mempolicy and task lock around mpol_get(). Ho
sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa
The older format of /proc/pid/sched printed home node info which required the mempolicy and task lock around mpol_get(). However the format has changed since then and there is no need for sched_show_numa() any more to have mempolicy argument, asssociated mpol_get/put and task_lock/unlock. Remove them.
Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched") Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com
show more ...
|
Revision tags: v5.15.15, v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14 |
|
#
4feee7d1 |
| 18-Oct-2021 |
Josh Don <joshdon@google.com> |
sched/core: Forced idle accounting
Adds accounting for "forced idle" time, which is time where a cookie'd task forces its SMT sibling to idle, despite the presence of runnable tasks.
Forced idle ti
sched/core: Forced idle accounting
Adds accounting for "forced idle" time, which is time where a cookie'd task forces its SMT sibling to idle, despite the presence of runnable tasks.
Forced idle time is one means to measure the cost of enabling core scheduling (ie. the capacity lost due to the need to force idle).
Forced idle time is attributed to the thread responsible for causing the forced idle.
A few details: - Forced idle time is displayed via /proc/PID/sched. It also requires that schedstats is enabled. - Forced idle is only accounted when a sibling hyperthread is held idle despite the presence of runnable tasks. No time is charged if a sibling is idle but has no runnable tasks. - Tasks with 0 cookie are never charged forced idle. - For SMT > 2, we scale the amount of forced idle charged based on the number of forced idle siblings. Additionally, we split the time up and evenly charge it to all running tasks, as each is equally responsible for the forced idle.
Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211018203428.2025792-1-joshdon@google.com
show more ...
|
Revision tags: v5.14.13, v5.14.12, v5.14.11, v5.14.10 |
|
#
769fdf83 |
| 06-Oct-2021 |
Peter Zijlstra <peterz@infradead.org> |
sched: Fix DEBUG && !SCHEDSTATS warn
When !SCHEDSTATS schedstat_enabled() is an unconditional 0 and the whole block doesn't exist, however GCC figures the scoped variable 'stats' is unused and compl
sched: Fix DEBUG && !SCHEDSTATS warn
When !SCHEDSTATS schedstat_enabled() is an unconditional 0 and the whole block doesn't exist, however GCC figures the scoped variable 'stats' is unused and complains about it.
Upgrade the warning from -Wunused-variable to -Wunused-but-set-variable by writing it in two statements. This fixes the build because the new warning is in W=1.
Given that whole if(0) {} thing, I don't feel motivated to change things overly much and quite strongly feel this is the compiler being daft.
Fixes: cb3e971c435d ("sched: Make struct sched_statistics independent of fair sched class") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
show more ...
|
Revision tags: v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63 |
|
#
847fc0cd |
| 05-Sep-2021 |
Yafang Shao <laoar.shao@gmail.com> |
sched: Introduce task block time in schedstats
Currently in schedstats we have sum_sleep_runtime and iowait_sum, but there's no metric to show how long the task is in D state. Once a task in D stat
sched: Introduce task block time in schedstats
Currently in schedstats we have sum_sleep_runtime and iowait_sum, but there's no metric to show how long the task is in D state. Once a task in D state, it means the task is blocked in the kernel, for example the task may be waiting for a mutex. The D state is more frequent than iowait, and it is more critital than S state. So it is worth to add a metric to measure it.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210905143547.4668-5-laoar.shao@gmail.com
show more ...
|
#
ceeadb83 |
| 05-Sep-2021 |
Yafang Shao <laoar.shao@gmail.com> |
sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The str
sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The struct sched_statistics is the schedular statistics of a task_struct or a task_group. So we can move it into struct task_struct and struct task_group to achieve the goal.
After the patch, schestats are orgnized as follows,
struct task_struct { ... struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; ... struct sched_statistics stats; ... };
Regarding the task group, schedstats is only supported for fair group sched, and a new struct sched_entity_stats is introduced, suggested by Peter -
struct sched_entity_stats { struct sched_entity se; struct sched_statistics stats; } __no_randomize_layout;
Then with the se in a task_group, we can easily get the stats.
The sched_statistics members may be frequently modified when schedstats is enabled, in order to avoid impacting on random data which may in the same cacheline with them, the struct sched_statistics is defined as cacheline aligned.
As this patch changes the core struct of scheduler, so I verified the performance it may impact on the scheduler with 'perf bench sched pipe', suggested by Mel. Below is the result, in which all the values are in usecs/op. Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the earlier version, that is because my old test machine is destroyed so I have to use a new different test machine.]
Almost no impact on the sched performance.
No functional change.
[lkp@intel.com: reported build failure in earlier version]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
show more ...
|
Revision tags: v5.14.1, v5.10.62, v5.14, v5.10.61 |
|
#
51ce83ed |
| 19-Aug-2021 |
Josh Don <joshdon@google.com> |
sched: reduce sched slice for SCHED_IDLE entities
Use a small, non-scaled min granularity for SCHED_IDLE entities, when competing with normal entities. This reduces the latency of getting a normal e
sched: reduce sched slice for SCHED_IDLE entities
Use a small, non-scaled min granularity for SCHED_IDLE entities, when competing with normal entities. This reduces the latency of getting a normal entity back on cpu, at the expense of increased context switch frequency of SCHED_IDLE entities.
The benefit of this change is to reduce the round-robin latency for normal entities when competing with a SCHED_IDLE entity.
Example: on a machine with HZ=1000, spawned two threads, one of which is SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE thread runs for 4ms then waits for 1.4s. With this patch, it runs for 1ms and waits 340ms (as it round-robins with the other thread).
Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com
show more ...
|
#
a480adde |
| 19-Aug-2021 |
Josh Don <joshdon@google.com> |
sched: Account number of SCHED_IDLE entities on each cfs_rq
Adds cfs_rq->idle_nr_running, which accounts the number of idle entities directly enqueued on the cfs_rq.
Signed-off-by: Josh Don <joshdo
sched: Account number of SCHED_IDLE entities on each cfs_rq
Adds cfs_rq->idle_nr_running, which accounts the number of idle entities directly enqueued on the cfs_rq.
Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-3-joshdon@google.com
show more ...
|
#
26e9a1de |
| 02-Sep-2022 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
sched/debug: fix dentry leak in update_sched_domain_debugfs
commit c2e406596571659451f4b95e37ddfd5a8ef1d0dc upstream.
Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup()) leaks a
sched/debug: fix dentry leak in update_sched_domain_debugfs
commit c2e406596571659451f4b95e37ddfd5a8ef1d0dc upstream.
Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup()) leaks a dentry and with a hotplug stress test, the machine eventually runs out of memory.
Fix this up by using the newly created debugfs_lookup_and_remove() call instead which properly handles the dentry reference counting logic.
Cc: Major Chen <major.chen@samsung.com> Cc: stable <stable@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Reported-by: Kuyo Chang <kuyo.chang@mediatek.com> Tested-by: Kuyo Chang <kuyo.chang@mediatek.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
8bc68c44 |
| 17-Jan-2022 |
Bharata B Rao <bharata@amd.com> |
sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa
[ Upstream commit 28c988c3ec29db74a1dda631b18785958d57df4f ]
The older format of /proc/pid/sched printed home node info wh
sched/debug: Remove mpol_get/put and task_lock/unlock from sched_show_numa
[ Upstream commit 28c988c3ec29db74a1dda631b18785958d57df4f ]
The older format of /proc/pid/sched printed home node info which required the mempolicy and task lock around mpol_get(). However the format has changed since then and there is no need for sched_show_numa() any more to have mempolicy argument, asssociated mpol_get/put and task_lock/unlock. Remove them.
Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched") Signed-off-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
70306618 |
| 27-Sep-2021 |
Mel Gorman <mgorman@techsingularity.net> |
sched/fair: Null terminate buffer when updating tunable_scaling
This patch null-terminates the temporary buffer in sched_scaling_write() so kstrtouint() does not return failure and checks the value
sched/fair: Null terminate buffer when updating tunable_scaling
This patch null-terminates the temporary buffer in sched_scaling_write() so kstrtouint() does not return failure and checks the value is valid.
Before: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument $ cat /sys/kernel/debug/sched/tunable_scaling 1
After: $ cat /sys/kernel/debug/sched/tunable_scaling 1 $ echo 0 > /sys/kernel/debug/sched/tunable_scaling $ cat /sys/kernel/debug/sched/tunable_scaling 0 $ echo 3 > /sys/kernel/debug/sched/tunable_scaling -bash: echo: write error: Invalid argument
Fixes: 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210927114635.GH3959@techsingularity.net
show more ...
|
Revision tags: v5.10.60 |
|
#
30400039 |
| 29-Jul-2021 |
Josh Don <joshdon@google.com> |
sched: Cgroup SCHED_IDLE support
This extends SCHED_IDLE to cgroups.
Interface: cgroup/cpu.idle. 0: default behavior 1: SCHED_IDLE
Extending SCHED_IDLE to cgroups means that we incorporate the e
sched: Cgroup SCHED_IDLE support
This extends SCHED_IDLE to cgroups.
Interface: cgroup/cpu.idle. 0: default behavior 1: SCHED_IDLE
Extending SCHED_IDLE to cgroups means that we incorporate the existing aspects of SCHED_IDLE; a SCHED_IDLE cgroup will count all of its descendant threads towards the idle_h_nr_running count of all of its ancestor cgroups. Thus, sched_idle_rq() will work properly. Additionally, SCHED_IDLE cgroups are configured with minimum weight.
There are two key differences between the per-task and per-cgroup SCHED_IDLE interface:
- The cgroup interface allows tasks within a SCHED_IDLE hierarchy to maintain their relative weights. The entity that is "idle" is the cgroup, not the tasks themselves.
- Since the idle entity is the cgroup, our SCHED_IDLE wakeup preemption decision is not made by comparing the current task with the woken task, but rather by comparing their matching sched_entity.
A typical use-case for this is a user that creates an idle and a non-idle subtree. The non-idle subtree will dominate competition vs the idle subtree, but the idle subtree will still be high priority vs other users on the system. The latter is accomplished via comparing matching sched_entity in the waken preemption path (this could also be improved by making the sched_idle_rq() decision dependent on the perspective of a specific task).
For now, we maintain the existing SCHED_IDLE semantics. Future patches may make improvements that extend how we treat SCHED_IDLE entities.
The per-task_group idle field is an integer that currently only holds either a 0 or a 1. This is explicitly typed as an integer to allow for further extensions to this API. For example, a negative value may indicate a highly latency-sensitive cgroup that should be preferred for preemption/placement/etc.
Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210730020019.1487127-2-joshdon@google.com
show more ...
|
Revision tags: v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39 |
|
#
459b09b5 |
| 18-May-2021 |
Valentin Schneider <valentin.schneider@arm.com> |
sched/debug: Don't update sched_domain debug directories before sched_debug_init()
Since CPU capacity asymmetry can stem purely from maximum frequency differences (e.g. Pixel 1), a rebuild of the sc
sched/debug: Don't update sched_domain debug directories before sched_debug_init()
Since CPU capacity asymmetry can stem purely from maximum frequency differences (e.g. Pixel 1), a rebuild of the scheduler topology can be issued upon loading cpufreq, see:
arch_topology.c::init_cpu_capacity_callback()
Turns out that if this rebuild happens *before* sched_debug_init() is run (which is a late initcall), we end up messing up the sched_domain debug directory: passing a NULL parent to debugfs_create_dir() ends up creating the directory at the debugfs root, which in this case creates /sys/kernel/debug/domains (instead of /sys/kernel/debug/sched/domains).
This currently doesn't happen on asymmetric systems which use cpufreq-scpi or cpufreq-dt drivers, as those are loaded via deferred_probe_initcall() (it is also a late initcall, but appears to be ordered *after* sched_debug_init()).
Ionela has been working on detecting maximum frequency asymmetry via ACPI, and that actually happens via a *device* initcall, thus before sched_debug_init(), and causes the aforementionned debugfs mayhem.
One option would be to punt sched_debug_init() down to fs_initcall_sync(). Preventing update_sched_domain_debugfs() from running before sched_debug_init() appears to be the safer option.
Fixes: 3b87f136f8fc ("sched,debug: Convert sysctl sched_domains to debugfs") Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lore.kernel.org/r/20210514095339.12979-1-ionela.voinescu@arm.com
show more ...
|
#
68d7a190 |
| 02-Jun-2021 |
Dietmar Eggemann <dietmar.eggemann@arm.com> |
sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent unnecessary util_est updates uses the LSB of util_est.enqueued. It is expo
sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent unnecessary util_est updates uses the LSB of util_est.enqueued. It is exposed via _task_util_est() (and task_util_est()).
Commit 92a801e5d5b7 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages") mentions that the LSB is lost for util_est resolution but find_energy_efficient_cpu() checks if task_util_est() returns 0 to return prev_cpu early.
_task_util_est() returns the max value of util_est.ewma and util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED. So task_util_est() returning the max of task_util() and _task_util_est() will never return 0 under the default SCHED_FEAT(UTIL_EST, true).
To fix this use the MSB of util_est.enqueued instead and keep the flag util_est internal, i.e. don't export it via _task_util_est().
The maximal possible util_avg value for a task is 1024 so the MSB of 'unsigned int util_est.enqueued' isn't used to store a util value.
As a caveat the code behind the util_est_se trace point has to filter UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should be easy to do.
This also fixes an issue report by Xuewen Yan that util_est_update() only used UTIL_AVG_UNCHANGED for the subtrahend of the equation:
last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED)
Fixes: b89997aa88f0b sched/pelt: Fix task util_est update filtering Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
show more ...
|
Revision tags: v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10 |
|
#
5cb9eaa3 |
| 17-Nov-2020 |
Peter Zijlstra <peterz@infradead.org> |
sched: Wrap rq::lock access
In preparation of playing games with rq->lock, abstract the thing using an accessor.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <d
sched: Wrap rq::lock access
In preparation of playing games with rq->lock, abstract the thing using an accessor.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Don Hiatt <dhiatt@digitalocean.com> Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210422123308.136465446@infradead.org
show more ...
|
#
ad789f84 |
| 15-Apr-2021 |
Waiman Long <longman@redhat.com> |
sched/debug: Fix cgroup_path[] serialization
The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connecte
sched/debug: Fix cgroup_path[] serialization
The handling of sysrq key can be activated by echoing the key to /proc/sysrq-trigger or via the magic key sequence typed into a terminal that is connected to the system in some way (serial, USB or other mean). In the former case, the handling is done in a user context. In the latter case, it is likely to be in an interrupt context.
Currently in print_cpu() of kernel/sched/debug.c, sched_debug_lock is taken with interrupt disabled for the whole duration of the calls to print_*_stats() and print_rq() which could last for the quite some time if the information dump happens on the serial console.
If the system has many cpus and the sched_debug_lock is somehow busy (e.g. parallel sysrq-t), the system may hit a hard lockup panic depending on the actually serial console implementation of the system.
The purpose of sched_debug_lock is to serialize the use of the global cgroup_path[] buffer in print_cpu(). The rests of the printk calls don't need serialization from sched_debug_lock.
Calling printk() with interrupt disabled can still be problematic if multiple instances are running. Allocating a stack buffer of PATH_MAX bytes is not feasible because of the limited size of the kernel stack.
The solution implemented in this patch is to allow only one caller at a time to use the full size group_path[], while other simultaneous callers will have to use shorter stack buffers with the possibility of path name truncation. A "..." suffix will be printed if truncation may have happened. The cgroup path name is provided for informational purpose only, so occasional path name truncation should not be a big problem.
Fixes: efe25c2c7b3a ("sched: Reinstate group names in /proc/sched_debug") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210415195426.6677-1-longman@redhat.com
show more ...
|