Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44 |
|
#
95f5c19c |
| 03-Aug-2023 |
Han Dapeng <han-dapeng@foxmail.com> |
Documentation: cgroup-v2.rst: Correct number of stats entries
Signed-off-by: Han Dapeng <han-dapeng@foxmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
|
Revision tags: v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34 |
|
#
e973dfe9 |
| 13-Jun-2023 |
LeiZhou-97 <lei.zhou@intel.com> |
cgroup/misc: Expose misc.current on cgroup v2 root
Hello,
This patch is to expose misc.current on cgroup v2 root for tracking how much of the resource has been consumed in total on the system.
Mos
cgroup/misc: Expose misc.current on cgroup v2 root
Hello,
This patch is to expose misc.current on cgroup v2 root for tracking how much of the resource has been consumed in total on the system.
Most of the cloud infrastucture use cgroup to fetch the host information for scheduling purpose.
Currently, the misc controller can be used by Intel TDX HKIDs and AMD SEV ASIDs, which are both used for creating encrypted VMs. Intel TDX and AMD SEV are mostly be used by the cloud providers for providing confidential VMs.
In actual use of a server, these confidential VMs may be launched in different ways. For the cloud solution, there are kubvirt and coco (tracked by kubepods.slice); on host, they can be booted directly through qemu by end user (tracked by user.slice), etc.
In this complex environment, when wanting to know how many resource is used in total it has to iterate through all existing slices to get the value of each misc.current and add them up to calculate the total number of consumed keys.
So exposing misc.current to root cgroup tends to give much easier when calculates how much resource has been used in total, which helps to schedule and count resources for the cloud infrastucture.
Signed-off-by: LeiZhou-97 <lei.zhou@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v6.1.33, v6.1.32, v6.1.31 |
|
#
e0e0b412 |
| 24-May-2023 |
Lars R. Damerow <lars@pixar.com> |
mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcg
This patch is similar to commit 8e20d4b33266 ("mm/memcontrol: export memcg->watermark via sysfs for v2 memcg"), but exports the swap
mm/memcontrol: export memcg.swap watermark via sysfs for v2 memcg
This patch is similar to commit 8e20d4b33266 ("mm/memcontrol: export memcg->watermark via sysfs for v2 memcg"), but exports the swap counter's watermark.
We allocate jobs to our compute farm using heuristics determined by memory and swap usage from previous jobs. Tracking the peak swap usage for new jobs is important for determining when jobs are exceeding their expected bounds, or when our baseline metrics are getting outdated.
Our toolset was written to use the "memory.memsw.max_usage_in_bytes" file in cgroups v1, and altering it to poll cgroups v2's "memory.swap.current" would give less accurate results as well as add complication to the code. Having this watermark exposed in sysfs is much preferred.
Link: https://lkml.kernel.org/r/20230524181734.125696-1-lars@pixar.com Signed-off-by: Lars R. Damerow <lars@pixar.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.1.30, v6.1.29, v6.1.28, v6.1.27 |
|
#
ddf63516 |
| 28-Apr-2023 |
Hou Tao <houtao1@huawei.com> |
blk-ioprio: Introduce promote-to-rt policy
Since commit a78418e6a04c ("block: Always initialize bio IO priority on submit"), bio->bi_ioprio will never be IOPRIO_CLASS_NONE when calling blkcg_set_iop
blk-ioprio: Introduce promote-to-rt policy
Since commit a78418e6a04c ("block: Always initialize bio IO priority on submit"), bio->bi_ioprio will never be IOPRIO_CLASS_NONE when calling blkcg_set_ioprio(), so there will be no way to promote the io-priority of one cgroup to IOPRIO_CLASS_RT, because bi_ioprio will always be greater than or equals to IOPRIO_CLASS_RT.
It seems possible to call blkcg_set_ioprio() first then try to initialize bi_ioprio later in bio_set_ioprio(), but this doesn't work for bio in which bi_ioprio is already initialized (e.g., direct-io), so introduce a new promote-to-rt policy to promote the iopriority of bio to IOPRIO_CLASS_RT if the ioprio is not already RT.
For none-to-rt policy, although it doesn't work now, but considering that its purpose was also to override the io-priority to RT and allowing for a smoother transition, just keep it and treat it as an alias of the promote-to-rt policy.
Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20230428074404.280532-1-houtao@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
5647e53f |
| 01-Jun-2023 |
Dan Schatzberg <schatzberg.dan@gmail.com> |
cgroup: Documentation: Clarify usage of memory limits
The existing documentation refers to memory.high as the "main mechanism to control memory usage." This seems incorrect to me - memory.high can r
cgroup: Documentation: Clarify usage of memory limits
The existing documentation refers to memory.high as the "main mechanism to control memory usage." This seems incorrect to me - memory.high can result in reclaim pressure which simply leads to stalls unless some external component observes and actions on it (e.g. systemd-oomd can be used for this purpose). While this is feasible, users are unaware of this interaction and are led to believe that memory.high alone is an effective mechanism for limiting memory.
The documentation should recommend the use of memory.max as the effective way to enforce memory limits - it triggers reclaim and results in OOM kills by itself.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9 |
|
#
dbeb56fe |
| 29-Jan-2023 |
Randy Dunlap <rdunlap@infradead.org> |
Documentation: admin-guide: correct spelling
Correct spelling problems for Documentation/admin-guide/ as reported by codespell.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Muke
Documentation: admin-guide: correct spelling
Correct spelling problems for Documentation/admin-guide/ as reported by codespell.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@redhat.com Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-media@vger.kernel.org Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230129231053.20863-2-rdunlap@infradead.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
show more ...
|
Revision tags: v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14 |
|
#
55ab834a |
| 16-Dec-2022 |
Michal Hocko <mhocko@suse.com> |
Revert "mm: add nodes= arg to memory.reclaim"
This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
Although it is recognized that a finer grained pro-active reclaim is something we need an
Revert "mm: add nodes= arg to memory.reclaim"
This reverts commit 12a5d3955227b0d7e04fb793ccceeb2a1dd275c5.
Although it is recognized that a finer grained pro-active reclaim is something we need and want the semantic of this implementation is really ambiguous.
In a follow up discussion it became clear that there are two essential usecases here. One is to use memory.reclaim to pro-actively reclaim memory and expectation is that the requested and reported amount of memory is uncharged from the memcg. Another usecase focuses on pro-active demotion when the memory is merely shuffled around to demotion targets while the overall charged memory stays unchanged.
The current implementation considers demoted pages as reclaimed and that break both usecases. [1] has tried to address the reporting part but there are more issues with that summarized in [2] and follow up emails.
Let's revert the nodemask based extension of the memcg pro-active reclaim for now until we settle with a more robust semantic.
[1] http://lkml.kernel.org/r/http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com [2] http://lkml.kernel.org/r/Y5bsmpCyeryu3Zz1@dhcp22.suse.cz
Link: https://lkml.kernel.org/r/Y5xASNe1x8cusiTx@dhcp22.suse.cz Fixes: 12a5d3955227b0d ("mm: add nodes= arg to memory.reclaim") Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
acbee592 |
| 16-Dec-2022 |
Qais Yousef <qyousef@layalina.io> |
sched/documentation: Document the util clamp feature
Add a document explaining the util clamp feature: what it is and how to use it. The new document hopefully covers everything one needs to know ab
sched/documentation: Document the util clamp feature
Add a document explaining the util clamp feature: what it is and how to use it. The new document hopefully covers everything one needs to know about uclamp.
Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20221216235716.201923-1-qyousef@layalina.io Cc: Jonathan Corbet <corbet@lwn.net>
show more ...
|
Revision tags: v6.0.13, v6.1, v6.0.12 |
|
#
12a5d395 |
| 02-Dec-2022 |
Mina Almasry <almasrymina@google.com> |
mm: add nodes= arg to memory.reclaim
The nodes= arg instructs the kernel to only scan the given nodes for proactive reclaim. For example use cases, consider a 2 tier memory system:
nodes 0,1 -> to
mm: add nodes= arg to memory.reclaim
The nodes= arg instructs the kernel to only scan the given nodes for proactive reclaim. For example use cases, consider a 2 tier memory system:
nodes 0,1 -> top tier nodes 2,3 -> second tier
$ echo "1m nodes=0" > memory.reclaim
This instructs the kernel to attempt to reclaim 1m memory from node 0. Since node 0 is a top tier node, demotion will be attempted first. This is useful to direct proactive reclaim to specific nodes that are under pressure.
$ echo "1m nodes=2,3" > memory.reclaim
This instructs the kernel to attempt to reclaim 1m memory in the second tier, since this tier of memory has no demotion targets the memory will be reclaimed.
$ echo "1m nodes=0,1" > memory.reclaim
Instructs the kernel to reclaim memory from the top tier nodes, which can be desirable according to the userspace policy if there is pressure on the top tiers. Since these nodes have demotion targets, the kernel will attempt demotion first.
Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), the proactive reclaim interface memory.reclaim does both reclaim and demotion. Reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault.
The 'nodes' arg is useful to allow the userspace to control demotion and reclaim independently according to its policy: if the memory.reclaim is called on a node with demotion targets, it will attempt demotion first; if it is called on a node without demotion targets, it will only attempt reclaim.
Link: https://lkml.kernel.org/r/20221202223533.1785418-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6 |
|
#
57e9cc50 |
| 26-Oct-2022 |
Johannes Weiner <hannes@cmpxchg.org> |
mm: vmscan: split khugepaged stats from direct reclaim stats
Direct reclaim stats are useful for identifying a potential source for application latency, as well as spotting issues with kswapd. Howe
mm: vmscan: split khugepaged stats from direct reclaim stats
Direct reclaim stats are useful for identifying a potential source for application latency, as well as spotting issues with kswapd. However, khugepaged currently distorts the picture: as a kernel thread it doesn't impose allocation latencies on userspace, and it explicitly opts out of kswapd reclaim. Its activity showing up in the direct reclaim stats is misleading. Counting it as kswapd reclaim could also cause confusion when trying to understand actual kswapd behavior.
Break out khugepaged from the direct reclaim counters into new pgsteal_khugepaged, pgdemote_khugepaged, pgscan_khugepaged counters.
Test with a huge executable (CONFIG_READ_ONLY_THP_FOR_FS):
pgsteal_kswapd 1342185 pgsteal_direct 0 pgsteal_khugepaged 3623 pgscan_kswapd 1345025 pgscan_direct 0 pgscan_khugepaged 3623
Link: https://lkml.kernel.org/r/20221026180133.377671-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Eric Bergen <ebergen@meta.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66 |
|
#
34f26a15 |
| 07-Sep-2022 |
Chengming Zhou <zhouchengming@bytedance.com> |
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhe
sched/psi: Per-cgroup PSI accounting disable/re-enable interface
PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhead for some workloads when under deep level of the hierarchy.
commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead.
But for our use case, we also want leaf cgroup PSI stats accounted for userspace adjustment on that cgroup, apart from only system-wide adjustment.
So this patch introduce a per-cgroup PSI accounting disable/re-enable interface "cgroup.pressure", which is a read-write single value file that allowed values are "0" and "1", the defaults is "1" so per-cgroup PSI stats is enabled by default.
Implementation details:
It should be relatively straight-forward to disable and re-enable state aggregation, time tracking, averaging on a per-cgroup level, if we can live with losing history from while it was disabled. I.e. the avgs will restart from 0, total= will have gaps.
But it's hard or complex to stop/restart groupc->tasks[] updates, which is not implemented in this patch. So we always update groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when the cgroup PSI stats is disabled.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
show more ...
|
Revision tags: v5.15.65, v5.15.64 |
|
#
52b1364b |
| 25-Aug-2022 |
Chengming Zhou <zhouchengming@bytedance.com> |
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on so
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
show more ...
|
#
8cbfdc24 |
| 01-Sep-2022 |
Waiman Long <longman@redhat.com> |
cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced "isolated" cpuset partition type as well as other c
cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced "isolated" cpuset partition type as well as other changes made in other cpuset patches.
Signed-off-by: Waiman Long <longman@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.15.63 |
|
#
ebc97a52 |
| 22-Aug-2022 |
Yosry Ahmed <yosryahmed@google.com> |
mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.
We keep track of several kernel memory stats (total kernel memory, page tables, stack, vmalloc, etc) on multiple levels (global, pe
mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.
We keep track of several kernel memory stats (total kernel memory, page tables, stack, vmalloc, etc) on multiple levels (global, per-node, per-memcg, etc). These stats give insights to users to how much memory is used by the kernel and for what purposes.
Currently, memory used by KVM mmu is not accounted in any of those kernel memory stats. This patch series accounts the memory pages used by KVM for page tables in those stats in a new NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account for other types of secondary pages tables (e.g. iommu page tables).
KVM has a decent number of large allocations that aren't for page tables, but for most of them, the number/size of those allocations scales linearly with either the number of vCPUs or the amount of memory assigned to the VM. KVM's secondary page table allocations do not scale linearly, especially when nested virtualization is in use.
From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's per-VM pages_{4k,2m,1g} stats unless the guest is doing something bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is forced to allocate a large number of page tables even though the guest isn't accessing that much memory). However, someone would need to either understand how KVM works to make that connection, or know (or be told) to go look at KVM's stats if they're running VMs to better decipher the stats.
Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE is informative. For example, when backing a VM with THP vs. HugeTLB, NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order of magnitude higher with THP. So having this stat will at the very least prove to be useful for understanding tradeoffs between VM backing types, and likely even steer folks towards potential optimizations.
The original discussion with more details about the rationale: https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org
This stat will be used by subsequent patches to count KVM mmu memory usage.
Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220823004639.2387269-2-yosryahmed@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
show more ...
|
Revision tags: v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55 |
|
#
73b73bac |
| 14-Jul-2022 |
Yosry Ahmed <yosryahmed@google.com> |
mm: vmpressure: don't count proactive reclaim in vmpressure
memory.reclaim is a cgroup v2 interface that allows users to proactively reclaim memory from a memcg, without real memory pressure. Recla
mm: vmpressure: don't count proactive reclaim in vmpressure
memory.reclaim is a cgroup v2 interface that allows users to proactively reclaim memory from a memcg, without real memory pressure. Reclaim operations invoke vmpressure, which is used: (a) To notify userspace of reclaim efficiency in cgroup v1, and (b) As a signal for a memcg being under memory pressure for networking (see mem_cgroup_under_socket_pressure()).
For (a), vmpressure notifications in v1 are not affected by this change since memory.reclaim is a v2 feature.
For (b), the effects of the vmpressure signal (according to Shakeel [1]) are as follows: 1. Reducing send and receive buffers of the current socket. 2. May drop packets on the rx path. 3. May throttle current thread on the tx path.
Since proactive reclaim is invoked directly by userspace, not by memory pressure, it makes sense not to throttle networking. Hence, this change makes sure that proactive reclaim caused by memory.reclaim does not trigger vmpressure.
[1] https://lore.kernel.org/lkml/CALvZod68WdrXEmBpOkadhB5GPYmCXaDZzXH=yyGOCAjFRn4NDQ@mail.gmail.com/
[yosryahmed@google.com: update documentation] Link: https://lkml.kernel.org/r/20220721173015.2643248-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220714064918.2576464-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: NeilBrown <neilb@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
c808f463 |
| 27-Jul-2022 |
Tejun Heo <tj@kernel.org> |
cgroup: remove "no" prefixed mount options
30312730bd02 ("cgroup: Add "no" prefixed mount options") added "no" prefixed mount options to allow turning them off and 6a010a49b63a ("cgroup: Make !percp
cgroup: remove "no" prefixed mount options
30312730bd02 ("cgroup: Add "no" prefixed mount options") added "no" prefixed mount options to allow turning them off and 6a010a49b63a ("cgroup: Make !percpu threadgroup_rwsem operations optional") added one more "no" prefixed mount option. However, Michal pointed out that the "no" prefixed options aren't necessary in allowing mount options to be turned off:
# grep group /proc/mounts cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0 # mount -o remount,nsdelegate,memory_recursiveprot none /sys/fs/cgroup # grep cgroup /proc/mounts cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,nsdelegate,memory_recursiveprot 0 0
Note that this is different from the remount behavior when the mount(1) is invoked without the device argument - "none":
# grep cgroup /proc/mounts cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0 # mount -o remount,nsdelegate,memory_recursiveprot /sys/fs/cgroup # grep cgroup /proc/mounts cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
While a bit confusing, given that there is a way to turn off the options, there's no reason to have the explicit "no" prefixed options. Let's remove them.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
6a010a49 |
| 23-Jul-2022 |
Tejun Heo <tj@kernel.org> |
cgroup: Make !percpu threadgroup_rwsem operations optional
3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem
cgroup: Make !percpu threadgroup_rwsem operations optional
3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem because the impiled synchronize_rcu() on write locking was pushing up the latencies too much for android which constantly moves processes between cgroups.
This makes the hotter paths - fork and exit - slower as they're always forced into the slow path. There is no reason to force this on everyone especially given that more common static usage pattern can now completely avoid write-locking the rwsem. Write-locking is elided when turning on and off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a cgroup without grabbing the rwsem.
Restore the default percpu operations and introduce the mount option "favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need lower latencies for the dynamic operations.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutn� <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Oleg Nesterov <oleg@redhat.com>
show more ...
|
#
30312730 |
| 14-Jul-2022 |
Tejun Heo <tj@kernel.org> |
cgroup: Add "no" prefixed mount options
We allow modifying these mount options via remount. Let's add "no" prefixed variants so that they can be turned off too.
Signed-off-by: Tejun Heo <tj@kernel.
cgroup: Add "no" prefixed mount options
We allow modifying these mount options via remount. Let's add "no" prefixed variants so that they can be turned off too.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutný <mkoutny@suse.com>
show more ...
|
Revision tags: v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45 |
|
#
673520f8 |
| 04-Jun-2022 |
Qi Zheng <zhengqi.arch@bytedance.com> |
mm: memcontrol: add {pgscan,pgsteal}_{kswapd,direct} items in memory.stat of cgroup v2
There are already statistics of {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct of memcg event here, but no
mm: memcontrol: add {pgscan,pgsteal}_{kswapd,direct} items in memory.stat of cgroup v2
There are already statistics of {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct of memcg event here, but now only the sum of the two is displayed in memory.stat of cgroup v2.
In order to obtain more accurate information during monitoring and debugging, and to align with the display in /proc/vmstat, it better to display {pgscan,pgsteal}_kswapd and {pgscan,pgsteal}_direct separately.
Also, for forward compatibility, we still display pgscan and pgsteal items so that it won't break existing applications.
[zhengqi.arch@bytedance.com: add comment for memcg_vm_event_stat (suggested by Michal)] Link: https://lkml.kernel.org/r/20220606154028.55030-1-zhengqi.arch@bytedance.com [zhengqi.arch@bytedance.com: fix the doc, thanks to Johannes] Link: https://lkml.kernel.org/r/20220607064803.79363-1-zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/20220604082209.55174-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v5.15.44, v5.15.43, v5.15.42, v5.18 |
|
#
f4840ccf |
| 19-May-2022 |
Johannes Weiner <hannes@cmpxchg.org> |
zswap: memcg accounting
Applications can currently escape their cgroup memory containment when zswap is enabled. This patch adds per-cgroup tracking and limiting of zswap backend memory to rectify
zswap: memcg accounting
Applications can currently escape their cgroup memory containment when zswap is enabled. This patch adds per-cgroup tracking and limiting of zswap backend memory to rectify this.
The existing cgroup2 memory.stat file is extended to show zswap statistics analogous to what's in meminfo and vmstat. Furthermore, two new control files, memory.zswap.current and memory.zswap.max, are added to allow tuning zswap usage on a per-workload basis. This is important since not all workloads benefit from zswap equally; some even suffer compared to disk swap when memory contents don't compress well. The optimal size of the zswap pool, and the threshold for writeback, also depends on the size of the workload's warm set.
The implementation doesn't use a traditional page_counter transaction. zswap is unconventional as a memory consumer in that we only know the amount of memory to charge once expensive compression has occurred. If zwap is disabled or the limit is already exceeded we obviously don't want to compress page upon page only to reject them all. Instead, the limit is checked against current usage, then we compress and charge. This allows some limit overrun, but not enough to matter in practice.
[hannes@cmpxchg.org: fix for CONFIG_SLOB builds] Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org [hannes@cmpxchg.org: opt out of cgroups v1] Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v5.15.41, v5.15.40 |
|
#
8e20d4b3 |
| 13-May-2022 |
Ganesan Rajagopal <rganesan@arista.com> |
mm/memcontrol: export memcg->watermark via sysfs for v2 memcg
We run a lot of automated tests when building our software and run into OOM scenarios when the tests run unbounded. v1 memcg exports me
mm/memcontrol: export memcg->watermark via sysfs for v2 memcg
We run a lot of automated tests when building our software and run into OOM scenarios when the tests run unbounded. v1 memcg exports memcg->watermark as "memory.max_usage_in_bytes" in sysfs. We use this metric to heuristically limit the number of tests that can run in parallel based on per test historical data.
This metric is currently not exported for v2 memcg and there is no other easy way of getting this information. getrusage() syscall returns "ru_maxrss" which can be used as an approximation but that's the max RSS of a single child process across all children instead of the aggregated max for all child processes. The only work around is to periodically poll "memory.current" but that's not practical for short-lived one-off cgroups.
Hence, expose memcg->watermark as "memory.peak" for v2 memcg.
Link: https://lkml.kernel.org/r/20220507050916.GA13577@us192.sjc.aristanetworks.com Signed-off-by: Ganesan Rajagopal <rganesan@arista.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v5.15.39, v5.15.38, v5.15.37 |
|
#
94968384 |
| 29-Apr-2022 |
Shakeel Butt <shakeelb@google.com> |
memcg: introduce per-memcg reclaim interface
This patch series adds a memory.reclaim proactive reclaim interface. The rationale behind the interface and how it works are in the first patch.
This p
memcg: introduce per-memcg reclaim interface
This patch series adds a memory.reclaim proactive reclaim interface. The rationale behind the interface and how it works are in the first patch.
This patch (of 4):
Introduce a memcg interface to trigger memory reclaim on a memory cgroup.
Use case: Proactive Reclaim ---------------------------
A userspace proactive reclaimer can continuously probe the memcg to reclaim a small amount of memory. This gives more accurate and up-to-date workingset estimation as the LRUs are continuously sorted and can potentially provide more deterministic memory overcommit behavior. The memory overcommit controller can provide more proactive response to the changing behavior of the running applications instead of being reactive.
A userspace reclaimer's purpose in this case is not a complete replacement for kswapd or direct reclaim, it is to proactively identify memory savings opportunities and reclaim some amount of cold pages set by the policy to free up the memory for more demanding jobs or scheduling new jobs.
A user space proactive reclaimer is used in Google data centers. Additionally, Meta's TMO paper recently referenced a very similar interface used for user space proactive reclaim: https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
Benefits of a user space reclaimer: -----------------------------------
1) More flexible on who should be charged for the cpu of the memory reclaim. For proactive reclaim, it makes more sense to be centralized.
2) More flexible on dedicating the resources (like cpu). The memory overcommit controller can balance the cost between the cpu usage and the memory reclaimed.
3) Provides a way to the applications to keep their LRUs sorted, so, under memory pressure better reclaim candidates are selected. This also gives more accurate and uptodate notion of working set for an application.
Why memory.high is not enough? ------------------------------
- memory.high can be used to trigger reclaim in a memcg and can potentially be used for proactive reclaim. However there is a big downside in using memory.high. It can potentially introduce high reclaim stalls in the target application as the allocations from the processes or the threads of the application can hit the temporary memory.high limit.
- Userspace proactive reclaimers usually use feedback loops to decide how much memory to proactively reclaim from a workload. The metrics used for this are usually either refaults or PSI, and these metrics will become messy if the application gets throttled by hitting the high limit.
- memory.high is a stateful interface, if the userspace proactive reclaimer crashes for any reason while triggering reclaim it can leave the application in a bad state.
- If a workload is rapidly expanding, setting memory.high to proactively reclaim memory can result in actually reclaiming more memory than intended.
The benefits of such interface and shortcomings of existing interface were further discussed in this RFC thread: https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
Interface: ----------
Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to trigger reclaim in the target memory cgroup.
The interface is introduced as a nested-keyed file to allow for future optional arguments to be easily added to configure the behavior of reclaim.
Possible Extensions: --------------------
- This interface can be extended with an additional parameter or flags to allow specifying one or more types of memory to reclaim from (e.g. file, anon, ..).
- The interface can also be extended with a node mask to reclaim from specific nodes. This has use cases for reclaim-based demotion in memory tiering systens.
- A similar per-node interface can also be added to support proactive reclaim and reclaim-based demotion in systems without memcg.
- Add a timeout parameter to make it easier for user space to call the interface without worrying about being blocked for an undefined amount of time.
For now, let's keep things simple by adding the basic functionality.
[yosryahmed@google.com: worked on versions v2 onwards, refreshed to current master, updated commit message based on recent discussions and use cases] Link: https://lkml.kernel.org/r/20220425190040.2475377-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20220425190040.2475377-2-yosryahmed@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Co-developed-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Wei Xu <weixugc@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: "Michal Koutn" <mkoutny@suse.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v5.15.36 |
|
#
a477b94d |
| 22-Apr-2022 |
Joel Savitz <jsavitz@redhat.com> |
Documentation: add missing angle bracket in cgroup-v2 doc
Trivial addition of missing closing angle backet.
Signed-off-by: Joel Savitz <jsavitz@redhat.com> Link: https://lore.kernel.org/r/202204221
Documentation: add missing angle bracket in cgroup-v2 doc
Trivial addition of missing closing angle backet.
Signed-off-by: Joel Savitz <jsavitz@redhat.com> Link: https://lore.kernel.org/r/20220422164526.3464306-1-jsavitz@redhat.com Signed-off-by: Jonathan Corbet <corbet@lwn.net>
show more ...
|
Revision tags: v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31 |
|
#
a8c49af3 |
| 22-Mar-2022 |
Yosry Ahmed <yosryahmed@google.com> |
memcg: add per-memcg total kernel memory stat
Currently memcg stats show several types of kernel memory: kernel stack, page tables, sock, vmalloc, and slab. However, there are other allocations wit
memcg: add per-memcg total kernel memory stat
Currently memcg stats show several types of kernel memory: kernel stack, page tables, sock, vmalloc, and slab. However, there are other allocations with __GFP_ACCOUNT (or supersets such as GFP_KERNEL_ACCOUNT) that are not accounted in any of those stats, a few examples are:
- various kvm allocations (e.g. allocated pages to create vcpus) - io_uring - tmp_page in pipes during pipe_write() - bpf ringbuffers - unix sockets
Keeping track of the total kernel memory is essential for the ease of migration from cgroup v1 to v2 as there are large discrepancies between v1's kmem.usage_in_bytes and the sum of the available kernel memory stats in v2. Adding separate memcg stats for all __GFP_ACCOUNT kernel allocations is an impractical maintenance burden as there a lot of those all over the kernel code, with more use cases likely to show up in the future.
Therefore, add a "kernel" memcg stat that is analogous to kmem page counter, with added benefits such as using rstat infrastructure which aggregates stats more efficiently. Additionally, this provides a lighter alternative in case the legacy kmem is deprecated in the future
[yosryahmed@google.com: v2] Link: https://lkml.kernel.org/r/20220203193856.972500-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20220201200823.3283171-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15 |
|
#
f4776199 |
| 14-Jan-2022 |
Mina Almasry <almasrymina@google.com> |
hugetlb: add hugetlb.*.numa_stat file
For hugetlb backed jobs/VMs it's critical to understand the numa information for the memory backing these jobs to deliver optimal performance.
Currently this t
hugetlb: add hugetlb.*.numa_stat file
For hugetlb backed jobs/VMs it's critical to understand the numa information for the memory backing these jobs to deliver optimal performance.
Currently this technically can be queried from /proc/self/numa_maps, but there are significant issues with that. Namely:
1. Memory can be mapped or unmapped.
2. numa_maps are per process and need to be aggregated across all processes in the cgroup. For shared memory this is more involved as the userspace needs to make sure it doesn't double count shared mappings.
3. I believe querying numa_maps needs to hold the mmap_lock which adds to the contention on this lock.
For these reasons I propose simply adding hugetlb.*.numa_stat file, which shows the numa information of the cgroup similarly to memory.numa_stat.
On cgroup-v2: cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat total=2097152 N0=2097152 N1=0
On cgroup-v1: cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat total=2097152 N0=2097152 N1=0 hierarichal_total=2097152 N0=2097152 N1=0
This patch was tested manually by allocating hugetlb memory and querying the hugetlb.*.numa_stat file of the cgroup and its parents.
[colin.i.king@googlemail.com: fix spelling mistake "hierarichal" -> "hierarchical"] Link: https://lkml.kernel.org/r/20211125090635.23508-1-colin.i.king@gmail.com [keescook@chromium.org: fix copy/paste array assignment] Link: https://lkml.kernel.org/r/20211203065647.2819707-1-keescook@chromium.org
Link: https://lkml.kernel.org/r/20211123001020.4083653-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Jue Wang <juew@google.com> Cc: Yang Yao <ygyao@google.com> Cc: Joanna Li <joannali@google.com> Cc: Cannon Matthews <cannonmatthews@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|