Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13 |
|
#
2e205eb5 |
| 15-Jan-2024 |
Babu Moger <babu.moger@amd.com> |
x86/resctrl: Read supported bandwidth sources from CPUID
[ Upstream commit 54e35eb8611cce5550d3d7689679b1a91c864f28 ]
If the BMEC (Bandwidth Monitoring Event Configuration) feature is supported, th
x86/resctrl: Read supported bandwidth sources from CPUID
[ Upstream commit 54e35eb8611cce5550d3d7689679b1a91c864f28 ]
If the BMEC (Bandwidth Monitoring Event Configuration) feature is supported, the bandwidth events can be configured. The maximum supported bandwidth bitmask can be read from CPUID:
CPUID_Fn80000020_ECX_x03 [Platform QoS Monitoring Bandwidth Event Configuration] Bits Description 31:7 Reserved 6:0 Identifies the bandwidth sources that can be tracked.
While at it, move the mask checking to mon_config_write() before iterating over all the domains. Also, print the valid bitmask when the user tries to configure invalid event configuration value.
The CPUID details are documented in the Processor Programming Reference (PPR) Vol 1.1 for AMD Family 19h Model 11h B1 - 55901 Rev 0.25 in the Link tag.
Fixes: dc2a3e857981 ("x86/resctrl: Add interface to read mbm_total_bytes_config") Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Link: https://lore.kernel.org/r/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25 |
|
#
8da2b938 |
| 19-Apr-2023 |
Peter Newman <peternewman@google.com> |
x86/resctrl: Implement rename op for mon groups
To change the resources allocated to a large group of tasks, such as an application container, a container manager must write all of the tasks' IDs in
x86/resctrl: Implement rename op for mon groups
To change the resources allocated to a large group of tasks, such as an application container, a container manager must write all of the tasks' IDs into the tasks file interface of the new control group. This is challenging when the container's task list is always changing.
In addition, if the container manager is using monitoring groups to separately track the bandwidth of containers assigned to the same control group, when moving a container, it must first move the container's tasks to the default monitoring group of the new control group before it can move these tasks into the container's replacement monitoring group under the destination control group. This is undesirable because it makes bandwidth usage during the move unattributable to the correct tasks and resets monitoring event counters and cache usage information for the group.
Implement the rename operation only for resctrlfs monitor groups to enable users to move a monitoring group from one control group to another. This effects a change in resources allocated to all the tasks in the monitoring group while otherwise leaving the monitoring data intact.
Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20230419125015.693566-3-peternewman@google.com
show more ...
|
#
c45c06d4 |
| 19-Apr-2023 |
Peter Newman <peternewman@google.com> |
x86/resctrl: Factor rdtgroup lock for multi-file ops
rdtgroup_kn_lock_live() can only release a kernfs reference for a single file before waiting on the rdtgroup_mutex, limiting its usefulness for o
x86/resctrl: Factor rdtgroup lock for multi-file ops
rdtgroup_kn_lock_live() can only release a kernfs reference for a single file before waiting on the rdtgroup_mutex, limiting its usefulness for operations on multiple files, such as rename.
Factor the work needed to respectively break and unbreak active protection on an individual file into rdtgroup_kn_{get,put}().
No functional change.
Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/20230419125015.693566-2-peternewman@google.com
show more ...
|
#
2997d94b |
| 15-May-2023 |
Shawn Wang <shawnwang@linux.alibaba.com> |
x86/resctrl: Only show tasks' pid in current pid namespace
When writing a task id to the "tasks" file in an rdtgroup, rdtgroup_tasks_write() treats the pid as a number in the current pid namespace.
x86/resctrl: Only show tasks' pid in current pid namespace
When writing a task id to the "tasks" file in an rdtgroup, rdtgroup_tasks_write() treats the pid as a number in the current pid namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows the list of global pids from the init namespace, which is confusing and incorrect.
To be more robust, let the "tasks" file only show pids in the current pid namespace.
Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files") Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Acked-by: Fenghua Yu <fenghua.yu@intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/all/20230116071246.97717-1-shawnwang@linux.alibaba.com/
show more ...
|
Revision tags: v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7 |
|
#
0424a7df |
| 17-Jan-2023 |
Shawn Wang <shawnwang@linux.alibaba.com> |
x86/resctrl: Clear staged_config[] before and after it is used
As a temporary storage, staged_config[] in rdt_domain should be cleared before and after it is used. The stale value in staged_config[]
x86/resctrl: Clear staged_config[] before and after it is used
As a temporary storage, staged_config[] in rdt_domain should be cleared before and after it is used. The stale value in staged_config[] could cause an MSR access error.
Here is a reproducer on a system with 16 usable CLOSIDs for a 15-way L3 Cache (MBA should be disabled if the number of CLOSIDs for MB is less than 16.) : mount -t resctrl resctrl -o cdp /sys/fs/resctrl mkdir /sys/fs/resctrl/p{1..7} umount /sys/fs/resctrl/ mount -t resctrl resctrl /sys/fs/resctrl mkdir /sys/fs/resctrl/p{1..8}
An error occurs when creating resource group named p8: unchecked MSR access error: WRMSR to 0xca0 (tried to write 0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60) Call Trace: <IRQ> __flush_smp_call_function_queue+0x11d/0x170 __sysvec_call_function+0x24/0xd0 sysvec_call_function+0x89/0xc0 </IRQ> <TASK> asm_sysvec_call_function+0x16/0x20
When creating a new resource control group, hardware will be configured by the following process: rdtgroup_mkdir() rdtgroup_mkdir_ctrl_mon() rdtgroup_init_alloc() resctrl_arch_update_domains()
resctrl_arch_update_domains() iterates and updates all resctrl_conf_type whose have_new_ctrl is true. Since staged_config[] holds the same values as when CDP was enabled, it will continue to update the CDP_CODE and CDP_DATA configurations. When group p8 is created, get_config_index() called in resctrl_arch_update_domains() will return 16 and 17 as the CLOSIDs for CDP_CODE and CDP_DATA, which will be translated to an invalid register - 0xca0 in this scenario.
Fix it by clearing staged_config[] before and after it is used.
[reinette: re-order commit tags]
Fixes: 75408e43509e ("x86/resctrl: Allow different CODE/DATA configurations to be staged") Suggested-by: Xin Hao <xhao@linux.alibaba.com> Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/2fad13f49fbe89687fc40e9a5a61f23a28d1507a.1673988935.git.reinette.chatre%40intel.com
show more ...
|
#
7fef0997 |
| 07-Mar-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
x86/resctl: fix scheduler confusion with 'current'
The implementation of 'current' on x86 is very intentionally special: it is a very common thing to look up, and it uses 'this_cpu_read_stable()' to
x86/resctl: fix scheduler confusion with 'current'
The implementation of 'current' on x86 is very intentionally special: it is a very common thing to look up, and it uses 'this_cpu_read_stable()' to get the current thread pointer efficiently from per-cpu storage.
And the keyword in there is 'stable': the current thread pointer never changes as far as a single thread is concerned. Even if when a thread is preempted, or moved to another CPU, or even across an explicit call 'schedule()' that thread will still have the same value for 'current'.
It is, after all, the kernel base pointer to thread-local storage. That's why it's stable to begin with, but it's also why it's important enough that we have that special 'this_cpu_read_stable()' access for it.
So this is all done very intentionally to allow the compiler to treat 'current' as a value that never visibly changes, so that the compiler can do CSE and combine multiple different 'current' accesses into one.
However, there is obviously one very special situation when the currently running thread does actually change: inside the scheduler itself.
So the scheduler code paths are special, and do not have a 'current' thread at all. Instead there are _two_ threads: the previous and the next thread - typically called 'prev' and 'next' (or prev_p/next_p) internally.
So this is all actually quite straightforward and simple, and not all that complicated.
Except for when you then have special code that is run in scheduler context, that code then has to be aware that 'current' isn't really a valid thing. Did you mean 'prev'? Did you mean 'next'?
In fact, even if then look at the code, and you use 'current' after the new value has been assigned to the percpu variable, we have explicitly told the compiler that 'current' is magical and always stable. So the compiler is quite free to use an older (or newer) value of 'current', and the actual assignment to the percpu storage is not relevant even if it might look that way.
Which is exactly what happened in the resctl code, that blithely used 'current' in '__resctrl_sched_in()' when it really wanted the new process state (as implied by the name: we're scheduling 'into' that new resctl state). And clang would end up just using the old thread pointer value at least in some configurations.
This could have happened with gcc too, and purely depends on random compiler details. Clang just seems to have been more aggressive about moving the read of the per-cpu current_task pointer around.
The fix is trivial: just make the resctl code adhere to the scheduler rules of using the prev/next thread pointer explicitly, instead of using 'current' in a situation where it just wasn't valid.
That same code is then also used outside of the scheduler context (when a thread resctl state is explicitly changed), and then we will just pass in 'current' as that pointer, of course. There is no ambiguity in that case.
The fix may be trivial, but noticing and figuring out what went wrong was not. The credit for that goes to Stephane Eranian.
Reported-by: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/lkml/20230303231133.1486085-1-eranian@google.com/ Link: https://lore.kernel.org/lkml/alpine.LFD.2.01.0908011214330.3304@localhost.localdomain/ Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Stephane Eranian <eranian@google.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
793207ba |
| 24-Jan-2023 |
Borislav Petkov (AMD) <bp@alien8.de> |
x86/resctrl: Fix a silly -Wunused-but-set-variable warning
clang correctly complains
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1456:6: warning: variable \ 'h' set but not used [-Wunused-but-set
x86/resctrl: Fix a silly -Wunused-but-set-variable warning
clang correctly complains
arch/x86/kernel/cpu/resctrl/rdtgroup.c:1456:6: warning: variable \ 'h' set but not used [-Wunused-but-set-variable] u32 h; ^
but it can't know whether this use is innocuous or really a problem. There's a reason why those warning switches are behind a W=1 and not enabled by default - yes, one needs to do:
make W=1 CC=clang HOSTCC=clang arch/x86/kernel/cpu/resctrl/
with clang 14 in order to trigger it.
I would normally not take a silly fix like that but this one is simple and doesn't make the code uglier so...
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/202301242015.kbzkVteJ-lkp@intel.com
show more ...
|
Revision tags: v6.1.6 |
|
#
4fe61bff |
| 13-Jan-2023 |
Babu Moger <babu.moger@amd.com> |
x86/resctrl: Add interface to write mbm_local_bytes_config
The event configuration for mbm_local_bytes can be changed by the user by writing to the configuration file /sys/fs/resctrl/info/L3_MON/mbm
x86/resctrl: Add interface to write mbm_local_bytes_config
The event configuration for mbm_local_bytes can be changed by the user by writing to the configuration file /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config.
The event configuration settings are domain specific and will affect all the CPUs in the domain.
Following are the types of events supported:
==== =========================================================== Bits Description ==== =========================================================== 6 Dirty Victims from the QOS domain to all types of memory 5 Reads to slow memory in the non-local NUMA domain 4 Reads to slow memory in the local NUMA domain 3 Non-temporal writes to non-local NUMA domain 2 Non-temporal writes to local NUMA domain 1 Reads to memory in the non-local NUMA domain 0 Reads to memory in the local NUMA domain ==== ===========================================================
For example, to change the mbm_local_bytes_config to count all the non-temporal writes on domain 0, the bits 2 and 3 needs to be set which is 1100b (in hex 0xc). Run the command:
$echo 0=0xc > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
To change the mbm_local_bytes to count only reads to local NUMA domain 1, the bit 0 needs to be set which 1b (in hex 0x1). Run the command:
$echo 1=0x1 > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20230113152039.770054-13-babu.moger@amd.com
show more ...
|
#
92bd5a13 |
| 13-Jan-2023 |
Babu Moger <babu.moger@amd.com> |
x86/resctrl: Add interface to write mbm_total_bytes_config
The event configuration for mbm_total_bytes can be changed by the user by writing to the file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_c
x86/resctrl: Add interface to write mbm_total_bytes_config
The event configuration for mbm_total_bytes can be changed by the user by writing to the file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config.
The event configuration settings are domain specific and affect all the CPUs in the domain.
Following are the types of events supported:
==== =========================================================== Bits Description ==== =========================================================== 6 Dirty Victims from the QOS domain to all types of memory 5 Reads to slow memory in the non-local NUMA domain 4 Reads to slow memory in the local NUMA domain 3 Non-temporal writes to non-local NUMA domain 2 Non-temporal writes to local NUMA domain 1 Reads to memory in the non-local NUMA domain 0 Reads to memory in the local NUMA domain ==== ===========================================================
For example:
To change the mbm_total_bytes to count only reads on domain 0, the bits 0, 1, 4 and 5 needs to be set, which is 110011b (in hex 0x33). Run the command:
$echo 0=0x33 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
To change the mbm_total_bytes to count all the slow memory reads on domain 1, the bits 4 and 5 needs to be set which is 110000b (in hex 0x30). Run the command:
$echo 1=0x30 > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20230113152039.770054-12-babu.moger@amd.com
show more ...
|
#
73afb2d3 |
| 13-Jan-2023 |
Babu Moger <babu.moger@amd.com> |
x86/resctrl: Add interface to read mbm_local_bytes_config
The event configuration can be viewed by the user by reading the configuration file /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config. The
x86/resctrl: Add interface to read mbm_local_bytes_config
The event configuration can be viewed by the user by reading the configuration file /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config. The event configuration settings are domain specific and will affect all the CPUs in the domain.
Following are the types of events supported:
==== =========================================================== Bits Description ==== =========================================================== 6 Dirty Victims from the QOS domain to all types of memory 5 Reads to slow memory in the non-local NUMA domain 4 Reads to slow memory in the local NUMA domain 3 Non-temporal writes to non-local NUMA domain 2 Non-temporal writes to local NUMA domain 1 Reads to memory in the non-local NUMA domain 0 Reads to memory in the local NUMA domain ==== ===========================================================
By default, the mbm_local_bytes_config is set to 0x15 to count all the local event types.
For example:
$cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 0=0x15;1=0x15;2=0x15;3=0x15
In this case, the event mbm_local_bytes is configured with 0x15 on domains 0 to 3.
Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20230113152039.770054-11-babu.moger@amd.com
show more ...
|
#
dc2a3e85 |
| 13-Jan-2023 |
Babu Moger <babu.moger@amd.com> |
x86/resctrl: Add interface to read mbm_total_bytes_config
The event configuration can be viewed by the user by reading the configuration file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config. The
x86/resctrl: Add interface to read mbm_total_bytes_config
The event configuration can be viewed by the user by reading the configuration file /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config. The event configuration settings are domain specific and will affect all the CPUs in the domain.
Following are the types of events supported:
==== =========================================================== Bits Description ==== =========================================================== 6 Dirty Victims from the QOS domain to all types of memory 5 Reads to slow memory in the non-local NUMA domain 4 Reads to slow memory in the local NUMA domain 3 Non-temporal writes to non-local NUMA domain 2 Non-temporal writes to local NUMA domain 1 Reads to memory in the non-local NUMA domain 0 Reads to memory in the local NUMA domain ==== ===========================================================
By default, the mbm_total_bytes_config is set to 0x7f to count all the event types.
For example:
$cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 0=0x7f;1=0x7f;2=0x7f;3=0x7f
In this case, the event mbm_total_bytes is configured with 0x7f on domains 0 to 3.
Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20230113152039.770054-10-babu.moger@amd.com
show more ...
|
#
d507f83c |
| 13-Jan-2023 |
Babu Moger <babu.moger@amd.com> |
x86/resctrl: Support monitor configuration
Add a new field in struct mon_evt to support Bandwidth Monitoring Event Configuration (BMEC) and also update the "mon_features" display.
The resctrl file
x86/resctrl: Support monitor configuration
Add a new field in struct mon_evt to support Bandwidth Monitoring Event Configuration (BMEC) and also update the "mon_features" display.
The resctrl file "mon_features" will display the supported events and files that can be used to configure those events if monitor configuration is supported.
Before the change:
$ cat /sys/fs/resctrl/info/L3_MON/mon_features llc_occupancy mbm_total_bytes mbm_local_bytes
After the change when BMEC is supported:
$ cat /sys/fs/resctrl/info/L3_MON/mon_features llc_occupancy mbm_total_bytes mbm_total_bytes_config mbm_local_bytes mbm_local_bytes_config
Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20230113152039.770054-9-babu.moger@amd.com
show more ...
|
#
5b6fac3f |
| 13-Jan-2023 |
Babu Moger <babu.moger@amd.com> |
x86/resctrl: Detect and configure Slow Memory Bandwidth Allocation
The QoS slow memory configuration details are available via CPUID_Fn80000020_EDX_x02. Detect the available details and initialize t
x86/resctrl: Detect and configure Slow Memory Bandwidth Allocation
The QoS slow memory configuration details are available via CPUID_Fn80000020_EDX_x02. Detect the available details and initialize the rest to defaults.
Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20230113152039.770054-7-babu.moger@amd.com
show more ...
|
#
fc3b618c |
| 13-Jan-2023 |
Babu Moger <babu.moger@amd.com> |
x86/resctrl: Replace smp_call_function_many() with on_each_cpu_mask()
on_each_cpu_mask() runs the function on each CPU specified by cpumask, which may include the local processor.
Replace smp_call_
x86/resctrl: Replace smp_call_function_many() with on_each_cpu_mask()
on_each_cpu_mask() runs the function on each CPU specified by cpumask, which may include the local processor.
Replace smp_call_function_many() with on_each_cpu_mask() to simplify the code.
Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20230113152039.770054-2-babu.moger@amd.com
show more ...
|
Revision tags: v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15 |
|
#
fe1f0714 |
| 20-Dec-2022 |
Peter Newman <peternewman@google.com> |
x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's file interface or by deleting its rdtgroup, the resulting change in CLOSID/RMID mu
x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's file interface or by deleting its rdtgroup, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts running between a task_curr() check that the CPU hoisted before the stores in the CLOSID/RMID update then it can start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid} stores before the loads in task_curr(). In particular, in the rdt_move_group_tasks() case, simply execute an smp_mb() on every iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but this would require two passes and a means of remembering which task_structs were updated in the first loop. However, benchmarking results below showed too little performance impact in the simple approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test # echo $$ > /sys/fs/resctrl/test/tasks # perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
show more ...
|
Revision tags: v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65 |
|
#
d80975e2 |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's boot_cpu_data
resctrl_rmid_realloc_threshold can be set by user-space. The maximum value is specified by the architecture.
Currently ma
x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's boot_cpu_data
resctrl_rmid_realloc_threshold can be set by user-space. The maximum value is specified by the architecture.
Currently max_threshold_occ_write() reads the maximum value from boot_cpu_data.x86_cache_size, which is not portable to another architecture.
Add resctrl_rmid_realloc_limit to describe the maximum size in bytes that user-space can set the threshold to.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-21-james.morse@arm.com
show more ...
|
#
ae2328b5 |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Rename and change the units of resctrl_cqm_threshold
resctrl_cqm_threshold is stored in a hardware specific chunk size, but exposed to user-space as bytes.
This means the filesystem pa
x86/resctrl: Rename and change the units of resctrl_cqm_threshold
resctrl_cqm_threshold is stored in a hardware specific chunk size, but exposed to user-space as bytes.
This means the filesystem parts of resctrl need to know how the hardware counts, to convert the user provided byte value to chunks. The interface between the architecture's resctrl code and the filesystem ought to treat everything as bytes.
Change the unit of resctrl_cqm_threshold to bytes. resctrl_arch_rmid_read() still returns its value in chunks, so this needs converting to bytes. As all the users have been touched, rename the variable to resctrl_rmid_realloc_threshold, which describes what the value is for.
Neither r->num_rmid nor hw_res->mon_scale are guaranteed to be a power of 2, so the existing code introduces a rounding error from resctrl's theoretical fraction of the cache usage. This behaviour is kept as it ensures the user visible value matches the value read from hardware when the rmid will be reallocated.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-20-james.morse@arm.com
show more ...
|
#
b58d4eb1 |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Remove architecture copy of mbps_val
The resctrl arch code provides a second configuration array mbps_val[] for the MBA software controller.
Since resctrl switched over to allocating a
x86/resctrl: Remove architecture copy of mbps_val
The resctrl arch code provides a second configuration array mbps_val[] for the MBA software controller.
Since resctrl switched over to allocating and freeing its own array when needed, nothing uses the arch code version.
Remove it.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-11-james.morse@arm.com
show more ...
|
#
6ce1560d |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Switch over to the resctrl mbps_val list
Updates to resctrl's software controller follow the same path as other configuration updates, but they don't modify the hardware state. rdtgroup
x86/resctrl: Switch over to the resctrl mbps_val list
Updates to resctrl's software controller follow the same path as other configuration updates, but they don't modify the hardware state. rdtgroup_schemata_write() uses parse_line() and the resource's parse_ctrlval() function to stage the configuration. resctrl_arch_update_domains() then updates the mbps_val[] array instead, and resctrl_arch_update_domains() skips the rdt_ctrl_update() call that would update hardware.
This complicates the interface between resctrl's filesystem parts and architecture specific code. It should be possible for mba_sc to be completely implemented by the filesystem parts of resctrl. This would allow it to work on a second architecture with no additional code. resctrl_arch_update_domains() using the mbps_val[] array prevents this.
Change parse_bw() to write the configuration value directly to the mbps_val[] array in the domain structure. Change rdtgroup_schemata_write() to skip the call to resctrl_arch_update_domains(), meaning all the mba_sc specific code in resctrl_arch_update_domains() can be removed. On the read-side, show_doms() and update_mba_bw() are changed to read the mbps_val[] array from the domain structure. With this, resctrl_arch_get_config() no longer needs to consider mba_sc resources.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-10-james.morse@arm.com
show more ...
|
#
781096d9 |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Create mba_sc configuration in the rdt_domain
To support resctrl's MBA software controller, the architecture must provide a second configuration array to hold the mbps_val[] from user-s
x86/resctrl: Create mba_sc configuration in the rdt_domain
To support resctrl's MBA software controller, the architecture must provide a second configuration array to hold the mbps_val[] from user-space.
This complicates the interface between the architecture specific code and the filesystem portions of resctrl that will move to /fs/, to allow multiple architectures to support resctrl.
Make the filesystem parts of resctrl create an array for the mba_sc values. The software controller can be changed to use this, allowing the architecture code to only consider the values configured in hardware.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-9-james.morse@arm.com
show more ...
|
#
b045c215 |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Abstract and use supports_mba_mbps()
To determine whether the mba_MBps option to resctrl should be supported, resctrl tests the boot CPUs' x86_vendor.
This isn't portable, and needs ab
x86/resctrl: Abstract and use supports_mba_mbps()
To determine whether the mba_MBps option to resctrl should be supported, resctrl tests the boot CPUs' x86_vendor.
This isn't portable, and needs abstracting behind a helper so this check can be part of the filesystem code that moves to /fs/.
Re-use the tests set_mba_sc() does to determine if the mba_sc is supported on this system. An 'alloc_capable' test is added so that support for the controls isn't implied by the 'delay_linear' property, which is always true for MPAM. Because mbm_update() only update mba_sc if the mbm_local counters are enabled, supports_mba_mbps() checks is_mbm_local_enabled(). (instead of using is_mbm_enabled(), which checks both).
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-8-james.morse@arm.com
show more ...
|
#
1644dfe7 |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Remove set_mba_sc()s control array re-initialisation
set_mba_sc() enables the 'software controller' to regulate the bandwidth based on the byte counters. This can be managed entirely in
x86/resctrl: Remove set_mba_sc()s control array re-initialisation
set_mba_sc() enables the 'software controller' to regulate the bandwidth based on the byte counters. This can be managed entirely in the parts of resctrl that move to /fs/, without any extra support from the architecture specific code. set_mba_sc() is called by rdt_enable_ctx() during mount and unmount. It currently resets the arch code's ctrl_val[] and mbps_val[] arrays.
The ctrl_val[] was already reset when the domain was created, and by reset_all_ctrls() when the filesystem was last unmounted. Doing the work in set_mba_sc() is not necessary as the values are already at their defaults due to the creation of the domain, or were previously reset during umount(), or are about to reset during umount().
Add a reset of the mbps_val[] in reset_all_ctrls(), allowing the code in set_mba_sc() that reaches in to the architecture specific structures to be removed.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-7-james.morse@arm.com
show more ...
|
#
798fd4b9 |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Add domain offline callback for resctrl work
Because domains are exposed to user-space via resctrl, the filesystem must update its state when CPU hotplug callbacks are triggered.
Some
x86/resctrl: Add domain offline callback for resctrl work
Because domains are exposed to user-space via resctrl, the filesystem must update its state when CPU hotplug callbacks are triggered.
Some of this work is common to any architecture that would support resctrl, but the work is tied up with the architecture code to free the memory.
Move the monitor subdir removal and the cancelling of the mbm/limbo works into a new resctrl_offline_domain() call. These bits are not specific to the architecture. Grouping them in one function allows that code to be moved to /fs/ and re-used by another architecture.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-6-james.morse@arm.com
show more ...
|
#
3a7232cd |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Add domain online callback for resctrl work
Because domains are exposed to user-space via resctrl, the filesystem must update its state when CPU hotplug callbacks are triggered.
Some o
x86/resctrl: Add domain online callback for resctrl work
Because domains are exposed to user-space via resctrl, the filesystem must update its state when CPU hotplug callbacks are triggered.
Some of this work is common to any architecture that would support resctrl, but the work is tied up with the architecture code to allocate the memory.
Move domain_setup_mon_state(), the monitor subdir creation call and the mbm/limbo workers into a new resctrl_online_domain() call. These bits are not specific to the architecture. Grouping them in one function allows that code to be moved to /fs/ and re-used by another architecture.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-4-james.morse@arm.com
show more ...
|
#
bab6ee73 |
| 02-Sep-2022 |
James Morse <james.morse@arm.com> |
x86/resctrl: Merge mon_capable and mon_enabled
mon_enabled and mon_capable are always set as a pair by rdt_get_mon_l3_config().
There is no point having two values.
Merge them together.
Signed-of
x86/resctrl: Merge mon_capable and mon_enabled
mon_enabled and mon_capable are always set as a pair by rdt_get_mon_l3_config().
There is no point having two values.
Merge them together.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-3-james.morse@arm.com
show more ...
|