1.. SPDX-License-Identifier: GPL-2.0 2.. include:: <isonum.txt> 3 4=========================================== 5User Interface for Resource Control feature 6=========================================== 7 8:Copyright: |copy| 2016 Intel Corporation 9:Authors: - Fenghua Yu <fenghua.yu@intel.com> 10 - Tony Luck <tony.luck@intel.com> 11 - Vikas Shivappa <vikas.shivappa@intel.com> 12 13 14Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). 15AMD refers to this feature as AMD Platform Quality of Service(AMD QoS). 16 17This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo 18flag bits: 19 20=============================================== ================================ 21RDT (Resource Director Technology) Allocation "rdt_a" 22CAT (Cache Allocation Technology) "cat_l3", "cat_l2" 23CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2" 24CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc" 25MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" 26MBA (Memory Bandwidth Allocation) "mba" 27SMBA (Slow Memory Bandwidth Allocation) "" 28BMEC (Bandwidth Monitoring Event Configuration) "" 29=============================================== ================================ 30 31Historically, new features were made visible by default in /proc/cpuinfo. This 32resulted in the feature flags becoming hard to parse by humans. Adding a new 33flag to /proc/cpuinfo should be avoided if user space can obtain information 34about the feature from resctrl's info directory. 35 36To use the feature mount the file system:: 37 38 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl 39 40mount options are: 41 42"cdp": 43 Enable code/data prioritization in L3 cache allocations. 44"cdpl2": 45 Enable code/data prioritization in L2 cache allocations. 46"mba_MBps": 47 Enable the MBA Software Controller(mba_sc) to specify MBA 48 bandwidth in MBps 49 50L2 and L3 CDP are controlled separately. 51 52RDT features are orthogonal. A particular system may support only 53monitoring, only control, or both monitoring and control. Cache 54pseudo-locking is a unique way of using cache control to "pin" or 55"lock" data in the cache. Details can be found in 56"Cache Pseudo-Locking". 57 58 59The mount succeeds if either of allocation or monitoring is present, but 60only those files and directories supported by the system will be created. 61For more details on the behavior of the interface during monitoring 62and allocation, see the "Resource alloc and monitor groups" section. 63 64Info directory 65============== 66 67The 'info' directory contains information about the enabled 68resources. Each resource has its own subdirectory. The subdirectory 69names reflect the resource names. 70 71Each subdirectory contains the following files with respect to 72allocation: 73 74Cache resource(L3/L2) subdirectory contains the following files 75related to allocation: 76 77"num_closids": 78 The number of CLOSIDs which are valid for this 79 resource. The kernel uses the smallest number of 80 CLOSIDs of all enabled resources as limit. 81"cbm_mask": 82 The bitmask which is valid for this resource. 83 This mask is equivalent to 100%. 84"min_cbm_bits": 85 The minimum number of consecutive bits which 86 must be set when writing a mask. 87 88"shareable_bits": 89 Bitmask of shareable resource with other executing 90 entities (e.g. I/O). User can use this when 91 setting up exclusive cache partitions. Note that 92 some platforms support devices that have their 93 own settings for cache use which can over-ride 94 these bits. 95"bit_usage": 96 Annotated capacity bitmasks showing how all 97 instances of the resource are used. The legend is: 98 99 "0": 100 Corresponding region is unused. When the system's 101 resources have been allocated and a "0" is found 102 in "bit_usage" it is a sign that resources are 103 wasted. 104 105 "H": 106 Corresponding region is used by hardware only 107 but available for software use. If a resource 108 has bits set in "shareable_bits" but not all 109 of these bits appear in the resource groups' 110 schematas then the bits appearing in 111 "shareable_bits" but no resource group will 112 be marked as "H". 113 "X": 114 Corresponding region is available for sharing and 115 used by hardware and software. These are the 116 bits that appear in "shareable_bits" as 117 well as a resource group's allocation. 118 "S": 119 Corresponding region is used by software 120 and available for sharing. 121 "E": 122 Corresponding region is used exclusively by 123 one resource group. No sharing allowed. 124 "P": 125 Corresponding region is pseudo-locked. No 126 sharing allowed. 127 128Memory bandwidth(MB) subdirectory contains the following files 129with respect to allocation: 130 131"min_bandwidth": 132 The minimum memory bandwidth percentage which 133 user can request. 134 135"bandwidth_gran": 136 The granularity in which the memory bandwidth 137 percentage is allocated. The allocated 138 b/w percentage is rounded off to the next 139 control step available on the hardware. The 140 available bandwidth control steps are: 141 min_bandwidth + N * bandwidth_gran. 142 143"delay_linear": 144 Indicates if the delay scale is linear or 145 non-linear. This field is purely informational 146 only. 147 148"thread_throttle_mode": 149 Indicator on Intel systems of how tasks running on threads 150 of a physical core are throttled in cases where they 151 request different memory bandwidth percentages: 152 153 "max": 154 the smallest percentage is applied 155 to all threads 156 "per-thread": 157 bandwidth percentages are directly applied to 158 the threads running on the core 159 160If RDT monitoring is available there will be an "L3_MON" directory 161with the following files: 162 163"num_rmids": 164 The number of RMIDs available. This is the 165 upper bound for how many "CTRL_MON" + "MON" 166 groups can be created. 167 168"mon_features": 169 Lists the monitoring events if 170 monitoring is enabled for the resource. 171 Example:: 172 173 # cat /sys/fs/resctrl/info/L3_MON/mon_features 174 llc_occupancy 175 mbm_total_bytes 176 mbm_local_bytes 177 178 If the system supports Bandwidth Monitoring Event 179 Configuration (BMEC), then the bandwidth events will 180 be configurable. The output will be:: 181 182 # cat /sys/fs/resctrl/info/L3_MON/mon_features 183 llc_occupancy 184 mbm_total_bytes 185 mbm_total_bytes_config 186 mbm_local_bytes 187 mbm_local_bytes_config 188 189"mbm_total_bytes_config", "mbm_local_bytes_config": 190 Read/write files containing the configuration for the mbm_total_bytes 191 and mbm_local_bytes events, respectively, when the Bandwidth 192 Monitoring Event Configuration (BMEC) feature is supported. 193 The event configuration settings are domain specific and affect 194 all the CPUs in the domain. When either event configuration is 195 changed, the bandwidth counters for all RMIDs of both events 196 (mbm_total_bytes as well as mbm_local_bytes) are cleared for that 197 domain. The next read for every RMID will report "Unavailable" 198 and subsequent reads will report the valid value. 199 200 Following are the types of events supported: 201 202 ==== ======================================================== 203 Bits Description 204 ==== ======================================================== 205 6 Dirty Victims from the QOS domain to all types of memory 206 5 Reads to slow memory in the non-local NUMA domain 207 4 Reads to slow memory in the local NUMA domain 208 3 Non-temporal writes to non-local NUMA domain 209 2 Non-temporal writes to local NUMA domain 210 1 Reads to memory in the non-local NUMA domain 211 0 Reads to memory in the local NUMA domain 212 ==== ======================================================== 213 214 By default, the mbm_total_bytes configuration is set to 0x7f to count 215 all the event types and the mbm_local_bytes configuration is set to 216 0x15 to count all the local memory events. 217 218 Examples: 219 220 * To view the current configuration:: 221 :: 222 223 # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 224 0=0x7f;1=0x7f;2=0x7f;3=0x7f 225 226 # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 227 0=0x15;1=0x15;3=0x15;4=0x15 228 229 * To change the mbm_total_bytes to count only reads on domain 0, 230 the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary 231 (in hexadecimal 0x33): 232 :: 233 234 # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 235 236 # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 237 0=0x33;1=0x7f;2=0x7f;3=0x7f 238 239 * To change the mbm_local_bytes to count all the slow memory reads on 240 domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b 241 in binary (in hexadecimal 0x30): 242 :: 243 244 # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 245 246 # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 247 0=0x30;1=0x30;3=0x15;4=0x15 248 249"max_threshold_occupancy": 250 Read/write file provides the largest value (in 251 bytes) at which a previously used LLC_occupancy 252 counter can be considered for re-use. 253 254Finally, in the top level of the "info" directory there is a file 255named "last_cmd_status". This is reset with every "command" issued 256via the file system (making new directories or writing to any of the 257control files). If the command was successful, it will read as "ok". 258If the command failed, it will provide more information that can be 259conveyed in the error returns from file operations. E.g. 260:: 261 262 # echo L3:0=f7 > schemata 263 bash: echo: write error: Invalid argument 264 # cat info/last_cmd_status 265 mask f7 has non-consecutive 1-bits 266 267Resource alloc and monitor groups 268================================= 269 270Resource groups are represented as directories in the resctrl file 271system. The default group is the root directory which, immediately 272after mounting, owns all the tasks and cpus in the system and can make 273full use of all resources. 274 275On a system with RDT control features additional directories can be 276created in the root directory that specify different amounts of each 277resource (see "schemata" below). The root and these additional top level 278directories are referred to as "CTRL_MON" groups below. 279 280On a system with RDT monitoring the root directory and other top level 281directories contain a directory named "mon_groups" in which additional 282directories can be created to monitor subsets of tasks in the CTRL_MON 283group that is their ancestor. These are called "MON" groups in the rest 284of this document. 285 286Removing a directory will move all tasks and cpus owned by the group it 287represents to the parent. Removing one of the created CTRL_MON groups 288will automatically remove all MON groups below it. 289 290Moving MON group directories to a new parent CTRL_MON group is supported 291for the purpose of changing the resource allocations of a MON group 292without impacting its monitoring data or assigned tasks. This operation 293is not allowed for MON groups which monitor CPUs. No other move 294operation is currently allowed other than simply renaming a CTRL_MON or 295MON group. 296 297All groups contain the following files: 298 299"tasks": 300 Reading this file shows the list of all tasks that belong to 301 this group. Writing a task id to the file will add a task to the 302 group. If the group is a CTRL_MON group the task is removed from 303 whichever previous CTRL_MON group owned the task and also from 304 any MON group that owned the task. If the group is a MON group, 305 then the task must already belong to the CTRL_MON parent of this 306 group. The task is removed from any previous MON group. 307 308 309"cpus": 310 Reading this file shows a bitmask of the logical CPUs owned by 311 this group. Writing a mask to this file will add and remove 312 CPUs to/from this group. As with the tasks file a hierarchy is 313 maintained where MON groups may only include CPUs owned by the 314 parent CTRL_MON group. 315 When the resource group is in pseudo-locked mode this file will 316 only be readable, reflecting the CPUs associated with the 317 pseudo-locked region. 318 319 320"cpus_list": 321 Just like "cpus", only using ranges of CPUs instead of bitmasks. 322 323 324When control is enabled all CTRL_MON groups will also contain: 325 326"schemata": 327 A list of all the resources available to this group. 328 Each resource has its own line and format - see below for details. 329 330"size": 331 Mirrors the display of the "schemata" file to display the size in 332 bytes of each allocation instead of the bits representing the 333 allocation. 334 335"mode": 336 The "mode" of the resource group dictates the sharing of its 337 allocations. A "shareable" resource group allows sharing of its 338 allocations while an "exclusive" resource group does not. A 339 cache pseudo-locked region is created by first writing 340 "pseudo-locksetup" to the "mode" file before writing the cache 341 pseudo-locked region's schemata to the resource group's "schemata" 342 file. On successful pseudo-locked region creation the mode will 343 automatically change to "pseudo-locked". 344 345When monitoring is enabled all MON groups will also contain: 346 347"mon_data": 348 This contains a set of files organized by L3 domain and by 349 RDT event. E.g. on a system with two L3 domains there will 350 be subdirectories "mon_L3_00" and "mon_L3_01". Each of these 351 directories have one file per event (e.g. "llc_occupancy", 352 "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these 353 files provide a read out of the current value of the event for 354 all tasks in the group. In CTRL_MON groups these files provide 355 the sum for all tasks in the CTRL_MON group and all tasks in 356 MON groups. Please see example section for more details on usage. 357 358Resource allocation rules 359------------------------- 360 361When a task is running the following rules define which resources are 362available to it: 363 3641) If the task is a member of a non-default group, then the schemata 365 for that group is used. 366 3672) Else if the task belongs to the default group, but is running on a 368 CPU that is assigned to some specific group, then the schemata for the 369 CPU's group is used. 370 3713) Otherwise the schemata for the default group is used. 372 373Resource monitoring rules 374------------------------- 3751) If a task is a member of a MON group, or non-default CTRL_MON group 376 then RDT events for the task will be reported in that group. 377 3782) If a task is a member of the default CTRL_MON group, but is running 379 on a CPU that is assigned to some specific group, then the RDT events 380 for the task will be reported in that group. 381 3823) Otherwise RDT events for the task will be reported in the root level 383 "mon_data" group. 384 385 386Notes on cache occupancy monitoring and control 387=============================================== 388When moving a task from one group to another you should remember that 389this only affects *new* cache allocations by the task. E.g. you may have 390a task in a monitor group showing 3 MB of cache occupancy. If you move 391to a new group and immediately check the occupancy of the old and new 392groups you will likely see that the old group is still showing 3 MB and 393the new group zero. When the task accesses locations still in cache from 394before the move, the h/w does not update any counters. On a busy system 395you will likely see the occupancy in the old group go down as cache lines 396are evicted and re-used while the occupancy in the new group rises as 397the task accesses memory and loads into the cache are counted based on 398membership in the new group. 399 400The same applies to cache allocation control. Moving a task to a group 401with a smaller cache partition will not evict any cache lines. The 402process may continue to use them from the old partition. 403 404Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) 405to identify a control group and a monitoring group respectively. Each of 406the resource groups are mapped to these IDs based on the kind of group. The 407number of CLOSid and RMID are limited by the hardware and hence the creation of 408a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID 409and creation of "MON" group may fail if we run out of RMIDs. 410 411max_threshold_occupancy - generic concepts 412------------------------------------------ 413 414Note that an RMID once freed may not be immediately available for use as 415the RMID is still tagged the cache lines of the previous user of RMID. 416Hence such RMIDs are placed on limbo list and checked back if the cache 417occupancy has gone down. If there is a time when system has a lot of 418limbo RMIDs but which are not ready to be used, user may see an -EBUSY 419during mkdir. 420 421max_threshold_occupancy is a user configurable value to determine the 422occupancy at which an RMID can be freed. 423 424Schemata files - general concepts 425--------------------------------- 426Each line in the file describes one resource. The line starts with 427the name of the resource, followed by specific values to be applied 428in each of the instances of that resource on the system. 429 430Cache IDs 431--------- 432On current generation systems there is one L3 cache per socket and L2 433caches are generally just shared by the hyperthreads on a core, but this 434isn't an architectural requirement. We could have multiple separate L3 435caches on a socket, multiple cores could share an L2 cache. So instead 436of using "socket" or "core" to define the set of logical cpus sharing 437a resource we use a "Cache ID". At a given cache level this will be a 438unique number across the whole system (but it isn't guaranteed to be a 439contiguous sequence, there may be gaps). To find the ID for each logical 440CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id 441 442Cache Bit Masks (CBM) 443--------------------- 444For cache resources we describe the portion of the cache that is available 445for allocation using a bitmask. The maximum value of the mask is defined 446by each cpu model (and may be different for different cache levels). It 447is found using CPUID, but is also provided in the "info" directory of 448the resctrl file system in "info/{resource}/cbm_mask". Intel hardware 449requires that these masks have all the '1' bits in a contiguous block. So 4500x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 451and 0xA are not. On a system with a 20-bit mask each bit represents 5% 452of the capacity of the cache. You could partition the cache into four 453equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. 454 455Memory bandwidth Allocation and monitoring 456========================================== 457 458For Memory bandwidth resource, by default the user controls the resource 459by indicating the percentage of total memory bandwidth. 460 461The minimum bandwidth percentage value for each cpu model is predefined 462and can be looked up through "info/MB/min_bandwidth". The bandwidth 463granularity that is allocated is also dependent on the cpu model and can 464be looked up at "info/MB/bandwidth_gran". The available bandwidth 465control steps are: min_bw + N * bw_gran. Intermediate values are rounded 466to the next control step available on the hardware. 467 468The bandwidth throttling is a core specific mechanism on some of Intel 469SKUs. Using a high bandwidth and a low bandwidth setting on two threads 470sharing a core may result in both threads being throttled to use the 471low bandwidth (see "thread_throttle_mode"). 472 473The fact that Memory bandwidth allocation(MBA) may be a core 474specific mechanism where as memory bandwidth monitoring(MBM) is done at 475the package level may lead to confusion when users try to apply control 476via the MBA and then monitor the bandwidth to see if the controls are 477effective. Below are such scenarios: 478 4791. User may *not* see increase in actual bandwidth when percentage 480 values are increased: 481 482This can occur when aggregate L2 external bandwidth is more than L3 483external bandwidth. Consider an SKL SKU with 24 cores on a package and 484where L2 external is 10GBps (hence aggregate L2 external bandwidth is 485240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 486threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 487bandwidth of 100GBps although the percentage value specified is only 50% 488<< 100%. Hence increasing the bandwidth percentage will not yield any 489more bandwidth. This is because although the L2 external bandwidth still 490has capacity, the L3 external bandwidth is fully used. Also note that 491this would be dependent on number of cores the benchmark is run on. 492 4932. Same bandwidth percentage may mean different actual bandwidth 494 depending on # of threads: 495 496For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 497thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although 498they have same percentage bandwidth of 10%. This is simply because as 499threads start using more cores in an rdtgroup, the actual bandwidth may 500increase or vary although user specified bandwidth percentage is same. 501 502In order to mitigate this and make the interface more user friendly, 503resctrl added support for specifying the bandwidth in MBps as well. The 504kernel underneath would use a software feedback mechanism or a "Software 505Controller(mba_sc)" which reads the actual bandwidth using MBM counters 506and adjust the memory bandwidth percentages to ensure:: 507 508 "actual bandwidth < user specified bandwidth". 509 510By default, the schemata would take the bandwidth percentage values 511where as user can switch to the "MBA software controller" mode using 512a mount option 'mba_MBps'. The schemata format is specified in the below 513sections. 514 515L3 schemata file details (code and data prioritization disabled) 516---------------------------------------------------------------- 517With CDP disabled the L3 schemata format is:: 518 519 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 520 521L3 schemata file details (CDP enabled via mount option to resctrl) 522------------------------------------------------------------------ 523When CDP is enabled L3 control is split into two separate resources 524so you can specify independent masks for code and data like this:: 525 526 L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 527 L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 528 529L2 schemata file details 530------------------------ 531CDP is supported at L2 using the 'cdpl2' mount option. The schemata 532format is either:: 533 534 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 535 536or 537 538 L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 539 L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 540 541 542Memory bandwidth Allocation (default mode) 543------------------------------------------ 544 545Memory b/w domain is L3 cache. 546:: 547 548 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... 549 550Memory bandwidth Allocation specified in MBps 551--------------------------------------------- 552 553Memory bandwidth domain is L3 cache. 554:: 555 556 MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... 557 558Slow Memory Bandwidth Allocation (SMBA) 559--------------------------------------- 560AMD hardware supports Slow Memory Bandwidth Allocation (SMBA). 561CXL.memory is the only supported "slow" memory device. With the 562support of SMBA, the hardware enables bandwidth allocation on 563the slow memory devices. If there are multiple such devices in 564the system, the throttling logic groups all the slow sources 565together and applies the limit on them as a whole. 566 567The presence of SMBA (with CXL.memory) is independent of slow memory 568devices presence. If there are no such devices on the system, then 569configuring SMBA will have no impact on the performance of the system. 570 571The bandwidth domain for slow memory is L3 cache. Its schemata file 572is formatted as: 573:: 574 575 SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... 576 577Reading/writing the schemata file 578--------------------------------- 579Reading the schemata file will show the state of all resources 580on all domains. When writing you only need to specify those values 581which you wish to change. E.g. 582:: 583 584 # cat schemata 585 L3DATA:0=fffff;1=fffff;2=fffff;3=fffff 586 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 587 # echo "L3DATA:2=3c0;" > schemata 588 # cat schemata 589 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff 590 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 591 592Reading/writing the schemata file (on AMD systems) 593-------------------------------------------------- 594Reading the schemata file will show the current bandwidth limit on all 595domains. The allocated resources are in multiples of one eighth GB/s. 596When writing to the file, you need to specify what cache id you wish to 597configure the bandwidth limit. 598 599For example, to allocate 2GB/s limit on the first cache id: 600 601:: 602 603 # cat schemata 604 MB:0=2048;1=2048;2=2048;3=2048 605 L3:0=ffff;1=ffff;2=ffff;3=ffff 606 607 # echo "MB:1=16" > schemata 608 # cat schemata 609 MB:0=2048;1= 16;2=2048;3=2048 610 L3:0=ffff;1=ffff;2=ffff;3=ffff 611 612Reading/writing the schemata file (on AMD systems) with SMBA feature 613-------------------------------------------------------------------- 614Reading and writing the schemata file is the same as without SMBA in 615above section. 616 617For example, to allocate 8GB/s limit on the first cache id: 618 619:: 620 621 # cat schemata 622 SMBA:0=2048;1=2048;2=2048;3=2048 623 MB:0=2048;1=2048;2=2048;3=2048 624 L3:0=ffff;1=ffff;2=ffff;3=ffff 625 626 # echo "SMBA:1=64" > schemata 627 # cat schemata 628 SMBA:0=2048;1= 64;2=2048;3=2048 629 MB:0=2048;1=2048;2=2048;3=2048 630 L3:0=ffff;1=ffff;2=ffff;3=ffff 631 632Cache Pseudo-Locking 633==================== 634CAT enables a user to specify the amount of cache space that an 635application can fill. Cache pseudo-locking builds on the fact that a 636CPU can still read and write data pre-allocated outside its current 637allocated area on a cache hit. With cache pseudo-locking, data can be 638preloaded into a reserved portion of cache that no application can 639fill, and from that point on will only serve cache hits. The cache 640pseudo-locked memory is made accessible to user space where an 641application can map it into its virtual address space and thus have 642a region of memory with reduced average read latency. 643 644The creation of a cache pseudo-locked region is triggered by a request 645from the user to do so that is accompanied by a schemata of the region 646to be pseudo-locked. The cache pseudo-locked region is created as follows: 647 648- Create a CAT allocation CLOSNEW with a CBM matching the schemata 649 from the user of the cache region that will contain the pseudo-locked 650 memory. This region must not overlap with any current CAT allocation/CLOS 651 on the system and no future overlap with this cache region is allowed 652 while the pseudo-locked region exists. 653- Create a contiguous region of memory of the same size as the cache 654 region. 655- Flush the cache, disable hardware prefetchers, disable preemption. 656- Make CLOSNEW the active CLOS and touch the allocated memory to load 657 it into the cache. 658- Set the previous CLOS as active. 659- At this point the closid CLOSNEW can be released - the cache 660 pseudo-locked region is protected as long as its CBM does not appear in 661 any CAT allocation. Even though the cache pseudo-locked region will from 662 this point on not appear in any CBM of any CLOS an application running with 663 any CLOS will be able to access the memory in the pseudo-locked region since 664 the region continues to serve cache hits. 665- The contiguous region of memory loaded into the cache is exposed to 666 user-space as a character device. 667 668Cache pseudo-locking increases the probability that data will remain 669in the cache via carefully configuring the CAT feature and controlling 670application behavior. There is no guarantee that data is placed in 671cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict 672“locked” data from cache. Power management C-states may shrink or 673power off cache. Deeper C-states will automatically be restricted on 674pseudo-locked region creation. 675 676It is required that an application using a pseudo-locked region runs 677with affinity to the cores (or a subset of the cores) associated 678with the cache on which the pseudo-locked region resides. A sanity check 679within the code will not allow an application to map pseudo-locked memory 680unless it runs with affinity to cores associated with the cache on which the 681pseudo-locked region resides. The sanity check is only done during the 682initial mmap() handling, there is no enforcement afterwards and the 683application self needs to ensure it remains affine to the correct cores. 684 685Pseudo-locking is accomplished in two stages: 686 6871) During the first stage the system administrator allocates a portion 688 of cache that should be dedicated to pseudo-locking. At this time an 689 equivalent portion of memory is allocated, loaded into allocated 690 cache portion, and exposed as a character device. 6912) During the second stage a user-space application maps (mmap()) the 692 pseudo-locked memory into its address space. 693 694Cache Pseudo-Locking Interface 695------------------------------ 696A pseudo-locked region is created using the resctrl interface as follows: 697 6981) Create a new resource group by creating a new directory in /sys/fs/resctrl. 6992) Change the new resource group's mode to "pseudo-locksetup" by writing 700 "pseudo-locksetup" to the "mode" file. 7013) Write the schemata of the pseudo-locked region to the "schemata" file. All 702 bits within the schemata should be "unused" according to the "bit_usage" 703 file. 704 705On successful pseudo-locked region creation the "mode" file will contain 706"pseudo-locked" and a new character device with the same name as the resource 707group will exist in /dev/pseudo_lock. This character device can be mmap()'ed 708by user space in order to obtain access to the pseudo-locked memory region. 709 710An example of cache pseudo-locked region creation and usage can be found below. 711 712Cache Pseudo-Locking Debugging Interface 713---------------------------------------- 714The pseudo-locking debugging interface is enabled by default (if 715CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. 716 717There is no explicit way for the kernel to test if a provided memory 718location is present in the cache. The pseudo-locking debugging interface uses 719the tracing infrastructure to provide two ways to measure cache residency of 720the pseudo-locked region: 721 7221) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data 723 from these measurements are best visualized using a hist trigger (see 724 example below). In this test the pseudo-locked region is traversed at 725 a stride of 32 bytes while hardware prefetchers and preemption 726 are disabled. This also provides a substitute visualization of cache 727 hits and misses. 7282) Cache hit and miss measurements using model specific precision counters if 729 available. Depending on the levels of cache on the system the pseudo_lock_l2 730 and pseudo_lock_l3 tracepoints are available. 731 732When a pseudo-locked region is created a new debugfs directory is created for 733it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single 734write-only file, pseudo_lock_measure, is present in this directory. The 735measurement of the pseudo-locked region depends on the number written to this 736debugfs file: 737 7381: 739 writing "1" to the pseudo_lock_measure file will trigger the latency 740 measurement captured in the pseudo_lock_mem_latency tracepoint. See 741 example below. 7422: 743 writing "2" to the pseudo_lock_measure file will trigger the L2 cache 744 residency (cache hits and misses) measurement captured in the 745 pseudo_lock_l2 tracepoint. See example below. 7463: 747 writing "3" to the pseudo_lock_measure file will trigger the L3 cache 748 residency (cache hits and misses) measurement captured in the 749 pseudo_lock_l3 tracepoint. 750 751All measurements are recorded with the tracing infrastructure. This requires 752the relevant tracepoints to be enabled before the measurement is triggered. 753 754Example of latency debugging interface 755~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 756In this example a pseudo-locked region named "newlock" was created. Here is 757how we can measure the latency in cycles of reading from this region and 758visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS 759is set:: 760 761 # :> /sys/kernel/tracing/trace 762 # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger 763 # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable 764 # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 765 # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable 766 # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist 767 768 # event histogram 769 # 770 # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] 771 # 772 773 { latency: 456 } hitcount: 1 774 { latency: 50 } hitcount: 83 775 { latency: 36 } hitcount: 96 776 { latency: 44 } hitcount: 174 777 { latency: 48 } hitcount: 195 778 { latency: 46 } hitcount: 262 779 { latency: 42 } hitcount: 693 780 { latency: 40 } hitcount: 3204 781 { latency: 38 } hitcount: 3484 782 783 Totals: 784 Hits: 8192 785 Entries: 9 786 Dropped: 0 787 788Example of cache hits/misses debugging 789~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 790In this example a pseudo-locked region named "newlock" was created on the L2 791cache of a platform. Here is how we can obtain details of the cache hits 792and misses using the platform's precision counters. 793:: 794 795 # :> /sys/kernel/tracing/trace 796 # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable 797 # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 798 # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable 799 # cat /sys/kernel/tracing/trace 800 801 # tracer: nop 802 # 803 # _-----=> irqs-off 804 # / _----=> need-resched 805 # | / _---=> hardirq/softirq 806 # || / _--=> preempt-depth 807 # ||| / delay 808 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 809 # | | | |||| | | 810 pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 811 812 813Examples for RDT allocation usage 814~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 815 8161) Example 1 817 818On a two socket machine (one L3 cache per socket) with just four bits 819for cache bit masks, minimum b/w of 10% with a memory bandwidth 820granularity of 10%. 821:: 822 823 # mount -t resctrl resctrl /sys/fs/resctrl 824 # cd /sys/fs/resctrl 825 # mkdir p0 p1 826 # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata 827 # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata 828 829The default resource group is unmodified, so we have access to all parts 830of all caches (its schemata file reads "L3:0=f;1=f"). 831 832Tasks that are under the control of group "p0" may only allocate from the 833"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 834Tasks in group "p1" use the "lower" 50% of cache on both sockets. 835 836Similarly, tasks that are under the control of group "p0" may use a 837maximum memory b/w of 50% on socket0 and 50% on socket 1. 838Tasks in group "p1" may also use 50% memory b/w on both sockets. 839Note that unlike cache masks, memory b/w cannot specify whether these 840allocations can overlap or not. The allocations specifies the maximum 841b/w that the group may be able to use and the system admin can configure 842the b/w accordingly. 843 844If resctrl is using the software controller (mba_sc) then user can enter the 845max b/w in MB rather than the percentage values. 846:: 847 848 # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata 849 # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata 850 851In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w 852of 1024MB where as on socket 1 they would use 500MB. 853 8542) Example 2 855 856Again two sockets, but this time with a more realistic 20-bit mask. 857 858Two real time tasks pid=1234 running on processor 0 and pid=5678 running on 859processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy 860neighbors, each of the two real-time tasks exclusively occupies one quarter 861of L3 cache on socket 0. 862:: 863 864 # mount -t resctrl resctrl /sys/fs/resctrl 865 # cd /sys/fs/resctrl 866 867First we reset the schemata for the default group so that the "upper" 86850% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by 869ordinary tasks:: 870 871 # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata 872 873Next we make a resource group for our first real time task and give 874it access to the "top" 25% of the cache on socket 0. 875:: 876 877 # mkdir p0 878 # echo "L3:0=f8000;1=fffff" > p0/schemata 879 880Finally we move our first real time task into this resource group. We 881also use taskset(1) to ensure the task always runs on a dedicated CPU 882on socket 0. Most uses of resource groups will also constrain which 883processors tasks run on. 884:: 885 886 # echo 1234 > p0/tasks 887 # taskset -cp 1 1234 888 889Ditto for the second real time task (with the remaining 25% of cache):: 890 891 # mkdir p1 892 # echo "L3:0=7c00;1=fffff" > p1/schemata 893 # echo 5678 > p1/tasks 894 # taskset -cp 2 5678 895 896For the same 2 socket system with memory b/w resource and CAT L3 the 897schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is 89810): 899 900For our first real time task this would request 20% memory b/w on socket 0. 901:: 902 903 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 904 905For our second real time task this would request an other 20% memory b/w 906on socket 0. 907:: 908 909 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 910 9113) Example 3 912 913A single socket system which has real-time tasks running on core 4-7 and 914non real-time workload assigned to core 0-3. The real-time tasks share text 915and data, so a per task association is not required and due to interaction 916with the kernel it's desired that the kernel on these cores shares L3 with 917the tasks. 918:: 919 920 # mount -t resctrl resctrl /sys/fs/resctrl 921 # cd /sys/fs/resctrl 922 923First we reset the schemata for the default group so that the "upper" 92450% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 925cannot be used by ordinary tasks:: 926 927 # echo "L3:0=3ff\nMB:0=50" > schemata 928 929Next we make a resource group for our real time cores and give it access 930to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on 931socket 0. 932:: 933 934 # mkdir p0 935 # echo "L3:0=ffc00\nMB:0=50" > p0/schemata 936 937Finally we move core 4-7 over to the new group and make sure that the 938kernel and the tasks running there get 50% of the cache. They should 939also get 50% of memory bandwidth assuming that the cores 4-7 are SMT 940siblings and only the real time threads are scheduled on the cores 4-7. 941:: 942 943 # echo F0 > p0/cpus 944 9454) Example 4 946 947The resource groups in previous examples were all in the default "shareable" 948mode allowing sharing of their cache allocations. If one resource group 949configures a cache allocation then nothing prevents another resource group 950to overlap with that allocation. 951 952In this example a new exclusive resource group will be created on a L2 CAT 953system with two L2 cache instances that can be configured with an 8-bit 954capacity bitmask. The new exclusive resource group will be configured to use 95525% of each cache instance. 956:: 957 958 # mount -t resctrl resctrl /sys/fs/resctrl/ 959 # cd /sys/fs/resctrl 960 961First, we observe that the default group is configured to allocate to all L2 962cache:: 963 964 # cat schemata 965 L2:0=ff;1=ff 966 967We could attempt to create the new resource group at this point, but it will 968fail because of the overlap with the schemata of the default group:: 969 970 # mkdir p0 971 # echo 'L2:0=0x3;1=0x3' > p0/schemata 972 # cat p0/mode 973 shareable 974 # echo exclusive > p0/mode 975 -sh: echo: write error: Invalid argument 976 # cat info/last_cmd_status 977 schemata overlaps 978 979To ensure that there is no overlap with another resource group the default 980resource group's schemata has to change, making it possible for the new 981resource group to become exclusive. 982:: 983 984 # echo 'L2:0=0xfc;1=0xfc' > schemata 985 # echo exclusive > p0/mode 986 # grep . p0/* 987 p0/cpus:0 988 p0/mode:exclusive 989 p0/schemata:L2:0=03;1=03 990 p0/size:L2:0=262144;1=262144 991 992A new resource group will on creation not overlap with an exclusive resource 993group:: 994 995 # mkdir p1 996 # grep . p1/* 997 p1/cpus:0 998 p1/mode:shareable 999 p1/schemata:L2:0=fc;1=fc 1000 p1/size:L2:0=786432;1=786432 1001 1002The bit_usage will reflect how the cache is used:: 1003 1004 # cat info/L2/bit_usage 1005 0=SSSSSSEE;1=SSSSSSEE 1006 1007A resource group cannot be forced to overlap with an exclusive resource group:: 1008 1009 # echo 'L2:0=0x1;1=0x1' > p1/schemata 1010 -sh: echo: write error: Invalid argument 1011 # cat info/last_cmd_status 1012 overlaps with exclusive group 1013 1014Example of Cache Pseudo-Locking 1015~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1016Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked 1017region is exposed at /dev/pseudo_lock/newlock that can be provided to 1018application for argument to mmap(). 1019:: 1020 1021 # mount -t resctrl resctrl /sys/fs/resctrl/ 1022 # cd /sys/fs/resctrl 1023 1024Ensure that there are bits available that can be pseudo-locked, since only 1025unused bits can be pseudo-locked the bits to be pseudo-locked needs to be 1026removed from the default resource group's schemata:: 1027 1028 # cat info/L2/bit_usage 1029 0=SSSSSSSS;1=SSSSSSSS 1030 # echo 'L2:1=0xfc' > schemata 1031 # cat info/L2/bit_usage 1032 0=SSSSSSSS;1=SSSSSS00 1033 1034Create a new resource group that will be associated with the pseudo-locked 1035region, indicate that it will be used for a pseudo-locked region, and 1036configure the requested pseudo-locked region capacity bitmask:: 1037 1038 # mkdir newlock 1039 # echo pseudo-locksetup > newlock/mode 1040 # echo 'L2:1=0x3' > newlock/schemata 1041 1042On success the resource group's mode will change to pseudo-locked, the 1043bit_usage will reflect the pseudo-locked region, and the character device 1044exposing the pseudo-locked region will exist:: 1045 1046 # cat newlock/mode 1047 pseudo-locked 1048 # cat info/L2/bit_usage 1049 0=SSSSSSSS;1=SSSSSSPP 1050 # ls -l /dev/pseudo_lock/newlock 1051 crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock 1052 1053:: 1054 1055 /* 1056 * Example code to access one page of pseudo-locked cache region 1057 * from user space. 1058 */ 1059 #define _GNU_SOURCE 1060 #include <fcntl.h> 1061 #include <sched.h> 1062 #include <stdio.h> 1063 #include <stdlib.h> 1064 #include <unistd.h> 1065 #include <sys/mman.h> 1066 1067 /* 1068 * It is required that the application runs with affinity to only 1069 * cores associated with the pseudo-locked region. Here the cpu 1070 * is hardcoded for convenience of example. 1071 */ 1072 static int cpuid = 2; 1073 1074 int main(int argc, char *argv[]) 1075 { 1076 cpu_set_t cpuset; 1077 long page_size; 1078 void *mapping; 1079 int dev_fd; 1080 int ret; 1081 1082 page_size = sysconf(_SC_PAGESIZE); 1083 1084 CPU_ZERO(&cpuset); 1085 CPU_SET(cpuid, &cpuset); 1086 ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); 1087 if (ret < 0) { 1088 perror("sched_setaffinity"); 1089 exit(EXIT_FAILURE); 1090 } 1091 1092 dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); 1093 if (dev_fd < 0) { 1094 perror("open"); 1095 exit(EXIT_FAILURE); 1096 } 1097 1098 mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 1099 dev_fd, 0); 1100 if (mapping == MAP_FAILED) { 1101 perror("mmap"); 1102 close(dev_fd); 1103 exit(EXIT_FAILURE); 1104 } 1105 1106 /* Application interacts with pseudo-locked memory @mapping */ 1107 1108 ret = munmap(mapping, page_size); 1109 if (ret < 0) { 1110 perror("munmap"); 1111 close(dev_fd); 1112 exit(EXIT_FAILURE); 1113 } 1114 1115 close(dev_fd); 1116 exit(EXIT_SUCCESS); 1117 } 1118 1119Locking between applications 1120---------------------------- 1121 1122Certain operations on the resctrl filesystem, composed of read/writes 1123to/from multiple files, must be atomic. 1124 1125As an example, the allocation of an exclusive reservation of L3 cache 1126involves: 1127 1128 1. Read the cbmmasks from each directory or the per-resource "bit_usage" 1129 2. Find a contiguous set of bits in the global CBM bitmask that is clear 1130 in any of the directory cbmmasks 1131 3. Create a new directory 1132 4. Set the bits found in step 2 to the new directory "schemata" file 1133 1134If two applications attempt to allocate space concurrently then they can 1135end up allocating the same bits so the reservations are shared instead of 1136exclusive. 1137 1138To coordinate atomic operations on the resctrlfs and to avoid the problem 1139above, the following locking procedure is recommended: 1140 1141Locking is based on flock, which is available in libc and also as a shell 1142script command 1143 1144Write lock: 1145 1146 A) Take flock(LOCK_EX) on /sys/fs/resctrl 1147 B) Read/write the directory structure. 1148 C) funlock 1149 1150Read lock: 1151 1152 A) Take flock(LOCK_SH) on /sys/fs/resctrl 1153 B) If success read the directory structure. 1154 C) funlock 1155 1156Example with bash:: 1157 1158 # Atomically read directory structure 1159 $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl 1160 1161 # Read directory contents and create new subdirectory 1162 1163 $ cat create-dir.sh 1164 find /sys/fs/resctrl/ > output.txt 1165 mask = function-of(output.txt) 1166 mkdir /sys/fs/resctrl/newres/ 1167 echo mask > /sys/fs/resctrl/newres/schemata 1168 1169 $ flock /sys/fs/resctrl/ ./create-dir.sh 1170 1171Example with C:: 1172 1173 /* 1174 * Example code do take advisory locks 1175 * before accessing resctrl filesystem 1176 */ 1177 #include <sys/file.h> 1178 #include <stdlib.h> 1179 1180 void resctrl_take_shared_lock(int fd) 1181 { 1182 int ret; 1183 1184 /* take shared lock on resctrl filesystem */ 1185 ret = flock(fd, LOCK_SH); 1186 if (ret) { 1187 perror("flock"); 1188 exit(-1); 1189 } 1190 } 1191 1192 void resctrl_take_exclusive_lock(int fd) 1193 { 1194 int ret; 1195 1196 /* release lock on resctrl filesystem */ 1197 ret = flock(fd, LOCK_EX); 1198 if (ret) { 1199 perror("flock"); 1200 exit(-1); 1201 } 1202 } 1203 1204 void resctrl_release_lock(int fd) 1205 { 1206 int ret; 1207 1208 /* take shared lock on resctrl filesystem */ 1209 ret = flock(fd, LOCK_UN); 1210 if (ret) { 1211 perror("flock"); 1212 exit(-1); 1213 } 1214 } 1215 1216 void main(void) 1217 { 1218 int fd, ret; 1219 1220 fd = open("/sys/fs/resctrl", O_DIRECTORY); 1221 if (fd == -1) { 1222 perror("open"); 1223 exit(-1); 1224 } 1225 resctrl_take_shared_lock(fd); 1226 /* code to read directory contents */ 1227 resctrl_release_lock(fd); 1228 1229 resctrl_take_exclusive_lock(fd); 1230 /* code to read and write directory contents */ 1231 resctrl_release_lock(fd); 1232 } 1233 1234Examples for RDT Monitoring along with allocation usage 1235======================================================= 1236Reading monitored data 1237---------------------- 1238Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would 1239show the current snapshot of LLC occupancy of the corresponding MON 1240group or CTRL_MON group. 1241 1242 1243Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) 1244------------------------------------------------------------------------ 1245On a two socket machine (one L3 cache per socket) with just four bits 1246for cache bit masks:: 1247 1248 # mount -t resctrl resctrl /sys/fs/resctrl 1249 # cd /sys/fs/resctrl 1250 # mkdir p0 p1 1251 # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata 1252 # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata 1253 # echo 5678 > p1/tasks 1254 # echo 5679 > p1/tasks 1255 1256The default resource group is unmodified, so we have access to all parts 1257of all caches (its schemata file reads "L3:0=f;1=f"). 1258 1259Tasks that are under the control of group "p0" may only allocate from the 1260"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 1261Tasks in group "p1" use the "lower" 50% of cache on both sockets. 1262 1263Create monitor groups and assign a subset of tasks to each monitor group. 1264:: 1265 1266 # cd /sys/fs/resctrl/p1/mon_groups 1267 # mkdir m11 m12 1268 # echo 5678 > m11/tasks 1269 # echo 5679 > m12/tasks 1270 1271fetch data (data shown in bytes) 1272:: 1273 1274 # cat m11/mon_data/mon_L3_00/llc_occupancy 1275 16234000 1276 # cat m11/mon_data/mon_L3_01/llc_occupancy 1277 14789000 1278 # cat m12/mon_data/mon_L3_00/llc_occupancy 1279 16789000 1280 1281The parent ctrl_mon group shows the aggregated data. 1282:: 1283 1284 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 1285 31234000 1286 1287Example 2 (Monitor a task from its creation) 1288-------------------------------------------- 1289On a two socket machine (one L3 cache per socket):: 1290 1291 # mount -t resctrl resctrl /sys/fs/resctrl 1292 # cd /sys/fs/resctrl 1293 # mkdir p0 p1 1294 1295An RMID is allocated to the group once its created and hence the <cmd> 1296below is monitored from its creation. 1297:: 1298 1299 # echo $$ > /sys/fs/resctrl/p1/tasks 1300 # <cmd> 1301 1302Fetch the data:: 1303 1304 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 1305 31789000 1306 1307Example 3 (Monitor without CAT support or before creating CAT groups) 1308--------------------------------------------------------------------- 1309 1310Assume a system like HSW has only CQM and no CAT support. In this case 1311the resctrl will still mount but cannot create CTRL_MON directories. 1312But user can create different MON groups within the root group thereby 1313able to monitor all tasks including kernel threads. 1314 1315This can also be used to profile jobs cache size footprint before being 1316able to allocate them to different allocation groups. 1317:: 1318 1319 # mount -t resctrl resctrl /sys/fs/resctrl 1320 # cd /sys/fs/resctrl 1321 # mkdir mon_groups/m01 1322 # mkdir mon_groups/m02 1323 1324 # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks 1325 # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks 1326 1327Monitor the groups separately and also get per domain data. From the 1328below its apparent that the tasks are mostly doing work on 1329domain(socket) 0. 1330:: 1331 1332 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy 1333 31234000 1334 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy 1335 34555 1336 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy 1337 31234000 1338 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy 1339 32789 1340 1341 1342Example 4 (Monitor real time tasks) 1343----------------------------------- 1344 1345A single socket system which has real time tasks running on cores 4-7 1346and non real time tasks on other cpus. We want to monitor the cache 1347occupancy of the real time threads on these cores. 1348:: 1349 1350 # mount -t resctrl resctrl /sys/fs/resctrl 1351 # cd /sys/fs/resctrl 1352 # mkdir p1 1353 1354Move the cpus 4-7 over to p1:: 1355 1356 # echo f0 > p1/cpus 1357 1358View the llc occupancy snapshot:: 1359 1360 # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 1361 11234000 1362 1363Intel RDT Errata 1364================ 1365 1366Intel MBM Counters May Report System Memory Bandwidth Incorrectly 1367----------------------------------------------------------------- 1368 1369Errata SKX99 for Skylake server and BDF102 for Broadwell server. 1370 1371Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics 1372according to the assigned Resource Monitor ID (RMID) for that logical 1373core. The IA32_QM_CTR register (MSR 0xC8E), used to report these 1374metrics, may report incorrect system bandwidth for certain RMID values. 1375 1376Implication: Due to the errata, system memory bandwidth may not match 1377what is reported. 1378 1379Workaround: MBM total and local readings are corrected according to the 1380following correction factor table: 1381 1382+---------------+---------------+---------------+-----------------+ 1383|core count |rmid count |rmid threshold |correction factor| 1384+---------------+---------------+---------------+-----------------+ 1385|1 |8 |0 |1.000000 | 1386+---------------+---------------+---------------+-----------------+ 1387|2 |16 |0 |1.000000 | 1388+---------------+---------------+---------------+-----------------+ 1389|3 |24 |15 |0.969650 | 1390+---------------+---------------+---------------+-----------------+ 1391|4 |32 |0 |1.000000 | 1392+---------------+---------------+---------------+-----------------+ 1393|6 |48 |31 |0.969650 | 1394+---------------+---------------+---------------+-----------------+ 1395|7 |56 |47 |1.142857 | 1396+---------------+---------------+---------------+-----------------+ 1397|8 |64 |0 |1.000000 | 1398+---------------+---------------+---------------+-----------------+ 1399|9 |72 |63 |1.185115 | 1400+---------------+---------------+---------------+-----------------+ 1401|10 |80 |63 |1.066553 | 1402+---------------+---------------+---------------+-----------------+ 1403|11 |88 |79 |1.454545 | 1404+---------------+---------------+---------------+-----------------+ 1405|12 |96 |0 |1.000000 | 1406+---------------+---------------+---------------+-----------------+ 1407|13 |104 |95 |1.230769 | 1408+---------------+---------------+---------------+-----------------+ 1409|14 |112 |95 |1.142857 | 1410+---------------+---------------+---------------+-----------------+ 1411|15 |120 |95 |1.066667 | 1412+---------------+---------------+---------------+-----------------+ 1413|16 |128 |0 |1.000000 | 1414+---------------+---------------+---------------+-----------------+ 1415|17 |136 |127 |1.254863 | 1416+---------------+---------------+---------------+-----------------+ 1417|18 |144 |127 |1.185255 | 1418+---------------+---------------+---------------+-----------------+ 1419|19 |152 |0 |1.000000 | 1420+---------------+---------------+---------------+-----------------+ 1421|20 |160 |127 |1.066667 | 1422+---------------+---------------+---------------+-----------------+ 1423|21 |168 |0 |1.000000 | 1424+---------------+---------------+---------------+-----------------+ 1425|22 |176 |159 |1.454334 | 1426+---------------+---------------+---------------+-----------------+ 1427|23 |184 |0 |1.000000 | 1428+---------------+---------------+---------------+-----------------+ 1429|24 |192 |127 |0.969744 | 1430+---------------+---------------+---------------+-----------------+ 1431|25 |200 |191 |1.280246 | 1432+---------------+---------------+---------------+-----------------+ 1433|26 |208 |191 |1.230921 | 1434+---------------+---------------+---------------+-----------------+ 1435|27 |216 |0 |1.000000 | 1436+---------------+---------------+---------------+-----------------+ 1437|28 |224 |191 |1.143118 | 1438+---------------+---------------+---------------+-----------------+ 1439 1440If rmid > rmid threshold, MBM total and local values should be multiplied 1441by the correction factor. 1442 1443See: 1444 14451. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update: 1446http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html 1447 14482. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update: 1449http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf 1450 14513. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual: 1452https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html 1453 1454for further information. 1455