1.. SPDX-License-Identifier: GPL-2.0 2.. include:: <isonum.txt> 3 4=========================================== 5User Interface for Resource Control feature 6=========================================== 7 8:Copyright: |copy| 2016 Intel Corporation 9:Authors: - Fenghua Yu <fenghua.yu@intel.com> 10 - Tony Luck <tony.luck@intel.com> 11 - Vikas Shivappa <vikas.shivappa@intel.com> 12 13 14Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT). 15AMD refers to this feature as AMD Platform Quality of Service(AMD QoS). 16 17This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo 18flag bits: 19 20=============================================== ================================ 21RDT (Resource Director Technology) Allocation "rdt_a" 22CAT (Cache Allocation Technology) "cat_l3", "cat_l2" 23CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2" 24CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc" 25MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local" 26MBA (Memory Bandwidth Allocation) "mba" 27SMBA (Slow Memory Bandwidth Allocation) "" 28BMEC (Bandwidth Monitoring Event Configuration) "" 29=============================================== ================================ 30 31Historically, new features were made visible by default in /proc/cpuinfo. This 32resulted in the feature flags becoming hard to parse by humans. Adding a new 33flag to /proc/cpuinfo should be avoided if user space can obtain information 34about the feature from resctrl's info directory. 35 36To use the feature mount the file system:: 37 38 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl 39 40mount options are: 41 42"cdp": 43 Enable code/data prioritization in L3 cache allocations. 44"cdpl2": 45 Enable code/data prioritization in L2 cache allocations. 46"mba_MBps": 47 Enable the MBA Software Controller(mba_sc) to specify MBA 48 bandwidth in MBps 49 50L2 and L3 CDP are controlled separately. 51 52RDT features are orthogonal. A particular system may support only 53monitoring, only control, or both monitoring and control. Cache 54pseudo-locking is a unique way of using cache control to "pin" or 55"lock" data in the cache. Details can be found in 56"Cache Pseudo-Locking". 57 58 59The mount succeeds if either of allocation or monitoring is present, but 60only those files and directories supported by the system will be created. 61For more details on the behavior of the interface during monitoring 62and allocation, see the "Resource alloc and monitor groups" section. 63 64Info directory 65============== 66 67The 'info' directory contains information about the enabled 68resources. Each resource has its own subdirectory. The subdirectory 69names reflect the resource names. 70 71Each subdirectory contains the following files with respect to 72allocation: 73 74Cache resource(L3/L2) subdirectory contains the following files 75related to allocation: 76 77"num_closids": 78 The number of CLOSIDs which are valid for this 79 resource. The kernel uses the smallest number of 80 CLOSIDs of all enabled resources as limit. 81"cbm_mask": 82 The bitmask which is valid for this resource. 83 This mask is equivalent to 100%. 84"min_cbm_bits": 85 The minimum number of consecutive bits which 86 must be set when writing a mask. 87 88"shareable_bits": 89 Bitmask of shareable resource with other executing 90 entities (e.g. I/O). User can use this when 91 setting up exclusive cache partitions. Note that 92 some platforms support devices that have their 93 own settings for cache use which can over-ride 94 these bits. 95"bit_usage": 96 Annotated capacity bitmasks showing how all 97 instances of the resource are used. The legend is: 98 99 "0": 100 Corresponding region is unused. When the system's 101 resources have been allocated and a "0" is found 102 in "bit_usage" it is a sign that resources are 103 wasted. 104 105 "H": 106 Corresponding region is used by hardware only 107 but available for software use. If a resource 108 has bits set in "shareable_bits" but not all 109 of these bits appear in the resource groups' 110 schematas then the bits appearing in 111 "shareable_bits" but no resource group will 112 be marked as "H". 113 "X": 114 Corresponding region is available for sharing and 115 used by hardware and software. These are the 116 bits that appear in "shareable_bits" as 117 well as a resource group's allocation. 118 "S": 119 Corresponding region is used by software 120 and available for sharing. 121 "E": 122 Corresponding region is used exclusively by 123 one resource group. No sharing allowed. 124 "P": 125 Corresponding region is pseudo-locked. No 126 sharing allowed. 127 128Memory bandwidth(MB) subdirectory contains the following files 129with respect to allocation: 130 131"min_bandwidth": 132 The minimum memory bandwidth percentage which 133 user can request. 134 135"bandwidth_gran": 136 The granularity in which the memory bandwidth 137 percentage is allocated. The allocated 138 b/w percentage is rounded off to the next 139 control step available on the hardware. The 140 available bandwidth control steps are: 141 min_bandwidth + N * bandwidth_gran. 142 143"delay_linear": 144 Indicates if the delay scale is linear or 145 non-linear. This field is purely informational 146 only. 147 148"thread_throttle_mode": 149 Indicator on Intel systems of how tasks running on threads 150 of a physical core are throttled in cases where they 151 request different memory bandwidth percentages: 152 153 "max": 154 the smallest percentage is applied 155 to all threads 156 "per-thread": 157 bandwidth percentages are directly applied to 158 the threads running on the core 159 160If RDT monitoring is available there will be an "L3_MON" directory 161with the following files: 162 163"num_rmids": 164 The number of RMIDs available. This is the 165 upper bound for how many "CTRL_MON" + "MON" 166 groups can be created. 167 168"mon_features": 169 Lists the monitoring events if 170 monitoring is enabled for the resource. 171 Example:: 172 173 # cat /sys/fs/resctrl/info/L3_MON/mon_features 174 llc_occupancy 175 mbm_total_bytes 176 mbm_local_bytes 177 178 If the system supports Bandwidth Monitoring Event 179 Configuration (BMEC), then the bandwidth events will 180 be configurable. The output will be:: 181 182 # cat /sys/fs/resctrl/info/L3_MON/mon_features 183 llc_occupancy 184 mbm_total_bytes 185 mbm_total_bytes_config 186 mbm_local_bytes 187 mbm_local_bytes_config 188 189"mbm_total_bytes_config", "mbm_local_bytes_config": 190 Read/write files containing the configuration for the mbm_total_bytes 191 and mbm_local_bytes events, respectively, when the Bandwidth 192 Monitoring Event Configuration (BMEC) feature is supported. 193 The event configuration settings are domain specific and affect 194 all the CPUs in the domain. When either event configuration is 195 changed, the bandwidth counters for all RMIDs of both events 196 (mbm_total_bytes as well as mbm_local_bytes) are cleared for that 197 domain. The next read for every RMID will report "Unavailable" 198 and subsequent reads will report the valid value. 199 200 Following are the types of events supported: 201 202 ==== ======================================================== 203 Bits Description 204 ==== ======================================================== 205 6 Dirty Victims from the QOS domain to all types of memory 206 5 Reads to slow memory in the non-local NUMA domain 207 4 Reads to slow memory in the local NUMA domain 208 3 Non-temporal writes to non-local NUMA domain 209 2 Non-temporal writes to local NUMA domain 210 1 Reads to memory in the non-local NUMA domain 211 0 Reads to memory in the local NUMA domain 212 ==== ======================================================== 213 214 By default, the mbm_total_bytes configuration is set to 0x7f to count 215 all the event types and the mbm_local_bytes configuration is set to 216 0x15 to count all the local memory events. 217 218 Examples: 219 220 * To view the current configuration:: 221 :: 222 223 # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 224 0=0x7f;1=0x7f;2=0x7f;3=0x7f 225 226 # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 227 0=0x15;1=0x15;3=0x15;4=0x15 228 229 * To change the mbm_total_bytes to count only reads on domain 0, 230 the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary 231 (in hexadecimal 0x33): 232 :: 233 234 # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 235 236 # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config 237 0=0x33;1=0x7f;2=0x7f;3=0x7f 238 239 * To change the mbm_local_bytes to count all the slow memory reads on 240 domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b 241 in binary (in hexadecimal 0x30): 242 :: 243 244 # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 245 246 # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config 247 0=0x30;1=0x30;3=0x15;4=0x15 248 249"max_threshold_occupancy": 250 Read/write file provides the largest value (in 251 bytes) at which a previously used LLC_occupancy 252 counter can be considered for re-use. 253 254Finally, in the top level of the "info" directory there is a file 255named "last_cmd_status". This is reset with every "command" issued 256via the file system (making new directories or writing to any of the 257control files). If the command was successful, it will read as "ok". 258If the command failed, it will provide more information that can be 259conveyed in the error returns from file operations. E.g. 260:: 261 262 # echo L3:0=f7 > schemata 263 bash: echo: write error: Invalid argument 264 # cat info/last_cmd_status 265 mask f7 has non-consecutive 1-bits 266 267Resource alloc and monitor groups 268================================= 269 270Resource groups are represented as directories in the resctrl file 271system. The default group is the root directory which, immediately 272after mounting, owns all the tasks and cpus in the system and can make 273full use of all resources. 274 275On a system with RDT control features additional directories can be 276created in the root directory that specify different amounts of each 277resource (see "schemata" below). The root and these additional top level 278directories are referred to as "CTRL_MON" groups below. 279 280On a system with RDT monitoring the root directory and other top level 281directories contain a directory named "mon_groups" in which additional 282directories can be created to monitor subsets of tasks in the CTRL_MON 283group that is their ancestor. These are called "MON" groups in the rest 284of this document. 285 286Removing a directory will move all tasks and cpus owned by the group it 287represents to the parent. Removing one of the created CTRL_MON groups 288will automatically remove all MON groups below it. 289 290All groups contain the following files: 291 292"tasks": 293 Reading this file shows the list of all tasks that belong to 294 this group. Writing a task id to the file will add a task to the 295 group. If the group is a CTRL_MON group the task is removed from 296 whichever previous CTRL_MON group owned the task and also from 297 any MON group that owned the task. If the group is a MON group, 298 then the task must already belong to the CTRL_MON parent of this 299 group. The task is removed from any previous MON group. 300 301 302"cpus": 303 Reading this file shows a bitmask of the logical CPUs owned by 304 this group. Writing a mask to this file will add and remove 305 CPUs to/from this group. As with the tasks file a hierarchy is 306 maintained where MON groups may only include CPUs owned by the 307 parent CTRL_MON group. 308 When the resource group is in pseudo-locked mode this file will 309 only be readable, reflecting the CPUs associated with the 310 pseudo-locked region. 311 312 313"cpus_list": 314 Just like "cpus", only using ranges of CPUs instead of bitmasks. 315 316 317When control is enabled all CTRL_MON groups will also contain: 318 319"schemata": 320 A list of all the resources available to this group. 321 Each resource has its own line and format - see below for details. 322 323"size": 324 Mirrors the display of the "schemata" file to display the size in 325 bytes of each allocation instead of the bits representing the 326 allocation. 327 328"mode": 329 The "mode" of the resource group dictates the sharing of its 330 allocations. A "shareable" resource group allows sharing of its 331 allocations while an "exclusive" resource group does not. A 332 cache pseudo-locked region is created by first writing 333 "pseudo-locksetup" to the "mode" file before writing the cache 334 pseudo-locked region's schemata to the resource group's "schemata" 335 file. On successful pseudo-locked region creation the mode will 336 automatically change to "pseudo-locked". 337 338When monitoring is enabled all MON groups will also contain: 339 340"mon_data": 341 This contains a set of files organized by L3 domain and by 342 RDT event. E.g. on a system with two L3 domains there will 343 be subdirectories "mon_L3_00" and "mon_L3_01". Each of these 344 directories have one file per event (e.g. "llc_occupancy", 345 "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these 346 files provide a read out of the current value of the event for 347 all tasks in the group. In CTRL_MON groups these files provide 348 the sum for all tasks in the CTRL_MON group and all tasks in 349 MON groups. Please see example section for more details on usage. 350 351Resource allocation rules 352------------------------- 353 354When a task is running the following rules define which resources are 355available to it: 356 3571) If the task is a member of a non-default group, then the schemata 358 for that group is used. 359 3602) Else if the task belongs to the default group, but is running on a 361 CPU that is assigned to some specific group, then the schemata for the 362 CPU's group is used. 363 3643) Otherwise the schemata for the default group is used. 365 366Resource monitoring rules 367------------------------- 3681) If a task is a member of a MON group, or non-default CTRL_MON group 369 then RDT events for the task will be reported in that group. 370 3712) If a task is a member of the default CTRL_MON group, but is running 372 on a CPU that is assigned to some specific group, then the RDT events 373 for the task will be reported in that group. 374 3753) Otherwise RDT events for the task will be reported in the root level 376 "mon_data" group. 377 378 379Notes on cache occupancy monitoring and control 380=============================================== 381When moving a task from one group to another you should remember that 382this only affects *new* cache allocations by the task. E.g. you may have 383a task in a monitor group showing 3 MB of cache occupancy. If you move 384to a new group and immediately check the occupancy of the old and new 385groups you will likely see that the old group is still showing 3 MB and 386the new group zero. When the task accesses locations still in cache from 387before the move, the h/w does not update any counters. On a busy system 388you will likely see the occupancy in the old group go down as cache lines 389are evicted and re-used while the occupancy in the new group rises as 390the task accesses memory and loads into the cache are counted based on 391membership in the new group. 392 393The same applies to cache allocation control. Moving a task to a group 394with a smaller cache partition will not evict any cache lines. The 395process may continue to use them from the old partition. 396 397Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) 398to identify a control group and a monitoring group respectively. Each of 399the resource groups are mapped to these IDs based on the kind of group. The 400number of CLOSid and RMID are limited by the hardware and hence the creation of 401a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID 402and creation of "MON" group may fail if we run out of RMIDs. 403 404max_threshold_occupancy - generic concepts 405------------------------------------------ 406 407Note that an RMID once freed may not be immediately available for use as 408the RMID is still tagged the cache lines of the previous user of RMID. 409Hence such RMIDs are placed on limbo list and checked back if the cache 410occupancy has gone down. If there is a time when system has a lot of 411limbo RMIDs but which are not ready to be used, user may see an -EBUSY 412during mkdir. 413 414max_threshold_occupancy is a user configurable value to determine the 415occupancy at which an RMID can be freed. 416 417Schemata files - general concepts 418--------------------------------- 419Each line in the file describes one resource. The line starts with 420the name of the resource, followed by specific values to be applied 421in each of the instances of that resource on the system. 422 423Cache IDs 424--------- 425On current generation systems there is one L3 cache per socket and L2 426caches are generally just shared by the hyperthreads on a core, but this 427isn't an architectural requirement. We could have multiple separate L3 428caches on a socket, multiple cores could share an L2 cache. So instead 429of using "socket" or "core" to define the set of logical cpus sharing 430a resource we use a "Cache ID". At a given cache level this will be a 431unique number across the whole system (but it isn't guaranteed to be a 432contiguous sequence, there may be gaps). To find the ID for each logical 433CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id 434 435Cache Bit Masks (CBM) 436--------------------- 437For cache resources we describe the portion of the cache that is available 438for allocation using a bitmask. The maximum value of the mask is defined 439by each cpu model (and may be different for different cache levels). It 440is found using CPUID, but is also provided in the "info" directory of 441the resctrl file system in "info/{resource}/cbm_mask". Intel hardware 442requires that these masks have all the '1' bits in a contiguous block. So 4430x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 444and 0xA are not. On a system with a 20-bit mask each bit represents 5% 445of the capacity of the cache. You could partition the cache into four 446equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. 447 448Memory bandwidth Allocation and monitoring 449========================================== 450 451For Memory bandwidth resource, by default the user controls the resource 452by indicating the percentage of total memory bandwidth. 453 454The minimum bandwidth percentage value for each cpu model is predefined 455and can be looked up through "info/MB/min_bandwidth". The bandwidth 456granularity that is allocated is also dependent on the cpu model and can 457be looked up at "info/MB/bandwidth_gran". The available bandwidth 458control steps are: min_bw + N * bw_gran. Intermediate values are rounded 459to the next control step available on the hardware. 460 461The bandwidth throttling is a core specific mechanism on some of Intel 462SKUs. Using a high bandwidth and a low bandwidth setting on two threads 463sharing a core may result in both threads being throttled to use the 464low bandwidth (see "thread_throttle_mode"). 465 466The fact that Memory bandwidth allocation(MBA) may be a core 467specific mechanism where as memory bandwidth monitoring(MBM) is done at 468the package level may lead to confusion when users try to apply control 469via the MBA and then monitor the bandwidth to see if the controls are 470effective. Below are such scenarios: 471 4721. User may *not* see increase in actual bandwidth when percentage 473 values are increased: 474 475This can occur when aggregate L2 external bandwidth is more than L3 476external bandwidth. Consider an SKL SKU with 24 cores on a package and 477where L2 external is 10GBps (hence aggregate L2 external bandwidth is 478240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 479threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 480bandwidth of 100GBps although the percentage value specified is only 50% 481<< 100%. Hence increasing the bandwidth percentage will not yield any 482more bandwidth. This is because although the L2 external bandwidth still 483has capacity, the L3 external bandwidth is fully used. Also note that 484this would be dependent on number of cores the benchmark is run on. 485 4862. Same bandwidth percentage may mean different actual bandwidth 487 depending on # of threads: 488 489For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 490thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although 491they have same percentage bandwidth of 10%. This is simply because as 492threads start using more cores in an rdtgroup, the actual bandwidth may 493increase or vary although user specified bandwidth percentage is same. 494 495In order to mitigate this and make the interface more user friendly, 496resctrl added support for specifying the bandwidth in MBps as well. The 497kernel underneath would use a software feedback mechanism or a "Software 498Controller(mba_sc)" which reads the actual bandwidth using MBM counters 499and adjust the memory bandwidth percentages to ensure:: 500 501 "actual bandwidth < user specified bandwidth". 502 503By default, the schemata would take the bandwidth percentage values 504where as user can switch to the "MBA software controller" mode using 505a mount option 'mba_MBps'. The schemata format is specified in the below 506sections. 507 508L3 schemata file details (code and data prioritization disabled) 509---------------------------------------------------------------- 510With CDP disabled the L3 schemata format is:: 511 512 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 513 514L3 schemata file details (CDP enabled via mount option to resctrl) 515------------------------------------------------------------------ 516When CDP is enabled L3 control is split into two separate resources 517so you can specify independent masks for code and data like this:: 518 519 L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 520 L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 521 522L2 schemata file details 523------------------------ 524CDP is supported at L2 using the 'cdpl2' mount option. The schemata 525format is either:: 526 527 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 528 529or 530 531 L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 532 L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 533 534 535Memory bandwidth Allocation (default mode) 536------------------------------------------ 537 538Memory b/w domain is L3 cache. 539:: 540 541 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... 542 543Memory bandwidth Allocation specified in MBps 544--------------------------------------------- 545 546Memory bandwidth domain is L3 cache. 547:: 548 549 MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... 550 551Slow Memory Bandwidth Allocation (SMBA) 552--------------------------------------- 553AMD hardware supports Slow Memory Bandwidth Allocation (SMBA). 554CXL.memory is the only supported "slow" memory device. With the 555support of SMBA, the hardware enables bandwidth allocation on 556the slow memory devices. If there are multiple such devices in 557the system, the throttling logic groups all the slow sources 558together and applies the limit on them as a whole. 559 560The presence of SMBA (with CXL.memory) is independent of slow memory 561devices presence. If there are no such devices on the system, then 562configuring SMBA will have no impact on the performance of the system. 563 564The bandwidth domain for slow memory is L3 cache. Its schemata file 565is formatted as: 566:: 567 568 SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... 569 570Reading/writing the schemata file 571--------------------------------- 572Reading the schemata file will show the state of all resources 573on all domains. When writing you only need to specify those values 574which you wish to change. E.g. 575:: 576 577 # cat schemata 578 L3DATA:0=fffff;1=fffff;2=fffff;3=fffff 579 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 580 # echo "L3DATA:2=3c0;" > schemata 581 # cat schemata 582 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff 583 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 584 585Reading/writing the schemata file (on AMD systems) 586-------------------------------------------------- 587Reading the schemata file will show the current bandwidth limit on all 588domains. The allocated resources are in multiples of one eighth GB/s. 589When writing to the file, you need to specify what cache id you wish to 590configure the bandwidth limit. 591 592For example, to allocate 2GB/s limit on the first cache id: 593 594:: 595 596 # cat schemata 597 MB:0=2048;1=2048;2=2048;3=2048 598 L3:0=ffff;1=ffff;2=ffff;3=ffff 599 600 # echo "MB:1=16" > schemata 601 # cat schemata 602 MB:0=2048;1= 16;2=2048;3=2048 603 L3:0=ffff;1=ffff;2=ffff;3=ffff 604 605Reading/writing the schemata file (on AMD systems) with SMBA feature 606-------------------------------------------------------------------- 607Reading and writing the schemata file is the same as without SMBA in 608above section. 609 610For example, to allocate 8GB/s limit on the first cache id: 611 612:: 613 614 # cat schemata 615 SMBA:0=2048;1=2048;2=2048;3=2048 616 MB:0=2048;1=2048;2=2048;3=2048 617 L3:0=ffff;1=ffff;2=ffff;3=ffff 618 619 # echo "SMBA:1=64" > schemata 620 # cat schemata 621 SMBA:0=2048;1= 64;2=2048;3=2048 622 MB:0=2048;1=2048;2=2048;3=2048 623 L3:0=ffff;1=ffff;2=ffff;3=ffff 624 625Cache Pseudo-Locking 626==================== 627CAT enables a user to specify the amount of cache space that an 628application can fill. Cache pseudo-locking builds on the fact that a 629CPU can still read and write data pre-allocated outside its current 630allocated area on a cache hit. With cache pseudo-locking, data can be 631preloaded into a reserved portion of cache that no application can 632fill, and from that point on will only serve cache hits. The cache 633pseudo-locked memory is made accessible to user space where an 634application can map it into its virtual address space and thus have 635a region of memory with reduced average read latency. 636 637The creation of a cache pseudo-locked region is triggered by a request 638from the user to do so that is accompanied by a schemata of the region 639to be pseudo-locked. The cache pseudo-locked region is created as follows: 640 641- Create a CAT allocation CLOSNEW with a CBM matching the schemata 642 from the user of the cache region that will contain the pseudo-locked 643 memory. This region must not overlap with any current CAT allocation/CLOS 644 on the system and no future overlap with this cache region is allowed 645 while the pseudo-locked region exists. 646- Create a contiguous region of memory of the same size as the cache 647 region. 648- Flush the cache, disable hardware prefetchers, disable preemption. 649- Make CLOSNEW the active CLOS and touch the allocated memory to load 650 it into the cache. 651- Set the previous CLOS as active. 652- At this point the closid CLOSNEW can be released - the cache 653 pseudo-locked region is protected as long as its CBM does not appear in 654 any CAT allocation. Even though the cache pseudo-locked region will from 655 this point on not appear in any CBM of any CLOS an application running with 656 any CLOS will be able to access the memory in the pseudo-locked region since 657 the region continues to serve cache hits. 658- The contiguous region of memory loaded into the cache is exposed to 659 user-space as a character device. 660 661Cache pseudo-locking increases the probability that data will remain 662in the cache via carefully configuring the CAT feature and controlling 663application behavior. There is no guarantee that data is placed in 664cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict 665“locked” data from cache. Power management C-states may shrink or 666power off cache. Deeper C-states will automatically be restricted on 667pseudo-locked region creation. 668 669It is required that an application using a pseudo-locked region runs 670with affinity to the cores (or a subset of the cores) associated 671with the cache on which the pseudo-locked region resides. A sanity check 672within the code will not allow an application to map pseudo-locked memory 673unless it runs with affinity to cores associated with the cache on which the 674pseudo-locked region resides. The sanity check is only done during the 675initial mmap() handling, there is no enforcement afterwards and the 676application self needs to ensure it remains affine to the correct cores. 677 678Pseudo-locking is accomplished in two stages: 679 6801) During the first stage the system administrator allocates a portion 681 of cache that should be dedicated to pseudo-locking. At this time an 682 equivalent portion of memory is allocated, loaded into allocated 683 cache portion, and exposed as a character device. 6842) During the second stage a user-space application maps (mmap()) the 685 pseudo-locked memory into its address space. 686 687Cache Pseudo-Locking Interface 688------------------------------ 689A pseudo-locked region is created using the resctrl interface as follows: 690 6911) Create a new resource group by creating a new directory in /sys/fs/resctrl. 6922) Change the new resource group's mode to "pseudo-locksetup" by writing 693 "pseudo-locksetup" to the "mode" file. 6943) Write the schemata of the pseudo-locked region to the "schemata" file. All 695 bits within the schemata should be "unused" according to the "bit_usage" 696 file. 697 698On successful pseudo-locked region creation the "mode" file will contain 699"pseudo-locked" and a new character device with the same name as the resource 700group will exist in /dev/pseudo_lock. This character device can be mmap()'ed 701by user space in order to obtain access to the pseudo-locked memory region. 702 703An example of cache pseudo-locked region creation and usage can be found below. 704 705Cache Pseudo-Locking Debugging Interface 706---------------------------------------- 707The pseudo-locking debugging interface is enabled by default (if 708CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. 709 710There is no explicit way for the kernel to test if a provided memory 711location is present in the cache. The pseudo-locking debugging interface uses 712the tracing infrastructure to provide two ways to measure cache residency of 713the pseudo-locked region: 714 7151) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data 716 from these measurements are best visualized using a hist trigger (see 717 example below). In this test the pseudo-locked region is traversed at 718 a stride of 32 bytes while hardware prefetchers and preemption 719 are disabled. This also provides a substitute visualization of cache 720 hits and misses. 7212) Cache hit and miss measurements using model specific precision counters if 722 available. Depending on the levels of cache on the system the pseudo_lock_l2 723 and pseudo_lock_l3 tracepoints are available. 724 725When a pseudo-locked region is created a new debugfs directory is created for 726it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single 727write-only file, pseudo_lock_measure, is present in this directory. The 728measurement of the pseudo-locked region depends on the number written to this 729debugfs file: 730 7311: 732 writing "1" to the pseudo_lock_measure file will trigger the latency 733 measurement captured in the pseudo_lock_mem_latency tracepoint. See 734 example below. 7352: 736 writing "2" to the pseudo_lock_measure file will trigger the L2 cache 737 residency (cache hits and misses) measurement captured in the 738 pseudo_lock_l2 tracepoint. See example below. 7393: 740 writing "3" to the pseudo_lock_measure file will trigger the L3 cache 741 residency (cache hits and misses) measurement captured in the 742 pseudo_lock_l3 tracepoint. 743 744All measurements are recorded with the tracing infrastructure. This requires 745the relevant tracepoints to be enabled before the measurement is triggered. 746 747Example of latency debugging interface 748~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 749In this example a pseudo-locked region named "newlock" was created. Here is 750how we can measure the latency in cycles of reading from this region and 751visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS 752is set:: 753 754 # :> /sys/kernel/tracing/trace 755 # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger 756 # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable 757 # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 758 # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable 759 # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist 760 761 # event histogram 762 # 763 # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] 764 # 765 766 { latency: 456 } hitcount: 1 767 { latency: 50 } hitcount: 83 768 { latency: 36 } hitcount: 96 769 { latency: 44 } hitcount: 174 770 { latency: 48 } hitcount: 195 771 { latency: 46 } hitcount: 262 772 { latency: 42 } hitcount: 693 773 { latency: 40 } hitcount: 3204 774 { latency: 38 } hitcount: 3484 775 776 Totals: 777 Hits: 8192 778 Entries: 9 779 Dropped: 0 780 781Example of cache hits/misses debugging 782~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 783In this example a pseudo-locked region named "newlock" was created on the L2 784cache of a platform. Here is how we can obtain details of the cache hits 785and misses using the platform's precision counters. 786:: 787 788 # :> /sys/kernel/tracing/trace 789 # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable 790 # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 791 # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable 792 # cat /sys/kernel/tracing/trace 793 794 # tracer: nop 795 # 796 # _-----=> irqs-off 797 # / _----=> need-resched 798 # | / _---=> hardirq/softirq 799 # || / _--=> preempt-depth 800 # ||| / delay 801 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 802 # | | | |||| | | 803 pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 804 805 806Examples for RDT allocation usage 807~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 808 8091) Example 1 810 811On a two socket machine (one L3 cache per socket) with just four bits 812for cache bit masks, minimum b/w of 10% with a memory bandwidth 813granularity of 10%. 814:: 815 816 # mount -t resctrl resctrl /sys/fs/resctrl 817 # cd /sys/fs/resctrl 818 # mkdir p0 p1 819 # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata 820 # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata 821 822The default resource group is unmodified, so we have access to all parts 823of all caches (its schemata file reads "L3:0=f;1=f"). 824 825Tasks that are under the control of group "p0" may only allocate from the 826"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 827Tasks in group "p1" use the "lower" 50% of cache on both sockets. 828 829Similarly, tasks that are under the control of group "p0" may use a 830maximum memory b/w of 50% on socket0 and 50% on socket 1. 831Tasks in group "p1" may also use 50% memory b/w on both sockets. 832Note that unlike cache masks, memory b/w cannot specify whether these 833allocations can overlap or not. The allocations specifies the maximum 834b/w that the group may be able to use and the system admin can configure 835the b/w accordingly. 836 837If resctrl is using the software controller (mba_sc) then user can enter the 838max b/w in MB rather than the percentage values. 839:: 840 841 # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata 842 # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata 843 844In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w 845of 1024MB where as on socket 1 they would use 500MB. 846 8472) Example 2 848 849Again two sockets, but this time with a more realistic 20-bit mask. 850 851Two real time tasks pid=1234 running on processor 0 and pid=5678 running on 852processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy 853neighbors, each of the two real-time tasks exclusively occupies one quarter 854of L3 cache on socket 0. 855:: 856 857 # mount -t resctrl resctrl /sys/fs/resctrl 858 # cd /sys/fs/resctrl 859 860First we reset the schemata for the default group so that the "upper" 86150% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by 862ordinary tasks:: 863 864 # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata 865 866Next we make a resource group for our first real time task and give 867it access to the "top" 25% of the cache on socket 0. 868:: 869 870 # mkdir p0 871 # echo "L3:0=f8000;1=fffff" > p0/schemata 872 873Finally we move our first real time task into this resource group. We 874also use taskset(1) to ensure the task always runs on a dedicated CPU 875on socket 0. Most uses of resource groups will also constrain which 876processors tasks run on. 877:: 878 879 # echo 1234 > p0/tasks 880 # taskset -cp 1 1234 881 882Ditto for the second real time task (with the remaining 25% of cache):: 883 884 # mkdir p1 885 # echo "L3:0=7c00;1=fffff" > p1/schemata 886 # echo 5678 > p1/tasks 887 # taskset -cp 2 5678 888 889For the same 2 socket system with memory b/w resource and CAT L3 the 890schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is 89110): 892 893For our first real time task this would request 20% memory b/w on socket 0. 894:: 895 896 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 897 898For our second real time task this would request an other 20% memory b/w 899on socket 0. 900:: 901 902 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 903 9043) Example 3 905 906A single socket system which has real-time tasks running on core 4-7 and 907non real-time workload assigned to core 0-3. The real-time tasks share text 908and data, so a per task association is not required and due to interaction 909with the kernel it's desired that the kernel on these cores shares L3 with 910the tasks. 911:: 912 913 # mount -t resctrl resctrl /sys/fs/resctrl 914 # cd /sys/fs/resctrl 915 916First we reset the schemata for the default group so that the "upper" 91750% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 918cannot be used by ordinary tasks:: 919 920 # echo "L3:0=3ff\nMB:0=50" > schemata 921 922Next we make a resource group for our real time cores and give it access 923to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on 924socket 0. 925:: 926 927 # mkdir p0 928 # echo "L3:0=ffc00\nMB:0=50" > p0/schemata 929 930Finally we move core 4-7 over to the new group and make sure that the 931kernel and the tasks running there get 50% of the cache. They should 932also get 50% of memory bandwidth assuming that the cores 4-7 are SMT 933siblings and only the real time threads are scheduled on the cores 4-7. 934:: 935 936 # echo F0 > p0/cpus 937 9384) Example 4 939 940The resource groups in previous examples were all in the default "shareable" 941mode allowing sharing of their cache allocations. If one resource group 942configures a cache allocation then nothing prevents another resource group 943to overlap with that allocation. 944 945In this example a new exclusive resource group will be created on a L2 CAT 946system with two L2 cache instances that can be configured with an 8-bit 947capacity bitmask. The new exclusive resource group will be configured to use 94825% of each cache instance. 949:: 950 951 # mount -t resctrl resctrl /sys/fs/resctrl/ 952 # cd /sys/fs/resctrl 953 954First, we observe that the default group is configured to allocate to all L2 955cache:: 956 957 # cat schemata 958 L2:0=ff;1=ff 959 960We could attempt to create the new resource group at this point, but it will 961fail because of the overlap with the schemata of the default group:: 962 963 # mkdir p0 964 # echo 'L2:0=0x3;1=0x3' > p0/schemata 965 # cat p0/mode 966 shareable 967 # echo exclusive > p0/mode 968 -sh: echo: write error: Invalid argument 969 # cat info/last_cmd_status 970 schemata overlaps 971 972To ensure that there is no overlap with another resource group the default 973resource group's schemata has to change, making it possible for the new 974resource group to become exclusive. 975:: 976 977 # echo 'L2:0=0xfc;1=0xfc' > schemata 978 # echo exclusive > p0/mode 979 # grep . p0/* 980 p0/cpus:0 981 p0/mode:exclusive 982 p0/schemata:L2:0=03;1=03 983 p0/size:L2:0=262144;1=262144 984 985A new resource group will on creation not overlap with an exclusive resource 986group:: 987 988 # mkdir p1 989 # grep . p1/* 990 p1/cpus:0 991 p1/mode:shareable 992 p1/schemata:L2:0=fc;1=fc 993 p1/size:L2:0=786432;1=786432 994 995The bit_usage will reflect how the cache is used:: 996 997 # cat info/L2/bit_usage 998 0=SSSSSSEE;1=SSSSSSEE 999 1000A resource group cannot be forced to overlap with an exclusive resource group:: 1001 1002 # echo 'L2:0=0x1;1=0x1' > p1/schemata 1003 -sh: echo: write error: Invalid argument 1004 # cat info/last_cmd_status 1005 overlaps with exclusive group 1006 1007Example of Cache Pseudo-Locking 1008~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1009Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked 1010region is exposed at /dev/pseudo_lock/newlock that can be provided to 1011application for argument to mmap(). 1012:: 1013 1014 # mount -t resctrl resctrl /sys/fs/resctrl/ 1015 # cd /sys/fs/resctrl 1016 1017Ensure that there are bits available that can be pseudo-locked, since only 1018unused bits can be pseudo-locked the bits to be pseudo-locked needs to be 1019removed from the default resource group's schemata:: 1020 1021 # cat info/L2/bit_usage 1022 0=SSSSSSSS;1=SSSSSSSS 1023 # echo 'L2:1=0xfc' > schemata 1024 # cat info/L2/bit_usage 1025 0=SSSSSSSS;1=SSSSSS00 1026 1027Create a new resource group that will be associated with the pseudo-locked 1028region, indicate that it will be used for a pseudo-locked region, and 1029configure the requested pseudo-locked region capacity bitmask:: 1030 1031 # mkdir newlock 1032 # echo pseudo-locksetup > newlock/mode 1033 # echo 'L2:1=0x3' > newlock/schemata 1034 1035On success the resource group's mode will change to pseudo-locked, the 1036bit_usage will reflect the pseudo-locked region, and the character device 1037exposing the pseudo-locked region will exist:: 1038 1039 # cat newlock/mode 1040 pseudo-locked 1041 # cat info/L2/bit_usage 1042 0=SSSSSSSS;1=SSSSSSPP 1043 # ls -l /dev/pseudo_lock/newlock 1044 crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock 1045 1046:: 1047 1048 /* 1049 * Example code to access one page of pseudo-locked cache region 1050 * from user space. 1051 */ 1052 #define _GNU_SOURCE 1053 #include <fcntl.h> 1054 #include <sched.h> 1055 #include <stdio.h> 1056 #include <stdlib.h> 1057 #include <unistd.h> 1058 #include <sys/mman.h> 1059 1060 /* 1061 * It is required that the application runs with affinity to only 1062 * cores associated with the pseudo-locked region. Here the cpu 1063 * is hardcoded for convenience of example. 1064 */ 1065 static int cpuid = 2; 1066 1067 int main(int argc, char *argv[]) 1068 { 1069 cpu_set_t cpuset; 1070 long page_size; 1071 void *mapping; 1072 int dev_fd; 1073 int ret; 1074 1075 page_size = sysconf(_SC_PAGESIZE); 1076 1077 CPU_ZERO(&cpuset); 1078 CPU_SET(cpuid, &cpuset); 1079 ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); 1080 if (ret < 0) { 1081 perror("sched_setaffinity"); 1082 exit(EXIT_FAILURE); 1083 } 1084 1085 dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); 1086 if (dev_fd < 0) { 1087 perror("open"); 1088 exit(EXIT_FAILURE); 1089 } 1090 1091 mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 1092 dev_fd, 0); 1093 if (mapping == MAP_FAILED) { 1094 perror("mmap"); 1095 close(dev_fd); 1096 exit(EXIT_FAILURE); 1097 } 1098 1099 /* Application interacts with pseudo-locked memory @mapping */ 1100 1101 ret = munmap(mapping, page_size); 1102 if (ret < 0) { 1103 perror("munmap"); 1104 close(dev_fd); 1105 exit(EXIT_FAILURE); 1106 } 1107 1108 close(dev_fd); 1109 exit(EXIT_SUCCESS); 1110 } 1111 1112Locking between applications 1113---------------------------- 1114 1115Certain operations on the resctrl filesystem, composed of read/writes 1116to/from multiple files, must be atomic. 1117 1118As an example, the allocation of an exclusive reservation of L3 cache 1119involves: 1120 1121 1. Read the cbmmasks from each directory or the per-resource "bit_usage" 1122 2. Find a contiguous set of bits in the global CBM bitmask that is clear 1123 in any of the directory cbmmasks 1124 3. Create a new directory 1125 4. Set the bits found in step 2 to the new directory "schemata" file 1126 1127If two applications attempt to allocate space concurrently then they can 1128end up allocating the same bits so the reservations are shared instead of 1129exclusive. 1130 1131To coordinate atomic operations on the resctrlfs and to avoid the problem 1132above, the following locking procedure is recommended: 1133 1134Locking is based on flock, which is available in libc and also as a shell 1135script command 1136 1137Write lock: 1138 1139 A) Take flock(LOCK_EX) on /sys/fs/resctrl 1140 B) Read/write the directory structure. 1141 C) funlock 1142 1143Read lock: 1144 1145 A) Take flock(LOCK_SH) on /sys/fs/resctrl 1146 B) If success read the directory structure. 1147 C) funlock 1148 1149Example with bash:: 1150 1151 # Atomically read directory structure 1152 $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl 1153 1154 # Read directory contents and create new subdirectory 1155 1156 $ cat create-dir.sh 1157 find /sys/fs/resctrl/ > output.txt 1158 mask = function-of(output.txt) 1159 mkdir /sys/fs/resctrl/newres/ 1160 echo mask > /sys/fs/resctrl/newres/schemata 1161 1162 $ flock /sys/fs/resctrl/ ./create-dir.sh 1163 1164Example with C:: 1165 1166 /* 1167 * Example code do take advisory locks 1168 * before accessing resctrl filesystem 1169 */ 1170 #include <sys/file.h> 1171 #include <stdlib.h> 1172 1173 void resctrl_take_shared_lock(int fd) 1174 { 1175 int ret; 1176 1177 /* take shared lock on resctrl filesystem */ 1178 ret = flock(fd, LOCK_SH); 1179 if (ret) { 1180 perror("flock"); 1181 exit(-1); 1182 } 1183 } 1184 1185 void resctrl_take_exclusive_lock(int fd) 1186 { 1187 int ret; 1188 1189 /* release lock on resctrl filesystem */ 1190 ret = flock(fd, LOCK_EX); 1191 if (ret) { 1192 perror("flock"); 1193 exit(-1); 1194 } 1195 } 1196 1197 void resctrl_release_lock(int fd) 1198 { 1199 int ret; 1200 1201 /* take shared lock on resctrl filesystem */ 1202 ret = flock(fd, LOCK_UN); 1203 if (ret) { 1204 perror("flock"); 1205 exit(-1); 1206 } 1207 } 1208 1209 void main(void) 1210 { 1211 int fd, ret; 1212 1213 fd = open("/sys/fs/resctrl", O_DIRECTORY); 1214 if (fd == -1) { 1215 perror("open"); 1216 exit(-1); 1217 } 1218 resctrl_take_shared_lock(fd); 1219 /* code to read directory contents */ 1220 resctrl_release_lock(fd); 1221 1222 resctrl_take_exclusive_lock(fd); 1223 /* code to read and write directory contents */ 1224 resctrl_release_lock(fd); 1225 } 1226 1227Examples for RDT Monitoring along with allocation usage 1228======================================================= 1229Reading monitored data 1230---------------------- 1231Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would 1232show the current snapshot of LLC occupancy of the corresponding MON 1233group or CTRL_MON group. 1234 1235 1236Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) 1237------------------------------------------------------------------------ 1238On a two socket machine (one L3 cache per socket) with just four bits 1239for cache bit masks:: 1240 1241 # mount -t resctrl resctrl /sys/fs/resctrl 1242 # cd /sys/fs/resctrl 1243 # mkdir p0 p1 1244 # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata 1245 # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata 1246 # echo 5678 > p1/tasks 1247 # echo 5679 > p1/tasks 1248 1249The default resource group is unmodified, so we have access to all parts 1250of all caches (its schemata file reads "L3:0=f;1=f"). 1251 1252Tasks that are under the control of group "p0" may only allocate from the 1253"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 1254Tasks in group "p1" use the "lower" 50% of cache on both sockets. 1255 1256Create monitor groups and assign a subset of tasks to each monitor group. 1257:: 1258 1259 # cd /sys/fs/resctrl/p1/mon_groups 1260 # mkdir m11 m12 1261 # echo 5678 > m11/tasks 1262 # echo 5679 > m12/tasks 1263 1264fetch data (data shown in bytes) 1265:: 1266 1267 # cat m11/mon_data/mon_L3_00/llc_occupancy 1268 16234000 1269 # cat m11/mon_data/mon_L3_01/llc_occupancy 1270 14789000 1271 # cat m12/mon_data/mon_L3_00/llc_occupancy 1272 16789000 1273 1274The parent ctrl_mon group shows the aggregated data. 1275:: 1276 1277 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 1278 31234000 1279 1280Example 2 (Monitor a task from its creation) 1281-------------------------------------------- 1282On a two socket machine (one L3 cache per socket):: 1283 1284 # mount -t resctrl resctrl /sys/fs/resctrl 1285 # cd /sys/fs/resctrl 1286 # mkdir p0 p1 1287 1288An RMID is allocated to the group once its created and hence the <cmd> 1289below is monitored from its creation. 1290:: 1291 1292 # echo $$ > /sys/fs/resctrl/p1/tasks 1293 # <cmd> 1294 1295Fetch the data:: 1296 1297 # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 1298 31789000 1299 1300Example 3 (Monitor without CAT support or before creating CAT groups) 1301--------------------------------------------------------------------- 1302 1303Assume a system like HSW has only CQM and no CAT support. In this case 1304the resctrl will still mount but cannot create CTRL_MON directories. 1305But user can create different MON groups within the root group thereby 1306able to monitor all tasks including kernel threads. 1307 1308This can also be used to profile jobs cache size footprint before being 1309able to allocate them to different allocation groups. 1310:: 1311 1312 # mount -t resctrl resctrl /sys/fs/resctrl 1313 # cd /sys/fs/resctrl 1314 # mkdir mon_groups/m01 1315 # mkdir mon_groups/m02 1316 1317 # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks 1318 # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks 1319 1320Monitor the groups separately and also get per domain data. From the 1321below its apparent that the tasks are mostly doing work on 1322domain(socket) 0. 1323:: 1324 1325 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy 1326 31234000 1327 # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy 1328 34555 1329 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy 1330 31234000 1331 # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy 1332 32789 1333 1334 1335Example 4 (Monitor real time tasks) 1336----------------------------------- 1337 1338A single socket system which has real time tasks running on cores 4-7 1339and non real time tasks on other cpus. We want to monitor the cache 1340occupancy of the real time threads on these cores. 1341:: 1342 1343 # mount -t resctrl resctrl /sys/fs/resctrl 1344 # cd /sys/fs/resctrl 1345 # mkdir p1 1346 1347Move the cpus 4-7 over to p1:: 1348 1349 # echo f0 > p1/cpus 1350 1351View the llc occupancy snapshot:: 1352 1353 # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 1354 11234000 1355 1356Intel RDT Errata 1357================ 1358 1359Intel MBM Counters May Report System Memory Bandwidth Incorrectly 1360----------------------------------------------------------------- 1361 1362Errata SKX99 for Skylake server and BDF102 for Broadwell server. 1363 1364Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics 1365according to the assigned Resource Monitor ID (RMID) for that logical 1366core. The IA32_QM_CTR register (MSR 0xC8E), used to report these 1367metrics, may report incorrect system bandwidth for certain RMID values. 1368 1369Implication: Due to the errata, system memory bandwidth may not match 1370what is reported. 1371 1372Workaround: MBM total and local readings are corrected according to the 1373following correction factor table: 1374 1375+---------------+---------------+---------------+-----------------+ 1376|core count |rmid count |rmid threshold |correction factor| 1377+---------------+---------------+---------------+-----------------+ 1378|1 |8 |0 |1.000000 | 1379+---------------+---------------+---------------+-----------------+ 1380|2 |16 |0 |1.000000 | 1381+---------------+---------------+---------------+-----------------+ 1382|3 |24 |15 |0.969650 | 1383+---------------+---------------+---------------+-----------------+ 1384|4 |32 |0 |1.000000 | 1385+---------------+---------------+---------------+-----------------+ 1386|6 |48 |31 |0.969650 | 1387+---------------+---------------+---------------+-----------------+ 1388|7 |56 |47 |1.142857 | 1389+---------------+---------------+---------------+-----------------+ 1390|8 |64 |0 |1.000000 | 1391+---------------+---------------+---------------+-----------------+ 1392|9 |72 |63 |1.185115 | 1393+---------------+---------------+---------------+-----------------+ 1394|10 |80 |63 |1.066553 | 1395+---------------+---------------+---------------+-----------------+ 1396|11 |88 |79 |1.454545 | 1397+---------------+---------------+---------------+-----------------+ 1398|12 |96 |0 |1.000000 | 1399+---------------+---------------+---------------+-----------------+ 1400|13 |104 |95 |1.230769 | 1401+---------------+---------------+---------------+-----------------+ 1402|14 |112 |95 |1.142857 | 1403+---------------+---------------+---------------+-----------------+ 1404|15 |120 |95 |1.066667 | 1405+---------------+---------------+---------------+-----------------+ 1406|16 |128 |0 |1.000000 | 1407+---------------+---------------+---------------+-----------------+ 1408|17 |136 |127 |1.254863 | 1409+---------------+---------------+---------------+-----------------+ 1410|18 |144 |127 |1.185255 | 1411+---------------+---------------+---------------+-----------------+ 1412|19 |152 |0 |1.000000 | 1413+---------------+---------------+---------------+-----------------+ 1414|20 |160 |127 |1.066667 | 1415+---------------+---------------+---------------+-----------------+ 1416|21 |168 |0 |1.000000 | 1417+---------------+---------------+---------------+-----------------+ 1418|22 |176 |159 |1.454334 | 1419+---------------+---------------+---------------+-----------------+ 1420|23 |184 |0 |1.000000 | 1421+---------------+---------------+---------------+-----------------+ 1422|24 |192 |127 |0.969744 | 1423+---------------+---------------+---------------+-----------------+ 1424|25 |200 |191 |1.280246 | 1425+---------------+---------------+---------------+-----------------+ 1426|26 |208 |191 |1.230921 | 1427+---------------+---------------+---------------+-----------------+ 1428|27 |216 |0 |1.000000 | 1429+---------------+---------------+---------------+-----------------+ 1430|28 |224 |191 |1.143118 | 1431+---------------+---------------+---------------+-----------------+ 1432 1433If rmid > rmid threshold, MBM total and local values should be multiplied 1434by the correction factor. 1435 1436See: 1437 14381. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update: 1439http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html 1440 14412. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update: 1442http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf 1443 14443. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual: 1445https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html 1446 1447for further information. 1448