1465f27a3SJiri Olsaperf-c2c(1) 2465f27a3SJiri Olsa=========== 3465f27a3SJiri Olsa 4465f27a3SJiri OlsaNAME 5465f27a3SJiri Olsa---- 6465f27a3SJiri Olsaperf-c2c - Shared Data C2C/HITM Analyzer. 7465f27a3SJiri Olsa 8465f27a3SJiri OlsaSYNOPSIS 9465f27a3SJiri Olsa-------- 10465f27a3SJiri Olsa[verse] 11465f27a3SJiri Olsa'perf c2c record' [<options>] <command> 12f2c24ebaSAlyssa Ross'perf c2c record' [<options>] \-- [<record command options>] <command> 13465f27a3SJiri Olsa'perf c2c report' [<options>] 14465f27a3SJiri Olsa 15465f27a3SJiri OlsaDESCRIPTION 16465f27a3SJiri Olsa----------- 17465f27a3SJiri OlsaC2C stands for Cache To Cache. 18465f27a3SJiri Olsa 19465f27a3SJiri OlsaThe perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows 20465f27a3SJiri Olsayou to track down the cacheline contentions. 21465f27a3SJiri Olsa 22f7b58cbdSRavi BangoriaOn Intel, the tool is based on load latency and precise store facility events 23f0fabf9cSRavi Bangoriaprovided by Intel CPUs. On PowerPC, the tool uses random instruction sampling 24f7b58cbdSRavi Bangoriawith thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware 2586569c0aSJames Clarklimitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to 2686569c0aSJames Clarksample load and store operations, therefore hardware and kernel support is 2786569c0aSJames Clarkrequired. See linkperf:perf-arm-spe[1] for a setup guide. Due to the 2886569c0aSJames Clarkstatistical nature of Arm SPE sampling, not every memory operation will be 2986569c0aSJames Clarksampled. 30f0fabf9cSRavi Bangoria 31f0fabf9cSRavi BangoriaThese events provide: 32465f27a3SJiri Olsa - memory address of the access 33465f27a3SJiri Olsa - type of the access (load and store details) 34465f27a3SJiri Olsa - latency (in cycles) of the load access 35465f27a3SJiri Olsa 36465f27a3SJiri OlsaThe c2c tool provide means to record this data and report back access details 37465f27a3SJiri Olsafor cachelines with highest contention - highest number of HITM accesses. 38465f27a3SJiri Olsa 39465f27a3SJiri OlsaThe basic workflow with this tool follows the standard record/report phase. 40465f27a3SJiri OlsaUser uses the record command to record events data and report command to 41465f27a3SJiri Olsadisplay it. 42465f27a3SJiri Olsa 43465f27a3SJiri Olsa 44465f27a3SJiri OlsaRECORD OPTIONS 45465f27a3SJiri Olsa-------------- 46465f27a3SJiri Olsa-e:: 47465f27a3SJiri Olsa--event=:: 48b027cc6fSIan Rogers Select the PMU event. Use 'perf c2c record -e list' 49465f27a3SJiri Olsa to list available events. 50465f27a3SJiri Olsa 51465f27a3SJiri Olsa-v:: 52465f27a3SJiri Olsa--verbose:: 53465f27a3SJiri Olsa Be more verbose (show counter open errors, etc). 54465f27a3SJiri Olsa 55465f27a3SJiri Olsa-l:: 56465f27a3SJiri Olsa--ldlat:: 57f7b58cbdSRavi Bangoria Configure mem-loads latency. Supported on Intel and Arm64 processors 58f7b58cbdSRavi Bangoria only. Ignored on other archs. 59465f27a3SJiri Olsa 60465f27a3SJiri Olsa-k:: 61465f27a3SJiri Olsa--all-kernel:: 62465f27a3SJiri Olsa Configure all used events to run in kernel space. 63465f27a3SJiri Olsa 64465f27a3SJiri Olsa-u:: 65465f27a3SJiri Olsa--all-user:: 66465f27a3SJiri Olsa Configure all used events to run in user space. 67465f27a3SJiri Olsa 68465f27a3SJiri OlsaREPORT OPTIONS 69465f27a3SJiri Olsa-------------- 70465f27a3SJiri Olsa-k:: 71465f27a3SJiri Olsa--vmlinux=<file>:: 72465f27a3SJiri Olsa vmlinux pathname 73465f27a3SJiri Olsa 74465f27a3SJiri Olsa-v:: 75465f27a3SJiri Olsa--verbose:: 76465f27a3SJiri Olsa Be more verbose (show counter open errors, etc). 77465f27a3SJiri Olsa 78465f27a3SJiri Olsa-i:: 79465f27a3SJiri Olsa--input:: 80465f27a3SJiri Olsa Specify the input file to process. 81465f27a3SJiri Olsa 82465f27a3SJiri Olsa-N:: 83465f27a3SJiri Olsa--node-info:: 84465f27a3SJiri Olsa Show extra node info in report (see NODE INFO section) 85465f27a3SJiri Olsa 86465f27a3SJiri Olsa-c:: 87465f27a3SJiri Olsa--coalesce:: 881291927aSKim Phillips Specify sorting fields for single cacheline display. 89465f27a3SJiri Olsa Following fields are available: tid,pid,iaddr,dso 90465f27a3SJiri Olsa (see COALESCE) 91465f27a3SJiri Olsa 92465f27a3SJiri Olsa-g:: 93465f27a3SJiri Olsa--call-graph:: 94465f27a3SJiri Olsa Setup callchains parameters. 95465f27a3SJiri Olsa Please refer to perf-report man page for details. 96465f27a3SJiri Olsa 97465f27a3SJiri Olsa--stdio:: 98465f27a3SJiri Olsa Force the stdio output (see STDIO OUTPUT) 99465f27a3SJiri Olsa 100465f27a3SJiri Olsa--stats:: 101465f27a3SJiri Olsa Display only statistic tables and force stdio mode. 102465f27a3SJiri Olsa 103465f27a3SJiri Olsa--full-symbols:: 104465f27a3SJiri Olsa Display full length of symbols. 105465f27a3SJiri Olsa 10618f278d2SJiri Olsa--no-source:: 10718f278d2SJiri Olsa Do not display Source:Line column. 10818f278d2SJiri Olsa 109af09b2d3SJiri Olsa--show-all:: 110af09b2d3SJiri Olsa Show all captured HITM lines, with no regard to HITM % 0.0005 limit. 111af09b2d3SJiri Olsa 112b7ac4f9fSJiri Olsa-f:: 113b7ac4f9fSJiri Olsa--force:: 114b7ac4f9fSJiri Olsa Don't do ownership validation. 115b7ac4f9fSJiri Olsa 116d940baccSJiri Olsa-d:: 117d940baccSJiri Olsa--display:: 118e754dd7eSLeo Yan Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display 119e754dd7eSLeo Yan and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode 120e754dd7eSLeo Yan as default. 121d940baccSJiri Olsa 122d80da766SKan Liang--stitch-lbr:: 123d80da766SKan Liang Show callgraph with stitched LBRs, which may have more complete 124d80da766SKan Liang callgraph. The perf.data file must have been obtained using 125d80da766SKan Liang perf c2c record --call-graph lbr. 126d80da766SKan Liang Disabled by default. In common cases with call stack overflows, 127d80da766SKan Liang it can recreate better call stacks than the default lbr call stack 1284cbd5334SIan Rogers output. But this approach is not foolproof. There can be cases 129d80da766SKan Liang where it creates incorrect call stacks from incorrect matches. 130d80da766SKan Liang The known limitations include exception handing such as 131d80da766SKan Liang setjmp/longjmp will have calls/returns not match. 132d80da766SKan Liang 133*1470a108SFeng Tang--double-cl:: 134*1470a108SFeng Tang Group the detection of shared cacheline events into double cacheline 135*1470a108SFeng Tang granularity. Some architectures have an Adjacent Cacheline Prefetch 136*1470a108SFeng Tang feature, which causes cacheline sharing to behave like the cacheline 137*1470a108SFeng Tang size is doubled. 138*1470a108SFeng Tang 139465f27a3SJiri OlsaC2C RECORD 140465f27a3SJiri Olsa---------- 141465f27a3SJiri OlsaThe perf c2c record command setup options related to HITM cacheline analysis 142465f27a3SJiri Olsaand calls standard perf record command. 143465f27a3SJiri Olsa 144465f27a3SJiri OlsaFollowing perf record options are configured by default: 145465f27a3SJiri Olsa(check perf record man page for details) 146465f27a3SJiri Olsa 1478fab7843SJiri Olsa -W,-d,--phys-data,--sample-cpu 148465f27a3SJiri Olsa 149465f27a3SJiri OlsaUnless specified otherwise with '-e' option, following events are monitored by 150f7b58cbdSRavi Bangoriadefault on Intel: 151465f27a3SJiri Olsa 152465f27a3SJiri Olsa cpu/mem-loads,ldlat=30/P 153465f27a3SJiri Olsa cpu/mem-stores/P 154465f27a3SJiri Olsa 155f7b58cbdSRavi Bangoriafollowing on AMD: 156f7b58cbdSRavi Bangoria 157f7b58cbdSRavi Bangoria ibs_op// 158f7b58cbdSRavi Bangoria 159f0fabf9cSRavi Bangoriaand following on PowerPC: 160f0fabf9cSRavi Bangoria 161f0fabf9cSRavi Bangoria cpu/mem-loads/ 162f0fabf9cSRavi Bangoria cpu/mem-stores/ 163f0fabf9cSRavi Bangoria 164465f27a3SJiri OlsaUser can pass any 'perf record' option behind '--' mark, like (to enable 165465f27a3SJiri Olsacallchains and system wide monitoring): 166465f27a3SJiri Olsa 167465f27a3SJiri Olsa $ perf c2c record -- -g -a 168465f27a3SJiri Olsa 169465f27a3SJiri OlsaPlease check RECORD OPTIONS section for specific c2c record options. 170465f27a3SJiri Olsa 171465f27a3SJiri OlsaC2C REPORT 172465f27a3SJiri Olsa---------- 173465f27a3SJiri OlsaThe perf c2c report command displays shared data analysis. It comes in two 174465f27a3SJiri Olsadisplay modes: stdio and tui (default). 175465f27a3SJiri Olsa 176465f27a3SJiri OlsaThe report command workflow is following: 177465f27a3SJiri Olsa - sort all the data based on the cacheline address 178465f27a3SJiri Olsa - store access details for each cacheline 179465f27a3SJiri Olsa - sort all cachelines based on user settings 180465f27a3SJiri Olsa - display data 181465f27a3SJiri Olsa 182465f27a3SJiri OlsaIn general perf report output consist of 2 basic views: 183465f27a3SJiri Olsa 1) most expensive cachelines list 184465f27a3SJiri Olsa 2) offsets details for each cacheline 185465f27a3SJiri Olsa 186465f27a3SJiri OlsaFor each cacheline in the 1) list we display following data: 187465f27a3SJiri Olsa(Both stdio and TUI modes follow the same fields output) 188465f27a3SJiri Olsa 189465f27a3SJiri Olsa Index 190465f27a3SJiri Olsa - zero based index to identify the cacheline 191465f27a3SJiri Olsa 192465f27a3SJiri Olsa Cacheline 193465f27a3SJiri Olsa - cacheline address (hex number) 194465f27a3SJiri Olsa 195e754dd7eSLeo Yan Rmt/Lcl Hitm (Display with HITM types) 196465f27a3SJiri Olsa - cacheline percentage of all Remote/Local HITM accesses 197465f27a3SJiri Olsa 198e754dd7eSLeo Yan Peer Snoop (Display with peer type) 199e754dd7eSLeo Yan - cacheline percentage of all peer accesses 200e754dd7eSLeo Yan 201e754dd7eSLeo Yan LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types) 202465f27a3SJiri Olsa - count of Total/Local/Remote load HITMs 203465f27a3SJiri Olsa 204e754dd7eSLeo Yan Load Peer - Total, Local, Remote (For display with peer type) 205e754dd7eSLeo Yan - count of Total/Local/Remote load from peer cache or DRAM 206e754dd7eSLeo Yan 207744aec4dSLeo Yan Total records 208744aec4dSLeo Yan - sum of all cachelines accesses 209465f27a3SJiri Olsa 210744aec4dSLeo Yan Total loads 211465f27a3SJiri Olsa - sum of all load accesses 212465f27a3SJiri Olsa 213744aec4dSLeo Yan Total stores 214744aec4dSLeo Yan - sum of all store accesses 215744aec4dSLeo Yan 21612aeaabaSLeo Yan Store Reference - L1Hit, L1Miss, N/A 217744aec4dSLeo Yan L1Hit - store accesses that hit L1 218744aec4dSLeo Yan L1Miss - store accesses that missed L1 21912aeaabaSLeo Yan N/A - store accesses with memory level is not available 220744aec4dSLeo Yan 221465f27a3SJiri Olsa Core Load Hit - FB, L1, L2 222465f27a3SJiri Olsa - count of load hits in FB (Fill Buffer), L1 and L2 cache 223465f27a3SJiri Olsa 224744aec4dSLeo Yan LLC Load Hit - LlcHit, LclHitm 225744aec4dSLeo Yan - count of LLC load accesses, includes LLC hits and LLC HITMs 226744aec4dSLeo Yan 227744aec4dSLeo Yan RMT Load Hit - RmtHit, RmtHitm 228e754dd7eSLeo Yan - count of remote load accesses, includes remote hits and remote HITMs; 229e754dd7eSLeo Yan on Arm neoverse cores, RmtHit is used to account remote accesses, 230e754dd7eSLeo Yan includes remote DRAM or any upward cache level in remote node 231744aec4dSLeo Yan 232744aec4dSLeo Yan Load Dram - Lcl, Rmt 233744aec4dSLeo Yan - count of local and remote DRAM accesses 234465f27a3SJiri Olsa 235465f27a3SJiri OlsaFor each offset in the 2) list we display following data: 236465f27a3SJiri Olsa 237e754dd7eSLeo Yan HITM - Rmt, Lcl (Display with HITM types) 238465f27a3SJiri Olsa - % of Remote/Local HITM accesses for given offset within cacheline 239465f27a3SJiri Olsa 240e754dd7eSLeo Yan Peer Snoop - Rmt, Lcl (Display with peer type) 241e754dd7eSLeo Yan - % of Remote/Local peer accesses for given offset within cacheline 242e754dd7eSLeo Yan 24312aeaabaSLeo Yan Store Refs - L1 Hit, L1 Miss, N/A 24412aeaabaSLeo Yan - % of store accesses that hit L1, missed L1 and N/A (no available) memory 24512aeaabaSLeo Yan level for given offset within cacheline 246465f27a3SJiri Olsa 247465f27a3SJiri Olsa Data address - Offset 248465f27a3SJiri Olsa - offset address 249465f27a3SJiri Olsa 250465f27a3SJiri Olsa Pid 251465f27a3SJiri Olsa - pid of the process responsible for the accesses 252465f27a3SJiri Olsa 253465f27a3SJiri Olsa Tid 254465f27a3SJiri Olsa - tid of the process responsible for the accesses 255465f27a3SJiri Olsa 256465f27a3SJiri Olsa Code address 257465f27a3SJiri Olsa - code address responsible for the accesses 258465f27a3SJiri Olsa 259e754dd7eSLeo Yan cycles - rmt hitm, lcl hitm, load (Display with HITM types) 260465f27a3SJiri Olsa - sum of cycles for given accesses - Remote/Local HITM and generic load 261465f27a3SJiri Olsa 262e754dd7eSLeo Yan cycles - rmt peer, lcl peer, load (Display with peer type) 263e754dd7eSLeo Yan - sum of cycles for given accesses - Remote/Local peer load and generic load 264e754dd7eSLeo Yan 265465f27a3SJiri Olsa cpu cnt 266465f27a3SJiri Olsa - number of cpus that participated on the access 267465f27a3SJiri Olsa 268465f27a3SJiri Olsa Symbol 269465f27a3SJiri Olsa - code symbol related to the 'Code address' value 270465f27a3SJiri Olsa 271465f27a3SJiri Olsa Shared Object 272465f27a3SJiri Olsa - shared object name related to the 'Code address' value 273465f27a3SJiri Olsa 274465f27a3SJiri Olsa Source:Line 275465f27a3SJiri Olsa - source information related to the 'Code address' value 276465f27a3SJiri Olsa 277465f27a3SJiri Olsa Node 278465f27a3SJiri Olsa - nodes participating on the access (see NODE INFO section) 279465f27a3SJiri Olsa 280465f27a3SJiri OlsaNODE INFO 281465f27a3SJiri Olsa--------- 282465f27a3SJiri OlsaThe 'Node' field displays nodes that accesses given cacheline 283465f27a3SJiri Olsaoffset. Its output comes in 3 flavors: 284465f27a3SJiri Olsa - node IDs separated by ',' 285465f27a3SJiri Olsa - node IDs with stats for each ID, in following format: 286e754dd7eSLeo Yan Node{cpus %hitms %stores} (Display with HITM types) 287e754dd7eSLeo Yan Node{cpus %peers %stores} (Display with peer type) 288465f27a3SJiri Olsa - node IDs with list of affected CPUs in following format: 289465f27a3SJiri Olsa Node{cpu list} 290465f27a3SJiri Olsa 291465f27a3SJiri OlsaUser can switch between above flavors with -N option or 292465f27a3SJiri Olsause 'n' key to interactively switch in TUI mode. 293465f27a3SJiri Olsa 294465f27a3SJiri OlsaCOALESCE 295465f27a3SJiri Olsa-------- 296465f27a3SJiri OlsaUser can specify how to sort offsets for cacheline. 297465f27a3SJiri Olsa 298465f27a3SJiri OlsaFollowing fields are available and governs the final 2994da6552cSLike Xuoutput fields set for cacheline offsets output: 300465f27a3SJiri Olsa 301465f27a3SJiri Olsa tid - coalesced by process TIDs 302465f27a3SJiri Olsa pid - coalesced by process PIDs 303465f27a3SJiri Olsa iaddr - coalesced by code address, following fields are displayed: 304465f27a3SJiri Olsa Code address, Code symbol, Shared Object, Source line 305465f27a3SJiri Olsa dso - coalesced by shared object 306465f27a3SJiri Olsa 307190baccaSJiri OlsaBy default the coalescing is setup with 'pid,iaddr'. 308465f27a3SJiri Olsa 309465f27a3SJiri OlsaSTDIO OUTPUT 310465f27a3SJiri Olsa------------ 311465f27a3SJiri OlsaThe stdio output displays data on standard output. 312465f27a3SJiri Olsa 313465f27a3SJiri OlsaFollowing tables are displayed: 314465f27a3SJiri Olsa Trace Event Information 315465f27a3SJiri Olsa - overall statistics of memory accesses 316465f27a3SJiri Olsa 317465f27a3SJiri Olsa Global Shared Cache Line Event Information 318465f27a3SJiri Olsa - overall statistics on shared cachelines 319465f27a3SJiri Olsa 320465f27a3SJiri Olsa Shared Data Cache Line Table 321465f27a3SJiri Olsa - list of most expensive cachelines 322465f27a3SJiri Olsa 323465f27a3SJiri Olsa Shared Cache Line Distribution Pareto 324465f27a3SJiri Olsa - list of all accessed offsets for each cacheline 325465f27a3SJiri Olsa 326465f27a3SJiri OlsaTUI OUTPUT 327465f27a3SJiri Olsa---------- 328465f27a3SJiri OlsaThe TUI output provides interactive interface to navigate 329465f27a3SJiri Olsathrough cachelines list and to display offset details. 330465f27a3SJiri Olsa 331465f27a3SJiri OlsaFor details please refer to the help window by pressing '?' key. 332465f27a3SJiri Olsa 333465f27a3SJiri OlsaCREDITS 334465f27a3SJiri Olsa------- 335465f27a3SJiri OlsaAlthough Don Zickus, Dick Fowles and Joe Mario worked together 336465f27a3SJiri Olsato get this implemented, we got lots of early help from Arnaldo 337465f27a3SJiri OlsaCarvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen. 338465f27a3SJiri Olsa 339465f27a3SJiri OlsaC2C BLOG 340465f27a3SJiri Olsa-------- 341465f27a3SJiri OlsaCheck Joe's blog on c2c tool for detailed use case explanation: 342465f27a3SJiri Olsa https://joemario.github.io/blog/2016/09/01/c2c-blog/ 343465f27a3SJiri Olsa 344465f27a3SJiri OlsaSEE ALSO 345465f27a3SJiri Olsa-------- 34686569c0aSJames Clarklinkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1] 347