1465f27a3SJiri Olsaperf-c2c(1)
2465f27a3SJiri Olsa===========
3465f27a3SJiri Olsa
4465f27a3SJiri OlsaNAME
5465f27a3SJiri Olsa----
6465f27a3SJiri Olsaperf-c2c - Shared Data C2C/HITM Analyzer.
7465f27a3SJiri Olsa
8465f27a3SJiri OlsaSYNOPSIS
9465f27a3SJiri Olsa--------
10465f27a3SJiri Olsa[verse]
11465f27a3SJiri Olsa'perf c2c record' [<options>] <command>
12f2c24ebaSAlyssa Ross'perf c2c record' [<options>] \-- [<record command options>] <command>
13465f27a3SJiri Olsa'perf c2c report' [<options>]
14465f27a3SJiri Olsa
15465f27a3SJiri OlsaDESCRIPTION
16465f27a3SJiri Olsa-----------
17465f27a3SJiri OlsaC2C stands for Cache To Cache.
18465f27a3SJiri Olsa
19465f27a3SJiri OlsaThe perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows
20465f27a3SJiri Olsayou to track down the cacheline contentions.
21465f27a3SJiri Olsa
22f7b58cbdSRavi BangoriaOn Intel, the tool is based on load latency and precise store facility events
23f0fabf9cSRavi Bangoriaprovided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
24f7b58cbdSRavi Bangoriawith thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware
2586569c0aSJames Clarklimitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to
2686569c0aSJames Clarksample load and store operations, therefore hardware and kernel support is
2786569c0aSJames Clarkrequired. See linkperf:perf-arm-spe[1] for a setup guide. Due to the
2886569c0aSJames Clarkstatistical nature of Arm SPE sampling, not every memory operation will be
2986569c0aSJames Clarksampled.
30f0fabf9cSRavi Bangoria
31f0fabf9cSRavi BangoriaThese events provide:
32465f27a3SJiri Olsa  - memory address of the access
33465f27a3SJiri Olsa  - type of the access (load and store details)
34465f27a3SJiri Olsa  - latency (in cycles) of the load access
35465f27a3SJiri Olsa
36465f27a3SJiri OlsaThe c2c tool provide means to record this data and report back access details
37465f27a3SJiri Olsafor cachelines with highest contention - highest number of HITM accesses.
38465f27a3SJiri Olsa
39465f27a3SJiri OlsaThe basic workflow with this tool follows the standard record/report phase.
40465f27a3SJiri OlsaUser uses the record command to record events data and report command to
41465f27a3SJiri Olsadisplay it.
42465f27a3SJiri Olsa
43465f27a3SJiri Olsa
44465f27a3SJiri OlsaRECORD OPTIONS
45465f27a3SJiri Olsa--------------
46465f27a3SJiri Olsa-e::
47465f27a3SJiri Olsa--event=::
48b027cc6fSIan Rogers	Select the PMU event. Use 'perf c2c record -e list'
49465f27a3SJiri Olsa	to list available events.
50465f27a3SJiri Olsa
51465f27a3SJiri Olsa-v::
52465f27a3SJiri Olsa--verbose::
53465f27a3SJiri Olsa	Be more verbose (show counter open errors, etc).
54465f27a3SJiri Olsa
55465f27a3SJiri Olsa-l::
56465f27a3SJiri Olsa--ldlat::
57f7b58cbdSRavi Bangoria	Configure mem-loads latency. Supported on Intel and Arm64 processors
58f7b58cbdSRavi Bangoria	only. Ignored on other archs.
59465f27a3SJiri Olsa
60465f27a3SJiri Olsa-k::
61465f27a3SJiri Olsa--all-kernel::
62465f27a3SJiri Olsa	Configure all used events to run in kernel space.
63465f27a3SJiri Olsa
64465f27a3SJiri Olsa-u::
65465f27a3SJiri Olsa--all-user::
66465f27a3SJiri Olsa	Configure all used events to run in user space.
67465f27a3SJiri Olsa
68465f27a3SJiri OlsaREPORT OPTIONS
69465f27a3SJiri Olsa--------------
70465f27a3SJiri Olsa-k::
71465f27a3SJiri Olsa--vmlinux=<file>::
72465f27a3SJiri Olsa	vmlinux pathname
73465f27a3SJiri Olsa
74465f27a3SJiri Olsa-v::
75465f27a3SJiri Olsa--verbose::
76465f27a3SJiri Olsa	Be more verbose (show counter open errors, etc).
77465f27a3SJiri Olsa
78465f27a3SJiri Olsa-i::
79465f27a3SJiri Olsa--input::
80465f27a3SJiri Olsa	Specify the input file to process.
81465f27a3SJiri Olsa
82465f27a3SJiri Olsa-N::
83465f27a3SJiri Olsa--node-info::
84465f27a3SJiri Olsa	Show extra node info in report (see NODE INFO section)
85465f27a3SJiri Olsa
86465f27a3SJiri Olsa-c::
87465f27a3SJiri Olsa--coalesce::
881291927aSKim Phillips	Specify sorting fields for single cacheline display.
89465f27a3SJiri Olsa	Following fields are available: tid,pid,iaddr,dso
90465f27a3SJiri Olsa	(see COALESCE)
91465f27a3SJiri Olsa
92465f27a3SJiri Olsa-g::
93465f27a3SJiri Olsa--call-graph::
94465f27a3SJiri Olsa	Setup callchains parameters.
95465f27a3SJiri Olsa	Please refer to perf-report man page for details.
96465f27a3SJiri Olsa
97465f27a3SJiri Olsa--stdio::
98465f27a3SJiri Olsa	Force the stdio output (see STDIO OUTPUT)
99465f27a3SJiri Olsa
100465f27a3SJiri Olsa--stats::
101465f27a3SJiri Olsa	Display only statistic tables and force stdio mode.
102465f27a3SJiri Olsa
103465f27a3SJiri Olsa--full-symbols::
104465f27a3SJiri Olsa	Display full length of symbols.
105465f27a3SJiri Olsa
10618f278d2SJiri Olsa--no-source::
10718f278d2SJiri Olsa	Do not display Source:Line column.
10818f278d2SJiri Olsa
109af09b2d3SJiri Olsa--show-all::
110af09b2d3SJiri Olsa	Show all captured HITM lines, with no regard to HITM % 0.0005 limit.
111af09b2d3SJiri Olsa
112b7ac4f9fSJiri Olsa-f::
113b7ac4f9fSJiri Olsa--force::
114b7ac4f9fSJiri Olsa	Don't do ownership validation.
115b7ac4f9fSJiri Olsa
116d940baccSJiri Olsa-d::
117d940baccSJiri Olsa--display::
118e754dd7eSLeo Yan	Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display
119e754dd7eSLeo Yan	and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode
120e754dd7eSLeo Yan	as default.
121d940baccSJiri Olsa
122d80da766SKan Liang--stitch-lbr::
123d80da766SKan Liang	Show callgraph with stitched LBRs, which may have more complete
124d80da766SKan Liang	callgraph. The perf.data file must have been obtained using
125d80da766SKan Liang	perf c2c record --call-graph lbr.
126d80da766SKan Liang	Disabled by default. In common cases with call stack overflows,
127d80da766SKan Liang	it can recreate better call stacks than the default lbr call stack
1284cbd5334SIan Rogers	output. But this approach is not foolproof. There can be cases
129d80da766SKan Liang	where it creates incorrect call stacks from incorrect matches.
130d80da766SKan Liang	The known limitations include exception handing such as
131d80da766SKan Liang	setjmp/longjmp will have calls/returns not match.
132d80da766SKan Liang
133*1470a108SFeng Tang--double-cl::
134*1470a108SFeng Tang	Group the detection of shared cacheline events into double cacheline
135*1470a108SFeng Tang	granularity. Some architectures have an Adjacent Cacheline Prefetch
136*1470a108SFeng Tang	feature, which causes cacheline sharing to behave like the cacheline
137*1470a108SFeng Tang	size is doubled.
138*1470a108SFeng Tang
139465f27a3SJiri OlsaC2C RECORD
140465f27a3SJiri Olsa----------
141465f27a3SJiri OlsaThe perf c2c record command setup options related to HITM cacheline analysis
142465f27a3SJiri Olsaand calls standard perf record command.
143465f27a3SJiri Olsa
144465f27a3SJiri OlsaFollowing perf record options are configured by default:
145465f27a3SJiri Olsa(check perf record man page for details)
146465f27a3SJiri Olsa
1478fab7843SJiri Olsa  -W,-d,--phys-data,--sample-cpu
148465f27a3SJiri Olsa
149465f27a3SJiri OlsaUnless specified otherwise with '-e' option, following events are monitored by
150f7b58cbdSRavi Bangoriadefault on Intel:
151465f27a3SJiri Olsa
152465f27a3SJiri Olsa  cpu/mem-loads,ldlat=30/P
153465f27a3SJiri Olsa  cpu/mem-stores/P
154465f27a3SJiri Olsa
155f7b58cbdSRavi Bangoriafollowing on AMD:
156f7b58cbdSRavi Bangoria
157f7b58cbdSRavi Bangoria  ibs_op//
158f7b58cbdSRavi Bangoria
159f0fabf9cSRavi Bangoriaand following on PowerPC:
160f0fabf9cSRavi Bangoria
161f0fabf9cSRavi Bangoria  cpu/mem-loads/
162f0fabf9cSRavi Bangoria  cpu/mem-stores/
163f0fabf9cSRavi Bangoria
164465f27a3SJiri OlsaUser can pass any 'perf record' option behind '--' mark, like (to enable
165465f27a3SJiri Olsacallchains and system wide monitoring):
166465f27a3SJiri Olsa
167465f27a3SJiri Olsa  $ perf c2c record -- -g -a
168465f27a3SJiri Olsa
169465f27a3SJiri OlsaPlease check RECORD OPTIONS section for specific c2c record options.
170465f27a3SJiri Olsa
171465f27a3SJiri OlsaC2C REPORT
172465f27a3SJiri Olsa----------
173465f27a3SJiri OlsaThe perf c2c report command displays shared data analysis.  It comes in two
174465f27a3SJiri Olsadisplay modes: stdio and tui (default).
175465f27a3SJiri Olsa
176465f27a3SJiri OlsaThe report command workflow is following:
177465f27a3SJiri Olsa  - sort all the data based on the cacheline address
178465f27a3SJiri Olsa  - store access details for each cacheline
179465f27a3SJiri Olsa  - sort all cachelines based on user settings
180465f27a3SJiri Olsa  - display data
181465f27a3SJiri Olsa
182465f27a3SJiri OlsaIn general perf report output consist of 2 basic views:
183465f27a3SJiri Olsa  1) most expensive cachelines list
184465f27a3SJiri Olsa  2) offsets details for each cacheline
185465f27a3SJiri Olsa
186465f27a3SJiri OlsaFor each cacheline in the 1) list we display following data:
187465f27a3SJiri Olsa(Both stdio and TUI modes follow the same fields output)
188465f27a3SJiri Olsa
189465f27a3SJiri Olsa  Index
190465f27a3SJiri Olsa  - zero based index to identify the cacheline
191465f27a3SJiri Olsa
192465f27a3SJiri Olsa  Cacheline
193465f27a3SJiri Olsa  - cacheline address (hex number)
194465f27a3SJiri Olsa
195e754dd7eSLeo Yan  Rmt/Lcl Hitm (Display with HITM types)
196465f27a3SJiri Olsa  - cacheline percentage of all Remote/Local HITM accesses
197465f27a3SJiri Olsa
198e754dd7eSLeo Yan  Peer Snoop (Display with peer type)
199e754dd7eSLeo Yan  - cacheline percentage of all peer accesses
200e754dd7eSLeo Yan
201e754dd7eSLeo Yan  LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types)
202465f27a3SJiri Olsa  - count of Total/Local/Remote load HITMs
203465f27a3SJiri Olsa
204e754dd7eSLeo Yan  Load Peer - Total, Local, Remote (For display with peer type)
205e754dd7eSLeo Yan  - count of Total/Local/Remote load from peer cache or DRAM
206e754dd7eSLeo Yan
207744aec4dSLeo Yan  Total records
208744aec4dSLeo Yan  - sum of all cachelines accesses
209465f27a3SJiri Olsa
210744aec4dSLeo Yan  Total loads
211465f27a3SJiri Olsa  - sum of all load accesses
212465f27a3SJiri Olsa
213744aec4dSLeo Yan  Total stores
214744aec4dSLeo Yan  - sum of all store accesses
215744aec4dSLeo Yan
21612aeaabaSLeo Yan  Store Reference - L1Hit, L1Miss, N/A
217744aec4dSLeo Yan    L1Hit - store accesses that hit L1
218744aec4dSLeo Yan    L1Miss - store accesses that missed L1
21912aeaabaSLeo Yan    N/A - store accesses with memory level is not available
220744aec4dSLeo Yan
221465f27a3SJiri Olsa  Core Load Hit - FB, L1, L2
222465f27a3SJiri Olsa  - count of load hits in FB (Fill Buffer), L1 and L2 cache
223465f27a3SJiri Olsa
224744aec4dSLeo Yan  LLC Load Hit - LlcHit, LclHitm
225744aec4dSLeo Yan  - count of LLC load accesses, includes LLC hits and LLC HITMs
226744aec4dSLeo Yan
227744aec4dSLeo Yan  RMT Load Hit - RmtHit, RmtHitm
228e754dd7eSLeo Yan  - count of remote load accesses, includes remote hits and remote HITMs;
229e754dd7eSLeo Yan    on Arm neoverse cores, RmtHit is used to account remote accesses,
230e754dd7eSLeo Yan    includes remote DRAM or any upward cache level in remote node
231744aec4dSLeo Yan
232744aec4dSLeo Yan  Load Dram - Lcl, Rmt
233744aec4dSLeo Yan  - count of local and remote DRAM accesses
234465f27a3SJiri Olsa
235465f27a3SJiri OlsaFor each offset in the 2) list we display following data:
236465f27a3SJiri Olsa
237e754dd7eSLeo Yan  HITM - Rmt, Lcl (Display with HITM types)
238465f27a3SJiri Olsa  - % of Remote/Local HITM accesses for given offset within cacheline
239465f27a3SJiri Olsa
240e754dd7eSLeo Yan  Peer Snoop - Rmt, Lcl (Display with peer type)
241e754dd7eSLeo Yan  - % of Remote/Local peer accesses for given offset within cacheline
242e754dd7eSLeo Yan
24312aeaabaSLeo Yan  Store Refs - L1 Hit, L1 Miss, N/A
24412aeaabaSLeo Yan  - % of store accesses that hit L1, missed L1 and N/A (no available) memory
24512aeaabaSLeo Yan    level for given offset within cacheline
246465f27a3SJiri Olsa
247465f27a3SJiri Olsa  Data address - Offset
248465f27a3SJiri Olsa  - offset address
249465f27a3SJiri Olsa
250465f27a3SJiri Olsa  Pid
251465f27a3SJiri Olsa  - pid of the process responsible for the accesses
252465f27a3SJiri Olsa
253465f27a3SJiri Olsa  Tid
254465f27a3SJiri Olsa  - tid of the process responsible for the accesses
255465f27a3SJiri Olsa
256465f27a3SJiri Olsa  Code address
257465f27a3SJiri Olsa  - code address responsible for the accesses
258465f27a3SJiri Olsa
259e754dd7eSLeo Yan  cycles - rmt hitm, lcl hitm, load (Display with HITM types)
260465f27a3SJiri Olsa    - sum of cycles for given accesses - Remote/Local HITM and generic load
261465f27a3SJiri Olsa
262e754dd7eSLeo Yan  cycles - rmt peer, lcl peer, load (Display with peer type)
263e754dd7eSLeo Yan    - sum of cycles for given accesses - Remote/Local peer load and generic load
264e754dd7eSLeo Yan
265465f27a3SJiri Olsa  cpu cnt
266465f27a3SJiri Olsa    - number of cpus that participated on the access
267465f27a3SJiri Olsa
268465f27a3SJiri Olsa  Symbol
269465f27a3SJiri Olsa    - code symbol related to the 'Code address' value
270465f27a3SJiri Olsa
271465f27a3SJiri Olsa  Shared Object
272465f27a3SJiri Olsa    - shared object name related to the 'Code address' value
273465f27a3SJiri Olsa
274465f27a3SJiri Olsa  Source:Line
275465f27a3SJiri Olsa    - source information related to the 'Code address' value
276465f27a3SJiri Olsa
277465f27a3SJiri Olsa  Node
278465f27a3SJiri Olsa    - nodes participating on the access (see NODE INFO section)
279465f27a3SJiri Olsa
280465f27a3SJiri OlsaNODE INFO
281465f27a3SJiri Olsa---------
282465f27a3SJiri OlsaThe 'Node' field displays nodes that accesses given cacheline
283465f27a3SJiri Olsaoffset. Its output comes in 3 flavors:
284465f27a3SJiri Olsa  - node IDs separated by ','
285465f27a3SJiri Olsa  - node IDs with stats for each ID, in following format:
286e754dd7eSLeo Yan      Node{cpus %hitms %stores} (Display with HITM types)
287e754dd7eSLeo Yan      Node{cpus %peers %stores} (Display with peer type)
288465f27a3SJiri Olsa  - node IDs with list of affected CPUs in following format:
289465f27a3SJiri Olsa      Node{cpu list}
290465f27a3SJiri Olsa
291465f27a3SJiri OlsaUser can switch between above flavors with -N option or
292465f27a3SJiri Olsause 'n' key to interactively switch in TUI mode.
293465f27a3SJiri Olsa
294465f27a3SJiri OlsaCOALESCE
295465f27a3SJiri Olsa--------
296465f27a3SJiri OlsaUser can specify how to sort offsets for cacheline.
297465f27a3SJiri Olsa
298465f27a3SJiri OlsaFollowing fields are available and governs the final
2994da6552cSLike Xuoutput fields set for cacheline offsets output:
300465f27a3SJiri Olsa
301465f27a3SJiri Olsa  tid   - coalesced by process TIDs
302465f27a3SJiri Olsa  pid   - coalesced by process PIDs
303465f27a3SJiri Olsa  iaddr - coalesced by code address, following fields are displayed:
304465f27a3SJiri Olsa             Code address, Code symbol, Shared Object, Source line
305465f27a3SJiri Olsa  dso   - coalesced by shared object
306465f27a3SJiri Olsa
307190baccaSJiri OlsaBy default the coalescing is setup with 'pid,iaddr'.
308465f27a3SJiri Olsa
309465f27a3SJiri OlsaSTDIO OUTPUT
310465f27a3SJiri Olsa------------
311465f27a3SJiri OlsaThe stdio output displays data on standard output.
312465f27a3SJiri Olsa
313465f27a3SJiri OlsaFollowing tables are displayed:
314465f27a3SJiri Olsa  Trace Event Information
315465f27a3SJiri Olsa  - overall statistics of memory accesses
316465f27a3SJiri Olsa
317465f27a3SJiri Olsa  Global Shared Cache Line Event Information
318465f27a3SJiri Olsa  - overall statistics on shared cachelines
319465f27a3SJiri Olsa
320465f27a3SJiri Olsa  Shared Data Cache Line Table
321465f27a3SJiri Olsa  - list of most expensive cachelines
322465f27a3SJiri Olsa
323465f27a3SJiri Olsa  Shared Cache Line Distribution Pareto
324465f27a3SJiri Olsa  - list of all accessed offsets for each cacheline
325465f27a3SJiri Olsa
326465f27a3SJiri OlsaTUI OUTPUT
327465f27a3SJiri Olsa----------
328465f27a3SJiri OlsaThe TUI output provides interactive interface to navigate
329465f27a3SJiri Olsathrough cachelines list and to display offset details.
330465f27a3SJiri Olsa
331465f27a3SJiri OlsaFor details please refer to the help window by pressing '?' key.
332465f27a3SJiri Olsa
333465f27a3SJiri OlsaCREDITS
334465f27a3SJiri Olsa-------
335465f27a3SJiri OlsaAlthough Don Zickus, Dick Fowles and Joe Mario worked together
336465f27a3SJiri Olsato get this implemented, we got lots of early help from Arnaldo
337465f27a3SJiri OlsaCarvalho de Melo, Stephane Eranian, Jiri Olsa and Andi Kleen.
338465f27a3SJiri Olsa
339465f27a3SJiri OlsaC2C BLOG
340465f27a3SJiri Olsa--------
341465f27a3SJiri OlsaCheck Joe's blog on c2c tool for detailed use case explanation:
342465f27a3SJiri Olsa  https://joemario.github.io/blog/2016/09/01/c2c-blog/
343465f27a3SJiri Olsa
344465f27a3SJiri OlsaSEE ALSO
345465f27a3SJiri Olsa--------
34686569c0aSJames Clarklinkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1]
347