1=============================================================
2Alibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU)
3=============================================================
4
5The Yitian 710, custom-built by Alibaba Group's chip development business,
6T-Head, implements uncore PMU for performance and functional debugging to
7facilitate system maintenance.
8
9DDR Sub-System Driveway (DRW) PMU Driver
10=========================================
11
12Yitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel
13is independent of others to service system memory requests. And one DDR5
14channel is split into two independent sub-channels. The DDR Sub-System Driveway
15implements separate PMUs for each sub-channel to monitor various performance
16metrics.
17
18The Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf.
19For example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two
20sub-channels of the same channel in die 0. And the PMU device of die 1 is
21prefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000.
22
23Each sub-channel has 36 PMU counters in total, which is classified into
24four groups:
25
26- Group 0: PMU Cycle Counter. This group has one pair of counters
27  pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count
28  based on DDRC core clock.
29
30- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used
31  to count the total access number of either the eight bank groups in a
32  selected rank, or four ranks separately in the first 4 counters. The base
33  transfer unit is 64B.
34
35- Group 2: PMU Retry Counters. This group has 10 counters, that intend to
36  count the total retry number of each type of uncorrectable error.
37
38- Group 3: PMU Common Counters. This group has 16 counters, that are used
39  to count the common events.
40
41For now, the Driveway PMU driver only uses counters in group 0 and group 3.
42
43The DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution
44for connecting an SoC application bus to DDR memory devices. The DDRCTL
45receives transactions Host Interface (HIF) which is custom-defined by Synopsys.
46These transactions are queued internally and scheduled for access while
47satisfying the SDRAM protocol timing requirements, transaction priorities, and
48dependencies between the transactions. The DDRCTL in turn issues commands on
49the DDR PHY Interface (DFI) to the PHY module, which launches and captures data
50to and from the SDRAM. The driveway PMUs have hardware logic to gather
51statistics and performance logging signals on HIF, DFI, etc.
52
53By counting the READ, WRITE and RMW commands sent to the DDRC through the HIF
54interface, we could calculate the bandwidth. Example usage of counting memory
55data bandwidth::
56
57  perf stat \
58    -e ali_drw_21000/hif_wr/ \
59    -e ali_drw_21000/hif_rd/ \
60    -e ali_drw_21000/hif_rmw/ \
61    -e ali_drw_21000/cycle/ \
62    -e ali_drw_21080/hif_wr/ \
63    -e ali_drw_21080/hif_rd/ \
64    -e ali_drw_21080/hif_rmw/ \
65    -e ali_drw_21080/cycle/ \
66    -e ali_drw_23000/hif_wr/ \
67    -e ali_drw_23000/hif_rd/ \
68    -e ali_drw_23000/hif_rmw/ \
69    -e ali_drw_23000/cycle/ \
70    -e ali_drw_23080/hif_wr/ \
71    -e ali_drw_23080/hif_rd/ \
72    -e ali_drw_23080/hif_rmw/ \
73    -e ali_drw_23080/cycle/ \
74    -e ali_drw_25000/hif_wr/ \
75    -e ali_drw_25000/hif_rd/ \
76    -e ali_drw_25000/hif_rmw/ \
77    -e ali_drw_25000/cycle/ \
78    -e ali_drw_25080/hif_wr/ \
79    -e ali_drw_25080/hif_rd/ \
80    -e ali_drw_25080/hif_rmw/ \
81    -e ali_drw_25080/cycle/ \
82    -e ali_drw_27000/hif_wr/ \
83    -e ali_drw_27000/hif_rd/ \
84    -e ali_drw_27000/hif_rmw/ \
85    -e ali_drw_27000/cycle/ \
86    -e ali_drw_27080/hif_wr/ \
87    -e ali_drw_27080/hif_rd/ \
88    -e ali_drw_27080/hif_rmw/ \
89    -e ali_drw_27080/cycle/ -- sleep 10
90
91The average DRAM bandwidth can be calculated as follows:
92
93- Read Bandwidth =  perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle
94- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle
95
96Here, DDRC_WIDTH = 64 bytes.
97
98The current driver does not support sampling. So "perf record" is
99unsupported.  Also attach to a task is unsupported as the events are all
100uncore.
101