1*a6f92909SShuai Xue============================================================= 2*a6f92909SShuai XueAlibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU) 3*a6f92909SShuai Xue============================================================= 4*a6f92909SShuai Xue 5*a6f92909SShuai XueThe Yitian 710, custom-built by Alibaba Group's chip development business, 6*a6f92909SShuai XueT-Head, implements uncore PMU for performance and functional debugging to 7*a6f92909SShuai Xuefacilitate system maintenance. 8*a6f92909SShuai Xue 9*a6f92909SShuai XueDDR Sub-System Driveway (DRW) PMU Driver 10*a6f92909SShuai Xue========================================= 11*a6f92909SShuai Xue 12*a6f92909SShuai XueYitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel 13*a6f92909SShuai Xueis independent of others to service system memory requests. And one DDR5 14*a6f92909SShuai Xuechannel is split into two independent sub-channels. The DDR Sub-System Driveway 15*a6f92909SShuai Xueimplements separate PMUs for each sub-channel to monitor various performance 16*a6f92909SShuai Xuemetrics. 17*a6f92909SShuai Xue 18*a6f92909SShuai XueThe Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf. 19*a6f92909SShuai XueFor example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two 20*a6f92909SShuai Xuesub-channels of the same channel in die 0. And the PMU device of die 1 is 21*a6f92909SShuai Xueprefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000. 22*a6f92909SShuai Xue 23*a6f92909SShuai XueEach sub-channel has 36 PMU counters in total, which is classified into 24*a6f92909SShuai Xuefour groups: 25*a6f92909SShuai Xue 26*a6f92909SShuai Xue- Group 0: PMU Cycle Counter. This group has one pair of counters 27*a6f92909SShuai Xue pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count 28*a6f92909SShuai Xue based on DDRC core clock. 29*a6f92909SShuai Xue 30*a6f92909SShuai Xue- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used 31*a6f92909SShuai Xue to count the total access number of either the eight bank groups in a 32*a6f92909SShuai Xue selected rank, or four ranks separately in the first 4 counters. The base 33*a6f92909SShuai Xue transfer unit is 64B. 34*a6f92909SShuai Xue 35*a6f92909SShuai Xue- Group 2: PMU Retry Counters. This group has 10 counters, that intend to 36*a6f92909SShuai Xue count the total retry number of each type of uncorrectable error. 37*a6f92909SShuai Xue 38*a6f92909SShuai Xue- Group 3: PMU Common Counters. This group has 16 counters, that are used 39*a6f92909SShuai Xue to count the common events. 40*a6f92909SShuai Xue 41*a6f92909SShuai XueFor now, the Driveway PMU driver only uses counters in group 0 and group 3. 42*a6f92909SShuai Xue 43*a6f92909SShuai XueThe DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution 44*a6f92909SShuai Xuefor connecting an SoC application bus to DDR memory devices. The DDRCTL 45*a6f92909SShuai Xuereceives transactions Host Interface (HIF) which is custom-defined by Synopsys. 46*a6f92909SShuai XueThese transactions are queued internally and scheduled for access while 47*a6f92909SShuai Xuesatisfying the SDRAM protocol timing requirements, transaction priorities, and 48*a6f92909SShuai Xuedependencies between the transactions. The DDRCTL in turn issues commands on 49*a6f92909SShuai Xuethe DDR PHY Interface (DFI) to the PHY module, which launches and captures data 50*a6f92909SShuai Xueto and from the SDRAM. The driveway PMUs have hardware logic to gather 51*a6f92909SShuai Xuestatistics and performance logging signals on HIF, DFI, etc. 52*a6f92909SShuai Xue 53*a6f92909SShuai XueBy counting the READ, WRITE and RMW commands sent to the DDRC through the HIF 54*a6f92909SShuai Xueinterface, we could calculate the bandwidth. Example usage of counting memory 55*a6f92909SShuai Xuedata bandwidth:: 56*a6f92909SShuai Xue 57*a6f92909SShuai Xue perf stat \ 58*a6f92909SShuai Xue -e ali_drw_21000/hif_wr/ \ 59*a6f92909SShuai Xue -e ali_drw_21000/hif_rd/ \ 60*a6f92909SShuai Xue -e ali_drw_21000/hif_rmw/ \ 61*a6f92909SShuai Xue -e ali_drw_21000/cycle/ \ 62*a6f92909SShuai Xue -e ali_drw_21080/hif_wr/ \ 63*a6f92909SShuai Xue -e ali_drw_21080/hif_rd/ \ 64*a6f92909SShuai Xue -e ali_drw_21080/hif_rmw/ \ 65*a6f92909SShuai Xue -e ali_drw_21080/cycle/ \ 66*a6f92909SShuai Xue -e ali_drw_23000/hif_wr/ \ 67*a6f92909SShuai Xue -e ali_drw_23000/hif_rd/ \ 68*a6f92909SShuai Xue -e ali_drw_23000/hif_rmw/ \ 69*a6f92909SShuai Xue -e ali_drw_23000/cycle/ \ 70*a6f92909SShuai Xue -e ali_drw_23080/hif_wr/ \ 71*a6f92909SShuai Xue -e ali_drw_23080/hif_rd/ \ 72*a6f92909SShuai Xue -e ali_drw_23080/hif_rmw/ \ 73*a6f92909SShuai Xue -e ali_drw_23080/cycle/ \ 74*a6f92909SShuai Xue -e ali_drw_25000/hif_wr/ \ 75*a6f92909SShuai Xue -e ali_drw_25000/hif_rd/ \ 76*a6f92909SShuai Xue -e ali_drw_25000/hif_rmw/ \ 77*a6f92909SShuai Xue -e ali_drw_25000/cycle/ \ 78*a6f92909SShuai Xue -e ali_drw_25080/hif_wr/ \ 79*a6f92909SShuai Xue -e ali_drw_25080/hif_rd/ \ 80*a6f92909SShuai Xue -e ali_drw_25080/hif_rmw/ \ 81*a6f92909SShuai Xue -e ali_drw_25080/cycle/ \ 82*a6f92909SShuai Xue -e ali_drw_27000/hif_wr/ \ 83*a6f92909SShuai Xue -e ali_drw_27000/hif_rd/ \ 84*a6f92909SShuai Xue -e ali_drw_27000/hif_rmw/ \ 85*a6f92909SShuai Xue -e ali_drw_27000/cycle/ \ 86*a6f92909SShuai Xue -e ali_drw_27080/hif_wr/ \ 87*a6f92909SShuai Xue -e ali_drw_27080/hif_rd/ \ 88*a6f92909SShuai Xue -e ali_drw_27080/hif_rmw/ \ 89*a6f92909SShuai Xue -e ali_drw_27080/cycle/ -- sleep 10 90*a6f92909SShuai Xue 91*a6f92909SShuai XueThe average DRAM bandwidth can be calculated as follows: 92*a6f92909SShuai Xue 93*a6f92909SShuai Xue- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle 94*a6f92909SShuai Xue- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle 95*a6f92909SShuai Xue 96*a6f92909SShuai XueHere, DDRC_WIDTH = 64 bytes. 97*a6f92909SShuai Xue 98*a6f92909SShuai XueThe current driver does not support sampling. So "perf record" is 99*a6f92909SShuai Xueunsupported. Also attach to a task is unsupported as the events are all 100*a6f92909SShuai Xueuncore. 101