1a6f92909SShuai Xue============================================================= 2a6f92909SShuai XueAlibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU) 3a6f92909SShuai Xue============================================================= 4a6f92909SShuai Xue 5a6f92909SShuai XueThe Yitian 710, custom-built by Alibaba Group's chip development business, 6a6f92909SShuai XueT-Head, implements uncore PMU for performance and functional debugging to 7a6f92909SShuai Xuefacilitate system maintenance. 8a6f92909SShuai Xue 9a6f92909SShuai XueDDR Sub-System Driveway (DRW) PMU Driver 10a6f92909SShuai Xue========================================= 11a6f92909SShuai Xue 12a6f92909SShuai XueYitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel 13a6f92909SShuai Xueis independent of others to service system memory requests. And one DDR5 14a6f92909SShuai Xuechannel is split into two independent sub-channels. The DDR Sub-System Driveway 15a6f92909SShuai Xueimplements separate PMUs for each sub-channel to monitor various performance 16a6f92909SShuai Xuemetrics. 17a6f92909SShuai Xue 18a6f92909SShuai XueThe Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf. 19a6f92909SShuai XueFor example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two 20a6f92909SShuai Xuesub-channels of the same channel in die 0. And the PMU device of die 1 is 21a6f92909SShuai Xueprefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000. 22a6f92909SShuai Xue 23a6f92909SShuai XueEach sub-channel has 36 PMU counters in total, which is classified into 24a6f92909SShuai Xuefour groups: 25a6f92909SShuai Xue 26a6f92909SShuai Xue- Group 0: PMU Cycle Counter. This group has one pair of counters 27a6f92909SShuai Xue pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count 28a6f92909SShuai Xue based on DDRC core clock. 29a6f92909SShuai Xue 30a6f92909SShuai Xue- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used 31a6f92909SShuai Xue to count the total access number of either the eight bank groups in a 32a6f92909SShuai Xue selected rank, or four ranks separately in the first 4 counters. The base 33a6f92909SShuai Xue transfer unit is 64B. 34a6f92909SShuai Xue 35a6f92909SShuai Xue- Group 2: PMU Retry Counters. This group has 10 counters, that intend to 36a6f92909SShuai Xue count the total retry number of each type of uncorrectable error. 37a6f92909SShuai Xue 38a6f92909SShuai Xue- Group 3: PMU Common Counters. This group has 16 counters, that are used 39a6f92909SShuai Xue to count the common events. 40a6f92909SShuai Xue 41a6f92909SShuai XueFor now, the Driveway PMU driver only uses counters in group 0 and group 3. 42a6f92909SShuai Xue 43a6f92909SShuai XueThe DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution 44a6f92909SShuai Xuefor connecting an SoC application bus to DDR memory devices. The DDRCTL 45a6f92909SShuai Xuereceives transactions Host Interface (HIF) which is custom-defined by Synopsys. 46a6f92909SShuai XueThese transactions are queued internally and scheduled for access while 47a6f92909SShuai Xuesatisfying the SDRAM protocol timing requirements, transaction priorities, and 48a6f92909SShuai Xuedependencies between the transactions. The DDRCTL in turn issues commands on 49a6f92909SShuai Xuethe DDR PHY Interface (DFI) to the PHY module, which launches and captures data 50a6f92909SShuai Xueto and from the SDRAM. The driveway PMUs have hardware logic to gather 51a6f92909SShuai Xuestatistics and performance logging signals on HIF, DFI, etc. 52a6f92909SShuai Xue 53a6f92909SShuai XueBy counting the READ, WRITE and RMW commands sent to the DDRC through the HIF 54a6f92909SShuai Xueinterface, we could calculate the bandwidth. Example usage of counting memory 55a6f92909SShuai Xuedata bandwidth:: 56a6f92909SShuai Xue 57a6f92909SShuai Xue perf stat \ 58a6f92909SShuai Xue -e ali_drw_21000/hif_wr/ \ 59a6f92909SShuai Xue -e ali_drw_21000/hif_rd/ \ 60a6f92909SShuai Xue -e ali_drw_21000/hif_rmw/ \ 61a6f92909SShuai Xue -e ali_drw_21000/cycle/ \ 62a6f92909SShuai Xue -e ali_drw_21080/hif_wr/ \ 63a6f92909SShuai Xue -e ali_drw_21080/hif_rd/ \ 64a6f92909SShuai Xue -e ali_drw_21080/hif_rmw/ \ 65a6f92909SShuai Xue -e ali_drw_21080/cycle/ \ 66a6f92909SShuai Xue -e ali_drw_23000/hif_wr/ \ 67a6f92909SShuai Xue -e ali_drw_23000/hif_rd/ \ 68a6f92909SShuai Xue -e ali_drw_23000/hif_rmw/ \ 69a6f92909SShuai Xue -e ali_drw_23000/cycle/ \ 70a6f92909SShuai Xue -e ali_drw_23080/hif_wr/ \ 71a6f92909SShuai Xue -e ali_drw_23080/hif_rd/ \ 72a6f92909SShuai Xue -e ali_drw_23080/hif_rmw/ \ 73a6f92909SShuai Xue -e ali_drw_23080/cycle/ \ 74a6f92909SShuai Xue -e ali_drw_25000/hif_wr/ \ 75a6f92909SShuai Xue -e ali_drw_25000/hif_rd/ \ 76a6f92909SShuai Xue -e ali_drw_25000/hif_rmw/ \ 77a6f92909SShuai Xue -e ali_drw_25000/cycle/ \ 78a6f92909SShuai Xue -e ali_drw_25080/hif_wr/ \ 79a6f92909SShuai Xue -e ali_drw_25080/hif_rd/ \ 80a6f92909SShuai Xue -e ali_drw_25080/hif_rmw/ \ 81a6f92909SShuai Xue -e ali_drw_25080/cycle/ \ 82a6f92909SShuai Xue -e ali_drw_27000/hif_wr/ \ 83a6f92909SShuai Xue -e ali_drw_27000/hif_rd/ \ 84a6f92909SShuai Xue -e ali_drw_27000/hif_rmw/ \ 85a6f92909SShuai Xue -e ali_drw_27000/cycle/ \ 86a6f92909SShuai Xue -e ali_drw_27080/hif_wr/ \ 87a6f92909SShuai Xue -e ali_drw_27080/hif_rd/ \ 88a6f92909SShuai Xue -e ali_drw_27080/hif_rmw/ \ 89a6f92909SShuai Xue -e ali_drw_27080/cycle/ -- sleep 10 90a6f92909SShuai Xue 91*f849ce6bSJing ZhangExample usage of counting all memory read/write bandwidth by metric:: 92*f849ce6bSJing Zhang 93*f849ce6bSJing Zhang perf stat -M ddr_read_bandwidth.all -- sleep 10 94*f849ce6bSJing Zhang perf stat -M ddr_write_bandwidth.all -- sleep 10 95*f849ce6bSJing Zhang 96a6f92909SShuai XueThe average DRAM bandwidth can be calculated as follows: 97a6f92909SShuai Xue 98a6f92909SShuai Xue- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle 99a6f92909SShuai Xue- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle 100a6f92909SShuai Xue 101a6f92909SShuai XueHere, DDRC_WIDTH = 64 bytes. 102a6f92909SShuai Xue 103a6f92909SShuai XueThe current driver does not support sampling. So "perf record" is 104a6f92909SShuai Xueunsupported. Also attach to a task is unsupported as the events are all 105a6f92909SShuai Xueuncore. 106