1### BMC Health Monitor 2 3Author: 4 Vijay Khemka <vijaykhemka@fb.com>; <vijay!> 5 Sui Chen <suichen@google.com> 6 7Primary assignee: 8 Vijay Khemka <vijaykhemka@fb.com>; <vijay!> 9 10Created: 11 2020-05-04 12 13## Problem Description 14The problem is to monitor the health of a system with a BMC so we have some 15means to make sure the BMC is working correctly. User can get required metrics 16data as per configurations instantly. Set of monitored metrics may include CPU 17and memory utilization, uptime, free disk space, I2C bus stats, and so on. 18Actions can be taken based on monitoring data to correct the BMC’s state. 19 20For this purpose, there may exist a metric producer (the subject of discussion 21of this document), and a metric consumer (a program that makes use of health 22monitoring data, which may run on the BMC or on the host.) They perform the 23following tasks: 24 251) Configuration, where the user specifies what and how to collect, 26 thresholds, etc. 272) Metric collection, similar to what the read routine in phosphor-hwmon-readd 28 does. 293) Metric staging. When metrics are collected, they will be ready to be read 30 anytime in accessible forms like DBus objects or raw files for use with 31 consumer programs. Because of this staging step, the consumer does not need 32 to poll and wait. 334) Data transfer, where the consumer program obtains the metrics from the BMC 34 by in-band or out-of-band methods. 355) The consumer program may take certain actions based on the metrics 36 collected. 37 38Among those tasks, 1), 2), and 3) are the producer’s responsibility. 4) is 39accomplished by both the producer and consumer. 5) is up to the consumer. 40 41We realize there is some overlap between sensors and health monitoring in 42terms of design rationale and existing infrastructure, so we largely follow 43the sensor design rationale. There are also a few differences between sensors 44and metrics: 45 461) Sensor data originate from hardware, while most metrics may be obtained 47 through software. For this reason, there may be more commonalities between 48 metrics on all kinds of BMCs than sensors on BMCs, and we might not need 49 the hardware discovery process or build-time, hardware-specific 50 configuration for most health metrics. 512) Most sensors are instantaneous readings, while metrics might accumulate 52 over time, such as “uptime”. For those metrics, we might want to do 53 calculations that do not apply to sensor readings. 54 55As such, BMC Health Monitoring infrastructure will be an independent package 56that presents health monitoring data in the sensor structure as defined in 57phosphor-dbus-interface, supporting all sensor packages and allowing metrics 58to be accessed and processed like sensors. 59 60## Background and References 61References: 62dbus-monitor 63 64## Requirements 65 66The metric producer should provide 67- A daemon to periodically collect various health metrics and expose them on 68 DBus 69- A dbus interface to allow other services, like redfish and IPMI, to access 70 its data 71- Capability to configure health monitoring 72- Capability to take action as configured when values crosses threshold 73- Optionally, maintain a certain amount of historical data 74- Optionally, log critical / warning messages 75 76The metric consumer may be written in various different ways. No matter how 77the consumer is obtained, it should be able to obtain the health metrics from 78the producer through a set of interfaces. 79 80The metric consumer is not in the scope of this document. 81 82## Proposed Design 83 84The metric producer is a daemon running on the BMC that performs the required 85tasks and meets the requirements above. As described above, it is responsible 86for 871) Configuration 882) Metric collection and 893) Metric staging tasks 90 91For 1) Configuration, There is a JSON configuration file for threshold, 92frequency of monitoring in seconds, window size and actions. 93For example, 94 95```json 96 "CPU" : { 97 "Frequency" : 1, 98 "Window_size": 120, 99 "Threshold": 100 { 101 "Critical": 102 { 103 "Value": 90.0, 104 "Log": true, 105 "Target": "reboot.target" 106 }, 107 "Warning": 108 { 109 "Value": 80.0, 110 "Log": false, 111 "Target": "systemd unit file" 112 } 113 } 114 }, 115 "Memory" : { 116 "Frequency" : 1, 117 "Window_size": 120, 118 "Threshold": 119 { 120 "Critical": 121 { 122 "Value": 90.0, 123 "Log": true, 124 "Target": "reboot.target" 125 } 126 } 127 } 128``` 129Frequency : It is time in second when these data are collected in regular 130 interval. 131Window_size: This is a value for number of samples taken to average out usage 132 of system rather than taking a spike in usage data. 133Log : A boolean value which allows to log an alert. This field is an 134 optional with default value for this in critical is 'true' and in 135 warning it is 'false'. 136Target : This is a systemd target unit file which will called once value 137 crosses its threshold and it is optional. 138 139For 2) Metric collection, this will be done by running certain functions 140within the daemon, as opposed to launching external programs and shell 141scripts. This is due to performance and security considerations. 142 143For 3) Metric staging, the daemon creates a D-bus service named 144"xyz.openbmc_project.HealthMon" with object paths for each component: 145"/xyz/openbmc_project/sensors/utilization/cpu", 146"/xyz/openbmc_project/sensors/utilization/memory", etc. 147which will result in the following D-bus tree structure 148 149"xyz.openbmc_project.HealthMon": 150``` 151 /xyz/openbmc_project 152 └─/xyz/openbmc_project/sensors 153 └─/xyz/openbmc_project/sensors/utilization/CPU 154 └─/xyz/openbmc_project/sensors/utilization/Memory 155``` 156 157## Alternatives Considered 158We have tried doing health monitoring completely within the IPMI Blob 159framework. In comparison, having the metric collection part a separate daemon 160is better for supporting more interfaces. 161 162We have also tried doing the metric collection task by running an external 163binary as well as a shell script. It turns out running shell script is too 164slow, while running an external program might have security concerns (in that 165the 3rd party program will need to be verified to be safe). 166 167## Impacts 168Most of what the Health Monitoring Daemon does is to do metric collection and 169update DBus objects. The impacts of the daemon itself should be small. 170 171## Testing 172To verify the daemon is functionally working correctly, we can monitor the 173DBus traffic generated by the Daemon, and the readings on the Daemon’s DBus 174objects. 175 176This can also be tested over IPMI/Redfish using sensor command as some of 177metrics data are presented as sensors like CPU and Memory are presented as 178utilization sensors. 179 180To verify the performance aspect, we can stress-test the Daemon’s DBus 181interfaces to make sure the interfaces do not cause a high overhead. 182