1### BMC Health Monitor 2 3Author: 4 Vijay Khemka <vijaykhemka@fb.com>; <vijay!> 5 Sui Chen <suichen@google.com> 6 7Created: 8 2020-05-04 9 10## Problem Description 11The problem is to monitor the health of a system with a BMC so we have some 12means to make sure the BMC is working correctly. User can get required metrics 13data as per configurations instantly. Set of monitored metrics may include CPU 14and memory utilization, uptime, free disk space, I2C bus stats, and so on. 15Actions can be taken based on monitoring data to correct the BMC’s state. 16 17For this purpose, there may exist a metric producer (the subject of discussion 18of this document), and a metric consumer (a program that makes use of health 19monitoring data, which may run on the BMC or on the host.) They perform the 20following tasks: 21 221) Configuration, where the user specifies what and how to collect, 23 thresholds, etc. 242) Metric collection, similar to what the read routine in phosphor-hwmon-readd 25 does. 263) Metric staging. When metrics are collected, they will be ready to be read 27 anytime in accessible forms like DBus objects or raw files for use with 28 consumer programs. Because of this staging step, the consumer does not need 29 to poll and wait. 304) Data transfer, where the consumer program obtains the metrics from the BMC 31 by in-band or out-of-band methods. 325) The consumer program may take certain actions based on the metrics 33 collected. 34 35Among those tasks, 1), 2), and 3) are the producer’s responsibility. 4) is 36accomplished by both the producer and consumer. 5) is up to the consumer. 37 38We realize there is some overlap between sensors and health monitoring in 39terms of design rationale and existing infrastructure, so we largely follow 40the sensor design rationale. There are also a few differences between sensors 41and metrics: 42 431) Sensor data originate from hardware, while most metrics may be obtained 44 through software. For this reason, there may be more commonalities between 45 metrics on all kinds of BMCs than sensors on BMCs, and we might not need 46 the hardware discovery process or build-time, hardware-specific 47 configuration for most health metrics. 482) Most sensors are instantaneous readings, while metrics might accumulate 49 over time, such as “uptime”. For those metrics, we might want to do 50 calculations that do not apply to sensor readings. 51 52As such, BMC Health Monitoring infrastructure will be an independent package 53that presents health monitoring data in the sensor structure as defined in 54phosphor-dbus-interface, supporting all sensor packages and allowing metrics 55to be accessed and processed like sensors. 56 57## Background and References 58References: 59dbus-monitor 60 61## Requirements 62 63The metric producer should provide 64- A daemon to periodically collect various health metrics and expose them on 65 DBus 66- A dbus interface to allow other services, like redfish and IPMI, to access 67 its data 68- Capability to configure health monitoring 69- Capability to take action as configured when values crosses threshold 70- Optionally, maintain a certain amount of historical data 71- Optionally, log critical / warning messages 72 73The metric consumer may be written in various different ways. No matter how 74the consumer is obtained, it should be able to obtain the health metrics from 75the producer through a set of interfaces. 76 77The metric consumer is not in the scope of this document. 78 79## Proposed Design 80 81The metric producer is a daemon running on the BMC that performs the required 82tasks and meets the requirements above. As described above, it is responsible 83for 841) Configuration 852) Metric collection and 863) Metric staging tasks 87 88For 1) Configuration, There is a JSON configuration file for threshold, 89frequency of monitoring in seconds, window size and actions. 90For example, 91 92```json 93 "CPU" : { 94 "Frequency" : 1, 95 "Window_size": 120, 96 "Threshold": 97 { 98 "Critical": 99 { 100 "Value": 90.0, 101 "Log": true, 102 "Target": "reboot.target" 103 }, 104 "Warning": 105 { 106 "Value": 80.0, 107 "Log": false, 108 "Target": "systemd unit file" 109 } 110 } 111 }, 112 "Memory" : { 113 "Frequency" : 1, 114 "Window_size": 120, 115 "Threshold": 116 { 117 "Critical": 118 { 119 "Value": 90.0, 120 "Log": true, 121 "Target": "reboot.target" 122 } 123 } 124 } 125``` 126Frequency : It is time in second when these data are collected in regular 127 interval. 128Window_size: This is a value for number of samples taken to average out usage 129 of system rather than taking a spike in usage data. 130Log : A boolean value which allows to log an alert. This field is an 131 optional with default value for this in critical is 'true' and in 132 warning it is 'false'. 133Target : This is a systemd target unit file which will called once value 134 crosses its threshold and it is optional. 135 136For 2) Metric collection, this will be done by running certain functions 137within the daemon, as opposed to launching external programs and shell 138scripts. This is due to performance and security considerations. 139 140For 3) Metric staging, the daemon creates a D-bus service named 141"xyz.openbmc_project.HealthMon" with object paths for each component: 142"/xyz/openbmc_project/sensors/utilization/cpu", 143"/xyz/openbmc_project/sensors/utilization/memory", etc. 144which will result in the following D-bus tree structure 145 146"xyz.openbmc_project.HealthMon": 147``` 148 /xyz/openbmc_project 149 └─/xyz/openbmc_project/sensors 150 └─/xyz/openbmc_project/sensors/utilization/CPU 151 └─/xyz/openbmc_project/sensors/utilization/Memory 152``` 153 154## Alternatives Considered 155We have tried doing health monitoring completely within the IPMI Blob 156framework. In comparison, having the metric collection part a separate daemon 157is better for supporting more interfaces. 158 159We have also tried doing the metric collection task by running an external 160binary as well as a shell script. It turns out running shell script is too 161slow, while running an external program might have security concerns (in that 162the 3rd party program will need to be verified to be safe). 163 164## Impacts 165Most of what the Health Monitoring Daemon does is to do metric collection and 166update DBus objects. The impacts of the daemon itself should be small. 167 168## Testing 169To verify the daemon is functionally working correctly, we can monitor the 170DBus traffic generated by the Daemon, and the readings on the Daemon’s DBus 171objects. 172 173This can also be tested over IPMI/Redfish using sensor command as some of 174metrics data are presented as sensors like CPU and Memory are presented as 175utilization sensors. 176 177To verify the performance aspect, we can stress-test the Daemon’s DBus 178interfaces to make sure the interfaces do not cause a high overhead. 179