xref: /openbmc/docs/designs/bmc-health-monitor.md (revision f4febd002df578bad816239b70950f84ea4567e8)
1### BMC Health Monitor
2
3Author: Vijay Khemka <vijaykhemka@fb.com>; <vijay!> Sui Chen
4<suichen@google.com>
5
6Created: 2020-05-04
7
8## Problem Description
9
10The problem is to monitor the health of a system with a BMC so we have some
11means to make sure the BMC is working correctly. User can get required metrics
12data as per configurations instantly. Set of monitored metrics may include CPU
13and memory utilization, uptime, free disk space, I2C bus stats, and so on.
14Actions can be taken based on monitoring data to correct the BMC’s state.
15
16For this purpose, there may exist a metric producer (the subject of discussion
17of this document), and a metric consumer (a program that makes use of health
18monitoring data, which may run on the BMC or on the host.) They perform the
19following tasks:
20
211. Configuration, where the user specifies what and how to collect, thresholds,
22   etc.
232. Metric collection, similar to what the read routine in phosphor-hwmon-readd
24   does.
253. Metric staging. When metrics are collected, they will be ready to be read
26   anytime in accessible forms like DBus objects or raw files for use with
27   consumer programs. Because of this staging step, the consumer does not need
28   to poll and wait.
294. Data transfer, where the consumer program obtains the metrics from the BMC by
30   in-band or out-of-band methods.
315. The consumer program may take certain actions based on the metrics collected.
32
33Among those tasks, 1), 2), and 3) are the producer’s responsibility. 4) is
34accomplished by both the producer and consumer. 5) is up to the consumer.
35
36We realize there is some overlap between sensors and health monitoring in terms
37of design rationale and existing infrastructure, so we largely follow the sensor
38design rationale. There are also a few differences between sensors and metrics:
39
401. Sensor data originate from hardware, while most metrics may be obtained
41   through software. For this reason, there may be more commonalities between
42   metrics on all kinds of BMCs than sensors on BMCs, and we might not need the
43   hardware discovery process or build-time, hardware-specific configuration for
44   most health metrics.
452. Most sensors are instantaneous readings, while metrics might accumulate over
46   time, such as “uptime”. For those metrics, we might want to do calculations
47   that do not apply to sensor readings.
48
49As such, BMC Health Monitoring infrastructure will be an independent package
50that presents health monitoring data in the sensor structure as defined in
51phosphor-dbus-interface, supporting all sensor packages and allowing metrics to
52be accessed and processed like sensors.
53
54## Background and References
55
56References: dbus-monitor
57
58## Requirements
59
60The metric producer should provide
61
62- A daemon to periodically collect various health metrics and expose them on
63  DBus
64- A dbus interface to allow other services, like redfish and IPMI, to access its
65  data
66- Capability to configure health monitoring
67- Capability to take action as configured when values crosses threshold
68- Optionally, maintain a certain amount of historical data
69- Optionally, log critical / warning messages
70
71The metric consumer may be written in various different ways. No matter how the
72consumer is obtained, it should be able to obtain the health metrics from the
73producer through a set of interfaces.
74
75The metric consumer is not in the scope of this document.
76
77## Proposed Design
78
79The metric producer is a daemon running on the BMC that performs the required
80tasks and meets the requirements above. As described above, it is responsible
81for
82
831. Configuration
842. Metric collection and
853. Metric staging tasks
86
87For 1) Configuration, There is a JSON configuration file for threshold,
88frequency of monitoring in seconds, window size and actions. For example,
89
90```json
91  "CPU" : {
92    "Frequency" : 1,
93    "Window_size": 120,
94    "Threshold":
95    {
96        "Critical":
97        {
98            "Value": 90.0,
99            "Log": true,
100            "Target": "reboot.target"
101        },
102        "Warning":
103        {
104          "Value": 80.0,
105          "Log": false,
106          "Target": "systemd unit file"
107        }
108    }
109  },
110  "Memory" : {
111    "Frequency" : 1,
112    "Window_size": 120,
113    "Threshold":
114    {
115        "Critical":
116        {
117            "Value": 90.0,
118            "Log": true,
119            "Target": "reboot.target"
120        }
121    }
122  }
123```
124
125Frequency : It is time in second when these data are collected in regular
126interval. Window_size: This is a value for number of samples taken to average
127out usage of system rather than taking a spike in usage data. Log : A boolean
128value which allows to log an alert. This field is an optional with default value
129for this in critical is 'true' and in warning it is 'false'. Target : This is a
130systemd target unit file which will called once value crosses its threshold and
131it is optional.
132
133For 2) Metric collection, this will be done by running certain functions within
134the daemon, as opposed to launching external programs and shell scripts. This is
135due to performance and security considerations.
136
137For 3) Metric staging, the daemon creates a D-bus service named
138"xyz.openbmc_project.HealthMon" with object paths for each component:
139"/xyz/openbmc_project/sensors/utilization/cpu",
140"/xyz/openbmc_project/sensors/utilization/memory", etc. which will result in the
141following D-bus tree structure
142
143"xyz.openbmc_project.HealthMon":
144
145```
146    /xyz/openbmc_project
147    └─/xyz/openbmc_project/sensors
148      └─/xyz/openbmc_project/sensors/utilization/CPU
149      └─/xyz/openbmc_project/sensors/utilization/Memory
150```
151
152## Alternatives Considered
153
154We have tried doing health monitoring completely within the IPMI Blob framework.
155In comparison, having the metric collection part a separate daemon is better for
156supporting more interfaces.
157
158We have also tried doing the metric collection task by running an external
159binary as well as a shell script. It turns out running shell script is too slow,
160while running an external program might have security concerns (in that the 3rd
161party program will need to be verified to be safe).
162
163## Impacts
164
165Most of what the Health Monitoring Daemon does is to do metric collection and
166update DBus objects. The impacts of the daemon itself should be small.
167
168## Testing
169
170To verify the daemon is functionally working correctly, we can monitor the DBus
171traffic generated by the Daemon, and the readings on the Daemon’s DBus objects.
172
173This can also be tested over IPMI/Redfish using sensor command as some of
174metrics data are presented as sensors like CPU and Memory are presented as
175utilization sensors.
176
177To verify the performance aspect, we can stress-test the Daemon’s DBus
178interfaces to make sure the interfaces do not cause a high overhead.
179