1### BMC Health Monitor
2
3Author:
4  Vijay Khemka <vijaykhemka@fb.com>; <vijay!>
5  Sui Chen <suichen@google.com>
6
7Primary assignee:
8  Vijay Khemka <vijaykhemka@fb.com>; <vijay!>
9
10Created:
11  2020-05-04
12
13## Problem Description
14The problem is to monitor the health of a system with a BMC so we have some
15means to make sure the BMC is working correctly. User can get required metrics
16data as per configurations instantly. Set of monitored metrics may include CPU
17and memory utilization, uptime, free disk space, I2C bus stats, and so on.
18Actions can be taken based on monitoring data to correct the BMC’s state.
19
20For this purpose, there may exist a metric producer (the subject of discussion
21of this document), and a metric consumer (a program that makes use of health
22monitoring data, which may run on the BMC or on the host.) They perform the
23following tasks:
24
251) Configuration, where the user specifies what and how to collect,
26   thresholds, etc.
272) Metric collection, similar to what the read routine in phosphor-hwmon-readd
28   does.
293) Metric staging. When metrics are collected, they will be ready to be read
30   anytime in accessible forms like DBus objects or raw files for use with
31   consumer programs. Because of this staging step, the consumer does not need
32   to poll and wait.
334) Data transfer, where the consumer program obtains the metrics from the BMC
34   by in-band or out-of-band methods.
355) The consumer program may take certain actions based on the metrics
36   collected.
37
38Among those tasks, 1), 2), and 3) are the producer’s responsibility. 4) is
39accomplished by both the producer and consumer. 5) is up to the consumer.
40
41We realize there is some overlap between sensors and health monitoring in
42terms of design rationale and existing infrastructure, so we largely follow
43the sensor design rationale. There are also a few differences between sensors
44and metrics:
45
461) Sensor data originate from hardware, while most metrics may be obtained
47   through software. For this reason, there may be more commonalities between
48   metrics on all kinds of BMCs than sensors on BMCs, and we might not need
49   the hardware discovery process or build-time, hardware-specific
50   configuration for most health metrics.
512) Most sensors are instantaneous readings, while metrics might accumulate
52   over time, such as “uptime”. For those metrics, we might want to do
53   calculations that do not apply to sensor readings.
54
55As such, BMC Health Monitoring infrastructure will be an independent package
56that presents health monitoring data in the sensor structure as defined in
57phosphor-dbus-interface, supporting all sensor packages and allowing metrics
58to be accessed and processed like sensors.
59
60## Background and References
61References:
62dbus-monitor
63
64## Requirements
65
66The metric producer should provide
67- A daemon to periodically collect various health metrics and expose them on
68  DBus
69- A dbus interface to allow other services, like redfish and IPMI, to access
70  its data
71- Capability to configure health monitoring
72- Capability to take action as configured when values crosses threshold
73- Optionally, maintain a certain amount of historical data
74- Optionally, log critical / warning messages
75
76The metric consumer may be written in various different ways. No matter how
77the consumer is obtained, it should be able to obtain the health metrics from
78the producer through a set of interfaces.
79
80The metric consumer is not in the scope of this document.
81
82## Proposed Design
83
84The metric producer is a daemon running on the BMC that performs the required
85tasks and meets the requirements above. As described above, it is responsible
86for
871) Configuration
882) Metric collection and
893) Metric staging tasks
90
91For 1) Configuration, There is a JSON configuration file for threshold,
92frequency of monitoring in seconds, window size and actions.
93For example,
94
95```json
96  "CPU" : {
97    "Frequency" : 1,
98    "Window_size": 120,
99    "Threshold":
100    {
101        "Critical":
102        {
103            "Value": 90.0,
104            "Log": true,
105            "Target": "reboot.target"
106        },
107        "Warning":
108        {
109          "Value": 80.0,
110          "Log": false,
111          "Target": "systemd unit file"
112        }
113    }
114  },
115  "Memory" : {
116    "Frequency" : 1,
117    "Window_size": 120,
118    "Threshold":
119    {
120        "Critical":
121        {
122            "Value": 90.0,
123            "Log": true,
124            "Target": "reboot.target"
125        }
126    }
127  }
128```
129Frequency  : It is time in second when these data are collected in regular
130             interval.
131Window_size: This is a value for number of samples taken to average out usage
132             of system rather than taking a spike in usage data.
133Log        : A boolean value which allows to log an alert. This field is an
134             optional with default value for this in critical is 'true' and in
135             warning it is 'false'.
136Target     : This is a systemd target unit file which will called once value
137             crosses its threshold and it is optional.
138
139For 2) Metric collection, this will be done by running certain functions
140within the daemon, as opposed to launching external programs and shell
141scripts. This is due to performance and security considerations.
142
143For 3) Metric staging, the daemon creates a D-bus service named
144"xyz.openbmc_project.HealthMon" with object paths for each component:
145"/xyz/openbmc_project/sensors/utilization/cpu",
146"/xyz/openbmc_project/sensors/utilization/memory", etc.
147which will result in the following D-bus tree structure
148
149"xyz.openbmc_project.HealthMon":
150```
151    /xyz/openbmc_project
152    └─/xyz/openbmc_project/sensors
153      └─/xyz/openbmc_project/sensors/utilization/CPU
154      └─/xyz/openbmc_project/sensors/utilization/Memory
155```
156
157## Alternatives Considered
158We have tried doing health monitoring completely within the IPMI Blob
159framework. In comparison, having the metric collection part a separate daemon
160is better for supporting more interfaces.
161
162We have also tried doing the metric collection task by running an external
163binary as well as a shell script. It turns out running shell script is too
164slow, while running an external program might have security concerns (in that
165the 3rd party program will need to be verified to be safe).
166
167## Impacts
168Most of what the Health Monitoring Daemon does is to do metric collection and
169update DBus objects. The impacts of the daemon itself should be small.
170
171## Testing
172To verify the daemon is functionally working correctly, we can monitor the
173DBus traffic generated by the Daemon, and the readings on the Daemon’s DBus
174objects.
175
176This can also be tested over IPMI/Redfish using sensor command as some of
177metrics data are presented as sensors like CPU and Memory are presented as
178utilization sensors.
179
180To verify the performance aspect, we can stress-test the Daemon’s DBus
181interfaces to make sure the interfaces do not cause a high overhead.
182