1# Monitoring and Logging OpenBMC Systemd Target and Service Failures
2
3Author: Andrew Geissler
4  < geissonator >
5
6Created: June 18, 2019
7Last Updated: Feb 21, 2022
8
9## Problem Description
10
11OpenBMC uses systemd to coordinate the starting and stopping of its
12applications. For example, systemd is used to start all of OpenBMC's initial
13services, to power on and off the servers it manages, and to perform operations
14like firmware updates.
15
16[openbmc-systemd.md][1] has a good summary of systemd and the basics of how
17it is used within OpenBMC.
18
19At a high level, systemd is composed of units. For OpenBMC, the two key
20unit types are target and service.
21
22There are situations where OpenBMC systemd targets fail. They fail due to a
23target or service within them hitting some sort of error condition. In some
24cases, the unit which caused the failure will log an error to
25phosphor-logging but in some situations (segfault, unhandled exception,
26unhandled error path, ...), there will be no indication to the end user that
27something failed. It is critical that if a systemd target fails within OpenBMC,
28the user be notified.
29
30There are also scenarios where a system has successfully started all targets
31but a running service within that target fails. Some services are not all
32that critical, but something like fan-control or a power monitoring service,
33could be very critical. At a minimum, need to ensure the user of the system
34is informed that a critical service has failed. The solution proposed in
35this document does not preclude service owners from doing other recovery, this
36solution just ensures a bare minimum of reporting is done when a failure occurs.
37
38## Background and References
39
40See the [phosphor-state-manager][2] repository for background information on
41state management and systemd within OpenBMC.
42
43systemd provides signals when units complete and provides status on that
44completed unit. See the JobNew()/JobRemoved() section in the [systemd dbus
45API][3]. The six different results are:
46```
47done, canceled, timeout, failed, dependency, skipped
48```
49
50phosphor-state-manager code already monitors for these signals but only looks
51for a status of `done` to know when to mark the corresponding dbus object
52as ready(bmc)/on(chassis)/running(host).
53
54The proposal within this document is to monitor for these other results and
55log an appropriate error to phosphor-logging.
56
57A systemd unit that is a service will only enter into a `failed` state after
58all systemd defined retries have been executed. For OpenBMC systems, that
59involves 2 restarts within a 30 second window.
60
61## Requirements
62
63### Systemd Target Units
64- Must be able to monitor any arbitrary systemd target and log a defined error
65  based on the type of failure the target encountered
66- Must be configurable
67  - Target: Choose any systemd target
68  - Status: Choose one or more status's to monitor for
69  - Error to log: Specify error to log when error is detected
70- Must be able to take multiple configure files as input on startup
71- Support a `default` for errors to monitor for
72  - This will be `timeout`,`failed`, and `dependency`
73- Error will always be the configured one with additional data indicating the
74  status that failed (i.e. canceled, timeout, ...)
75  - Example:
76```
77    xyz.openbmc_project.State.BMC.Error.MultiUserTargetFailure
78    STATUS=timeout
79```
80- By default, enable this monitor and logging service for all critical OpenBMC
81  targets
82  - Critical systemd targets are ones used by phosphor-state-manager
83    - BMC: multi-user.target
84    - Chassis: obmc-chassis-poweron@0.target, obmc-chassis-poweroff@0.target
85    - Host: obmc-host-start@0.target, obmc-host-startmin@0.target,
86      obmc-host-shutdown@0.target, obmc-host-stop@0.target,
87      obmc-host-reboot@0.target
88  - The errors for these must be defined in phosphor-dbus-interfaces
89- Limitations:
90    - Fully qualified target name must be input (i.e. no templated / wild card
91      target support)
92
93### Systemd Service Units
94- Must be able to monitor any arbitrary systemd service within OpenBMC
95  - Service(s) to monitor must be configurable
96- Log an error indicating a service has failed (with service name in log)
97  - `xyz.openbmc_project.State.Error.CriticalServiceFailure`
98- Collect a BMC dump
99- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC
100- Report changed state externally via Redfish managers/bmc state
101
102## Proposed Design
103
104Create a new standalone application in phosphor-state-manager which will load
105json file(s) on startup.
106
107The json file(s) would have the following format for targets:
108```
109{
110    "targets" : [
111        {
112            "name": "multi-user.target",
113            "errorsToMonitor": ["default"],
114            "errorToLog": "xyz.openbmc_project.State.BMC.Error.MultiUserTargetFailure",
115        },
116        {
117            "name": "obmc-chassis-poweron@0.target",
118            "errorsToMonitor": ["timeout", "failed"],
119            "errorToLog": "xyz.openbmc_project.State.Chassis.Error.PowerOnTargetFailure",
120        }
121      ]
122}
123```
124
125The json (files) would have the following format for services:
126```
127{
128    "services" : [
129        "xyz.openbmc_project.biosconfig_manager.service",
130        "xyz.openbmc_project.Dump.Manager.service"
131        ]
132}
133```
134
135On startup, all input json files will be loaded and monitoring will be setup.
136
137This application will not register any interfaces on D-Bus but will subscribe
138to systemd dbus signals for JobRemoved. sdeventplus will be used for all
139D-Bus communication.
140
141For additional debug, the errors may be registered with the BMC Dump function to
142ensure the cause of the failure can be determined. This requires the errors
143logged by this service be put into `phosphor-debug-errors/errors_watch.yaml`.
144
145For service failures, a dump will be collected by default because the BMC
146will be moved into a Quisced state.
147
148Note that services which are short running applications responsible for
149transitioning the system from one target to another (i.e. chassis power on/off,
150starting/stopping the host, ...) are critical services, but they are not the
151type of services to be monitored. Their failures will cause systemd targets to
152fail which will fall into the target monitoring piece of this design. The
153service monitoring is meant for long running services which stay running
154after a target has completed.
155
156## Alternatives Considered
157
158Units have an OnError directive. Could put the error logging logic within that
159path but it introduces more complexity to OpenBMC systemd usage which is already
160quite complicated.
161
162Could implement this within the existing state manager applications since they
163are already monitoring some of these units. The standalone application and
164generic capability to monitor any unit was chosen as a better option.
165
166## Impacts
167
168A phosphor-logging event will be logged when one of the units listed above
169fails. This will be viewable by owners of the system. There may be situations
170where two logs are generated for the same issue. For example, if the power
171application detects an issue during power on and logs it to phosphor-logging and
172then returns non zero to systemd, the target that service is a part of will also
173fail and the code logic defined in this design will also log an error.
174
175The thought is that two errors are better than none. Also, there is a plugin
176architecture being defined for phosphor-logging. The user could monitor for
177that first error within the phosphor-logging plugin architecture and as a
178defined action, cancel the systemd target. A target status of `canceled` will
179not result in phosphor-state-manager generating an error.
180
181## Testing
182Need to cause all units mentioned within this design to fail. They should fail
183for each of the reasons defined within this design and the error generated for
184each scenario should be verified.
185
186[1]: https://github.com/openbmc/docs/blob/master/architecture/openbmc-systemd.md
187[2]: https://github.com/openbmc/phosphor-state-manager
188[3]: https://www.freedesktop.org/wiki/Software/systemd/dbus/
189