1# Monitoring and Logging OpenBMC Systemd Target Failures
2
3Author: Andrew Geissler
4  < geissonator >
5
6  Primary assignee: Andrew Geissler
7
8  Created: June 18, 2019
9
10## Problem Description
11
12OpenBMC uses systemd to coordinate the starting and stopping of its
13applications. For example, systemd is used to start all of OpenBMC's initial
14services, to power on and off the servers it manages, and to perform operations
15like firmware updates.
16
17[openbmc-systemd.md][1] has a good summary of systemd and the basics of how
18it is used within OpenBMC.
19
20There are situations where OpenBMC systemd targets fail. They fail due to a
21target or service within them hitting some sort of error condition. In some
22cases, the service which caused the failure will log an error to
23phosphor-logging but in some situations (segfault, unhandled exception,
24unhandled error path, ...), there will be no indication to the end user that
25something failed. It is critical that if a systemd target fails within OpenBMC,
26the user be notified.
27
28## Background and References
29
30See the [phosphor-state-manager][2] repository for background information on
31state management and systemd within OpenBMC.
32
33systemd provides signals when targets complete and provides status on that
34completed target. See the JobNew()/JobRemoved() section in the [systemd dbus
35API][3]. The six different results are:
36```
37done, canceled, timeout, failed, dependency, skipped
38```
39
40phosphor-state-manager code already monitors for these signals but only looks
41for a status of `done` to know when to mark the corresponding dbus object
42as ready(bmc)/on(chassis)/running(host).
43
44The proposal within this document is to monitor for these other results and
45log an appropriate error to phosphor-logging.
46
47## Requirements
48
49- Must be able to monitor any arbitrary systemd target and log a defined error
50  based on the type of failure the target encountered
51- Must be configurable
52  - Target: Choose any systemd target
53  - Status: Choose one or more status's to monitor for
54  - Error to log: Specify error to log when error is detected
55- Must be able to take multiple configure files as input on startup
56- Support a `default` for errors to monitor for
57  - This will be `timeout`,`failed`, and `dependency`
58- Error will always be the configured one with additional data indicating the
59  status that failed (i.e. canceled, timeout, ...)
60  - Example:
61```
62    xyz.openbmc_project.State.BMC.Error.MultiUserTargetFailure
63    STATUS=timeout
64```
65- By default, enable this monitor and logging service for all critical OpenBMC
66  targets
67  - Critical systemd targets are ones used by phosphor-state-manager
68    - BMC: multi-user.target
69    - Chassis: obmc-chassis-poweron@0.target, obmc-chassis-poweroff@0.target
70    - Host: obmc-host-start@0.target, obmc-host-startmin@0.target,
71      obmc-host-shutdown@0.target, obmc-host-stop@0.target,
72      obmc-host-reboot@0.target
73  - The errors for these must be defined in phosphor-dbus-interfaces
74- Limitations:
75    - Fully qualified target name must be input (i.e. no templated / wild card
76      target support)
77
78## Proposed Design
79
80Create a new standalone application in phosphor-state-manager which will load
81json file(s) on startup. The json file(s) would have the following format:
82```
83{
84    "targets" : [
85        {
86            "name": "multi-user.target",
87            "errorsToMonitor": ["default"],
88            "errorToLog": "xyz.openbmc_project.State.BMC.Error.MultiUserTargetFailure",
89        },
90        {
91            "name": "obmc-chassis-poweron@0.target",
92            "errorsToMonitor": ["timeout", "failed"],
93            "errorToLog": "xyz.openbmc_project.State.Chassis.Error.PowerOnTargetFailure",
94        }
95      ]
96}
97```
98
99On startup, all input json files will be loaded and monitoring will be setup.
100
101This application will not register any interfaces on D-Bus but will subscribe
102to systemd dbus signals for JobRemoved. sdeventplus will be used for all
103D-Bus communication.
104
105For additional debug, the errors may be registered with the BMC Dump function to
106ensure the cause of the failure can be determined. This requires the errors
107logged by this service be put into `phosphor-debug-errors/errors_watch.yaml`.
108
109## Alternatives Considered
110
111Targets have an OnError directive. Could put the error logging logic within that
112path but it introduces more complexity to OpenBMC systemd usage which is already
113quite complicated.
114
115Could implement this within the existing state manager applications since they
116are already monitoring some of these targets. The standalone application and
117generic capability to monitor any target was chosen as a better option.
118
119## Impacts
120
121A phosphor-logging event will be logged when one of the targets listed above
122fails. This will be viewable by owners of the system. There may be situations
123where two logs are generated for the same issue. For example, if the power
124application detects an issue during power on and logs it to phosphor-logging and
125then returns non zero to systemd, the target that service is a part of will also
126fail and the code logic defined in this design will also log an error.
127
128The thought is that two errors are better than none. Also, there is a plugin
129architecture being defined for phosphor-logging. The user could monitor for
130that first error within the phosphor-logging plugin architecture and as a
131defined action, cancel the systemd target. A target status of `canceled` will
132not result in phosphor-state-manager generating an error.
133
134## Testing
135Need to cause all targets mentioned within this design to fail. They should fail
136for each of the reasons defined within this design and the error generated for
137each scenario should be verified.
138
139[1]: https://github.com/openbmc/docs/blob/master/architecture/openbmc-systemd.md
140[2]: https://github.com/openbmc/phosphor-state-manager
141[3]: https://www.freedesktop.org/wiki/Software/systemd/dbus/
142