1# Monitoring and Logging OpenBMC Systemd Target and Service Failures 2 3Author: Andrew Geissler < geissonator > 4 5Created: June 18, 2019 Last Updated: Feb 21, 2022 6 7## Problem Description 8 9OpenBMC uses systemd to coordinate the starting and stopping of its 10applications. For example, systemd is used to start all of OpenBMC's initial 11services, to power on and off the servers it manages, and to perform operations 12like firmware updates. 13 14[openbmc-systemd.md][1] has a good summary of systemd and the basics of how it 15is used within OpenBMC. 16 17At a high level, systemd is composed of units. For OpenBMC, the two key unit 18types are target and service. 19 20There are situations where OpenBMC systemd targets fail. They fail due to a 21target or service within them hitting some sort of error condition. In some 22cases, the unit which caused the failure will log an error to phosphor-logging 23but in some situations (segfault, unhandled exception, unhandled error path, 24...), there will be no indication to the end user that something failed. It is 25critical that if a systemd target fails within OpenBMC, the user be notified. 26 27There are also scenarios where a system has successfully started all targets but 28a running service within that target fails. Some services are not all that 29critical, but something like fan-control or a power monitoring service, could be 30very critical. At a minimum, need to ensure the user of the system is informed 31that a critical service has failed. The solution proposed in this document does 32not preclude service owners from doing other recovery, this solution just 33ensures a bare minimum of reporting is done when a failure occurs. 34 35## Background and References 36 37See the [phosphor-state-manager][2] repository for background information on 38state management and systemd within OpenBMC. 39 40systemd provides signals when units complete and provides status on that 41completed unit. See the JobNew()/JobRemoved() section in the [systemd dbus 42API][3]. The six different results are: 43 44``` 45done, canceled, timeout, failed, dependency, skipped 46``` 47 48phosphor-state-manager code already monitors for these signals but only looks 49for a status of `done` to know when to mark the corresponding dbus object as 50ready(bmc)/on(chassis)/running(host). 51 52The proposal within this document is to monitor for these other results and log 53an appropriate error to phosphor-logging. 54 55A systemd unit that is a service will only enter into a `failed` state after all 56systemd defined retries have been executed. For OpenBMC systems, that involves 2 57restarts within a 30 second window. 58 59## Requirements 60 61### Systemd Target Units 62 63- Must be able to monitor any arbitrary systemd target and log a defined error 64 based on the type of failure the target encountered 65- Must be configurable 66 - Target: Choose any systemd target 67 - Status: Choose one or more status's to monitor for 68 - Error to log: Specify error to log when error is detected 69- Must be able to take multiple configure files as input on startup 70- Support a `default` for errors to monitor for 71 - This will be `timeout`,`failed`, and `dependency` 72- Error will always be the configured one with additional data indicating the 73 status that failed (i.e. canceled, timeout, ...) 74 - Example: 75 76``` 77 xyz.openbmc_project.State.BMC.Error.MultiUserTargetFailure 78 STATUS=timeout 79``` 80 81- By default, enable this monitor and logging service for all critical OpenBMC 82 targets 83 - Critical systemd targets are ones used by phosphor-state-manager 84 - BMC: multi-user.target 85 - Chassis: obmc-chassis-poweron@0.target, obmc-chassis-poweroff@0.target 86 - Host: obmc-host-start@0.target, obmc-host-startmin@0.target, 87 obmc-host-shutdown@0.target, obmc-host-stop@0.target, 88 obmc-host-reboot@0.target 89 - The errors for these must be defined in phosphor-dbus-interfaces 90- Limitations: 91 - Fully qualified target name must be input (i.e. no templated / wild card 92 target support) 93 94### Systemd Service Units 95 96- Must be able to monitor any arbitrary systemd service within OpenBMC 97 - Service(s) to monitor must be configurable 98- Log an error indicating a service has failed (with service name in log) 99 - `xyz.openbmc_project.State.Error.CriticalServiceFailure` 100- Collect a BMC dump 101- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC 102- Report changed state externally via Redfish managers/bmc state 103 104## Proposed Design 105 106Create a new standalone application in phosphor-state-manager which will load 107json file(s) on startup. 108 109The json file(s) would have the following format for targets: 110 111``` 112{ 113 "targets" : [ 114 { 115 "name": "multi-user.target", 116 "errorsToMonitor": ["default"], 117 "errorToLog": "xyz.openbmc_project.State.BMC.Error.MultiUserTargetFailure", 118 }, 119 { 120 "name": "obmc-chassis-poweron@0.target", 121 "errorsToMonitor": ["timeout", "failed"], 122 "errorToLog": "xyz.openbmc_project.State.Chassis.Error.PowerOnTargetFailure", 123 } 124 ] 125} 126``` 127 128The json (files) would have the following format for services: 129 130``` 131{ 132 "services" : [ 133 "xyz.openbmc_project.biosconfig_manager.service", 134 "xyz.openbmc_project.Dump.Manager.service" 135 ] 136} 137``` 138 139On startup, all input json files will be loaded and monitoring will be setup. 140 141This application will not register any interfaces on D-Bus but will subscribe to 142systemd dbus signals for JobRemoved. sdeventplus will be used for all D-Bus 143communication. 144 145For additional debug, the errors may be registered with the BMC Dump function to 146ensure the cause of the failure can be determined. This requires the errors 147logged by this service be put into `phosphor-debug-errors/errors_watch.yaml`. 148 149For service failures, a dump will be collected by default because the BMC will 150be moved into a Quisced state. 151 152Note that services which are short running applications responsible for 153transitioning the system from one target to another (i.e. chassis power on/off, 154starting/stopping the host, ...) are critical services, but they are not the 155type of services to be monitored. Their failures will cause systemd targets to 156fail which will fall into the target monitoring piece of this design. The 157service monitoring is meant for long running services which stay running after a 158target has completed. 159 160## Alternatives Considered 161 162Units have an OnError directive. Could put the error logging logic within that 163path but it introduces more complexity to OpenBMC systemd usage which is already 164quite complicated. 165 166Could implement this within the existing state manager applications since they 167are already monitoring some of these units. The standalone application and 168generic capability to monitor any unit was chosen as a better option. 169 170## Impacts 171 172A phosphor-logging event will be logged when one of the units listed above 173fails. This will be viewable by owners of the system. There may be situations 174where two logs are generated for the same issue. For example, if the power 175application detects an issue during power on and logs it to phosphor-logging and 176then returns non zero to systemd, the target that service is a part of will also 177fail and the code logic defined in this design will also log an error. 178 179The thought is that two errors are better than none. Also, there is a plugin 180architecture being defined for phosphor-logging. The user could monitor for that 181first error within the phosphor-logging plugin architecture and as a defined 182action, cancel the systemd target. A target status of `canceled` will not result 183in phosphor-state-manager generating an error. 184 185## Testing 186 187Need to cause all units mentioned within this design to fail. They should fail 188for each of the reasons defined within this design and the error generated for 189each scenario should be verified. 190 191[1]: https://github.com/openbmc/docs/blob/master/architecture/openbmc-systemd.md 192[2]: https://github.com/openbmc/phosphor-state-manager 193[3]: https://www.freedesktop.org/wiki/Software/systemd/dbus/ 194