1# Monitoring and Logging OpenBMC Systemd Target and Service Failures 2 3Author: Andrew Geissler 4 < geissonator > 5 6Created: June 18, 2019 7Last Updated: Feb 21, 2022 8 9## Problem Description 10 11OpenBMC uses systemd to coordinate the starting and stopping of its 12applications. For example, systemd is used to start all of OpenBMC's initial 13services, to power on and off the servers it manages, and to perform operations 14like firmware updates. 15 16[openbmc-systemd.md][1] has a good summary of systemd and the basics of how 17it is used within OpenBMC. 18 19At a high level, systemd is composed of units. For OpenBMC, the two key 20unit types are target and service. 21 22There are situations where OpenBMC systemd targets fail. They fail due to a 23target or service within them hitting some sort of error condition. In some 24cases, the unit which caused the failure will log an error to 25phosphor-logging but in some situations (segfault, unhandled exception, 26unhandled error path, ...), there will be no indication to the end user that 27something failed. It is critical that if a systemd target fails within OpenBMC, 28the user be notified. 29 30There are also scenarios where a system has successfully started all targets 31but a running service within that target fails. Some services are not all 32that critical, but something like fan-control or a power monitoring service, 33could be very critical. At a minimum, need to ensure the user of the system 34is informed that a critical service has failed. The solution proposed in 35this document does not preclude service owners from doing other recovery, this 36solution just ensures a bare minimum of reporting is done when a failure occurs. 37 38## Background and References 39 40See the [phosphor-state-manager][2] repository for background information on 41state management and systemd within OpenBMC. 42 43systemd provides signals when units complete and provides status on that 44completed unit. See the JobNew()/JobRemoved() section in the [systemd dbus 45API][3]. The six different results are: 46``` 47done, canceled, timeout, failed, dependency, skipped 48``` 49 50phosphor-state-manager code already monitors for these signals but only looks 51for a status of `done` to know when to mark the corresponding dbus object 52as ready(bmc)/on(chassis)/running(host). 53 54The proposal within this document is to monitor for these other results and 55log an appropriate error to phosphor-logging. 56 57A systemd unit that is a service will only enter into a `failed` state after 58all systemd defined retries have been executed. For OpenBMC systems, that 59involves 2 restarts within a 30 second window. 60 61## Requirements 62 63### Systemd Target Units 64- Must be able to monitor any arbitrary systemd target and log a defined error 65 based on the type of failure the target encountered 66- Must be configurable 67 - Target: Choose any systemd target 68 - Status: Choose one or more status's to monitor for 69 - Error to log: Specify error to log when error is detected 70- Must be able to take multiple configure files as input on startup 71- Support a `default` for errors to monitor for 72 - This will be `timeout`,`failed`, and `dependency` 73- Error will always be the configured one with additional data indicating the 74 status that failed (i.e. canceled, timeout, ...) 75 - Example: 76``` 77 xyz.openbmc_project.State.BMC.Error.MultiUserTargetFailure 78 STATUS=timeout 79``` 80- By default, enable this monitor and logging service for all critical OpenBMC 81 targets 82 - Critical systemd targets are ones used by phosphor-state-manager 83 - BMC: multi-user.target 84 - Chassis: obmc-chassis-poweron@0.target, obmc-chassis-poweroff@0.target 85 - Host: obmc-host-start@0.target, obmc-host-startmin@0.target, 86 obmc-host-shutdown@0.target, obmc-host-stop@0.target, 87 obmc-host-reboot@0.target 88 - The errors for these must be defined in phosphor-dbus-interfaces 89- Limitations: 90 - Fully qualified target name must be input (i.e. no templated / wild card 91 target support) 92 93### Systemd Service Units 94- Must be able to monitor any arbitrary systemd service within OpenBMC 95 - Service(s) to monitor must be configurable 96- Log an error indicating a service has failed (with service name in log) 97 - `xyz.openbmc_project.State.Error.CriticalServiceFailure` 98- Collect a BMC dump 99- Changes BMC state (CurrentBMCState) to indicate a degraded mode of the BMC 100- Report changed state externally via Redfish managers/bmc state 101 102## Proposed Design 103 104Create a new standalone application in phosphor-state-manager which will load 105json file(s) on startup. 106 107The json file(s) would have the following format for targets: 108``` 109{ 110 "targets" : [ 111 { 112 "name": "multi-user.target", 113 "errorsToMonitor": ["default"], 114 "errorToLog": "xyz.openbmc_project.State.BMC.Error.MultiUserTargetFailure", 115 }, 116 { 117 "name": "obmc-chassis-poweron@0.target", 118 "errorsToMonitor": ["timeout", "failed"], 119 "errorToLog": "xyz.openbmc_project.State.Chassis.Error.PowerOnTargetFailure", 120 } 121 ] 122} 123``` 124 125The json (files) would have the following format for services: 126``` 127{ 128 "services" : [ 129 "xyz.openbmc_project.biosconfig_manager.service", 130 "xyz.openbmc_project.Dump.Manager.service" 131 ] 132} 133``` 134 135On startup, all input json files will be loaded and monitoring will be setup. 136 137This application will not register any interfaces on D-Bus but will subscribe 138to systemd dbus signals for JobRemoved. sdeventplus will be used for all 139D-Bus communication. 140 141For additional debug, the errors may be registered with the BMC Dump function to 142ensure the cause of the failure can be determined. This requires the errors 143logged by this service be put into `phosphor-debug-errors/errors_watch.yaml`. 144 145For service failures, a dump will be collected by default because the BMC 146will be moved into a Quisced state. 147 148Note that services which are short running applications responsible for 149transitioning the system from one target to another (i.e. chassis power on/off, 150starting/stopping the host, ...) are critical services, but they are not the 151type of services to be monitored. Their failures will cause systemd targets to 152fail which will fall into the target monitoring piece of this design. The 153service monitoring is meant for long running services which stay running 154after a target has completed. 155 156## Alternatives Considered 157 158Units have an OnError directive. Could put the error logging logic within that 159path but it introduces more complexity to OpenBMC systemd usage which is already 160quite complicated. 161 162Could implement this within the existing state manager applications since they 163are already monitoring some of these units. The standalone application and 164generic capability to monitor any unit was chosen as a better option. 165 166## Impacts 167 168A phosphor-logging event will be logged when one of the units listed above 169fails. This will be viewable by owners of the system. There may be situations 170where two logs are generated for the same issue. For example, if the power 171application detects an issue during power on and logs it to phosphor-logging and 172then returns non zero to systemd, the target that service is a part of will also 173fail and the code logic defined in this design will also log an error. 174 175The thought is that two errors are better than none. Also, there is a plugin 176architecture being defined for phosphor-logging. The user could monitor for 177that first error within the phosphor-logging plugin architecture and as a 178defined action, cancel the systemd target. A target status of `canceled` will 179not result in phosphor-state-manager generating an error. 180 181## Testing 182Need to cause all units mentioned within this design to fail. They should fail 183for each of the reasons defined within this design and the error generated for 184each scenario should be verified. 185 186[1]: https://github.com/openbmc/docs/blob/master/architecture/openbmc-systemd.md 187[2]: https://github.com/openbmc/phosphor-state-manager 188[3]: https://www.freedesktop.org/wiki/Software/systemd/dbus/ 189