#
4f472a86 |
| 26-Aug-2022 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Use USR1 signal to dump debug data
Similar to what fan control is already doing, this commit adds a handler for the USR1 signal to write debug data to /tmp/fan_monitor_dump.json. The data b
monitor: Use USR1 signal to dump debug data
Similar to what fan control is already doing, this commit adds a handler for the USR1 signal to write debug data to /tmp/fan_monitor_dump.json. The data being written is the same data saved in an event log - the current sensor status plus any of the Logger class's logs.
Example output, which shows fan0 recovering from previous faults: { "logs": [ ... [ "Aug 26 17:04:47", "Setting tach sensor /xyz/openbmc_project/sensors/fan_tach/fan0_0 functional state to false. [target = 18000, input = 3446, allowed range = (10600 - NoMax) owned = true]" ], [ "Aug 26 17:04:47", "Starting shutdown action 'EPOW Power Off: 60s/60s' due to cause '2 Nonfunctional Fan Rotors'" ], [ "Aug 26 17:04:47", "Action EPOW Power Off: 60s/60s: Starting service mode timer" ], [ "Aug 26 17:04:47", "Creating event log for faulted fan /xyz/openbmc_project/inventory/system/chassis/motherboard/fan0 sensor /xyz/openbmc_project/sensors/fan_tach/fan0_0" ] ], "sensors": { "sensors": { "/xyz/openbmc_project/sensors/fan_tach/fan0_0": { "functional": false, "in_range": true, "present": true, "prev_tachs": "[11829,11867,11829,11867,11829,11867,11718,11467]", "prev_targets": "[18000,9000,9040,10320,0,0,0,0]", "tach": 11829.0, "target": 18000, "ticks": 18 }, "/xyz/openbmc_project/sensors/fan_tach/fan0_1": { "functional": false, "in_range": true, "present": true, "prev_tachs": "[17857,17772,17857,17772,17201,17045,16741,16375]", "tach": 17857.0, "ticks": 20 }, "/xyz/openbmc_project/sensors/fan_tach/fan1_0": { "functional": true, "in_range": true, "present": true, "prev_tachs": "[11755,11792,11755,11792,11755,11792,11755,11792]", "prev_targets": "[18000,9000,9040,10320,0,0,0,0]", "tach": 11755.0, "target": 18000, "ticks": 0 }, ... } } }
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I84179f78ec83ca6bab788052d0bebe677c1fd29f
show more ...
|
#
cb356d48 |
| 22-Jul-2022 |
Patrick Williams <patrick@stwcx.xyz> |
sdbusplus: use shorter type aliases
The sdbusplus headers provide shortened aliases for many types. Switch to using them to provide better code clarity and shorter lines. Possible replacements are
sdbusplus: use shorter type aliases
The sdbusplus headers provide shortened aliases for many types. Switch to using them to provide better code clarity and shorter lines. Possible replacements are for: * bus_t * exception_t * manager_t * match_t * message_t * object_t * slot_t
Signed-off-by: Patrick Williams <patrick@stwcx.xyz> Change-Id: I9029cc722e7712633c15436bd3868d8c3209f567
show more ...
|
#
683a96c6 |
| 27-Apr-2022 |
Mike Capps <mikepcapps@gmail.com> |
monitor: Capture BMC dumps on fan/ambient shutdowns
When fan-monitor or sensor-monitor generates an EPOW, this change creates a BMC dump after the system is powered off and all error logs are create
monitor: Capture BMC dumps on fan/ambient shutdowns
When fan-monitor or sensor-monitor generates an EPOW, this change creates a BMC dump after the system is powered off and all error logs are created.
Change-Id: Iacdd2d2b388e79988e2536d52497f0e697e1d444 Signed-off-by: Mike Capps <mikepcapps@gmail.com>
show more ...
|
#
b4379a1e |
| 11-Oct-2021 |
Mike Capps <mikepcapps@gmail.com> |
Monitor : handle inventory service offline
Using nameHasOwner and nameOwnerChanged D-Bus signals, a callback is activated when inventory is started.
There are two primary modes for operation: Compa
Monitor : handle inventory service offline
Using nameHasOwner and nameOwnerChanged D-Bus signals, a callback is activated when inventory is started.
There are two primary modes for operation: Compatible Interfaces, the inventory-detection callback will fail, however start() will be called a second time after EntityManager starts and forces a reload of the proper config for the machine type. Separately, if no EntityManager exists, then the callback for Inventory-detection will succeed and use the default configuration file.
To test: stop fan monitor and inventory services. start monitor, wait 10s, start Inventory, after about 15s you should see the online detection.
Signed-off-by: Mike Capps <mikepcapps@gmail.com> Change-Id: I289493a0aabb849abee8ce8de047513e94ee2219
show more ...
|
#
25f0327e |
| 13-Sep-2021 |
Mike Capps <mikepcapps@gmail.com> |
Monitor: Support hwmon service offline during startup
It is possible for fan-monitor to startup before the Hwmonitor service, causing unhandled exceptions that block system initialization. This fix
Monitor: Support hwmon service offline during startup
It is possible for fan-monitor to startup before the Hwmonitor service, causing unhandled exceptions that block system initialization. This fix catches the exception until a proper hwmon presence detector is deployed.
If the exception is caught, this code change forces a re-subscription during the poweron event to ensure tach sensors will receive published updates upon resumption of the hwmon service.
Signed-off-by: Mike Capps <mikepcapps@gmail.com> Change-Id: I8e696e747c432d7a6f696c5ccd9dab73abf7708f
show more ...
|
#
fdcd5db3 |
| 20-May-2021 |
Mike Capps <mikepcapps@gmail.com> |
monitor: Subscribe to tach target and feedback services
Subscribes to nameOwnerChanged signals for the services of the sensor and target interfaces for each configured fan. If those services go offl
monitor: Subscribe to tach target and feedback services
Subscribes to nameOwnerChanged signals for the services of the sensor and target interfaces for each configured fan. If those services go offline, the fan tach sensors should get marked nonfunctional due to no longer receiving updated target or feedback values. In this design, we use the existing method of determining when a fan tach sensor should be marked nonfunctional to allow a recovery window, wherein a brief offline/online transition (such as during a restart) will not trigger a nonfunctional state change.
Change-Id: I0a935ccad5a864dc952d023185356a1ef1226830 Signed-off-by: Mike Capps <mikepcapps@gmail.com>
show more ...
|
#
bb449c1c |
| 14-Jun-2021 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Shut down if no readings at power on
If there are no tach sensors on D-Bus when the power state changes to on, then create an event log and shut down the system. This is done because in th
monitor: Shut down if no readings at power on
If there are no tach sensors on D-Bus when the power state changes to on, then create an event log and shut down the system. This is done because in this case the code is not able to know the fan state - if there are any present or spinning.
The most likely reason there are no sensors (aside from a glaring error in the config file) is because the fan controller device driver failed its probe and was unable to detect it, maybe because the device didn't have power or there was an I2C problem. To aid in root cause analysis if this were to occur in the field, the code adds the following FFDC (First Failure Data Capture) to the event log:
* All of the loaded hwmon drivers, taken from /sys/class/hwmon/*/name * Failure related lines in dmesg, which is where driver errors would show up.
Tested: Unbound the fan device driver and then powered on the system. Also disabled I2C to the fan controller device in simulation and tried a power on.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: Ic0b80d67ec79c9401f59324fe1134ff12084112a
show more ...
|
#
823bc49e |
| 21-Jun-2021 |
Matthew Barth <msbarth@us.ibm.com> |
monitor: Use new JsonConfig object
To simplify handling the loading of config files, use the updated JsonConfig object that populates the available compatibility values used when retrieving the JSON
monitor: Use new JsonConfig object
To simplify handling the loading of config files, use the updated JsonConfig object that populates the available compatibility values used when retrieving the JSON file and loading it. The given load function is called if compatibility values are found upon being constructed or after an interfacesAdded signal is received, which then it can call `getConfFile` to find the JSON config file to be loaded.
Change-Id: Ifc164d36c036cf0ff810018d40e8de52efc6ca58 Signed-off-by: Matthew Barth <msbarth@us.ibm.com>
show more ...
|
#
4283c5d5 |
| 01-Mar-2021 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Allow missing D-Bus sensors on startup
Now that phosphor-fan-monitor is starting at the multi-user target, it may be starting before the fan sensor hwmon daemon is able to put the tach read
monitor: Allow missing D-Bus sensors on startup
Now that phosphor-fan-monitor is starting at the multi-user target, it may be starting before the fan sensor hwmon daemon is able to put the tach reading sensors on D-Bus. This was causing the TachSensor class objects to not get created so even if the hwmon tach sensor values did show up later on D-Bus fan monitor wouldn't notice them.
To fix this, still create the TachSensor objects if the corresponding hwmon D-Bus objects aren't there, and still set them to functional in the inventory so that any other monitoring code, such as phosphor-dbus-monitor, won't shut down the system before the hwmon tach sensors get a chance to show up on D-Bus, which was happening on witherspoon when a reboot was done with the power on.
When the monitor delay timer expires to kick off monitoring, a D-Bus read is forced, and if the hwmon sensors still aren't on D-Bus then the corresponding TachSensor objects will be set to nonfunctional to start down the error paths.
Also, when the power state changes to on, instead of blindly setting all TachSensor objects to functional, again check if their hwmon sensor values are on D-Bus before doing so.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I3e62727296630bf68602b0472328f4613e1a78e3
show more ...
|
#
7d135641 |
| 04-Feb-2021 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Support for running with power off
Put in the remaining changes necessary so that fan monitor doesn't need to be killed when power turns off.
This includes things like: * Support for start
monitor: Support for running with power off
Put in the remaining changes necessary so that fan monitor doesn't need to be killed when power turns off.
This includes things like: * Support for starting before the Present property is on D-Bus. * Support for starting before the config file name is available. * Stopping any running timers when power is turned off. * Checking the power off rules when power turns on.
Most, but not all, of the changes are common between the JSON and YAML modes, but this only truly supported when compiled for JSON.
This also removes the init vs monitor modes of operation, if compiled for JSON.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: Ic2c6848f24511c9dc763227e05bbebb4c8c80cd1
show more ...
|
#
c8d3c51f |
| 06-Jan-2021 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Add thermal fault alert D-Bus property
Add a new property to alert of a thermal fault. In this context, it means an imminent power off due to fan faults. On certain IBM systems it will be
monitor: Add thermal fault alert D-Bus property
Add a new property to alert of a thermal fault. In this context, it means an imminent power off due to fan faults. On certain IBM systems it will be used as a mechanism to alert the host of the power off when the 'epow_power_off' power off rule is used.
Service: xyz.openbmc_project.Thermal.Alert Path: /xyz/openbmc_project/alerts/thermal_fault_alert Interface: xyz.openbmc_project.Object.Enable Property: Enabled
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I0531de9ce40b6148244fda18a20e144bad85d830
show more ...
|
#
ac1efc11 |
| 27-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Re-log fan error on a power off
In the case where a power off rule runs to completion and powers off the system due to either missing or faulted fans, at the point of power off re-post the
monitor: Re-log fan error on a power off
In the case where a power off rule runs to completion and powers off the system due to either missing or faulted fans, at the point of power off re-post the event log for the previous fan error.
This way, there can be an error associated with the power off, because depending on the power off rule delays the original error could have happened several minutes or more in the past.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I1a38062cf75ffd4a11baa417ef3983b6c1a47ada
show more ...
|
#
27f6b686 |
| 27-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Event logs for missing fans
This commit adds the code to create event logs calling out the fan when it has been missing for a certain amount of time.
This is basically identical to the fun
monitor: Event logs for missing fans
This commit adds the code to create event logs calling out the fan when it has been missing for a certain amount of time.
This is basically identical to the functionality that the fan presence application in this repo provides, but with it in this application all fan errors are created from the same place. This will become important when there is a power off due to a fan missing and the error for that needs to be re-committed at power off time so it can be shown as the cause of the power off.
The functionality is configured in the JSON:
fan_missing_error_delay: Defines the number of seconds a fan must be missing with power on before an error will be created. If this isn't present in the JSON, then errors will not be created at all.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I76de9d8d1bf6e283560b1ce46e70f84522e2d708
show more ...
|
#
f13b42e2 |
| 26-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Event logs for nonfunc fan sensors
This commit adds the code to create event logs calling out the fan when fan sensors have been nonfunctional for a certain amount of time.
This functional
monitor: Event logs for nonfunc fan sensors
This commit adds the code to create event logs calling out the fan when fan sensors have been nonfunctional for a certain amount of time.
This functionality is configured in the JSON, and will only be enabled if the 'fault_handling' JSON section is present. It uses the following new JSON parameters:
nonfunc_rotor_error_delay (per fan): This says how many seconds a fan sensor must be nonfunctional before the event log will be created.
num_nonfunc_rotors_before_error (under fault_handling): This specifies how many nonfunctional fan rotors there must be at the same time before an event log with an error severity is created for the rotor. When there are fewer than this many nonfunctional rotors, then event logs with an informational severity will be created.
A new FanError class is used to create the event logs. It adds the Logger output as FFDC, plus any JSON data that is passed in with the commit() API. It uses CALLOUT_INVENTORY_PATH in the AdditionalData property to specify the faulted fan FRU.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I365114357580b4f38ec943a769c1ce7f695b51ab
show more ...
|
#
e892e39a |
| 14-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Start checking power off rules
In the system object, load the power off rules and start checking them. It will check them in the following cases (if power is on): * When the object is const
monitor: Start checking power off rules
In the system object, load the power off rules and start checking them. It will check them in the following cases (if power is on): * When the object is constructed * When the JSON config is reloaded * When fan presence or sensor functional state changes * When the power state changes to on
When the power is turned off, it will cancel any running rules.
Previously, fan monitor was only designed to run with power on, and there still may be more changes than just the ones added here to support it always running.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I8be81612ae4997d7568678471ac0f6f854a0e758
show more ...
|
#
b63aa09e |
| 14-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Track fan health in the System object
To prepare for being able to power off the system based on missing fans or nonfunctional fan sensors, put a global view of this health for all fans in
monitor: Track fan health in the System object
To prepare for being able to power off the system based on missing fans or nonfunctional fan sensors, put a global view of this health for all fans in the System object. This requires now keeping track of fan presence.
This information is stored in a map based on the fan name. It is done this way, as opposed to just always calling present/functional APIs on the Fan objects, so that the code that will be using this information can be tested in isolation without the System or Fan objects.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: Ieb1d4003bd13cebc806fd06f0064c63ea8ac6180
show more ...
|
#
d06905c9 |
| 12-Jun-2020 |
Matthew Barth <msbarth@us.ibm.com> |
monitor:SIGHUP: Handle reloading JSON config thru SIGHUP
Enable capturing the HUP signal to reload the JSON configuration. This will reload the appropriate JSON configuration file found and update t
monitor:SIGHUP: Handle reloading JSON config thru SIGHUP
Enable capturing the HUP signal to reload the JSON configuration. This will reload the appropriate JSON configuration file found and update the trust groups and fan definitions configured.
Tested: JSON configuration is reloaded and updated after SIGHUP Single instance of trust groups exist that match the JSON config Single instance of fan definitions exist that match the JSON config
Change-Id: If55ca583a67fd76f0733009707bd5c4b5eda3e63 Signed-off-by: Matthew Barth <msbarth@us.ibm.com>
show more ...
|
#
c95c527a |
| 15-Jun-2020 |
Matthew Barth <msbarth@us.ibm.com> |
monitor:SIGHUP: Create and use system object
Use a system object to handle retrieving the trust groups and fan definitions configured. This is necessary for handling HUP signals in the future where
monitor:SIGHUP: Create and use system object
Use a system object to handle retrieving the trust groups and fan definitions configured. This is necessary for handling HUP signals in the future where a reload of the JSON configuration is done.
Tested: No change in the loading of the trust groups configuration No change in the loading of the fan definitions configured
Change-Id: I5df2d54641f80778bbf09d7b1f4588a458e11c71 Signed-off-by: Matthew Barth <msbarth@us.ibm.com>
show more ...
|