ba3ee9ae | 06-Jan-2021 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Fill in EpowPowerOff action
This action does the following:
1) Starts a service mode timer, which would allow the system to be serviced before anything happens. 2) On the expiration of
monitor: Fill in EpowPowerOff action
This action does the following:
1) Starts a service mode timer, which would allow the system to be serviced before anything happens. 2) On the expiration of that timer, it will: a) Set the thermal fault alert D-Bus property. This will be used to send an EPOW alert to the host on IBM systems. b) Start the meltdown timer. 3) On the expiration of the meltdown timer, a hard power off will occur. This timer cannot be canceled even if fans start behaving.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I9434699b816b23b68c6d9d1e97283b4ab9befe4f
show more ...
|
c8d3c51f | 06-Jan-2021 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Add thermal fault alert D-Bus property
Add a new property to alert of a thermal fault. In this context, it means an imminent power off due to fan faults. On certain IBM systems it will be
monitor: Add thermal fault alert D-Bus property
Add a new property to alert of a thermal fault. In this context, it means an imminent power off due to fan faults. On certain IBM systems it will be used as a mechanism to alert the host of the power off when the 'epow_power_off' power off rule is used.
Service: xyz.openbmc_project.Thermal.Alert Path: /xyz/openbmc_project/alerts/thermal_fault_alert Interface: xyz.openbmc_project.Object.Enable Property: Enabled
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I0531de9ce40b6148244fda18a20e144bad85d830
show more ...
|
c4bed6b8 | 06-Jan-2021 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Remove _active from PowerOffAction
It isn't used anywhere.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I4697a3ff775206501b1e000b8ce14de7637453b4 |
b92aa3bf | 06-Jan-2021 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Change power off rule trace order
When starting a power off action, trace it before starting it so if the action traces something too this trace comes first.
Signed-off-by: Matt Spinler <s
monitor: Change power off rule trace order
When starting a power off action, trace it before starting it so if the action traces something too this trace comes first.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: Iae7a196422e9c629098e587e31cb01f2f15eabb3
show more ...
|
69f2f48e | 20-Oct-2020 |
Jolie Ku <jolie_ku@wistron.com> |
monitor: Add up/down count fault detection
Create an up/down count fault determination algorithm that could be used in place of the current timer based outOfRange() function. The up/down count is a
monitor: Add up/down count fault detection
Create an up/down count fault determination algorithm that could be used in place of the current timer based outOfRange() function. The up/down count is a different method for determining when a fan is faulted by counting up each iteration a rotor is out of spec and removing those counts when the rotor returns within spec.
Tested: 1. Remove a fan and run Mihawk, the counter add 1 when sensor is out of spec, and replaced the fan back before hit the threshold, the counter decrement back to 0. 2. Remove a fan, counter add 1 and mark the removed fan as nonfunctional when counter reaches the threshold, and Replaced the fan back, counter will decrement back to 0 and fan back to functional.
Change-Id: I632dd2c7553b007beb7ae6bb694a590d2cfc2a1c Signed-off-by: Jolie Ku <jolie_ku@wistron.com> Signed-off-by: Matthew Barth <msbarth@us.ibm.com>
show more ...
|
12b32010 | 27-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Add a README
Add a README.md for fan monitor that provides a high level overview of what it does.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: Id13ee104005d7328e3ba3102cf6d6
monitor: Add a README
Add a README.md for fan monitor that provides a high level overview of what it does.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: Id13ee104005d7328e3ba3102cf6d6f32ee3a1f78
show more ...
|
ac1efc11 | 27-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Re-log fan error on a power off
In the case where a power off rule runs to completion and powers off the system due to either missing or faulted fans, at the point of power off re-post the
monitor: Re-log fan error on a power off
In the case where a power off rule runs to completion and powers off the system due to either missing or faulted fans, at the point of power off re-post the event log for the previous fan error.
This way, there can be an error associated with the power off, because depending on the power off rule delays the original error could have happened several minutes or more in the past.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I1a38062cf75ffd4a11baa417ef3983b6c1a47ada
show more ...
|
27f6b686 | 27-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Event logs for missing fans
This commit adds the code to create event logs calling out the fan when it has been missing for a certain amount of time.
This is basically identical to the fun
monitor: Event logs for missing fans
This commit adds the code to create event logs calling out the fan when it has been missing for a certain amount of time.
This is basically identical to the functionality that the fan presence application in this repo provides, but with it in this application all fan errors are created from the same place. This will become important when there is a power off due to a fan missing and the error for that needs to be re-committed at power off time so it can be shown as the cause of the power off.
The functionality is configured in the JSON:
fan_missing_error_delay: Defines the number of seconds a fan must be missing with power on before an error will be created. If this isn't present in the JSON, then errors will not be created at all.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I76de9d8d1bf6e283560b1ce46e70f84522e2d708
show more ...
|
f13b42e2 | 26-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Event logs for nonfunc fan sensors
This commit adds the code to create event logs calling out the fan when fan sensors have been nonfunctional for a certain amount of time.
This functional
monitor: Event logs for nonfunc fan sensors
This commit adds the code to create event logs calling out the fan when fan sensors have been nonfunctional for a certain amount of time.
This functionality is configured in the JSON, and will only be enabled if the 'fault_handling' JSON section is present. It uses the following new JSON parameters:
nonfunc_rotor_error_delay (per fan): This says how many seconds a fan sensor must be nonfunctional before the event log will be created.
num_nonfunc_rotors_before_error (under fault_handling): This specifies how many nonfunctional fan rotors there must be at the same time before an event log with an error severity is created for the rotor. When there are fewer than this many nonfunctional rotors, then event logs with an informational severity will be created.
A new FanError class is used to create the event logs. It adds the Logger output as FFDC, plus any JSON data that is passed in with the commit() API. It uses CALLOUT_INVENTORY_PATH in the AdditionalData property to specify the faulted fan FRU.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I365114357580b4f38ec943a769c1ce7f695b51ab
show more ...
|
ae1f8efe | 14-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Allowing ignoring fan FRU func status
Make the 'num_sensors_nonfunc_for_fan_nonfunc' JSON entry be optional, and if it isn't present then don't set the parent fan FRU inventory object funct
monitor: Allowing ignoring fan FRU func status
Make the 'num_sensors_nonfunc_for_fan_nonfunc' JSON entry be optional, and if it isn't present then don't set the parent fan FRU inventory object functional state when the tach sensor functional states change.
This is necessary because on some systems some other entity will be managing the FRU level functional state.
This also adds a trace when the tach sensor functional state changes, since if the FRU functional state updating is turned off then the existing traces won't appear.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I1be9cc335c15a78d342e2e7ea4e5108a66d29de3
show more ...
|
e892e39a | 14-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Start checking power off rules
In the system object, load the power off rules and start checking them. It will check them in the following cases (if power is on): * When the object is const
monitor: Start checking power off rules
In the system object, load the power off rules and start checking them. It will check them in the following cases (if power is on): * When the object is constructed * When the JSON config is reloaded * When fan presence or sensor functional state changes * When the power state changes to on
When the power is turned off, it will cancel any running rules.
Previously, fan monitor was only designed to run with power on, and there still may be more changes than just the ones added here to support it always running.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I8be81612ae4997d7568678471ac0f6f854a0e758
show more ...
|
f06ab07c | 14-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Create PowerOffRules class
This class contains a PowerOffCause and a PowerOffAction. It provides a check() method that takes the FanHealth map which it then checks against the cause. If t
monitor: Create PowerOffRules class
This class contains a PowerOffCause and a PowerOffAction. It provides a check() method that takes the FanHealth map which it then checks against the cause. If the cause is satisfied, it then starts the power off action. It provides a cancel method that will force cancel a running action in the case that the object owner detects a system power off and so doesn't need to run this power off anymore.
The class's configuration data is read from the JSON config file.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I5c0c168591d6d62c894c4d036ec762797fd759af
show more ...
|
69b0cf08 | 14-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Create PowerOffAction class hierarchy
The PowerOffAction base class and its derived classes will be used to power off a system due to fan failures.
There are 3 types of power offs: 1. Hard
monitor: Create PowerOffAction class hierarchy
The PowerOffAction base class and its derived classes will be used to power off a system due to fan failures.
There are 3 types of power offs: 1. HardPowerOff - Do a hard power off after a delay 2. SoftPowerOff - Do a soft power off after a delay 3. EpowPowerOff - This isn't fully defined yet, but it will involve powering off after setting an early power off warning somehow and then waiting through 2 delays.
The code that makes the D-Bus calls to do the power offs is in a standalone class so that it can be be mocked in testcases.
This code also makes use of the Logger class for logging, so this commit brings that in as a singleton.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I83118963df4ec0b4f89619572f6935329eec3adb
show more ...
|
00237439 | 14-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Create PowerOffCause class hierarchy
The PowerOffCause base class and its derived classes will be used to determine when a power off needs to be done based on fan failures.
The 'satisified
monitor: Create PowerOffCause class hierarchy
The PowerOffCause base class and its derived classes will be used to determine when a power off needs to be done based on fan failures.
The 'satisified()' method, which takes the fan health map, is used to say if the cause is satisfied and a shut down will need to occur.
It provides two types of causes: * MissingFanFRUCause - Looks at missing fan FRUs * NonfuncFanRotorCause - Looks at nonfunctional rotors (sensors)
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I3c43347782dc559eb7c7441bf9c03d3407b248e2
show more ...
|
b63aa09e | 14-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Track fan health in the System object
To prepare for being able to power off the system based on missing fans or nonfunctional fan sensors, put a global view of this health for all fans in
monitor: Track fan health in the System object
To prepare for being able to power off the system based on missing fans or nonfunctional fan sensors, put a global view of this health for all fans in the System object. This requires now keeping track of fan presence.
This information is stored in a map based on the fan name. It is done this way, as opposed to just always calling present/functional APIs on the Fan objects, so that the code that will be using this information can be tested in isolation without the System or Fan objects.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: Ieb1d4003bd13cebc806fd06f0064c63ea8ac6180
show more ...
|
b0412d07 | 12-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Use only init mode when using JSON
Fan monitor is currently split into 2 modes - 'init' which is used right after a power on, and 'monitor', which is used later after the fans-ready target
monitor: Use only init mode when using JSON
Fan monitor is currently split into 2 modes - 'init' which is used right after a power on, and 'monitor', which is used later after the fans-ready target is started. Normally, the 'init' mode just sets the fans to functional and then exits, and the real monitoring work is done in the 'monitor' mode.
In the future this application will need to be able to check for fan problems as soon as it starts up after power on so that it can handle shutting down due to missing fans. To prepare for this, move all functionality into the init mode, and just exit immediately when called to run in the monitor mode. Only do this when compiled to use the JSON configuration, as this is new and I don't want to change how the existing YAML setups work.
This also creates a new 'monitor_start_delay' entry in the JSON to say how long to wait after startup before actually doing any sensor monitoring, which then gives the same behavior as how the monitor mode would delay by waiting for the fan control ready target, which itself is started by fan control --init after a hardcoded delay. This field is optional to preserve backwards compatibility and defaults to 0s.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I623a233f50233e734f50cd9e80139c60467518d8
show more ...
|
3220350f | 05-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Add fault config JSON documentation
Document the entries in fan monitor's JSON config file that relate to fault handling. This deals with when to create errors against faulted fans, and wh
monitor: Add fault config JSON documentation
Document the entries in fan monitor's JSON config file that relate to fault handling. This deals with when to create errors against faulted fans, and when and how to power off the system based on faulted and/or missing fans.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I814e3d16df5fc4ed268fa92a8cca47747b7d57e9
show more ...
|
5d083229 | 05-Oct-2020 |
Matt Spinler <spinler@us.ibm.com> |
monitor: Add JSON documentation
Add a markdown file to document the fields in the fan monitor JSON configuration file. A few fields have TODO placeholders with the intent they will be filled in lat
monitor: Add JSON documentation
Add a markdown file to document the fields in the fan monitor JSON configuration file. A few fields have TODO placeholders with the intent they will be filled in later.
A placeholder for a new fault handling configuration JSON section was also included. This section will eventually describe the configuration of how the fan monitor application will handle creating event logs and shutting down the system due to missing or faulted fans.
Signed-off-by: Matt Spinler <spinler@us.ibm.com> Change-Id: I35c242372225310d25c063d36e948433dd9c6c4c
show more ...
|
5d564a9f | 22-Oct-2020 |
Jolie Ku <jolie_ku@wistron.com> |
monitor: Use the number of failed tach sensors at startup
When marking a fan nonfunctional due to its tach sensors failing to be read from dbus, a check against the configured number of sensors that
monitor: Use the number of failed tach sensors at startup
When marking a fan nonfunctional due to its tach sensors failing to be read from dbus, a check against the configured number of sensors that should result in the fan being marked as nonfunctional should be checked.
Tested: Run phosphor-fan-monitor --monitor in witherspoon qemu, when the number of failed tach sensor is larger than the configured _numSensorFailsForNonFunc then mark the associated fan as nonfunctional.
Change-Id: I6ff97b9aae4279d6ce402d3aecda087d45dfa318 Signed-off-by: Jolie Ku <jolie_ku@wistron.com>
show more ...
|
a7aed017 | 06-Oct-2020 |
Jay Meyer <jaymeyer@us.ibm.com> |
monitor: journal message for fan Actual Speed wrong
Problem: Actual speed is formatted with the wrong format type. Solution: By using the format library, it is not necessary to specify a format for
monitor: journal message for fan Actual Speed wrong
Problem: Actual speed is formatted with the wrong format type. Solution: By using the format library, it is not necessary to specify a format for the result, and speed is correctly displayed.
Tested: Ran with simulation. In terminal connected to simiulator: systemctl disable phosphor-dbus-monitor.service obmcutil poweron
changed the fan speed in /sys/class/hwmon/hwmon9/fan1_target using an echo command: echo 8000 > fan1_target
After jrnl showed the fan had been disabled, setting fan to nonfunctional showed expected speed: "Setting fan /system/chassis/motherboard/fan5 to nonfunctional Sensor: /xyz/openbmc_project/sensors/fan_tach/fan5_0 Actual speed: 8000.0 Target speed: 11200"
Changed the fan speed back: echo 11200 > fan1_target
journal entry for setting fan back to functional was seen. "Setting fan /system/chassis/motherboard/fan2 back to functional"
Signed-off-by: Jay Meyer <jaymeyer@us.ibm.com> Change-Id: I26bf717694ff8a60851dde1a5052945e4336dfa0
show more ...
|
4c3c24f8 | 08-Sep-2020 |
Jolie Ku <jolie_ku@wistron.com> |
monitor: Mark a fan with a missing dbus sensor as nonfunctional
When fan monitor starts up and retrieves the tach feedback sensor values from dbus, the associated fan should be marked nonfunctional
monitor: Mark a fan with a missing dbus sensor as nonfunctional
When fan monitor starts up and retrieves the tach feedback sensor values from dbus, the associated fan should be marked nonfunctional when the sensor value is not found on dbus.
Tested: run phosphor-fan-monitor --monitor will mark missing tach sensor and associated fan as non-functional upon poweron in witherspoon qemu
Change-Id: I3be24504223d3bd9efe8c4306548d6cca93d8224 Signed-off-by: Jolie Ku <jolie_ku@wistron.com>
show more ...
|
1826c730 | 28-Aug-2020 |
Matthew Barth <msbarth@us.ibm.com> |
format: Include format lib and use on errors opening JSON files
Included the format library used to add more details to the journal message without needing the verbose output and updated the journal
format: Include format lib and use on errors opening JSON files
Included the format library used to add more details to the journal message without needing the verbose output and updated the journal logging when loading a JSON file. When loading a JSON file, now any errors will produce a journal message atleast containing the JSON file that failed to be loaded.
Tested: Removed JSON configuration file and attempted to load it Journal msg shows which JSON configuration file is loaded now Failure to parse JSON shows file and exception in journal msg
Change-Id: I6bec9bb01d8e95c3dced467ea96163129c59619b Signed-off-by: Matthew Barth <msbarth@us.ibm.com>
show more ...
|
0891e3b3 | 13-Aug-2020 |
Matthew Barth <msbarth@us.ibm.com> |
monitor: Tach input to double
It was found that after the Value property on the Sensor.Value interface was changed to be of double type, the tach input failed to get converted correctly from double
monitor: Tach input to double
It was found that after the Value property on the Sensor.Value interface was changed to be of double type, the tach input failed to get converted correctly from double to int64 type when the property changed. Need to set the tach input to be of double type explicitly.
Tested: Tach input values correctly read from dbus signal
Change-Id: I718375d5de50a88bcfaf8ff419e71f732d0b8a65 Signed-off-by: Matthew Barth <msbarth@us.ibm.com>
show more ...
|
8a0c2327 | 17-Jun-2020 |
Matthew Barth <msbarth@us.ibm.com> |
monitor: `optional` no longer experimental
Change-Id: I29e4fa5cfdf5cefe1af548fd5af2a54d08682a11 Signed-off-by: Matthew Barth <msbarth@us.ibm.com> |
fbe86eee | 17-Jun-2020 |
Matthew Barth <msbarth@us.ibm.com> |
monitor: Remove never used logging include from main
Change-Id: I2afd45c822033e81eb9a6cd79aeab33136b51179 Signed-off-by: Matthew Barth <msbarth@us.ibm.com> |