#
5e7deccd
|
| 07-May-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: add thresholds support to TLimit
This patch introduces support for retrieving GPU TLimit thresholds directly from the GPU device. TLimit Temperature represents the difference in degrees
nvidia-gpu: add thresholds support to TLimit
This patch introduces support for retrieving GPU TLimit thresholds directly from the GPU device. TLimit Temperature represents the difference in degrees Celsius between the current GPU temperature and the initial throttle threshold. The patch also enables the extraction of three critical throttle thresholds — Warning Low, Critical Low, and Hard Shutdown Low — from the GPU hardware.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
``` $ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_1", "Name": "NVIDIA GB200 GPU 0 TEMP 1", "Reading": 57.3984375, "ReadingRangeMax": 127.0, "ReadingRangeMin": -128.0, "ReadingType": "Temperature", "ReadingUnits": "Cel", "Status": { "Health": "OK", "State": "Enabled" }, "Thresholds": { "LowerCaution": { "Reading": 0.0 }, "LowerCritical": { "Reading": 0.0 }, "LowerFatal": { "Reading": 0.0 } } }% ```
Change-Id: I6f2ff2652ce9246287f9bd63c4297d9ad3229963 Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|