#
ada6baa9
|
| 01-Jul-2025 |
Rohit PAI <ropai@nvidia.com> |
Nvidia-Gpu: Support for Nvidia GPU Serial Number, Part Number
Support for serial number and part number fetch is added in inventory class which uses the Get Inventory Command. Currently we have a re
Nvidia-Gpu: Support for Nvidia GPU Serial Number, Part Number
Support for serial number and part number fetch is added in inventory class which uses the Get Inventory Command. Currently we have a retry policy of 3 retires to account of any failures to get response from the GPU device.
Tested - Able to get Serial Number, Part Number updated from the GPU device
``` busctl introspect xyz.openbmc_project.GpuSensor /xyz/openbmc_project/inventory/NVIDIA_GB200_GPU_0 NAME TYPE SIGNATURE RESULT/VALUE FLAGS org.freedesktop.DBus.Introspectable interface - - - .Introspect method - s - org.freedesktop.DBus.Peer interface - - - .GetMachineId method - s - .Ping method - - - org.freedesktop.DBus.Properties interface - - - .Get method ss v - .GetAll method s a{sv} - .Set method ssv - - .PropertiesChanged signal sa{sv}as - - xyz.openbmc_project.Inventory.Decorator.Asset interface - - - .PartNumber property s "699-2G153-0210-TS1" emits-change .SerialNumber property s "1330325220002" emits-change xyz.openbmc_project.Inventory.Item.Accelerator interface - - - .Type property s "GPU" emits-change
```
Change-Id: Id2b33a66ff6d5480f8e229fa233528afc0bdcfc0 Signed-off-by: Rohit PAI <ropai@nvidia.com>
show more ...
|
#
0a88826f
|
| 10-Jun-2025 |
Rohit PAI <ropai@nvidia.com> |
Nvidia-gpu: Create GPU Inventory device
GPU device class implements Item.accelerator interface to get identified as as GPU device. This will be used in Redfish to populate the GPU processor schema.
Nvidia-gpu: Create GPU Inventory device
GPU device class implements Item.accelerator interface to get identified as as GPU device. This will be used in Redfish to populate the GPU processor schema.
Tested - ``` root@gb200nvl-obmc:~# busctl introspect xyz.openbmc_project.GpuSensor /xyz/openbmc_project/inventory/NVIDIA_GB200_GPU_0 NAME TYPE SIGNATURE RESULT/VALUE FLAGS org.freedesktop.DBus.Introspectable interface - - - .Introspect method - s - org.freedesktop.DBus.Peer interface - - - .GetMachineId method - s - .Ping method - - - org.freedesktop.DBus.Properties interface - - - .Get method ss v - .GetAll method s a{sv} - .Set method ssv - - .PropertiesChanged signal sa{sv}as - - xyz.openbmc_project.Inventory.Item.Accelerator interface - - - .Type property s "GPU" emits-change ```
Change-Id: I20434529860cb37889e63651bbcd97cadfa9d54e Signed-off-by: Rohit PAI <ropai@nvidia.com>
show more ...
|
#
b10a67b2
|
| 27-May-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: add dram temperature sensor
This commit introduces a dram temperature sensor for the GPU, enhancing the existing temperature monitoring capabilities.
Tested: Build an image for gb200nvl
nvidia-gpu: add dram temperature sensor
This commit introduces a dram temperature sensor for the GPU, enhancing the existing temperature monitoring capabilities.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
``` $ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_DRAM_0_TEMP_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_DRAM_0_TEMP_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "temperature_NVIDIA_GB200_GPU_0_DRAM_0_TEMP_0", "Name": "NVIDIA GB200 GPU 0 DRAM 0 TEMP 0", "Reading": 30.0, "ReadingRangeMax": 127.0, "ReadingRangeMin": -128.0, "ReadingType": "Temperature", "ReadingUnits": "Cel", "Status": { "Health": "OK", "State": "Enabled" }, "Thresholds": { "UpperCritical": { "Reading": 95.0 } } }% ```
Change-Id: I914bb94f85e2d4163397b71a08b4ddd8b171e7d7 Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|
#
bef4d418
|
| 27-May-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: add voltage sensor
This commit introduces a voltage sensor for the GPU.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are neede
nvidia-gpu: add voltage sensor
This commit introduces a voltage sensor for the GPU.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
``` $ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/voltage_NVIDIA_GB200_GPU_0_Voltage_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/voltage_NVIDIA_GB200_GPU_0_Voltage_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "voltage_NVIDIA_GB200_GPU_0_Voltage_0", "Name": "NVIDIA GB200 GPU 0 Voltage 0", "Reading": 0.735, "ReadingRangeMax": 4294.967295, "ReadingRangeMin": 0.0, "ReadingType": "Voltage", "ReadingUnits": "V", "Status": { "Health": "OK", "State": "Enabled" } }% ```
Change-Id: I3d98f3d7c11221a42460c6f8420c927c1b1711b2 Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|
#
775199d2
|
| 27-May-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: add energy sensor
This commit introduces a energy sensor for the GPU.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed
nvidia-gpu: add energy sensor
This commit introduces a energy sensor for the GPU.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
``` $ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/energy_NVIDIA_GB200_GPU_0_Energy_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/energy_NVIDIA_GB200_GPU_0_Energy_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "energy_NVIDIA_GB200_GPU_0_Energy_0", "Name": "NVIDIA GB200 GPU 0 Energy 0", "Reading": 269947.856, "ReadingRangeMax": 1.8446744073709552e+16, "ReadingRangeMin": 0.0, "ReadingType": "EnergyJoules", "ReadingUnits": "J", "Status": { "Health": "OK", "State": "Enabled" } }% ```
Change-Id: I6f53ab2a83eedd54005bbdcd781dc8d320d7f26a Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|
#
902c649b
|
| 08-May-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: add power sensor
This patch adds support to fetch power sensor value from gpu
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are
nvidia-gpu: add power sensor
This patch adds support to fetch power sensor value from gpu
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
``` $ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/power_NVIDIA_GB200_GPU_0_Power_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/power_NVIDIA_GB200_GPU_0_Power_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "power_NVIDIA_GB200_GPU_0_Power_0", "Name": "NVIDIA GB200 GPU 0 Power 0", "Reading": 27.181, "ReadingRangeMax": 4294967.295, "ReadingRangeMin": 0.0, "ReadingType": "Power", "ReadingUnits": "W", "Status": { "Health": "OK", "State": "Enabled" } }% ```
Change-Id: Ic227a0056daa68ab2239a609ed20c7ed2f6bd2c5 Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|
#
5e7deccd
|
| 07-May-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: add thresholds support to TLimit
This patch introduces support for retrieving GPU TLimit thresholds directly from the GPU device. TLimit Temperature represents the difference in degrees
nvidia-gpu: add thresholds support to TLimit
This patch introduces support for retrieving GPU TLimit thresholds directly from the GPU device. TLimit Temperature represents the difference in degrees Celsius between the current GPU temperature and the initial throttle threshold. The patch also enables the extraction of three critical throttle thresholds — Warning Low, Critical Low, and Hard Shutdown Low — from the GPU hardware.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
``` $ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_1", "Name": "NVIDIA GB200 GPU 0 TEMP 1", "Reading": 57.3984375, "ReadingRangeMax": 127.0, "ReadingRangeMin": -128.0, "ReadingType": "Temperature", "ReadingUnits": "Cel", "Status": { "Health": "OK", "State": "Enabled" }, "Thresholds": { "LowerCaution": { "Reading": 0.0 }, "LowerCritical": { "Reading": 0.0 }, "LowerFatal": { "Reading": 0.0 } } }% ```
Change-Id: I6f2ff2652ce9246287f9bd63c4297d9ad3229963 Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|
#
ba138dae
|
| 05-May-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: add TLimit sensor
This commit introduces a new thermal limit (TLimit) sensor for the GPU, enhancing the existing temperature monitoring capabilities.
Tested: Build an image for gb200nvl
nvidia-gpu: add TLimit sensor
This commit introduces a new thermal limit (TLimit) sensor for the GPU, enhancing the existing temperature monitoring capabilities.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
``` $ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_1", "Name": "NVIDIA GB200 GPU 0 TEMP 1", "Reading": 57.3984375, "ReadingRangeMax": 127.0, "ReadingRangeMin": -128.0, "ReadingType": "Temperature", "ReadingUnits": "Cel", "Status": { "Health": "OK", "State": "Enabled" } }% ```
Change-Id: Ib8e0ef93a4acbb8870671665b098fb61d0205cb2 Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|
#
4ecdfaaa
|
| 22-May-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: introduce notion of a device
Perform device discovery tasks only once per device to prepare for introducing additional gpu sensors.
In the current implementation, sensor updates and dev
nvidia-gpu: introduce notion of a device
Perform device discovery tasks only once per device to prepare for introducing additional gpu sensors.
In the current implementation, sensor updates and device discovery via MCTP are managed within a single class for simplicity. However, since a GPU device typically includes multiple sensors, performing device discovery for each individual sensor is inefficient. Instead, it would be more effective to execute device discovery once per device.
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack. https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
``` $ curl -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_0 { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_0", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_0", "Name": "NVIDIA GB200 GPU 0 TEMP 0", "Reading": 37.6875, "ReadingRangeMax": 127.0, "ReadingRangeMin": -128.0, "ReadingType": "Temperature", "ReadingUnits": "Cel", "Status": { "Health": "OK", "State": "Enabled" } }% ```
Change-Id: Ie3dcd43caa031b4aaa61d8be3f5d71aefd53bc9a Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|