History log of /openbmc/dbus-sensors/src/nvidia-gpu/NvidiaGpuDevice.cpp (Results 1 – 9 of 9)
Revision Date Author Comments
# ada6baa9 01-Jul-2025 Rohit PAI <ropai@nvidia.com>

Nvidia-Gpu: Support for Nvidia GPU Serial Number, Part Number

Support for serial number and part number fetch is added in inventory
class which uses the Get Inventory Command. Currently we have a re

Nvidia-Gpu: Support for Nvidia GPU Serial Number, Part Number

Support for serial number and part number fetch is added in inventory
class which uses the Get Inventory Command. Currently we have a retry
policy of 3 retires to account of any failures to get response from the
GPU device.

Tested
- Able to get Serial Number, Part Number updated from the GPU device

```
busctl introspect xyz.openbmc_project.GpuSensor /xyz/openbmc_project/inventory/NVIDIA_GB200_GPU_0
NAME TYPE SIGNATURE RESULT/VALUE FLAGS
org.freedesktop.DBus.Introspectable interface - - -
.Introspect method - s -
org.freedesktop.DBus.Peer interface - - -
.GetMachineId method - s -
.Ping method - - -
org.freedesktop.DBus.Properties interface - - -
.Get method ss v -
.GetAll method s a{sv} -
.Set method ssv - -
.PropertiesChanged signal sa{sv}as - -
xyz.openbmc_project.Inventory.Decorator.Asset interface - - -
.PartNumber property s "699-2G153-0210-TS1" emits-change
.SerialNumber property s "1330325220002" emits-change
xyz.openbmc_project.Inventory.Item.Accelerator interface - - -
.Type property s "GPU" emits-change

```

Change-Id: Id2b33a66ff6d5480f8e229fa233528afc0bdcfc0
Signed-off-by: Rohit PAI <ropai@nvidia.com>

show more ...


# 0a88826f 10-Jun-2025 Rohit PAI <ropai@nvidia.com>

Nvidia-gpu: Create GPU Inventory device

GPU device class implements Item.accelerator interface to get identified
as as GPU device. This will be used in Redfish to populate the GPU
processor schema.

Nvidia-gpu: Create GPU Inventory device

GPU device class implements Item.accelerator interface to get identified
as as GPU device. This will be used in Redfish to populate the GPU
processor schema.

Tested -
```
root@gb200nvl-obmc:~# busctl introspect xyz.openbmc_project.GpuSensor /xyz/openbmc_project/inventory/NVIDIA_GB200_GPU_0
NAME TYPE SIGNATURE RESULT/VALUE FLAGS
org.freedesktop.DBus.Introspectable interface - - -
.Introspect method - s -
org.freedesktop.DBus.Peer interface - - -
.GetMachineId method - s -
.Ping method - - -
org.freedesktop.DBus.Properties interface - - -
.Get method ss v -
.GetAll method s a{sv} -
.Set method ssv - -
.PropertiesChanged signal sa{sv}as - -
xyz.openbmc_project.Inventory.Item.Accelerator interface - - -
.Type property s "GPU" emits-change
```

Change-Id: I20434529860cb37889e63651bbcd97cadfa9d54e
Signed-off-by: Rohit PAI <ropai@nvidia.com>

show more ...


# b10a67b2 27-May-2025 Harshit Aghera <haghera@nvidia.com>

nvidia-gpu: add dram temperature sensor

This commit introduces a dram temperature sensor for the GPU, enhancing
the existing temperature monitoring capabilities.

Tested: Build an image for gb200nvl

nvidia-gpu: add dram temperature sensor

This commit introduces a dram temperature sensor for the GPU, enhancing
the existing temperature monitoring capabilities.

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are needed to enable the mctp stack.

https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422

```
$ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_DRAM_0_TEMP_0
{
"@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_DRAM_0_TEMP_0",
"@odata.type": "#Sensor.v1_2_0.Sensor",
"Id": "temperature_NVIDIA_GB200_GPU_0_DRAM_0_TEMP_0",
"Name": "NVIDIA GB200 GPU 0 DRAM 0 TEMP 0",
"Reading": 30.0,
"ReadingRangeMax": 127.0,
"ReadingRangeMin": -128.0,
"ReadingType": "Temperature",
"ReadingUnits": "Cel",
"Status": {
"Health": "OK",
"State": "Enabled"
},
"Thresholds": {
"UpperCritical": {
"Reading": 95.0
}
}
}%
```

Change-Id: I914bb94f85e2d4163397b71a08b4ddd8b171e7d7
Signed-off-by: Harshit Aghera <haghera@nvidia.com>

show more ...


# bef4d418 27-May-2025 Harshit Aghera <haghera@nvidia.com>

nvidia-gpu: add voltage sensor

This commit introduces a voltage sensor for the GPU.

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are neede

nvidia-gpu: add voltage sensor

This commit introduces a voltage sensor for the GPU.

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are needed to enable the mctp stack.

https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422

```
$ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/voltage_NVIDIA_GB200_GPU_0_Voltage_0
{
"@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/voltage_NVIDIA_GB200_GPU_0_Voltage_0",
"@odata.type": "#Sensor.v1_2_0.Sensor",
"Id": "voltage_NVIDIA_GB200_GPU_0_Voltage_0",
"Name": "NVIDIA GB200 GPU 0 Voltage 0",
"Reading": 0.735,
"ReadingRangeMax": 4294.967295,
"ReadingRangeMin": 0.0,
"ReadingType": "Voltage",
"ReadingUnits": "V",
"Status": {
"Health": "OK",
"State": "Enabled"
}
}%
```

Change-Id: I3d98f3d7c11221a42460c6f8420c927c1b1711b2
Signed-off-by: Harshit Aghera <haghera@nvidia.com>

show more ...


# 775199d2 27-May-2025 Harshit Aghera <haghera@nvidia.com>

nvidia-gpu: add energy sensor

This commit introduces a energy sensor for the GPU.

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are needed

nvidia-gpu: add energy sensor

This commit introduces a energy sensor for the GPU.

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are needed to enable the mctp stack.

https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422

```
$ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/energy_NVIDIA_GB200_GPU_0_Energy_0
{
"@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/energy_NVIDIA_GB200_GPU_0_Energy_0",
"@odata.type": "#Sensor.v1_2_0.Sensor",
"Id": "energy_NVIDIA_GB200_GPU_0_Energy_0",
"Name": "NVIDIA GB200 GPU 0 Energy 0",
"Reading": 269947.856,
"ReadingRangeMax": 1.8446744073709552e+16,
"ReadingRangeMin": 0.0,
"ReadingType": "EnergyJoules",
"ReadingUnits": "J",
"Status": {
"Health": "OK",
"State": "Enabled"
}
}%
```

Change-Id: I6f53ab2a83eedd54005bbdcd781dc8d320d7f26a
Signed-off-by: Harshit Aghera <haghera@nvidia.com>

show more ...


# 902c649b 08-May-2025 Harshit Aghera <haghera@nvidia.com>

nvidia-gpu: add power sensor

This patch adds support to fetch power sensor value from gpu

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are

nvidia-gpu: add power sensor

This patch adds support to fetch power sensor value from gpu

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are needed to enable the mctp stack.

https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422

```
$ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/power_NVIDIA_GB200_GPU_0_Power_0
{
"@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/power_NVIDIA_GB200_GPU_0_Power_0",
"@odata.type": "#Sensor.v1_2_0.Sensor",
"Id": "power_NVIDIA_GB200_GPU_0_Power_0",
"Name": "NVIDIA GB200 GPU 0 Power 0",
"Reading": 27.181,
"ReadingRangeMax": 4294967.295,
"ReadingRangeMin": 0.0,
"ReadingType": "Power",
"ReadingUnits": "W",
"Status": {
"Health": "OK",
"State": "Enabled"
}
}%
```

Change-Id: Ic227a0056daa68ab2239a609ed20c7ed2f6bd2c5
Signed-off-by: Harshit Aghera <haghera@nvidia.com>

show more ...


# 5e7deccd 07-May-2025 Harshit Aghera <haghera@nvidia.com>

nvidia-gpu: add thresholds support to TLimit

This patch introduces support for retrieving GPU TLimit thresholds
directly from the GPU device. TLimit Temperature represents the
difference in degrees

nvidia-gpu: add thresholds support to TLimit

This patch introduces support for retrieving GPU TLimit thresholds
directly from the GPU device. TLimit Temperature represents the
difference in degrees Celsius between the current GPU temperature and
the initial throttle threshold. The patch also enables the extraction of
three critical throttle thresholds — Warning Low, Critical Low, and Hard
Shutdown Low — from the GPU hardware.

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are needed to enable the mctp stack.

https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422

```
$ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1
{
"@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1",
"@odata.type": "#Sensor.v1_2_0.Sensor",
"Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_1",
"Name": "NVIDIA GB200 GPU 0 TEMP 1",
"Reading": 57.3984375,
"ReadingRangeMax": 127.0,
"ReadingRangeMin": -128.0,
"ReadingType": "Temperature",
"ReadingUnits": "Cel",
"Status": {
"Health": "OK",
"State": "Enabled"
},
"Thresholds": {
"LowerCaution": {
"Reading": 0.0
},
"LowerCritical": {
"Reading": 0.0
},
"LowerFatal": {
"Reading": 0.0
}
}
}%
```

Change-Id: I6f2ff2652ce9246287f9bd63c4297d9ad3229963
Signed-off-by: Harshit Aghera <haghera@nvidia.com>

show more ...


# ba138dae 05-May-2025 Harshit Aghera <haghera@nvidia.com>

nvidia-gpu: add TLimit sensor

This commit introduces a new thermal limit (TLimit) sensor for the GPU,
enhancing the existing temperature monitoring capabilities.

Tested: Build an image for gb200nvl

nvidia-gpu: add TLimit sensor

This commit introduces a new thermal limit (TLimit) sensor for the GPU,
enhancing the existing temperature monitoring capabilities.

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are needed to enable the mctp stack.

https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422

```
$ curl -s -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1
{
"@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_1",
"@odata.type": "#Sensor.v1_2_0.Sensor",
"Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_1",
"Name": "NVIDIA GB200 GPU 0 TEMP 1",
"Reading": 57.3984375,
"ReadingRangeMax": 127.0,
"ReadingRangeMin": -128.0,
"ReadingType": "Temperature",
"ReadingUnits": "Cel",
"Status": {
"Health": "OK",
"State": "Enabled"
}
}%
```

Change-Id: Ib8e0ef93a4acbb8870671665b098fb61d0205cb2
Signed-off-by: Harshit Aghera <haghera@nvidia.com>

show more ...


# 4ecdfaaa 22-May-2025 Harshit Aghera <haghera@nvidia.com>

nvidia-gpu: introduce notion of a device

Perform device discovery tasks only once per device to prepare for
introducing additional gpu sensors.

In the current implementation, sensor updates and dev

nvidia-gpu: introduce notion of a device

Perform device discovery tasks only once per device to prepare for
introducing additional gpu sensors.

In the current implementation, sensor updates and device discovery via
MCTP are managed within a single class for simplicity. However, since a
GPU device typically includes multiple sensors, performing device
discovery for each individual sensor is inefficient. Instead, it would
be more effective to execute device discovery once per device.

Tested: Build an image for gb200nvl-obmc machine with the following
patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422

```
$ curl -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_0
{
"@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU_0_TEMP_0",
"@odata.type": "#Sensor.v1_2_0.Sensor",
"Id": "temperature_NVIDIA_GB200_GPU_0_TEMP_0",
"Name": "NVIDIA GB200 GPU 0 TEMP 0",
"Reading": 37.6875,
"ReadingRangeMax": 127.0,
"ReadingRangeMin": -128.0,
"ReadingType": "Temperature",
"ReadingUnits": "Cel",
"Status": {
"Health": "OK",
"State": "Enabled"
}
}%
```

Change-Id: Ie3dcd43caa031b4aaa61d8be3f5d71aefd53bc9a
Signed-off-by: Harshit Aghera <haghera@nvidia.com>

show more ...