#
560e6af7 |
| 21-Apr-2025 |
Harshit Aghera <haghera@nvidia.com> |
nvidia-gpu: add support for communication to the endpoint
The commit uses MCTP VDM protocol to read temperature sensor value from the gpu.
The MCTP VDM protocol is an extension of the OCP Accelerat
nvidia-gpu: add support for communication to the endpoint
The commit uses MCTP VDM protocol to read temperature sensor value from the gpu.
The MCTP VDM protocol is an extension of the OCP Accelerator Management Interface specification. [1]
Tested: Build an image for gb200nvl-obmc machine with the following patches cherry picked. This patches are needed to enable the mctp stack.
https://gerrit.openbmc.org/c/openbmc/openbmc/+/79422
Restart the nvidiagpusensor service. ``` root@gb200nvl-obmc:~# systemctl start xyz.openbmc_project.nvidiagpusensor.service ```
The app is detecting entity-manager configuration on gb200nvl-obmc machine. The app is also able to detect all the endpoints from the mctp service dbus tree. The app is reading temperature sensor value from gpu correctly and the temperature sensor is also present on redfish.
``` $ curl -k -u 'root:0penBmc' https://10.137.203.137/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU { "@odata.id": "/redfish/v1/Chassis/NVIDIA_GB200_1/Sensors/temperature_NVIDIA_GB200_GPU", "@odata.type": "#Sensor.v1_2_0.Sensor", "Id": "temperature_NVIDIA_GB200_GPU", "Name": "NVIDIA GB200 GPU", "Reading": 36.4375, "ReadingRangeMax": 127.0, "ReadingRangeMin": -128.0, "ReadingType": "Temperature", "ReadingUnits": "Cel", "Status": { "Health": "OK", "State": "Enabled" } }%
root@gb200nvl-obmc:~# busctl tree xyz.openbmc_project.GpuSensor └─ /xyz └─ /xyz/openbmc_project └─ /xyz/openbmc_project/sensors └─ /xyz/openbmc_project/sensors/temperature └─ /xyz/openbmc_project/sensors/temperature/NVIDIA_GB200_GPU
root@gb200nvl-obmc:~# busctl introspect xyz.openbmc_project.GpuSensor /xyz/openbmc_project/sensors/temperature/NVIDIA_GB200_GPU NAME TYPE SIGNATURE RESULT/VALUE FLAGS org.freedesktop.DBus.Introspectable interface - - - .Introspect method - s - org.freedesktop.DBus.Peer interface - - - .GetMachineId method - s - .Ping method - - - org.freedesktop.DBus.Properties interface - - - .Get method ss v - .GetAll method s a{sv} - .Set method ssv - - .PropertiesChanged signal sa{sv}as - - xyz.openbmc_project.Association.Definitions interface - - - .Associations property a(sss) 1 "chassis" "all_sensors" "/xyz/openbmc… emits-change xyz.openbmc_project.Sensor.Value interface - - - .MaxValue property d 127 emits-change .MinValue property d -128 emits-change .Unit property s "xyz.openbmc_project.Sensor.Value.Unit.… emits-change .Value property d 36.3125 emits-change writable xyz.openbmc_project.Sensor.ValueMutability interface - - - .Mutable property b true emits-change xyz.openbmc_project.State.Decorator.Availability interface - - - .Available property b true emits-change writable xyz.openbmc_project.State.Decorator.OperationalStatus interface - - - .Functional property b true emits-change ```
[1] https://www.opencompute.org/documents/ocp-gpu-accelerator-management-interfaces-v1-pdf
Change-Id: Ied938b9e5c19751ee283b4b948e16c905c78fb48 Signed-off-by: Harshit Aghera <haghera@nvidia.com>
show more ...
|