xref: /openbmc/docs/designs/nvmemi-over-smbus.md (revision ba560cc31297caddfc157c540ae9e6d760d630e5)
1# NVMe-MI over SMBus
2
3Author: Tony Lee <tony.lee@quantatw.com>
4
5Created: 3-8-2019
6
7## Problem Description
8
9Currently, OpenBMC does not support NVMe drive information. NVMe-MI
10specification defines a command that can read the NVMe drive information via
11SMBus directly. The NVMe drive can provide its information or status, like
12vendor ID, temperature, etc. The aim of this proposal is to allow users to
13monitor NVMe drives so appropriate action can be taken.
14
15## Background and References
16
17NVMe-MI specification defines a command called
18`NVM Express Basic Management Command` that can read the NVMe drives information
19via SMBus directly. [1]. This command uses SMBus Block Read protocol specified
20by the SMBus specification. [2].
21
22For our purpose is retrieve NVMe drives information, therefore, using NVM
23Express Basic Management Command where describe in NVMe-MI specification to
24communicate with NVMe drives. According to different platforms, temperature
25sensor, present status, LED and power sequence will be customized.
26
27[1]:
28  https://nvmexpress.org/wp-content/uploads/NVM_Express_Management_Interface_1_0a_2017.04.08_-_gold.pdf
29  "NVM Express Management Interface Revision 1.0a April 8, 2017 in Appendix A."
30[2]:
31  http://smbus.org/specs/SMBus_3_0_20141220.pdf
32  "System Management Bus (SMBus) Specification Version 3.0 20 Dec 2014"
33
34## Requirements
35
36The implementation should:
37
38- Provide a daemon to monitor NVMe drives. Parameters to be monitored are Status
39  Flags, SMART Warnings, Temperature, Percentage Drive Life Used, Vendor ID, and
40  Serial Number.
41- Provide a D-bus interface to allow other services to access data.
42- Capability of communication over hardware channel I2C to NVMe drives.
43- Ability to turn the fault LED on/off for each drive by SmartWarnings if the
44  object path of fault LED is defined in the configuration file.
45
46## Proposed Design
47
48Create a D-bus service "xyz.openbmc_project.nvme.manager" with object paths for
49each NVMe sensor: "/xyz/openbmc_project/sensors/temperature/nvme0",
50"/xyz/openbmc_project/sensors/temperature/nvme1", etc. There is a JSON
51configuration file for drive index, bus ID, and the fault LED object path for
52each drive. For example,
53
54```json
55{
56  "NvmeDriveIndex": 0,
57  "NVMeDriveBusID": 16,
58  "NVMeDriveFaultLEDGroupPath": "/xyz/openbmc_project/led/groups/led_u2_0_fault",
59  "NVMeDrivePresentPin": 148,
60  "NVMeDrivePwrGoodPin": 161
61},
62{
63  "NvmeDriveIndex": 1,
64  "NVMeDriveBusID": 17,
65  "NVMeDriveFaultLEDGroupPath": "/xyz/openbmc_project/led/groups/led_u2_0_fault",
66  "NVMeDrivePresentPin": 149,
67  "NVMeDrivePwrGoodPin": 162
68}
69```
70
71Structure like:
72
73Under the D-bus named "xyz.openbmc_project.nvme.manager":
74
75```text
76    /xyz/openbmc_project
77    └─/xyz/openbmc_project/sensors
78      └─/xyz/openbmc_project/sensors/temperature/nvme0
79```
80
81/xyz/openbmc_project/sensors/temperature/nvme0 Which implements:
82
83- xyz.openbmc_project.Sensor.Value
84- xyz.openbmc_project.Sensor.Threshold.Warning
85- xyz.openbmc_project.Sensor.Threshold.Critical
86
87Under the D-bus named "xyz.openbmc_project.Inventory.Manager":
88
89```text
90/xyz/openbmc_project
91    └─/xyz/openbmc_project/inventory
92      └─/xyz/openbmc_project/inventory/system
93        └─/xyz/openbmc_project/inventory/system/chassis
94          └─/xyz/openbmc_project/inventory/system/chassis/motherboard
95           └─/xyz/openbmc_project/inventory/system/chassis/motherboard/nvme0
96```
97
98/xyz/openbmc_project/inventory/system/chassis/motherboard/nvme0 Which
99implements:
100
101- xyz.openbmc_project.Inventory.Item
102- xyz.openbmc_project.Inventory.Decorator.Asset
103- xyz.openbmc_project.Nvme.Status
104
105Interface `xyz.openbmc_project.Sensor.Value`, it's for hwmon to monitor
106temperature and with the following properties:
107
108| Property | Type   | Description          |
109| -------- | ------ | -------------------- |
110| MaxValue | int64  | Sensor maximum value |
111| MinValue | int64  | Sensor minimum value |
112| Scale    | int64  | Sensor value scale   |
113| Unit     | string | Sensor unit          |
114| Value    | int64  | Sensor value         |
115
116Interface `xyz.openbmc_project.Nvme.Status` with the following properties:
117
118| Property          | Type   | Description                                  |
119| ----------------- | ------ | -------------------------------------------- |
120| SmartWarnings     | string | Indicates smart warnings for the state       |
121| StatusFlags       | string | Indicates the status of the drives           |
122| DriveLifeUsed     | string | A vendor specific estimate of the percentage |
123| TemperatureFault  | bool   | If warning type about temperature happened   |
124| BackupdrivesFault | bool   | If warning type about backup drives happened |
125| CapacityFault     | bool   | If warning type about capacity happened      |
126| DegradesFault     | bool   | If warning type about degrades happened      |
127| MediaFault        | bool   | If warning type about media happened         |
128
129Interface `xyz.openbmc_project.Inventory.Item` with the following properties:
130
131| Property   | Type   | Description                         |
132| ---------- | ------ | ----------------------------------- |
133| PrettyName | string | The human readable name of the item |
134| Present    | bool   | Whether or not the item is present  |
135
136Interface `xyz.openbmc_project.Inventory.Decorator.Asset` with the following
137properties:
138
139| Property     | Type   | Description                                       |
140| ------------ | ------ | ------------------------------------------------- |
141| PartNumber   | string | The item part number, typically a stocking number |
142| SerialNumber | string | The item serial number                            |
143| Manufacturer | string | The item manufacturer                             |
144| BuildDate    | bool   | The date of item manufacture in YYYYMMDD format   |
145| Model        | bool   | The model of the item                             |
146
147### xyz.openbmc_project.nvme.manager.service
148
149This service has several steps:
150
1511. It will register a D-bus called `xyz.openbmc_project.nvme.manager`
152   description above.
1532. Obtain the drive index, bus ID, GPIO present pin, power good pin and fault
154   LED object path from the json file mentioned above.
1553. Each cycle will do following steps:
156   1. Check if the present pin of target drive is true, if true, means drive
157      exists and go to next step. If not, means drive does not exists and remove
158      object path from D-bus by drive index.
159   2. Check if the power good pin of target drive is true, if true means drive
160      is ready then create object path by drive index and go to next step. If
161      not, means drive power abnormal, turn on fault LED and log in journal.
162   3. Send a NVMe-MI command via SMBus Block Read protocol by bus ID of target
163      drive to get data. Data get from NVMe drives are "Status Flags", "SMART
164      Warnings", "Temperature", "Percentage Drive Life Used", "Vendor ID", and
165      "Serial Number".
166   4. The data will be set to the properties in D-bus.
167
168This service will run automatically and look up NVMe drives every second.
169
170### Fault LED
171
172When the value obtained from the command corresponds to one of the warning
173types, it will trigger the fault LED of corresponding device and issue events.
174
175### Add SEL related to NVMe
176
177The events `TemperatureFault`, `BackupdrivesFault`, `CapacityFault`,
178`DegradesFault` and `MediaFault` will be generated for the NVMe errors.
179
180- Temperature Fault log : when the property `TemperatureFault` set to true
181- Backupdrives Fault log : when the property `BackupdrivesFault` set to true
182- Capacity Fault log : when the property `CapacityFault` set to true
183- Degrades Fault log : when the property `DegradesFault` set to true
184- Media Fault log: when the property `MediaFault` set to true
185
186## Alternatives Considered
187
188NVMe-MI specification defines multiple commands that can communicate with NVMe
189drives over MCTP protocol. The NVMe-MI over MCTP has the following key
190capabilities:
191
192- Discover drives that are present and learn capabilities of each drives.
193- Store data about the host environment enabling a Management Controller to
194  query the data later.
195- A standard format for VPD and defined mechanisms to read/write VPD contents.
196- Inventorying, configuring and monitoring.
197
198For monitoring NVMe drives, using NVM Express Basic Management Command over
199SMBus directly is much simpler than NVMe-MI over MCTP protocol.
200
201## Impacts
202
203This application is monitoring NVMe drives via SMbus and set values to D-bus.
204The impacts should be small in the system.
205
206## Testing
207
208This implementation is to use NVMe-MI-Basic command over SMBus and then set the
209response data to D-bus. Testing will send SMBus command to the drives to get the
210information and compare with the properties in D-bus to make sure they are the
211same. The testing can be performed on different NVMe drives by different
212manufacturers. For example: Intel P4500/P4600 and Micron 9200 Max/Pro.
213
214Unit tests will test by function:
215
216- It tests the length of responded data is as same as design in the function of
217  getting NVMe information.
218- It tests the function of setting values to D-bus is as same as design.
219- It tests the function of turn the corresponding LED ON/OFF by different
220  Smartwarnings values.
221