xref: /openbmc/docs/designs/nvmemi-over-smbus.md (revision 14081020824edc9ed1ae7221f71950688eb57062)
1 ### NVMe-MI over SMBus
2 
3 Author:
4   Tony Lee <tony.lee@quantatw.com>
5 Primary assignee:
6   Tony Lee <tony.lee@quantatw.com>
7 Created:
8   3-8-2019
9 
10 #### Problem Description
11 
12 Currently, OpenBMC does not support NVMe drive information. NVMe-MI
13 specification defines a command that can read the NVMe drive information via
14 SMBus directly. The NVMe drive can provide its information or status, like
15 vendor ID, temperature, etc. The aim of this proposal is to allow users to
16 monitor NVMe drives so appropriate action can be taken.
17 
18 #### Background and References
19 
20 NVMe-MI specification defines a command called
21 `NVM Express Basic Management Command` that can read the NVMe drives
22 information via SMBus directly. [1]. This command uses SMBus Block Read
23 protocol specified by the SMBus specification. [2].
24 
25 For our purpose is retrieve NVMe drives information, therefore, using NVM
26 Express Basic Management Command where describe in NVMe-MI specification to
27 communicate with NVMe drives. According to different platforms, temperature
28 sensor, present status, LED and power sequence will be customized.
29 
30 [1] NVM Express Management Interface Revision 1.0a April 8, 2017 in Appendix A.
31 (https://nvmexpress.org/wp-content/uploads/NVM_Express_Management_Interface_1_0a_2017.04.08_-_gold.pdf)
32 [2] System Management Bus (SMBus) Specification Version 3.0 20 Dec 2014
33 (http://smbus.org/specs/SMBus_3_0_20141220.pdf)
34 
35 #### Requirements
36 
37 The implementation should:
38 
39 - Provide a daemon to monitor NVMe drives. Parameters to be monitored are
40   Status Flags, SMART Warnings, Temperature, Percentage Drive Life Used, Vendor
41   ID, and Serial Number.
42 - Provide a D-bus interface to allow other services to access data.
43 - Capability of communication over hardware channel I2C to NVMe drives.
44 - Ability to turn the fault LED on/off for each drive by SmartWarnings if the
45   object path of fault LED is defined in the configuration file.
46 
47 #### Proposed Design
48 
49 Create a D-bus service "xyz.openbmc_project.nvme.manager" with object paths for
50 each NVMe sensor: "/xyz/openbmc_project/sensors/temperature/nvme0",
51 "/xyz/openbmc_project/sensors/temperature/nvme1", etc.
52 There is a JSON configuration file for drive index, bus ID, and the fault LED
53 object path for each drive.
54 For example,
55 
56 ```json
57 {
58   "NvmeDriveIndex": 0,
59   "NVMeDriveBusID": 16,
60   "NVMeDriveFaultLEDGroupPath": "/xyz/openbmc_project/led/groups/led_u2_0_fault",
61   "NVMeDrivePresentPin": 148,
62   "NVMeDrivePwrGoodPin": 161
63 },
64 {
65   "NvmeDriveIndex": 1,
66   "NVMeDriveBusID": 17,
67   "NVMeDriveFaultLEDGroupPath": "/xyz/openbmc_project/led/groups/led_u2_0_fault",
68   "NVMeDrivePresentPin": 149,
69   "NVMeDrivePwrGoodPin": 162
70 }
71 ```
72 
73 Structure like:
74 
75 Under the D-bus named "xyz.openbmc_project.nvme.manager":
76 
77 ```
78     /xyz/openbmc_project
79     └─/xyz/openbmc_project/sensors
80       └─/xyz/openbmc_project/sensors/temperature/nvme0
81 ```
82 
83 /xyz/openbmc_project/sensors/temperature/nvme0
84 Which implements:
85 
86 - xyz.openbmc_project.Sensor.Value
87 - xyz.openbmc_project.Sensor.Threshold.Warning
88 - xyz.openbmc_project.Sensor.Threshold.Critical
89 
90 Under the D-bus named "xyz.openbmc_project.Inventory.Manager":
91 
92 ```
93 /xyz/openbmc_project
94     └─/xyz/openbmc_project/inventory
95       └─/xyz/openbmc_project/inventory/system
96         └─/xyz/openbmc_project/inventory/system/chassis
97           └─/xyz/openbmc_project/inventory/system/chassis/motherboard
98            └─/xyz/openbmc_project/inventory/system/chassis/motherboard/nvme0
99 ```
100 
101 /xyz/openbmc_project/inventory/system/chassis/motherboard/nvme0
102 Which implements:
103 
104 - xyz.openbmc_project.Inventory.Item
105 - xyz.openbmc_project.Inventory.Decorator.Asset
106 - xyz.openbmc_project.Nvme.Status
107 
108 Interface `xyz.openbmc_project.Sensor.Value`, it's for hwmon to monitor
109 temperature and with the following properties:
110 
111 | Property | Type | Description |
112 | -------- | ---- | ----------- |
113 | MaxValue | int64 | Sensor maximum value |
114 | MinValue | int64 | Sensor minimum value |
115 | Scale | int64 | Sensor value scale |
116 | Unit | string | Sensor unit |
117 | Value | int64 | Sensor value |
118 
119 Interface `xyz.openbmc_project.Nvme.Status` with the following properties:
120 
121 | Property | Type | Description |
122 | -------- | ---- | ----------- |
123 | SmartWarnings| string | Indicates smart warnings for the state |
124 | StatusFlags | string | Indicates the status of the drives |
125 | DriveLifeUsed | string | A vendor specific estimate of the percentage |
126 | TemperatureFault| bool | If warning type about temperature happened |
127 | BackupdrivesFault | bool | If warning type about backup drives happened |
128 | CapacityFault| bool | If warning type about capacity happened |
129 | DegradesFault| bool | If warning type about degrades happened |
130 | MediaFault| bool | If warning type about media happened |
131 
132 Interface `xyz.openbmc_project.Inventory.Item` with the following properties:
133 
134 | Property | Type | Description |
135 | -------- | ---- | ----------- |
136 | PrettyName| string | The human readable name of the item |
137 | Present | bool | Whether or not the item is present |
138 
139 Interface `xyz.openbmc_project.Inventory.Decorator.Asset` with the following
140 properties:
141 
142 | Property | Type | Description |
143 | -------- | ---- | ----------- |
144 | PartNumber| string | The item part number, typically a stocking number |
145 | SerialNumber | string | The item serial number |
146 | Manufacturer | string | The item manufacturer |
147 | BuildDate| bool | The date of item manufacture in YYYYMMDD format |
148 | Model | bool | The model of the item |
149 
150 ##### xyz.openbmc_project.nvme.manager.service
151 
152 This service has several steps:
153 
154 1. It will register a D-bus called `xyz.openbmc_project.nvme.manager`
155    description above.
156 2. Obtain the drive index, bus ID, GPIO present pin, power good pin and fault
157    LED object path from the json file mentioned above.
158 3. Each cycle will do following steps:
159    1. Check if the present pin of target drive is true, if true, means drive
160       exists and go to next step. If not, means drive does not exists and
161       remove object path from D-bus by drive index.
162    2. Check if the power good pin of target drive is true, if true means drive
163       is ready then create object path by drive index and go to next step. If
164       not, means drive power abnormal, turn on fault LED and log in journal.
165    3. Send a NVMe-MI command via SMBus Block Read protocol by bus ID of target
166       drive to get data. Data get from NVMe drives are "Status Flags",
167       "SMART Warnings", "Temperature", "Percentage Drive Life Used",
168       "Vendor ID", and "Serial Number".
169    4. The data will be set to the properties in D-bus.
170 
171 This service will run automatically and look up NVMe drives every second.
172 
173 ##### Fault LED
174 
175 When the value obtained from the command corresponds to one of the warning
176 types, it will trigger the fault LED of corresponding device and issue events.
177 
178 ##### Add SEL related to NVMe
179 
180 The events `TemperatureFault`, `BackupdrivesFault`,
181 `CapacityFault`, `DegradesFault` and `MediaFault` will be generated for the
182 NVMe errors.
183 
184 - Temperature Fault log : when the property `TemperatureFault` set to true
185 - Backupdrives Fault log : when the property `BackupdrivesFault` set to true
186 - Capacity Fault log : when the property `CapacityFault` set to true
187 - Degrades Fault log : when the property `DegradesFault` set to true
188 - Media Fault log: when the property `MediaFault` set to true
189 
190 #### Alternatives Considered
191 
192 NVMe-MI specification defines multiple commands that can communicate with
193 NVMe drives over MCTP protocol. The NVMe-MI over MCTP has the following key
194 capabilities:
195 
196 - Discover drives that are present and learn capabilities of each drives.
197 - Store data about the host environment enabling a Management Controller to
198   query the data later.
199 - A standard format for VPD and defined mechanisms to read/write VPD contents.
200 - Inventorying, configuring and monitoring.
201 
202 For monitoring NVMe drives, using NVM Express Basic Management Command over
203 SMBus directly is much simpler than NVMe-MI over MCTP protocol.
204 
205 #### Impacts
206 
207 This application is monitoring NVMe drives via SMbus and set values to D-bus.
208 The impacts should be small in the system.
209 
210 #### Testing
211 
212 This implementation is to use NVMe-MI-Basic command over SMBus and then set the
213 response data to D-bus.
214 Testing will send SMBus command to the drives to get the information and compare
215 with the properties in D-bus to make sure they are the same.
216 The testing can be performed on different NVMe drives by different
217 manufacturers.
218 For example: Intel P4500/P4600 and Micron 9200 Max/Pro.
219 
220 Unit tests will test by function:
221 
222 - It tests the length of responded data is as same as design in the function
223 of getting NVMe information.
224 - It tests the function of setting values to D-bus is as same as design.
225 - It tests the function of turn the corresponding LED ON/OFF by different
226 Smartwarnings values.
227