xref: /openbmc/docs/designs/guard-on-bmc.md (revision 2bc8dac01353d130358c3be48e7fb5d5282ef0db)
1# Guard on BMC
2
3## Problem Description
4On systems with multiple processor units and other redundant vital resources,
5the system downtime can be prevented by isolating the faulty components. This
6document discusses the design of the BMC support for such isolation procedures.
7The defective components can be kept isolated until a replacement. Most of the
8actions required to isolate the parts will be dependant on the architecture and
9taken care of by the host. But the BMC needs to support a few steps like
10notifying users about the components in isolation, clearing isolation, isolating
11a suspected part, or isolating when the host is down due to a severe fault.
12The process of isolation is mentioned as guard in this document, which means
13guarding the system from faulty components.
14
15## Glossary
16**Guard**: Guarding the system against failures by permanently isolating faulty
17units.
18
19**Guard Records**: A file in the persistent storage with the list of
20permanently isolated components.
21
22**Manual guard**: An operation to manually add a unit to the list of isolated
23units. This operation is helpful in isolating a suspected component without
24physically removing it from the server.
25
26
27## Background and References
28
29The guard in the servers is for managing a record of faulty components to keep
30them out of service. The list of faulty but guarded components can be stored in
31multiple locations based on the ownership of the component. How to store the
32record or manage the record will be decided by the respective component.
33Some of the examples are guard on motherboard components managed by
34the host, guard on the fans can be managed by the fan control application, or
35the guard on the power components can be managed by the power
36management application.
37
38These records will be created when an error is encountered on an element
39that can be isolated. The decision to create a record is based on the type of
40error, usecase, and the availability of a redundant hardware resource to keep
41the system in a usable state. The record which gets created to isolate the
42component is named as Guard Record. The guard record will be deleted after
43a repair action or manually by service personnel. Most of the time, the host
44creates the guard record since the host is responsible for the initialization
45and use of the hardware resources in a server system. The BMC creates guard
46records on a limited set of units during the early boot time, after a severe
47fault, which brings the host down or on the components like a power supply or
48fan which can be controlled by BMC. The BMC retrieves the guard records for
49presenting to an external interface for the review of customers and service
50personnel form various guard record sources.
51
52## Requirements
53### Primary requirements
54![Guard Usecases](https://user-images.githubusercontent.com/16666879/70852658-0edfda80-1eca-11ea-9d70-c81a690c78f2.jpeg)
55- When user requests, create a record in the right guard record repository,
56  based on the hardware component.
57- An option should be given to user to create guard record for DIMM and
58  Processor core to manually keep it out of service.
59- An option should be given to the user to delete a guard record.
60- An option should be given to list the guard records.
61- An option should be given to delete all guard records
62- Industry standard interfaces should be provided to carry out these operations
63  on the guard records.
64
65### Assumptions
66- The guard records on the units which are owned by the host will be applied
67  only in a subsequent boot.
68- The sub-system which owns the hardware resource will apply the guard record
69  and isolates the units.
70- The clearing of the records after the replacement of the faulty component
71  should be done by the component owning the guard records.
72- There should be a mapping between key in the guard record to the key used in
73  BMC for the components.
74
75## Proposed Design
76The guard proposed here is an aggregator for the guard records from various
77sources and provide a common interface for creating, deleting and listing those
78records.
79
80### D-Bus Interfaces
81On BMC, There will be a guard manager object and objects for each entry.
82#### Guard Manager
83Guard manager is for providing the external interfaces for managing the guard
84records and retrieving information about guard records.
85The methods and properties of the guard manager.
86- Create Guard: Create guard record for a DIMM or Processor core
87       Inputs:
88           - Inventory path of the DIMM or Processor core to be guarded.
89
90- Delete Guard: Delete an existing guard record
91       Deleting the guard record D-Bus entry should delete the underlying
92       record.
93
94- Clear all guard: Clear all guard records in the system.
95
96- List Guard:  List all the guarded components.
97
98**Note:** In few platforms may be the system or hardware is not in a state where
99guard operation can be performed either "permanently" or "temporarily".
100
101#### Guard Entry
102The properties of each guard entry will be part of this object
103Properties:
104
105- ID: Id of the record which is part of the entry object path.
106- Associations:
107  - Guarded hardware inventory path:
108    - Forward Name must be "isolated_hw".
109    - Reverse Name must be "isolated_hw_entry".
110  - BMC Error Log:
111    - Forward Name must be "isolated_hw_errorlog".
112    - Reverse Name must be "isolated_hw_entry".
113
114      Error Log association can be optional because a user may be
115      tried guard hardware and it is not an error case.
116- Severity: Type of Guard
117  - Manual - Manually Guarded
118  - Critical - Guarded based on a critical error
119  - Warning - Guarded based on an error which is not critical, but eventually,
120              there can be critical failures.
121- Resolved: Status of guarded hardware
122  - Used to indicate whether guarded hardware is repaired or replaced.
123
124    **Note:** Setting this to "true" will not delete this entry because
125              in a few system platforms guarded hardware records may not be
126              deleted and used for further analysis.
127
128The underlying guard function is implemented by the applications managing the
129hardware units. The application which implements the common guard entry
130interface should map between entry to the underlying guard record in the
131original guard record store.
132
133## Redfish interface
134
135### Manual guard
136Creating manual gurad record, set the "Enabled" property to "false" to manually
137guard a unit which is present in the inventory.
138#### redfish » v1 » Systems » system » Processors » CPU1
139```
140{
141  "@odata.type": "#Processor.v1_7_0.Processor",
142  "Id":view details "CPU1",
143  "Name": "Processor",
144  "Socket": "CPU 1",
145  "ProcessorType": "CPU",
146  "ProcessorId":
147   {
148       "VendorId": "XXXX",
149       "IdentificationRegisters": "XXXX",
150   } ,
151   "MaxSpeedMHz": 3700,
152   "TotalCores": 8,
153   "TotalThreads": 16,
154   "Status":
155   {
156        "State": "UnavailableOffline",
157        "Health": "Critical"
158        "Enabled": "False" <--- guarded a CPU1
159   } ,
160   "@odata.id":view details "/redfish/v1/Systems/system/Processors/CPU1"
161}
162```
163
164### Listing the guard record
165#### redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware
166#### >> Entries
167```
168{
169  "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
170  "@odata.type": "#LogEntryCollection.LogEntryCollection",
171  "Description": "Collection of Isolated Hardware Components",
172  "Members": [
173    {
174      "@odata.id":
175"/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
176      "@odata.type": "#LogEntry.v1_7_0.LogEntry",
177      "Created": "2020-10-15T10:30:08+00:00",
178      "EntryType": "Event",
179      "Id": "1",
180      "Resolved": "false",
181      "Name": "Processor 1",
182      "links":  {
183                 "OriginOfCondition": {
184                        "@odata.id": "/redfish/v1/Systems/system/Processors/cpu1"
185                  }
186                },
187      "Severity": "Critical",
188      "SensorType" : "Processor",
189      "AdditionalDataURI": “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111"
190      “AddionalDataSizeBytes": "1024"
191    }
192  ],
193  "Members@odata.count": 1,
194  "Name": "Isolated Hardware Entries"
195  }
196```
197
198## Alternatives Considered
199
200The guard records can be created for any components which are redundant and
201isolatable to prevent any damage to the hardware or data. Once the record is
202created, an isolation procedure is needed to isolate the units from service.
203Some of the units like which are controlled by the host can be isolated only
204after a reboot, but the units controlled by BMC can be immediately taken out of
205service. The alternatives are
206- The host creates, applies, and present guard records: In this case,
207  BMC has no control, and the host needs to provide the user interface,
208  so there may not be a common interface across different types of hosts.
209  Different user interfaces are required for guard records created by
210  BMC and host.
211- BMC manages the external interfaces for guard: There will be one common
212  point for presenting or managing the guard records created by multiple hosts
213  or BMC itself. There are some guard records created after a severe failure
214  in the host; as a system control entity, it will be easier for BMC
215  to handle such situations and create the records.
216
217
218## Impacts
219- The guard records will be presented as an extension to logs
220- Redfish implementation will have an impact to do the operations required
221  for the guard record management by the user. A request for standardization is
222  planned for the method to list the isolated units in the redfish since that
223  is not yet available in the redfish standard.
224
225## Testing
226The necessary tests needed are creating, deleting, and listing the guard
227records and that should be automated, further tests on the isolation of each
228type of the unit is implementation-specific.
229