1# Guard on BMC 2 3## Problem Description 4 5On systems with multiple processor units and other redundant vital resources, 6the system downtime can be prevented by isolating the faulty components. This 7document discusses the design of the BMC support for such isolation procedures. 8The defective components can be kept isolated until a replacement. Most of the 9actions required to isolate the parts will be dependant on the architecture and 10taken care of by the host. But the BMC needs to support a few steps like 11notifying users about the components in isolation, clearing isolation, isolating 12a suspected part, or isolating when the host is down due to a severe fault. The 13process of isolation is mentioned as guard in this document, which means 14guarding the system from faulty components. 15 16## Glossary 17 18**Guard**: Guarding the system against failures by permanently isolating faulty 19units. 20 21**Guard Records**: A file in the persistent storage with the list of permanently 22isolated components. 23 24**Manual guard**: An operation to manually add a unit to the list of isolated 25units. This operation is helpful in isolating a suspected component without 26physically removing it from the server. 27 28## Background and References 29 30The guard in the servers is for managing a record of faulty components to keep 31them out of service. The list of faulty but guarded components can be stored in 32multiple locations based on the ownership of the component. How to store the 33record or manage the record will be decided by the respective component. Some of 34the examples are guard on motherboard components managed by the host, guard on 35the fans can be managed by the fan control application, or the guard on the 36power components can be managed by the power management application. 37 38These records will be created when an error is encountered on an element that 39can be isolated. The decision to create a record is based on the type of error, 40usecase, and the availability of a redundant hardware resource to keep the 41system in a usable state. The record which gets created to isolate the component 42is named as Guard Record. The guard record will be deleted after a repair action 43or manually by service personnel. Most of the time, the host creates the guard 44record since the host is responsible for the initialization and use of the 45hardware resources in a server system. The BMC creates guard records on a 46limited set of units during the early boot time, after a severe fault, which 47brings the host down or on the components like a power supply or fan which can 48be controlled by BMC. The BMC retrieves the guard records for presenting to an 49external interface for the review of customers and service personnel form 50various guard record sources. 51 52## Requirements 53 54### Primary requirements 55 56![Guard Usecases](https://user-images.githubusercontent.com/16666879/70852658-0edfda80-1eca-11ea-9d70-c81a690c78f2.jpeg) 57 58- When user requests, create a record in the right guard record repository, 59 based on the hardware component. 60- An option should be given to user to create guard record for DIMM and 61 Processor core to manually keep it out of service. 62- An option should be given to the user to delete a guard record. 63- An option should be given to list the guard records. 64- An option should be given to delete all guard records 65- Industry standard interfaces should be provided to carry out these operations 66 on the guard records. 67 68### Assumptions 69 70- The guard records on the units which are owned by the host will be applied 71 only in a subsequent boot. 72- The sub-system which owns the hardware resource will apply the guard record 73 and isolates the units. 74- The clearing of the records after the replacement of the faulty component 75 should be done by the component owning the guard records. 76- There should be a mapping between key in the guard record to the key used in 77 BMC for the components. 78 79## Proposed Design 80 81The guard proposed here is an aggregator for the guard records from various 82sources and provide a common interface for creating, deleting and listing those 83records. 84 85### D-Bus Interfaces 86 87On BMC, There will be a guard manager object and objects for each entry. 88 89#### Guard Manager 90 91Guard manager is for providing the external interfaces for managing the guard 92records and retrieving information about guard records. The methods and 93properties of the guard manager. 94 95- Create Guard: Create guard record for a DIMM or Processor core Inputs: - 96 Inventory path of the DIMM or Processor core to be guarded. 97 98- Delete Guard: Delete an existing guard record Deleting the guard record D-Bus 99 entry should delete the underlying record. 100 101- Clear all guard: Clear all guard records in the system. 102 103- List Guard: List all the guarded components. 104 105**Note:** In few platforms may be the system or hardware is not in a state where 106guard operation can be performed either "permanently" or "temporarily". 107 108#### Guard Entry 109 110The properties of each guard entry will be part of this object Properties: 111 112- ID: Id of the record which is part of the entry object path. 113- Associations: 114 115 - Guarded hardware inventory path: 116 - Forward Name must be "isolated_hw". 117 - Reverse Name must be "isolated_hw_entry". 118 - BMC Error Log: 119 120 - Forward Name must be "isolated_hw_errorlog". 121 - Reverse Name must be "isolated_hw_entry". 122 123 Error Log association can be optional because a user may be tried guard 124 hardware and it is not an error case. 125 126- Severity: Type of Guard 127 - Manual - Manually Guarded 128 - Critical - Guarded based on a critical error 129 - Warning - Guarded based on an error which is not critical, but eventually, 130 there can be critical failures. 131- Resolved: Status of guarded hardware 132 133 - Used to indicate whether guarded hardware is repaired or replaced. 134 135 **Note:** Setting this to "true" will not delete this entry because in a few 136 system platforms guarded hardware records may not be deleted and used for 137 further analysis. 138 139The underlying guard function is implemented by the applications managing the 140hardware units. The application which implements the common guard entry 141interface should map between entry to the underlying guard record in the 142original guard record store. 143 144## Redfish interface 145 146### Manual guard 147 148Creating manual gurad record, set the "Enabled" property to "false" to manually 149guard a unit which is present in the inventory. 150 151#### redfish » v1 » Systems » system » Processors » CPU1 152 153``` 154{ 155 "@odata.type": "#Processor.v1_7_0.Processor", 156 "Id":view details "CPU1", 157 "Name": "Processor", 158 "Socket": "CPU 1", 159 "ProcessorType": "CPU", 160 "ProcessorId": 161 { 162 "VendorId": "XXXX", 163 "IdentificationRegisters": "XXXX", 164 } , 165 "MaxSpeedMHz": 3700, 166 "TotalCores": 8, 167 "TotalThreads": 16, 168 "Status": 169 { 170 "State": "UnavailableOffline", 171 "Health": "Critical" 172 "Enabled": "False" <--- guarded a CPU1 173 } , 174 "@odata.id":view details "/redfish/v1/Systems/system/Processors/CPU1" 175} 176``` 177 178### Listing the guard record 179 180#### redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware 181 182#### >> Entries 183 184``` 185{ 186 "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries", 187 "@odata.type": "#LogEntryCollection.LogEntryCollection", 188 "Description": "Collection of Isolated Hardware Components", 189 "Members": [ 190 { 191 "@odata.id": 192"/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1", 193 "@odata.type": "#LogEntry.v1_7_0.LogEntry", 194 "Created": "2020-10-15T10:30:08+00:00", 195 "EntryType": "Event", 196 "Id": "1", 197 "Resolved": "false", 198 "Name": "Processor 1", 199 "links": { 200 "OriginOfCondition": { 201 "@odata.id": "/redfish/v1/Systems/system/Processors/cpu1" 202 } 203 }, 204 "Severity": "Critical", 205 "SensorType" : "Processor", 206 "AdditionalDataURI": “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111" 207 “AddionalDataSizeBytes": "1024" 208 } 209 ], 210 "Members@odata.count": 1, 211 "Name": "Isolated Hardware Entries" 212 } 213``` 214 215## Alternatives Considered 216 217The guard records can be created for any components which are redundant and 218isolatable to prevent any damage to the hardware or data. Once the record is 219created, an isolation procedure is needed to isolate the units from service. 220Some of the units like which are controlled by the host can be isolated only 221after a reboot, but the units controlled by BMC can be immediately taken out of 222service. The alternatives are 223 224- The host creates, applies, and present guard records: In this case, BMC has no 225 control, and the host needs to provide the user interface, so there may not be 226 a common interface across different types of hosts. Different user interfaces 227 are required for guard records created by BMC and host. 228- BMC manages the external interfaces for guard: There will be one common point 229 for presenting or managing the guard records created by multiple hosts or BMC 230 itself. There are some guard records created after a severe failure in the 231 host; as a system control entity, it will be easier for BMC to handle such 232 situations and create the records. 233 234## Impacts 235 236- The guard records will be presented as an extension to logs 237- Redfish implementation will have an impact to do the operations required for 238 the guard record management by the user. A request for standardization is 239 planned for the method to list the isolated units in the redfish since that is 240 not yet available in the redfish standard. 241 242## Testing 243 244The necessary tests needed are creating, deleting, and listing the guard records 245and that should be automated, further tests on the isolation of each type of the 246unit is implementation-specific. 247