1# Guard on BMC 2 3## Problem Description 4On systems with multiple processor units and other redundant vital resources, 5the system downtime can be prevented by isolating the faulty components. This 6document discusses the design of the BMC support for such isolation procedures. 7The defective components can be kept isolated until a replacement. Most of the 8actions required to isolate the parts will be dependant on the architecture and 9taken care of by the host. But the BMC needs to support a few steps like 10notifying users about the components in isolation, clearing isolation, isolating 11a suspected part, or isolating when the host is down due to a severe fault. 12The process of isolation is mentioned as guard in this document, which means 13guarding the system from faulty components. 14 15## Glossary 16**Guard**: Guarding the system against failures by permanently isolating faulty 17units. 18**Guard Records**: A file in the persistent storage with the list of 19permanently isolated components. 20**Manual guard**: An operation to manually add a unit to the list of isolated 21units. This operation is helpful in isolating a suspected component without 22physically removing it from the server. 23 24 25## Background and References 26 27The guard in the servers is for managing a record of faulty components to keep 28them out of service. The list of faulty but guarded components can be stored in 29multiple locations based on the ownership of the component. How to store the 30record or manage the record will be decided by the respective component. 31Some of the examples are guard on motherboard components managed by 32the host, guard on the fans can be managed by the fan control application, or 33the guard on the power components can be managed by the power 34management application. 35 36These records will be created when an error is encountered on an element 37that can be isolated. The decision to create a record is based on the type of 38error, usecase, and the availability of a redundant hardware resource to keep 39the system in a usable state. The record which gets created to isolate the 40component is named as Guard Record. The guard record will be deleted after 41a repair action or manually by service personnel. Most of the time, the host 42creates the guard record since the host is responsible for the initialization 43and use of the hardware resources in a server system. The BMC creates guard 44records on a limited set of units during the early boot time, after a severe 45fault, which brings the host down or on the components like a power supply or 46fan which can be controlled by BMC. The BMC retrieves the guard records for 47presenting to an external interface for the review of customers and service 48personnel form various guard record sources. 49 50## Requirements 51### Primary requirements 52![Guard Usecases](https://user-images.githubusercontent.com/16666879/70852658-0edfda80-1eca-11ea-9d70-c81a690c78f2.jpeg) 53- When user requests, create a record in the right guard record repository, 54 based on the hardware component. 55- An option should be given to user to create guard record for DIMM and 56 Processor core to manually keep it out of service. 57- An option should be given to the user to delete a guard record. 58- An option should be given to list the guard records. 59- An option should be given to delete all guard records 60- Industry standard interfaces should be provided to carry out these operations 61 on the guard records. 62 63### Assumptions 64- The guard records on the units which are owned by the host will be applied 65 only in a subsequent boot. 66- The sub-system which owns the hardware resource will apply the guard record 67 and isolates the units. 68- The clearing of the records after the replacement of the faulty component 69 should be done by the component owning the guard records. 70- There should be a mapping between key in the guard record to the key used in 71 BMC for the components. 72 73## Proposed Design 74The guard proposed here is an aggregator for the guard records from various 75sources and provide a common interface for creating, deleting and listing those 76records. 77 78### D-Bus Interfaces 79On BMC, There will be a guard manager object and objects for each entry. 80#### Guard Manager 81Guard manager is for providing the external interfaces for managing the guard 82records and retrieving information about guard records. 83The methods and properties of the guard manager. 84- Create Guard: Create guard record for a DIMM or Processor core 85 Inputs: 86 - Inventory path of the DIMM or Processor core to be guarded. 87 88- Delete Guard: Delete an existing guard record 89 Deleting the guard record D-Bus entry should delete the underlying 90 record. 91 92- Clear all guard: Clear all guard records in the system. 93 94- List Guard: List all the guarded components. 95 96#### Guard Entry 97The properties of each guard entry will be part of this object 98Properties: 99 100- ID: Id of the record which is part of the entry object path. 101- Associations: 102 - Guarded hardware inventory path: 103 - Forward Name must be "isolated_hw". 104 - Reverse Name must be "isolated_hw_entry". 105 - BMC Error Log: 106 - Forward Name must be "isolated_hw_errorlog". 107 - Reverse Name must be "isolated_hw_entry". 108 109 Error Log association can be optional because a user may be 110 tried guard hardware and it is not an error case. 111- Severity: Type of Guard 112 - Manual - Manually Guarded 113 - Critical - Guarded based on a critical error 114 - Warning - Guarded based on an error which is not critical, but eventually, 115 there can be critical failures. 116- Resolved: Status of guarded hardware 117 - Used to indicate whether guarded hardware is repaired or replaced. 118 119 **Note:** Setting this to "true" will not delete this entry because 120 in a few system platforms guarded hardware records may not be 121 deleted and used for further analysis. 122 123The underlying guard function is implemented by the applications managing the 124hardware units. The application which implements the common guard entry 125interface should map between entry to the underlying guard record in the 126original guard record store. 127 128## Redfish interface 129 130### Manual guard 131Creating manual gurad record, set the "Enabled" property to "false" to manually 132guard a unit which is present in the inventory. 133#### redfish » v1 » Systems » system » Processors » CPU1 134``` 135{ 136 "@odata.type": "#Processor.v1_7_0.Processor", 137 "Id":view details "CPU1", 138 "Name": "Processor", 139 "Socket": "CPU 1", 140 "ProcessorType": "CPU", 141 "ProcessorId": 142 { 143 "VendorId": "XXXX", 144 "IdentificationRegisters": "XXXX", 145 } , 146 "MaxSpeedMHz": 3700, 147 "TotalCores": 8, 148 "TotalThreads": 16, 149 "Status": 150 { 151 "State": "UnavailableOffline", 152 "Health": "Critical" 153 "Enabled": "False" <--- guarded a CPU1 154 } , 155 "@odata.id":view details "/redfish/v1/Systems/system/Processors/CPU1" 156} 157``` 158 159### Listing the guard record 160#### redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware 161#### >> Entries 162``` 163{ 164 "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries", 165 "@odata.type": "#LogEntryCollection.LogEntryCollection", 166 "Description": "Collection of Isolated Hardware Components", 167 "Members": [ 168 { 169 "@odata.id": 170"/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1", 171 "@odata.type": "#LogEntry.v1_7_0.LogEntry", 172 "Created": "2020-10-15T10:30:08+00:00", 173 "EntryType": "Event", 174 "Id": "1", 175 "Resolved": "false", 176 "Name": "Processor 1", 177 "links": { 178 "OriginOfCondition": { 179 "@odata.id": "/redfish/v1/Systems/system/Processors/cpu1" 180 } 181 }, 182 "Severity": "Critical", 183 "SensorType" : "Processor", 184 "AdditionalDataURI": “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111" 185 “AddionalDataSizeBytes": "1024" 186 } 187 ], 188 "Members@odata.count": 1, 189 "Name": "Isolated Hardware Entries" 190 } 191``` 192 193## Alternatives Considered 194 195The guard records can be created for any components which are redundant and 196isolatable to prevent any damage to the hardware or data. Once the record is 197created, an isolation procedure is needed to isolate the units from service. 198Some of the units like which are controlled by the host can be isolated only 199after a reboot, but the units controlled by BMC can be immediately taken out of 200service. The alternatives are 201- The host creates, applies, and present guard records: In this case, 202 BMC has no control, and the host needs to provide the user interface, 203 so there may not be a common interface across different types of hosts. 204 Different user interfaces are required for guard records created by 205 BMC and host. 206- BMC manages the external interfaces for guard: There will be one common 207 point for presenting or managing the guard records created by multiple hosts 208 or BMC itself. There are some guard records created after a severe failure 209 in the host; as a system control entity, it will be easier for BMC 210 to handle such situations and create the records. 211 212 213## Impacts 214- The guard records will be presented as an extension to logs 215- Redfish implementation will have an impact to do the operations required 216 for the guard record management by the user. A request for standardization is 217 planned for the method to list the isolated units in the redfish since that 218 is not yet available in the redfish standard. 219 220## Testing 221The necessary tests needed are creating, deleting, and listing the guard 222records and that should be automated, further tests on the isolation of each 223type of the unit is implementation-specific. 224