1f1640aa1SDhruvaraj Subhashchandran# Guard on BMC 2f1640aa1SDhruvaraj Subhashchandran 3f1640aa1SDhruvaraj Subhashchandran## Problem Description 4*f4febd00SPatrick Williams 5f1640aa1SDhruvaraj SubhashchandranOn systems with multiple processor units and other redundant vital resources, 6f1640aa1SDhruvaraj Subhashchandranthe system downtime can be prevented by isolating the faulty components. This 7f1640aa1SDhruvaraj Subhashchandrandocument discusses the design of the BMC support for such isolation procedures. 8f1640aa1SDhruvaraj SubhashchandranThe defective components can be kept isolated until a replacement. Most of the 9f1640aa1SDhruvaraj Subhashchandranactions required to isolate the parts will be dependant on the architecture and 10f1640aa1SDhruvaraj Subhashchandrantaken care of by the host. But the BMC needs to support a few steps like 11f1640aa1SDhruvaraj Subhashchandrannotifying users about the components in isolation, clearing isolation, isolating 12*f4febd00SPatrick Williamsa suspected part, or isolating when the host is down due to a severe fault. The 13*f4febd00SPatrick Williamsprocess of isolation is mentioned as guard in this document, which means 14f1640aa1SDhruvaraj Subhashchandranguarding the system from faulty components. 15f1640aa1SDhruvaraj Subhashchandran 16f1640aa1SDhruvaraj Subhashchandran## Glossary 17*f4febd00SPatrick Williams 18f1640aa1SDhruvaraj Subhashchandran**Guard**: Guarding the system against failures by permanently isolating faulty 19f1640aa1SDhruvaraj Subhashchandranunits. 209c79837aSRamesh Iyyar 21*f4febd00SPatrick Williams**Guard Records**: A file in the persistent storage with the list of permanently 22*f4febd00SPatrick Williamsisolated components. 239c79837aSRamesh Iyyar 24f1640aa1SDhruvaraj Subhashchandran**Manual guard**: An operation to manually add a unit to the list of isolated 25f1640aa1SDhruvaraj Subhashchandranunits. This operation is helpful in isolating a suspected component without 26f1640aa1SDhruvaraj Subhashchandranphysically removing it from the server. 27f1640aa1SDhruvaraj Subhashchandran 28f1640aa1SDhruvaraj Subhashchandran## Background and References 29f1640aa1SDhruvaraj Subhashchandran 30f1640aa1SDhruvaraj SubhashchandranThe guard in the servers is for managing a record of faulty components to keep 31f1640aa1SDhruvaraj Subhashchandranthem out of service. The list of faulty but guarded components can be stored in 32f1640aa1SDhruvaraj Subhashchandranmultiple locations based on the ownership of the component. How to store the 33*f4febd00SPatrick Williamsrecord or manage the record will be decided by the respective component. Some of 34*f4febd00SPatrick Williamsthe examples are guard on motherboard components managed by the host, guard on 35*f4febd00SPatrick Williamsthe fans can be managed by the fan control application, or the guard on the 36*f4febd00SPatrick Williamspower components can be managed by the power management application. 37f1640aa1SDhruvaraj Subhashchandran 38*f4febd00SPatrick WilliamsThese records will be created when an error is encountered on an element that 39*f4febd00SPatrick Williamscan be isolated. The decision to create a record is based on the type of error, 40*f4febd00SPatrick Williamsusecase, and the availability of a redundant hardware resource to keep the 41*f4febd00SPatrick Williamssystem in a usable state. The record which gets created to isolate the component 42*f4febd00SPatrick Williamsis named as Guard Record. The guard record will be deleted after a repair action 43*f4febd00SPatrick Williamsor manually by service personnel. Most of the time, the host creates the guard 44*f4febd00SPatrick Williamsrecord since the host is responsible for the initialization and use of the 45*f4febd00SPatrick Williamshardware resources in a server system. The BMC creates guard records on a 46*f4febd00SPatrick Williamslimited set of units during the early boot time, after a severe fault, which 47*f4febd00SPatrick Williamsbrings the host down or on the components like a power supply or fan which can 48*f4febd00SPatrick Williamsbe controlled by BMC. The BMC retrieves the guard records for presenting to an 49*f4febd00SPatrick Williamsexternal interface for the review of customers and service personnel form 50*f4febd00SPatrick Williamsvarious guard record sources. 51f1640aa1SDhruvaraj Subhashchandran 52f1640aa1SDhruvaraj Subhashchandran## Requirements 53*f4febd00SPatrick Williams 54f1640aa1SDhruvaraj Subhashchandran### Primary requirements 55*f4febd00SPatrick Williams 56f1640aa1SDhruvaraj Subhashchandran 57*f4febd00SPatrick Williams 58f1640aa1SDhruvaraj Subhashchandran- When user requests, create a record in the right guard record repository, 59f1640aa1SDhruvaraj Subhashchandran based on the hardware component. 60f1640aa1SDhruvaraj Subhashchandran- An option should be given to user to create guard record for DIMM and 61f1640aa1SDhruvaraj Subhashchandran Processor core to manually keep it out of service. 62f1640aa1SDhruvaraj Subhashchandran- An option should be given to the user to delete a guard record. 63f1640aa1SDhruvaraj Subhashchandran- An option should be given to list the guard records. 64f1640aa1SDhruvaraj Subhashchandran- An option should be given to delete all guard records 65f1640aa1SDhruvaraj Subhashchandran- Industry standard interfaces should be provided to carry out these operations 66f1640aa1SDhruvaraj Subhashchandran on the guard records. 67f1640aa1SDhruvaraj Subhashchandran 68f1640aa1SDhruvaraj Subhashchandran### Assumptions 69*f4febd00SPatrick Williams 70f1640aa1SDhruvaraj Subhashchandran- The guard records on the units which are owned by the host will be applied 71f1640aa1SDhruvaraj Subhashchandran only in a subsequent boot. 72f1640aa1SDhruvaraj Subhashchandran- The sub-system which owns the hardware resource will apply the guard record 73f1640aa1SDhruvaraj Subhashchandran and isolates the units. 74f1640aa1SDhruvaraj Subhashchandran- The clearing of the records after the replacement of the faulty component 75f1640aa1SDhruvaraj Subhashchandran should be done by the component owning the guard records. 76f1640aa1SDhruvaraj Subhashchandran- There should be a mapping between key in the guard record to the key used in 77f1640aa1SDhruvaraj Subhashchandran BMC for the components. 78f1640aa1SDhruvaraj Subhashchandran 79f1640aa1SDhruvaraj Subhashchandran## Proposed Design 80*f4febd00SPatrick Williams 81f1640aa1SDhruvaraj SubhashchandranThe guard proposed here is an aggregator for the guard records from various 82f1640aa1SDhruvaraj Subhashchandransources and provide a common interface for creating, deleting and listing those 83f1640aa1SDhruvaraj Subhashchandranrecords. 84f1640aa1SDhruvaraj Subhashchandran 85f1640aa1SDhruvaraj Subhashchandran### D-Bus Interfaces 86f1640aa1SDhruvaraj Subhashchandran 87*f4febd00SPatrick WilliamsOn BMC, There will be a guard manager object and objects for each entry. 88*f4febd00SPatrick Williams 89*f4febd00SPatrick Williams#### Guard Manager 90*f4febd00SPatrick Williams 91*f4febd00SPatrick WilliamsGuard manager is for providing the external interfaces for managing the guard 92*f4febd00SPatrick Williamsrecords and retrieving information about guard records. The methods and 93*f4febd00SPatrick Williamsproperties of the guard manager. 94*f4febd00SPatrick Williams 95*f4febd00SPatrick Williams- Create Guard: Create guard record for a DIMM or Processor core Inputs: - 96*f4febd00SPatrick Williams Inventory path of the DIMM or Processor core to be guarded. 97*f4febd00SPatrick Williams 98*f4febd00SPatrick Williams- Delete Guard: Delete an existing guard record Deleting the guard record D-Bus 99*f4febd00SPatrick Williams entry should delete the underlying record. 100f1640aa1SDhruvaraj Subhashchandran 101f1640aa1SDhruvaraj Subhashchandran- Clear all guard: Clear all guard records in the system. 102f1640aa1SDhruvaraj Subhashchandran 103f1640aa1SDhruvaraj Subhashchandran- List Guard: List all the guarded components. 104f1640aa1SDhruvaraj Subhashchandran 1059c79837aSRamesh Iyyar**Note:** In few platforms may be the system or hardware is not in a state where 1069c79837aSRamesh Iyyarguard operation can be performed either "permanently" or "temporarily". 1079c79837aSRamesh Iyyar 108f1640aa1SDhruvaraj Subhashchandran#### Guard Entry 109*f4febd00SPatrick Williams 110*f4febd00SPatrick WilliamsThe properties of each guard entry will be part of this object Properties: 1115357b00bSRamesh Iyyar 1125357b00bSRamesh Iyyar- ID: Id of the record which is part of the entry object path. 1135357b00bSRamesh Iyyar- Associations: 114*f4febd00SPatrick Williams 1155357b00bSRamesh Iyyar - Guarded hardware inventory path: 1165357b00bSRamesh Iyyar - Forward Name must be "isolated_hw". 1175357b00bSRamesh Iyyar - Reverse Name must be "isolated_hw_entry". 1185357b00bSRamesh Iyyar - BMC Error Log: 119*f4febd00SPatrick Williams 1205357b00bSRamesh Iyyar - Forward Name must be "isolated_hw_errorlog". 1215357b00bSRamesh Iyyar - Reverse Name must be "isolated_hw_entry". 1225357b00bSRamesh Iyyar 123*f4febd00SPatrick Williams Error Log association can be optional because a user may be tried guard 124*f4febd00SPatrick Williams hardware and it is not an error case. 125*f4febd00SPatrick Williams 1265357b00bSRamesh Iyyar- Severity: Type of Guard 127f1640aa1SDhruvaraj Subhashchandran - Manual - Manually Guarded 128f1640aa1SDhruvaraj Subhashchandran - Critical - Guarded based on a critical error 1295357b00bSRamesh Iyyar - Warning - Guarded based on an error which is not critical, but eventually, 1305357b00bSRamesh Iyyar there can be critical failures. 1315357b00bSRamesh Iyyar- Resolved: Status of guarded hardware 132*f4febd00SPatrick Williams 1335357b00bSRamesh Iyyar - Used to indicate whether guarded hardware is repaired or replaced. 1345357b00bSRamesh Iyyar 135*f4febd00SPatrick Williams **Note:** Setting this to "true" will not delete this entry because in a few 136*f4febd00SPatrick Williams system platforms guarded hardware records may not be deleted and used for 137*f4febd00SPatrick Williams further analysis. 138f1640aa1SDhruvaraj Subhashchandran 139f1640aa1SDhruvaraj SubhashchandranThe underlying guard function is implemented by the applications managing the 140f1640aa1SDhruvaraj Subhashchandranhardware units. The application which implements the common guard entry 141f1640aa1SDhruvaraj Subhashchandraninterface should map between entry to the underlying guard record in the 142f1640aa1SDhruvaraj Subhashchandranoriginal guard record store. 143f1640aa1SDhruvaraj Subhashchandran 144f1640aa1SDhruvaraj Subhashchandran## Redfish interface 145f1640aa1SDhruvaraj Subhashchandran 146f1640aa1SDhruvaraj Subhashchandran### Manual guard 147*f4febd00SPatrick Williams 1485357b00bSRamesh IyyarCreating manual gurad record, set the "Enabled" property to "false" to manually 149f1640aa1SDhruvaraj Subhashchandranguard a unit which is present in the inventory. 150*f4febd00SPatrick Williams 151f1640aa1SDhruvaraj Subhashchandran#### redfish » v1 » Systems » system » Processors » CPU1 152*f4febd00SPatrick Williams 1535357b00bSRamesh Iyyar``` 154f1640aa1SDhruvaraj Subhashchandran{ 155f1640aa1SDhruvaraj Subhashchandran "@odata.type": "#Processor.v1_7_0.Processor", 156f1640aa1SDhruvaraj Subhashchandran "Id":view details "CPU1", 157f1640aa1SDhruvaraj Subhashchandran "Name": "Processor", 158f1640aa1SDhruvaraj Subhashchandran "Socket": "CPU 1", 159f1640aa1SDhruvaraj Subhashchandran "ProcessorType": "CPU", 160f1640aa1SDhruvaraj Subhashchandran "ProcessorId": 161f1640aa1SDhruvaraj Subhashchandran { 162f1640aa1SDhruvaraj Subhashchandran "VendorId": "XXXX", 163f1640aa1SDhruvaraj Subhashchandran "IdentificationRegisters": "XXXX", 164f1640aa1SDhruvaraj Subhashchandran } , 165f1640aa1SDhruvaraj Subhashchandran "MaxSpeedMHz": 3700, 166f1640aa1SDhruvaraj Subhashchandran "TotalCores": 8, 167f1640aa1SDhruvaraj Subhashchandran "TotalThreads": 16, 168f1640aa1SDhruvaraj Subhashchandran "Status": 169f1640aa1SDhruvaraj Subhashchandran { 1705357b00bSRamesh Iyyar "State": "UnavailableOffline", 1715357b00bSRamesh Iyyar "Health": "Critical" 1725357b00bSRamesh Iyyar "Enabled": "False" <--- guarded a CPU1 173f1640aa1SDhruvaraj Subhashchandran } , 174f1640aa1SDhruvaraj Subhashchandran "@odata.id":view details "/redfish/v1/Systems/system/Processors/CPU1" 175f1640aa1SDhruvaraj Subhashchandran} 1765357b00bSRamesh Iyyar``` 177f1640aa1SDhruvaraj Subhashchandran 1785357b00bSRamesh Iyyar### Listing the guard record 179*f4febd00SPatrick Williams 180f1640aa1SDhruvaraj Subhashchandran#### redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware 181*f4febd00SPatrick Williams 182f1640aa1SDhruvaraj Subhashchandran#### >> Entries 183*f4febd00SPatrick Williams 1845357b00bSRamesh Iyyar``` 185f1640aa1SDhruvaraj Subhashchandran{ 186f1640aa1SDhruvaraj Subhashchandran "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries", 187f1640aa1SDhruvaraj Subhashchandran "@odata.type": "#LogEntryCollection.LogEntryCollection", 188f1640aa1SDhruvaraj Subhashchandran "Description": "Collection of Isolated Hardware Components", 189f1640aa1SDhruvaraj Subhashchandran "Members": [ 190f1640aa1SDhruvaraj Subhashchandran { 191f1640aa1SDhruvaraj Subhashchandran "@odata.id": 192f1640aa1SDhruvaraj Subhashchandran"/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1", 193f1640aa1SDhruvaraj Subhashchandran "@odata.type": "#LogEntry.v1_7_0.LogEntry", 194f1640aa1SDhruvaraj Subhashchandran "Created": "2020-10-15T10:30:08+00:00", 195f1640aa1SDhruvaraj Subhashchandran "EntryType": "Event", 196f1640aa1SDhruvaraj Subhashchandran "Id": "1", 197f1640aa1SDhruvaraj Subhashchandran "Resolved": "false", 198f1640aa1SDhruvaraj Subhashchandran "Name": "Processor 1", 199f1640aa1SDhruvaraj Subhashchandran "links": { 200f1640aa1SDhruvaraj Subhashchandran "OriginOfCondition": { 2015357b00bSRamesh Iyyar "@odata.id": "/redfish/v1/Systems/system/Processors/cpu1" 2025357b00bSRamesh Iyyar } 203f1640aa1SDhruvaraj Subhashchandran }, 204f1640aa1SDhruvaraj Subhashchandran "Severity": "Critical", 205f1640aa1SDhruvaraj Subhashchandran "SensorType" : "Processor", 2065357b00bSRamesh Iyyar "AdditionalDataURI": “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111" 207f1640aa1SDhruvaraj Subhashchandran “AddionalDataSizeBytes": "1024" 208f1640aa1SDhruvaraj Subhashchandran } 209f1640aa1SDhruvaraj Subhashchandran ], 210f1640aa1SDhruvaraj Subhashchandran "Members@odata.count": 1, 211f1640aa1SDhruvaraj Subhashchandran "Name": "Isolated Hardware Entries" 212f1640aa1SDhruvaraj Subhashchandran } 2135357b00bSRamesh Iyyar``` 214f1640aa1SDhruvaraj Subhashchandran 215f1640aa1SDhruvaraj Subhashchandran## Alternatives Considered 216f1640aa1SDhruvaraj Subhashchandran 217f1640aa1SDhruvaraj SubhashchandranThe guard records can be created for any components which are redundant and 218f1640aa1SDhruvaraj Subhashchandranisolatable to prevent any damage to the hardware or data. Once the record is 219f1640aa1SDhruvaraj Subhashchandrancreated, an isolation procedure is needed to isolate the units from service. 220f1640aa1SDhruvaraj SubhashchandranSome of the units like which are controlled by the host can be isolated only 221f1640aa1SDhruvaraj Subhashchandranafter a reboot, but the units controlled by BMC can be immediately taken out of 222f1640aa1SDhruvaraj Subhashchandranservice. The alternatives are 223f1640aa1SDhruvaraj Subhashchandran 224*f4febd00SPatrick Williams- The host creates, applies, and present guard records: In this case, BMC has no 225*f4febd00SPatrick Williams control, and the host needs to provide the user interface, so there may not be 226*f4febd00SPatrick Williams a common interface across different types of hosts. Different user interfaces 227*f4febd00SPatrick Williams are required for guard records created by BMC and host. 228*f4febd00SPatrick Williams- BMC manages the external interfaces for guard: There will be one common point 229*f4febd00SPatrick Williams for presenting or managing the guard records created by multiple hosts or BMC 230*f4febd00SPatrick Williams itself. There are some guard records created after a severe failure in the 231*f4febd00SPatrick Williams host; as a system control entity, it will be easier for BMC to handle such 232*f4febd00SPatrick Williams situations and create the records. 233f1640aa1SDhruvaraj Subhashchandran 234f1640aa1SDhruvaraj Subhashchandran## Impacts 235*f4febd00SPatrick Williams 236f1640aa1SDhruvaraj Subhashchandran- The guard records will be presented as an extension to logs 237*f4febd00SPatrick Williams- Redfish implementation will have an impact to do the operations required for 238*f4febd00SPatrick Williams the guard record management by the user. A request for standardization is 239*f4febd00SPatrick Williams planned for the method to list the isolated units in the redfish since that is 240*f4febd00SPatrick Williams not yet available in the redfish standard. 241f1640aa1SDhruvaraj Subhashchandran 242f1640aa1SDhruvaraj Subhashchandran## Testing 243*f4febd00SPatrick Williams 244*f4febd00SPatrick WilliamsThe necessary tests needed are creating, deleting, and listing the guard records 245*f4febd00SPatrick Williamsand that should be automated, further tests on the isolation of each type of the 246*f4febd00SPatrick Williamsunit is implementation-specific. 247