xref: /openbmc/docs/designs/guard-on-bmc.md (revision f4febd002df578bad816239b70950f84ea4567e8)
1f1640aa1SDhruvaraj Subhashchandran# Guard on BMC
2f1640aa1SDhruvaraj Subhashchandran
3f1640aa1SDhruvaraj Subhashchandran## Problem Description
4*f4febd00SPatrick Williams
5f1640aa1SDhruvaraj SubhashchandranOn systems with multiple processor units and other redundant vital resources,
6f1640aa1SDhruvaraj Subhashchandranthe system downtime can be prevented by isolating the faulty components. This
7f1640aa1SDhruvaraj Subhashchandrandocument discusses the design of the BMC support for such isolation procedures.
8f1640aa1SDhruvaraj SubhashchandranThe defective components can be kept isolated until a replacement. Most of the
9f1640aa1SDhruvaraj Subhashchandranactions required to isolate the parts will be dependant on the architecture and
10f1640aa1SDhruvaraj Subhashchandrantaken care of by the host. But the BMC needs to support a few steps like
11f1640aa1SDhruvaraj Subhashchandrannotifying users about the components in isolation, clearing isolation, isolating
12*f4febd00SPatrick Williamsa suspected part, or isolating when the host is down due to a severe fault. The
13*f4febd00SPatrick Williamsprocess of isolation is mentioned as guard in this document, which means
14f1640aa1SDhruvaraj Subhashchandranguarding the system from faulty components.
15f1640aa1SDhruvaraj Subhashchandran
16f1640aa1SDhruvaraj Subhashchandran## Glossary
17*f4febd00SPatrick Williams
18f1640aa1SDhruvaraj Subhashchandran**Guard**: Guarding the system against failures by permanently isolating faulty
19f1640aa1SDhruvaraj Subhashchandranunits.
209c79837aSRamesh Iyyar
21*f4febd00SPatrick Williams**Guard Records**: A file in the persistent storage with the list of permanently
22*f4febd00SPatrick Williamsisolated components.
239c79837aSRamesh Iyyar
24f1640aa1SDhruvaraj Subhashchandran**Manual guard**: An operation to manually add a unit to the list of isolated
25f1640aa1SDhruvaraj Subhashchandranunits. This operation is helpful in isolating a suspected component without
26f1640aa1SDhruvaraj Subhashchandranphysically removing it from the server.
27f1640aa1SDhruvaraj Subhashchandran
28f1640aa1SDhruvaraj Subhashchandran## Background and References
29f1640aa1SDhruvaraj Subhashchandran
30f1640aa1SDhruvaraj SubhashchandranThe guard in the servers is for managing a record of faulty components to keep
31f1640aa1SDhruvaraj Subhashchandranthem out of service. The list of faulty but guarded components can be stored in
32f1640aa1SDhruvaraj Subhashchandranmultiple locations based on the ownership of the component. How to store the
33*f4febd00SPatrick Williamsrecord or manage the record will be decided by the respective component. Some of
34*f4febd00SPatrick Williamsthe examples are guard on motherboard components managed by the host, guard on
35*f4febd00SPatrick Williamsthe fans can be managed by the fan control application, or the guard on the
36*f4febd00SPatrick Williamspower components can be managed by the power management application.
37f1640aa1SDhruvaraj Subhashchandran
38*f4febd00SPatrick WilliamsThese records will be created when an error is encountered on an element that
39*f4febd00SPatrick Williamscan be isolated. The decision to create a record is based on the type of error,
40*f4febd00SPatrick Williamsusecase, and the availability of a redundant hardware resource to keep the
41*f4febd00SPatrick Williamssystem in a usable state. The record which gets created to isolate the component
42*f4febd00SPatrick Williamsis named as Guard Record. The guard record will be deleted after a repair action
43*f4febd00SPatrick Williamsor manually by service personnel. Most of the time, the host creates the guard
44*f4febd00SPatrick Williamsrecord since the host is responsible for the initialization and use of the
45*f4febd00SPatrick Williamshardware resources in a server system. The BMC creates guard records on a
46*f4febd00SPatrick Williamslimited set of units during the early boot time, after a severe fault, which
47*f4febd00SPatrick Williamsbrings the host down or on the components like a power supply or fan which can
48*f4febd00SPatrick Williamsbe controlled by BMC. The BMC retrieves the guard records for presenting to an
49*f4febd00SPatrick Williamsexternal interface for the review of customers and service personnel form
50*f4febd00SPatrick Williamsvarious guard record sources.
51f1640aa1SDhruvaraj Subhashchandran
52f1640aa1SDhruvaraj Subhashchandran## Requirements
53*f4febd00SPatrick Williams
54f1640aa1SDhruvaraj Subhashchandran### Primary requirements
55*f4febd00SPatrick Williams
56f1640aa1SDhruvaraj Subhashchandran![Guard Usecases](https://user-images.githubusercontent.com/16666879/70852658-0edfda80-1eca-11ea-9d70-c81a690c78f2.jpeg)
57*f4febd00SPatrick Williams
58f1640aa1SDhruvaraj Subhashchandran- When user requests, create a record in the right guard record repository,
59f1640aa1SDhruvaraj Subhashchandran  based on the hardware component.
60f1640aa1SDhruvaraj Subhashchandran- An option should be given to user to create guard record for DIMM and
61f1640aa1SDhruvaraj Subhashchandran  Processor core to manually keep it out of service.
62f1640aa1SDhruvaraj Subhashchandran- An option should be given to the user to delete a guard record.
63f1640aa1SDhruvaraj Subhashchandran- An option should be given to list the guard records.
64f1640aa1SDhruvaraj Subhashchandran- An option should be given to delete all guard records
65f1640aa1SDhruvaraj Subhashchandran- Industry standard interfaces should be provided to carry out these operations
66f1640aa1SDhruvaraj Subhashchandran  on the guard records.
67f1640aa1SDhruvaraj Subhashchandran
68f1640aa1SDhruvaraj Subhashchandran### Assumptions
69*f4febd00SPatrick Williams
70f1640aa1SDhruvaraj Subhashchandran- The guard records on the units which are owned by the host will be applied
71f1640aa1SDhruvaraj Subhashchandran  only in a subsequent boot.
72f1640aa1SDhruvaraj Subhashchandran- The sub-system which owns the hardware resource will apply the guard record
73f1640aa1SDhruvaraj Subhashchandran  and isolates the units.
74f1640aa1SDhruvaraj Subhashchandran- The clearing of the records after the replacement of the faulty component
75f1640aa1SDhruvaraj Subhashchandran  should be done by the component owning the guard records.
76f1640aa1SDhruvaraj Subhashchandran- There should be a mapping between key in the guard record to the key used in
77f1640aa1SDhruvaraj Subhashchandran  BMC for the components.
78f1640aa1SDhruvaraj Subhashchandran
79f1640aa1SDhruvaraj Subhashchandran## Proposed Design
80*f4febd00SPatrick Williams
81f1640aa1SDhruvaraj SubhashchandranThe guard proposed here is an aggregator for the guard records from various
82f1640aa1SDhruvaraj Subhashchandransources and provide a common interface for creating, deleting and listing those
83f1640aa1SDhruvaraj Subhashchandranrecords.
84f1640aa1SDhruvaraj Subhashchandran
85f1640aa1SDhruvaraj Subhashchandran### D-Bus Interfaces
86f1640aa1SDhruvaraj Subhashchandran
87*f4febd00SPatrick WilliamsOn BMC, There will be a guard manager object and objects for each entry.
88*f4febd00SPatrick Williams
89*f4febd00SPatrick Williams#### Guard Manager
90*f4febd00SPatrick Williams
91*f4febd00SPatrick WilliamsGuard manager is for providing the external interfaces for managing the guard
92*f4febd00SPatrick Williamsrecords and retrieving information about guard records. The methods and
93*f4febd00SPatrick Williamsproperties of the guard manager.
94*f4febd00SPatrick Williams
95*f4febd00SPatrick Williams- Create Guard: Create guard record for a DIMM or Processor core Inputs: -
96*f4febd00SPatrick Williams  Inventory path of the DIMM or Processor core to be guarded.
97*f4febd00SPatrick Williams
98*f4febd00SPatrick Williams- Delete Guard: Delete an existing guard record Deleting the guard record D-Bus
99*f4febd00SPatrick Williams  entry should delete the underlying record.
100f1640aa1SDhruvaraj Subhashchandran
101f1640aa1SDhruvaraj Subhashchandran- Clear all guard: Clear all guard records in the system.
102f1640aa1SDhruvaraj Subhashchandran
103f1640aa1SDhruvaraj Subhashchandran- List Guard: List all the guarded components.
104f1640aa1SDhruvaraj Subhashchandran
1059c79837aSRamesh Iyyar**Note:** In few platforms may be the system or hardware is not in a state where
1069c79837aSRamesh Iyyarguard operation can be performed either "permanently" or "temporarily".
1079c79837aSRamesh Iyyar
108f1640aa1SDhruvaraj Subhashchandran#### Guard Entry
109*f4febd00SPatrick Williams
110*f4febd00SPatrick WilliamsThe properties of each guard entry will be part of this object Properties:
1115357b00bSRamesh Iyyar
1125357b00bSRamesh Iyyar- ID: Id of the record which is part of the entry object path.
1135357b00bSRamesh Iyyar- Associations:
114*f4febd00SPatrick Williams
1155357b00bSRamesh Iyyar  - Guarded hardware inventory path:
1165357b00bSRamesh Iyyar    - Forward Name must be "isolated_hw".
1175357b00bSRamesh Iyyar    - Reverse Name must be "isolated_hw_entry".
1185357b00bSRamesh Iyyar  - BMC Error Log:
119*f4febd00SPatrick Williams
1205357b00bSRamesh Iyyar    - Forward Name must be "isolated_hw_errorlog".
1215357b00bSRamesh Iyyar    - Reverse Name must be "isolated_hw_entry".
1225357b00bSRamesh Iyyar
123*f4febd00SPatrick Williams      Error Log association can be optional because a user may be tried guard
124*f4febd00SPatrick Williams      hardware and it is not an error case.
125*f4febd00SPatrick Williams
1265357b00bSRamesh Iyyar- Severity: Type of Guard
127f1640aa1SDhruvaraj Subhashchandran  - Manual - Manually Guarded
128f1640aa1SDhruvaraj Subhashchandran  - Critical - Guarded based on a critical error
1295357b00bSRamesh Iyyar  - Warning - Guarded based on an error which is not critical, but eventually,
1305357b00bSRamesh Iyyar    there can be critical failures.
1315357b00bSRamesh Iyyar- Resolved: Status of guarded hardware
132*f4febd00SPatrick Williams
1335357b00bSRamesh Iyyar  - Used to indicate whether guarded hardware is repaired or replaced.
1345357b00bSRamesh Iyyar
135*f4febd00SPatrick Williams    **Note:** Setting this to "true" will not delete this entry because in a few
136*f4febd00SPatrick Williams    system platforms guarded hardware records may not be deleted and used for
137*f4febd00SPatrick Williams    further analysis.
138f1640aa1SDhruvaraj Subhashchandran
139f1640aa1SDhruvaraj SubhashchandranThe underlying guard function is implemented by the applications managing the
140f1640aa1SDhruvaraj Subhashchandranhardware units. The application which implements the common guard entry
141f1640aa1SDhruvaraj Subhashchandraninterface should map between entry to the underlying guard record in the
142f1640aa1SDhruvaraj Subhashchandranoriginal guard record store.
143f1640aa1SDhruvaraj Subhashchandran
144f1640aa1SDhruvaraj Subhashchandran## Redfish interface
145f1640aa1SDhruvaraj Subhashchandran
146f1640aa1SDhruvaraj Subhashchandran### Manual guard
147*f4febd00SPatrick Williams
1485357b00bSRamesh IyyarCreating manual gurad record, set the "Enabled" property to "false" to manually
149f1640aa1SDhruvaraj Subhashchandranguard a unit which is present in the inventory.
150*f4febd00SPatrick Williams
151f1640aa1SDhruvaraj Subhashchandran#### redfish » v1 » Systems » system » Processors » CPU1
152*f4febd00SPatrick Williams
1535357b00bSRamesh Iyyar```
154f1640aa1SDhruvaraj Subhashchandran{
155f1640aa1SDhruvaraj Subhashchandran  "@odata.type": "#Processor.v1_7_0.Processor",
156f1640aa1SDhruvaraj Subhashchandran  "Id":view details "CPU1",
157f1640aa1SDhruvaraj Subhashchandran  "Name": "Processor",
158f1640aa1SDhruvaraj Subhashchandran  "Socket": "CPU 1",
159f1640aa1SDhruvaraj Subhashchandran  "ProcessorType": "CPU",
160f1640aa1SDhruvaraj Subhashchandran  "ProcessorId":
161f1640aa1SDhruvaraj Subhashchandran   {
162f1640aa1SDhruvaraj Subhashchandran       "VendorId": "XXXX",
163f1640aa1SDhruvaraj Subhashchandran       "IdentificationRegisters": "XXXX",
164f1640aa1SDhruvaraj Subhashchandran   } ,
165f1640aa1SDhruvaraj Subhashchandran   "MaxSpeedMHz": 3700,
166f1640aa1SDhruvaraj Subhashchandran   "TotalCores": 8,
167f1640aa1SDhruvaraj Subhashchandran   "TotalThreads": 16,
168f1640aa1SDhruvaraj Subhashchandran   "Status":
169f1640aa1SDhruvaraj Subhashchandran   {
1705357b00bSRamesh Iyyar        "State": "UnavailableOffline",
1715357b00bSRamesh Iyyar        "Health": "Critical"
1725357b00bSRamesh Iyyar        "Enabled": "False" <--- guarded a CPU1
173f1640aa1SDhruvaraj Subhashchandran   } ,
174f1640aa1SDhruvaraj Subhashchandran   "@odata.id":view details "/redfish/v1/Systems/system/Processors/CPU1"
175f1640aa1SDhruvaraj Subhashchandran}
1765357b00bSRamesh Iyyar```
177f1640aa1SDhruvaraj Subhashchandran
1785357b00bSRamesh Iyyar### Listing the guard record
179*f4febd00SPatrick Williams
180f1640aa1SDhruvaraj Subhashchandran#### redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware
181*f4febd00SPatrick Williams
182f1640aa1SDhruvaraj Subhashchandran#### >> Entries
183*f4febd00SPatrick Williams
1845357b00bSRamesh Iyyar```
185f1640aa1SDhruvaraj Subhashchandran{
186f1640aa1SDhruvaraj Subhashchandran  "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
187f1640aa1SDhruvaraj Subhashchandran  "@odata.type": "#LogEntryCollection.LogEntryCollection",
188f1640aa1SDhruvaraj Subhashchandran  "Description": "Collection of Isolated Hardware Components",
189f1640aa1SDhruvaraj Subhashchandran  "Members": [
190f1640aa1SDhruvaraj Subhashchandran    {
191f1640aa1SDhruvaraj Subhashchandran      "@odata.id":
192f1640aa1SDhruvaraj Subhashchandran"/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
193f1640aa1SDhruvaraj Subhashchandran      "@odata.type": "#LogEntry.v1_7_0.LogEntry",
194f1640aa1SDhruvaraj Subhashchandran      "Created": "2020-10-15T10:30:08+00:00",
195f1640aa1SDhruvaraj Subhashchandran      "EntryType": "Event",
196f1640aa1SDhruvaraj Subhashchandran      "Id": "1",
197f1640aa1SDhruvaraj Subhashchandran      "Resolved": "false",
198f1640aa1SDhruvaraj Subhashchandran      "Name": "Processor 1",
199f1640aa1SDhruvaraj Subhashchandran      "links":  {
200f1640aa1SDhruvaraj Subhashchandran                 "OriginOfCondition": {
2015357b00bSRamesh Iyyar                        "@odata.id": "/redfish/v1/Systems/system/Processors/cpu1"
2025357b00bSRamesh Iyyar                  }
203f1640aa1SDhruvaraj Subhashchandran                },
204f1640aa1SDhruvaraj Subhashchandran      "Severity": "Critical",
205f1640aa1SDhruvaraj Subhashchandran      "SensorType" : "Processor",
2065357b00bSRamesh Iyyar      "AdditionalDataURI": “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111"
207f1640aa1SDhruvaraj Subhashchandran      “AddionalDataSizeBytes": "1024"
208f1640aa1SDhruvaraj Subhashchandran    }
209f1640aa1SDhruvaraj Subhashchandran  ],
210f1640aa1SDhruvaraj Subhashchandran  "Members@odata.count": 1,
211f1640aa1SDhruvaraj Subhashchandran  "Name": "Isolated Hardware Entries"
212f1640aa1SDhruvaraj Subhashchandran  }
2135357b00bSRamesh Iyyar```
214f1640aa1SDhruvaraj Subhashchandran
215f1640aa1SDhruvaraj Subhashchandran## Alternatives Considered
216f1640aa1SDhruvaraj Subhashchandran
217f1640aa1SDhruvaraj SubhashchandranThe guard records can be created for any components which are redundant and
218f1640aa1SDhruvaraj Subhashchandranisolatable to prevent any damage to the hardware or data. Once the record is
219f1640aa1SDhruvaraj Subhashchandrancreated, an isolation procedure is needed to isolate the units from service.
220f1640aa1SDhruvaraj SubhashchandranSome of the units like which are controlled by the host can be isolated only
221f1640aa1SDhruvaraj Subhashchandranafter a reboot, but the units controlled by BMC can be immediately taken out of
222f1640aa1SDhruvaraj Subhashchandranservice. The alternatives are
223f1640aa1SDhruvaraj Subhashchandran
224*f4febd00SPatrick Williams- The host creates, applies, and present guard records: In this case, BMC has no
225*f4febd00SPatrick Williams  control, and the host needs to provide the user interface, so there may not be
226*f4febd00SPatrick Williams  a common interface across different types of hosts. Different user interfaces
227*f4febd00SPatrick Williams  are required for guard records created by BMC and host.
228*f4febd00SPatrick Williams- BMC manages the external interfaces for guard: There will be one common point
229*f4febd00SPatrick Williams  for presenting or managing the guard records created by multiple hosts or BMC
230*f4febd00SPatrick Williams  itself. There are some guard records created after a severe failure in the
231*f4febd00SPatrick Williams  host; as a system control entity, it will be easier for BMC to handle such
232*f4febd00SPatrick Williams  situations and create the records.
233f1640aa1SDhruvaraj Subhashchandran
234f1640aa1SDhruvaraj Subhashchandran## Impacts
235*f4febd00SPatrick Williams
236f1640aa1SDhruvaraj Subhashchandran- The guard records will be presented as an extension to logs
237*f4febd00SPatrick Williams- Redfish implementation will have an impact to do the operations required for
238*f4febd00SPatrick Williams  the guard record management by the user. A request for standardization is
239*f4febd00SPatrick Williams  planned for the method to list the isolated units in the redfish since that is
240*f4febd00SPatrick Williams  not yet available in the redfish standard.
241f1640aa1SDhruvaraj Subhashchandran
242f1640aa1SDhruvaraj Subhashchandran## Testing
243*f4febd00SPatrick Williams
244*f4febd00SPatrick WilliamsThe necessary tests needed are creating, deleting, and listing the guard records
245*f4febd00SPatrick Williamsand that should be automated, further tests on the isolation of each type of the
246*f4febd00SPatrick Williamsunit is implementation-specific.
247