xref: /openbmc/docs/designs/telemetry.md (revision d045c8aa)
1# OpenBMC platform telemetry
2
3Author: Piotr Matuszczak <piotr.matuszczak@intel.com>
4
5Other contributors: Pawel Rapkiewicz <pawel.rapkiewicz@intel.com> <pawelr>,
6Kamil Kowalski <kamil.kowalski@intel.com>
7
8Created: 2019-08-07
9
10## Problem Description
11
12The BMC on server platform gathers lots of telemetry data, which has to be
13exposed in clean, human readable and standardized format. This document focuses
14on telemetry over the Redfish, since it is standard API for platform
15manageability.
16
17## Background and References
18
19- OpenBMC platform telemetry shall leverage DMTF's [Redfish Telemetry Model][1]
20  for exposing platform telemetry over the network.
21- OpenBMC platform telemetry shall leverage the [OpenBMC sensors architecture
22  implementation][2].
23- OpenBMC platform telemetry shall implement a service, called Telemetry to deal
24  with metrics report and trigger management. This service is described later in
25  this document.
26- Although we use the [hwmon][3] to gather readings from physical sensors, this
27  architecture does not depend on it, because the Telemetry service component
28  relies on the [OpenBMC D-Bus sensors][2].
29
30## Requirements
31
32- [OpenBMC D-Bus sensors][2] support. This is also design limitation, since the
33  Telemetry service requires telemetry sources to be implemented as D-Bus
34  sensors.
35
36## Proposed Design
37
38Redfish Telemetry Model shall implement Telemetry Service with the following
39collection resources:
40
41- Metric Definitions - contains the metadata for metrics (unit, accuracy, etc.)
42- Metric Report Definitions - defines how metric report shall be created (which
43  metrics it shall contain, how often it shall be generated etc.)
44- Metric Reports - contains actual metric reports containing telemetry data
45  generated according to the Metric Report Definitions
46- Metric Triggers - contains thresholds and actions that apply to specific
47  metrics
48
49OpenBMC telemetry architecture is shown on the diagram below.
50
51```ascii
52   +--------------+               +----------------+     +-----------------+
53   |hwmon|        |               |Dbus sensors|   |     |Telemetry|       |
54   +-----/        |               +------------/   |     +---------/       |
55   |              +--filesystem--->                |     |                 |
56   |              |               |                |     |                 |
57   +--------------+               +--------^-------+     +--------^--------+
58                                           |                      |
59                                           |                      |
60<------------------------------------------v-----^--DBus----------v----------->
61                                                 |
62                                                 |
63+-------+---------------------------------------------------------------------+
64|bmcweb |                                        |                            |
65+-------/                                        |                            |
66|                                                |                            |
67| +--------+-------------------------------------v--------------------------+ |
68| |Redfish |                                                                | |
69| +--------/                                            +---------+-------+ | |
70| |                                                     |Existing |       | | |
71| | +------------------------------------------------+  |Redfish  |       | | |
72| | |Telemetry Service|                              |  |resources|       | | |
73| | +----------------+/                              |  +---------/       | | |
74| | |  +----------+  +-----------+  +-------------+  |  |   +---------+   | | |
75| | |  |  Metric  |  |  Metric   |  |Metric report|  |  |   | Redfish |   | | |
76| | |  | triggers |  |definitions|  |definitions  <---------+ sensors |   | | |
77| | |  |          |  |           |  |             |  |  |   |         |   | | |
78| | |  +----+-----+  +-----+-----+  +------+------+  |  |   +---------+   | | |
79| | |       |              |               |         |  |                 | | |
80| | |       |              |               |         |  |                 | | |
81| | |       |              |               |         |  |                 | | |
82| | |       |        +-----v-----+         |         |  |                 | | |
83| | |       |        |   Metric  |         |         |  |                 | | |
84| | |       +-------->   report  <---------+         |  |                 | | |
85| | |                |           |                   |  |                 | | |
86| | |                +-----------+                   |  |                 | | |
87| | |                                                |  |                 | | |
88| | +------------------------------------------------+  +-----------------+ | |
89| |                                                                         | |
90| +-------------------------------------------------------------------------+ |
91|                                                                             |
92+-----------------------------------------------------------------------------+
93```
94
95The telemetry service component is a part of Redfish and implements the DMTF's
96[Redfish Telemetry Model][1]. Metric report definitions uses Redfish sensors
97URIs for metric report creation. Those sensors are also used to get URI->D-Bus
98sensor mapping. Redfish Telemetry Service acts as presentation layer for the
99telemetry, while Telemetry service is responsible for gathering metrics from
100D-Bus sensors and exposing them as D-Bus objects. Telemetry service supports
101different monitoring modes (periodic, on change and on demand) along with
102aggregated operations:
103
104- SINGLE - current reading value
105- AVERAGE - average value over defined time period
106- MAX - max reading value during defined time period
107- MIN - min reading value during defined time period
108- SUM - sum of reading values over defined time period
109
110The time period for calculating aggregated metric is taken from the Redfish
111Metric Report Definition resource for each sensor's metric.
112
113Telemetry service supports creating and managing metric report, which may
114contain single or multiple metrics from sensors. This metric report is mapped to
115Metric Report for the Redfish Telemetry Service.
116
117The diagram below shows the flows for creation and update of metric report.
118
119```ascii
120+----+              +------+              +---------+                 +-------+
121|User|              |bmcweb|              |Telemetry|                 | D-Bus |
122+-+--+              +--+---+              +----+----+                 |Sensors|
123  |                    |                       |                      +---+---+
124  |                    |                       |                          |
125+-----------------------------------------------------------------------------+
126|Metric report definition flow|                |                          |   |
127+-----------------------------+                |                          |   |
128| |                    |                       |                          |   |
129| |                    |                       |                          |   |
130| |    POST request    |                       |                          |   |
131| |    with metric     |                       |                          |   |
132| |    report          |                       |                          |   |
133| |    definition      |                       |                          |   |
134| +-------------------->  Invoke AddReport     |  Register for D-Bus      |   |
135| |                    |  method on D-Bus      |  sensors                 |   |
136| |                    +----------------------->  PropertiesChanged       |   |
137| |                    |                       |  signals                 |   |
138| |                    |                       +-------------------------->   |
139| |                    |                       |-------------------------->   |
140| |                    |                       +-------------------------->   |
141| |                    |                       |                          |   |
142| |  HTTP response     |                       +-+Create Report           |   |
143| |  code 201 with     |  Return created       | |D-Bus object            |   |
144| |  Metric Report     |  Report D-Bus path    <-+                        |   |
145| |  Definition's URI  <-----------------------+                          |   |
146| <--------------------+                       |                          |   |
147| |                    |                       |                          |   |
148| |                    |                       |                          |   |
149+-----------------------------------------------------------------------------+
150  |                    |                       |                          |
151+-----------------------------------------------------------------------------+
152|Periodic metric report update flow|           |                          |   |
153+----------------------------------+           +-+Metric report           |   |
154| |                    |                       | |timer triggers          |   |
155| |                    |                       <-+report update           |   |
156| |                    |                       |                          |   |
157+----------------------------------Optional-----------------------------------+
158| |                    |                       |                          |   |
159| |  Send report as SSE or push-style event    |                          |   |
160| |  using Redfish Event Service (not shown    |                          |   |
161| |  here) if configured to do so.             |                          |   |
162| <--------------------------------------------+                          |   |
163| |                    |                       |                          |   |
164+-----------------------------------------------------------------------------+
165| |  GET on Metric     |                       |                          |   |
166| |  Report URI        |                       |   Sensor's Properties-   |   |
167| +-------------------->                       |   Changed signal         |   |
168| |                    +-+Map report's URI     <--------------------------+   |
169| |                    | |to D-Bus path        |                          |   |
170| |                    <-+                     | +----------------------+ |   |
171| |                    | Invoke GetAll method  | |Note that sensor's    | |   |
172| |                    | on report D-Bus       | |PropertiesChanged     | |   |
173| |                    | object                | |signal is asynchronous| |   |
174| |                    +-----------------------> |to metric report timer| |   |
175| |                    |                       | |This timer is the only| |   |
176| |  Return metric     | Return report data    | |thing that triggers   | |   |
177| |  report in JSON    <-----------------------+ |metric report update  | |   |
178| |  format            |                       | +----------------------+ |   |
179| <--------------------+                       |                          |   |
180| |                    |                       |                          |   |
181+-----------------------------------------------------------------------------+
182  |                    |                       |                          |
183+-----------------------------------------------------------------------------+
184|On change metric report update flow|          |   Sensor's Properties-   |   |
185+-----------------------------------+          |   Changed signal         |   |
186| |                    |                       <--------------------------+   |
187| |                    |                       |                          |   |
188| |                    |                       +-+Sensor's signal         |   |
189| |                    |                       | |triggers report         |   |
190| |                    |                       <-+update                  |   |
191| |                    |                       |                          |   |
192+----------------------------------Optional-----------------------------------+
193| |                    |                       |                          |   |
194| |  Send report as SSE or push-style event    |                          |   |
195| |  using Redfish Event Service (not shown    |                          |   |
196| |  here) if configured to do so.             |                          |   |
197| <--------------------------------------------+                          |   |
198| |                    |                       |                          |   |
199+-----------------------------------------------------------------------------+
200| |  GET on Metric     |                       |                          |   |
201| |  Report URI        |                       |                          |   |
202| +-------------------->                       |                          |   |
203| |                    +-+Map report's URI     |                          |   |
204| |                    | |to D-Bus path        | +----------------------+ |   |
205| |                    <-+                     | |Note that sensor's    | |   |
206| |                    | Invoke GetAll method  | |PropertiesChanged     | |   |
207| |                    | on report D-Bus       | |signal triggers the   | |   |
208| |                    | object                | |report update. It is  | |   |
209| |                    +-----------------------> |sufficient that the   | |   |
210| |                    |                       | |signal from only one  | |   |
211| |  Return metric     | Return report data    | |sensor triggers report| |   |
212| |  report in JSON    <-----------------------+ |update.               | |   |
213| |  format            |                       | +----------------------+ |   |
214| <--------------------+                       |                          |   |
215| |                    |                       |                          |   |
216+-----------------------------------------------------------------------------+
217  |                    |                       |                          |
218+-+--------------------+------------------------------------------------------+
219|On demand metric report update flow|          |                          |   |
220+-+--------------------+------------+          |                          |   |
221| |                    |                       |                          |   |
222| |  GET on Metric     |                       |                          |   |
223| |  Report URI        |                       |                          |   |
224| +-------------------->                       |                          |   |
225| |                    +-+Map report's URI     |                          |   |
226| |                    | |to D-Bus path        |                          |   |
227| |                    <-+                     |                          |   |
228| |                    |                       |                          |   |
229| |                    |  Invoke the Update    |                          |   |
230| |                    |  method for report    |                          |   |
231| |                    |  D-Bus object         |                          |   |
232| |                    +----------------------->                          |   |
233| |                    |                       +-+Update method triggers  |   |
234| |                    |                       | |report to be updated    |   |
235| |                    |                       | |with the latest known   |   |
236| |                    |                       | |sensor's readings.      |   |
237| |                    |                       | |No additional sensor    |   |
238| |                    |                       <-+readings are performed. |   |
239+----------------------------------Optional-----------------------------------+
240| |                    |                       |                          |   |
241| |  Send report as SSE or push-style event    |                          |   |
242| |  using Redfish Event Service (not shown    |                          |   |
243| |  here) if configured to do so.             |                          |   |
244| <--------------------------------------------+                          |   |
245| |                    |                       |                          |   |
246+-----------------------------------------------------------------------------+
247| |                    | Update method call    |                          |   |
248| |                    | result                |                          |   |
249| |                    <-----------------------+                          |   |
250| |                    |                       |                          |   |
251| |                    | Invoke GetAll method  |                          |   |
252| |                    | on report D-Bus       |                          |   |
253| |                    | object                |                          |   |
254| |                    +----------------------->                          |   |
255| |                    |                       |                          |   |
256| |  Return metric     | Return report data    |                          |   |
257| |  report in JSON    <-----------------------+                          |   |
258| |  format            |                       |                          |   |
259| <--------------------+                       |                          |   |
260| |                    |                       |                          |   |
261+-----------------------------------------------------------------------------+
262  |                    |                       |                          |
263```
264
265The Redfish implementation in bmcweb is stateless, thus it is not able to store
266metric reports. All operations on metric reports shall be done in the Telemetry
267service. Sending metric report as SSE or push-style events shall be done via the
268[Redfish Event Service][6]. It is marked as optional because metric report does
269not have to be configured for pushing its data through the event.
270
271In case of on demand metric report update, Telemetry service performs no
272additional sensor readings because it already has the latest values, since they
273are updated on PropertiesChanged signal from the D-Bus sensors.
274
275**Telemetry service on [D-Bus][4]**
276
277Telemetry service exposes specific interfaces on D-Bus. One of them will be used
278for reading report management. The second one will be used for triggers
279management.
280
281**Reading report management**
282
283The reading report management D-Bus object:
284
285```ascii
286xyz.openbmc_project.Telemetry.ReportManager
287/xyz/openbmc_project/Telemetry/Reports
288```
289
290The `ReportManager` implements D-Bus interface
291[`xyz.openbmc_project.Telemetry.ReportManager`][8] for report management. The
292interface is described in the phosphor-dbus-interfaces. This interface
293implements `AddReport` method, which is used to create a metric report. The
294report may contain a single or multiple sensor readings. The way how the report
295will be stored by the BMC is defined by one of this method's parameters. The
296`ReportManager` object implements property that stores the maximum number of
297reports supported simultaneously.
298
299The `AddReport` method returns the path to the newly created report object. The
300report object implements the [`xyz.openbmc_project.Object.Delete`][10] and
301[`xyz.openbmc_project.Telemetry.Report`][9] interfaces. The [`Delete`][10]
302interface is defined to add support for removing Report object, while the
303[`Report`][9] interface implements methods and properties for Report management
304along with properties containing telemetry readings. Each report object contains
305the timestamp of its last update. The report object contains an array of
306structures containing reading with its metadata and timestamp of last update of
307this metric. Each report has also the property that stores update interval (for
308periodically updated reports).
309
310**Trigger management**
311
312The trigger management D-Bus object:
313
314```ascii
315xyz.openbmc_project.Telemetry.TriggerManager
316/xyz/openbmc_project/Telemetry/Triggers
317```
318
319The `TriggerManager` supports the `xyz.openbmc_project.Telemetry.TriggerManager`
320interface, which implements the `AddTrigger` method. This method shall be used
321to create new trigger for the certain metric. The method's parameters allow to
322define the type of metric for which trigger is set (discrete or numeric). Depend
323on this setting, this method accepts different set of trigger parameters.
324
325For discrete metric type, trigger parameters contain:
326
327| Field            | Type                | Description                                                                                                                                                                                                     |
328| ---------------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
329| TriggerCondition | enum                | Discrete trigger condition: <br> "Changed" - trigger occurs when value of metric has changed; <br> "Specified" - trigger occurs when value of metric becomes one of the values listed in the discrete triggers. |
330| DiscreteTriggers | array of structures | Array of discrete trigger structures.                                                                                                                                                                           |
331
332Member of DiscreteTriggers array:
333
334| Field     | Type    | Description                                                                                                      |
335| --------- | ------- | ---------------------------------------------------------------------------------------------------------------- |
336| TriggerId | string  | Unique trigger Id                                                                                                |
337| Severity  | enum    | Severity: <br> "OK" - normal<br> "Warning" - requires attention<br> "Critical" - requires immediate attention    |
338| Value     | variant | Value of discrete metric, that constitutes a trigger event.                                                      |
339| DwellTime | uint64  | Time in milliseconds that a trigger occurrence persists before the action defined for this trigger is performed. |
340
341For numeric metric type, trigger parameters contain numeric thresholds. Numeric
342thresholds structure shall contain up to 4 thresholds: upper warning, upper
343critical, lower warning and lower critical. Thus it will contain up to 4
344structures shown below:
345
346| Field          | Type    | Description                                                                                                                                                                                                                                                                                                                                                                                                         |
347| -------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
348| ThresholdType  | enum    | Numeric trigger type: <br> "UpperCritical" - reading is above normal range and requires immediate attention<br>"UpperWarning" - reading is above normal range and may require attention<br>"LowerCritical" - reading is below normal range and requires immediate attention<br>"LowerWarning" - reading is below normal range and may require attention                                                             |
349| DwellTime      | uint64  | Time in milliseconds that a trigger occurrence persists before the action defined for this trigger is performed.                                                                                                                                                                                                                                                                                                    |
350| Activation     | enum    | Indicates direction of crossing the threshold value that trigger the threshold's action:<br> "Increasing" - trigger action when reading is changing from below to above the threshold's value<br> "Decreasing" - trigger action when reading is changing from above to below the threshold's value<br> "Either" - trigger action when reading is crossing the threshold's value in either direction described above |
351| ThresholdValue | variant | Value of reading that will trigger the threshold                                                                                                                                                                                                                                                                                                                                                                    |
352
353The `AddTrigger` method also allows to define the specific action when trigger
354is activated. Upon the trigger activation, three possible actions are allowed,
355logging event to log service, sending event via event service and triggering the
356metric report update.
357
358In order to assign trigger to specific metric, the metric parameter is defined.
359Its structure contains the following data:
360
361| Field      | Type        | Description                                                                                                                                                                                        |
362| ---------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
363| SensorPath | object path | D-Bus path to sensor, for which trigger is defined.                                                                                                                                                |
364| MetricId   | string      | Contains unique metric id, that can be mapped to Redfish MetricId.                                                                                                                                 |
365| ReportPath | object path | D-Bus path to Telemetry's reading report which update shall be triggered when trigger condition occurs. This is optional and shall be filled when trigger's action is set to update metric report. |
366
367The `AddTrigger` method also allows to set trigger's persistency (whether
368trigger shall be stored in the BMC's non-volatile memory).
369
370The `AddTrigger` method returns:
371
372```ascii
373String for created trigger - ie. '/xyz/openbmc_project/Telemetry/Triggers/{Domain}/{Name}'
374```
375
376Such created trigger implements the `xyz.openbmc_project.Object.Delete` and the
377`xyz.openbmc_project.Telemetry.Trigger` interfaces. Each trigger object contains
378read-only information about metric type, for which it was created (discrete or
379numeric). This information determines which triggers are stored within trigger
380object.
381
382If trigger is defined for discrete metric type, than it contains trigger
383information that looks like this:
384
385| Type                | Description                                                                                                                                                                                                   |
386| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
387| enum                | Discrete trigger condition: <br> "Changed" - trigger occurs when value of metric has changed;<br>"Specified" - trigger occurs when value of metric becomes one of the values listed in the discrete triggers. |
388| array of structures | Array of discrete trigger structures.                                                                                                                                                                         |
389
390Discrete trigger structure:
391
392| Type    | Description                                                                                                         |
393| ------- | ------------------------------------------------------------------------------------------------------------------- |
394| string  | Unique trigger Id                                                                                                   |
395| enum    | Severity: <br> "OK" - normal<br>"Warning" - requires attention<br> "Critical" - requires immediate attention        |
396| variant | Value of discrete metric, that constitutes a trigger event.                                                         |
397| uint64  | Time in milliseconds that a trigger occurrence persists before the action defined in the `ActionType` is performed. |
398
399If trigger is defined for numeric metric type, than it contains information
400about numeric triggers that is an array of 4 structures presented below:
401
402| Type    | Description                                                                                                                           |
403| ------- | ------------------------------------------------------------------------------------------------------------------------------------- |
404| enum    | Numeric trigger type: <br> "UpperWarning"<br>"UpperWarning"<br>"LowerCritical"<br>"LowerWarning"                                      |
405| uint64  | Time in milliseconds that a trigger occurrence persists before the action defined in the `ActionType` is performed.                   |
406| enum    | Indicates direction of crossing the threshold value that trigger the threshold's action:<br> "Increasing"<br>"Decreasing"<br>"Either" |
407| variant | Value of reading that will trigger the threshold                                                                                      |
408
409The trigger object also contains information about reading, for which trigger
410was defined. It is in a form of structure consisting of three fields.
411
412| Field type  | Description                                                                                                                                                                              |
413| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
414| object path | D-Bus path of sensor. This is a path to the sensor, for which reading trigger is defined.                                                                                                |
415| string      | Unique metric Id                                                                                                                                                                         |
416| object path | D-Bus path of existing reading report. This is required when trigger's action is to update metric report. This path shall point to existing reading report within the Telemetry service. |
417
418**Trigger operations**
419
420Triggers support three types of operation: Log, Event and Update. For each,
421there is a different way of proceeding.
422
4231. For action Log, the event shall be logged to the system journal. In this case
424   the Telemetry service writes data to system journal using libjournal. The
425   Redfish log service shall then retrieve the data by reading system journal.
426   All is shown on the diagram below.
427
428```ascii
429+---------------------------+
430|bmcweb|                    |         +----------------------+
431+------/    +-----------+-+ |         |Telemetry|            |
432|           |Redfish    | | |         +---------/            |
433|           |log service| | |         |                      |
434|           +-----------/ | |         |                      |
435|           |             | |         |                      |
436|           |             | |         |                      |
437|           +------^------+ |         +-----------+----------+
438+---------------------------+                     |
439                   |                              |
440                   +----collect----+            event
441                     journal entry |      (write to journal)
442                                   |              |
443       +------------------------------------+     |
444       |systemd|                   |        |     |
445       +-------/ +----------+  +---+------+ |     |
446       |         |journal|  |  |libjournal| |     |
447       |         +-------/  <-->          <-------+
448       |         |          |  +----------+ |
449       |         |          |               |
450       |         |          |               |
451       |         +----------+               |
452       |                                    |
453       +------------------------------------+
454```
455
4562. For action Event, the Telemetry service shall send event using the [Redfish
457   Event Service][6] either as push-style event or SSE.
458
4593. For action Update, the Telemetry service will trigger the update of reading
460   report pointed by it's D-Bus path contained in trigger object properties. The
461   update shall cause the reading report's D-Bus object to emit property change
462   signal. This will cause Redfish Metric Report to be streamed out if it was
463   configured to do so.
464
465**Redfish Telemetry Service API**
466
467Redfish Telemetry Service shall support 2019.1 Redfish schemas for telemetry
468resources. Metric report definitions determines which metrics are to be include
469in metric report. Metric definition is assigned to particular metric type and it
470describes how the metric should be interpreted. The following resource schemas
471shall be supported:
472
473- TelemetryService 1.1.2
474- MetricDefinition 1.0.3
475- MetricReportDefinition 1.3.0
476- MetricReport 1.2.0
477- Triggers 1.1.1
478
479The following diagram shows relations between these resources.
480
481```ascii
482 +----------------------------------------------------------------------------+
483 |                             Service root                                   |
484 +----------------------------------+-------------------------------+---------+
485                                    |                               |
486                                    |                               |
487                                    |                               |
488 +----------------------------------v-----------------+  +----------v---------+
489 |                                                    |  |Chassis             |
490 |                Telemetry Service                   |  |                    |
491 |                                                    |  |                    |
492 |                                                    |  |  +---------------+ |
493 +---------+--------------+------------------+--------+  |  |               | |
494           |              |                  |           |  |   Chassis 1   | |
495           |              |                  |           |  |               | |
496           |              |                  |           |  +---------+-----+ |
497           |              |                  |           |            |       |
498+----------v--+  +--------v----+  +----------v-----+     +--------------------+
499|Triggers     |  |Metric       |  |Metric report   |                  |
500|             |  |definition   |  |                |                  |
501|             |  | +---------+ |  |                | Reads            |
502| +---------+ |  | |Reading  | |  | +-----------+  | ReadingVolts  +--v------+
503| |         | |  | |Volts    <------+           +------------------>         |
504| |Trigger 1| |  | +---------+ |  | |  Metric   |  |               |         |
505| |         | |  |             |  | | report 1  |  | Reads         |  Power  |
506| |         | |  | +---------+ |  | |           |  | PowerConsumed |         |
507| |         | |  | |         | |  | |           |  | Watts         |         |
508| +--+---+--+ |  | |Power    <------+           +------------------>         |
509|    |   |    |  | |Consumed | |  | +-----^-----+  |               +----^----+
510|    |   |    |  | |Watts    | |  |       |        |                    |
511|    |   |    |  | +---------+ |  |       |        |                    |
512|    |   |    |  |             |  |       |        |                    |
513+-------------+  +-------------+  +----------------+                    |
514     |   |                                |                             |
515     |   | Triggers report update         |                             |
516     |   | (when applicable)              |                             |
517     |   +--------------------------------+                             |
518     |                                                                  |
519     |   Monitors PowerConsumedWatts to check                           |
520     |   whether trigger value is exceeded                              |
521     +------------------------------------------------------------------+
522```
523
524The diagram shows the relations between Redfish resources. Metric report is
525defined to be generated periodically, on demand or on change. Each metric in the
526Metric Report contains the URI to its metric definition and Redfish sensor,
527which reading value is presented. Nevertheless, under this presentation layer,
528Telemetry is gathering D-Bus sensors readings and exposing them in reading
529reports over D-Bus for the Telemetry Service. Each D-Bus sensor is mapped to
530Redfish sensor.
531
532Below examples of Redfish resources for the Telemetry Service are shown.
533
534The Telemetry Service Redfish resource example:
535
536```json
537{
538    "@odata.type": "#TelemetryService.v1_1_2.TelemetryService",
539    "Id": "TelemetryService",
540    "Name": "Telemetry Service",
541    "Status": {
542        "State": "Enabled",
543        "Health": "OK"
544    },
545    "MinCollectionInterval": "T00:00:10s",
546    "SupportedCollectionFunctions": [],
547    "MaxReports": <max_no_of_reports>,
548    "MetricDefinitions": {
549        "@odata.id": "/redfish/v1/TelemetryService/MetricDefinitions"
550    },
551    "MetricReportDefinitions": {
552        "@odata.id": "/redfish/v1/TelemetryService/MetricReportDefinitions"
553    },
554    "MetricReports": {
555        "@odata.id": "/redfish/v1/TelemetryService/MetricReports"
556    },
557    "Triggers": {
558        "@odata.id": "/redfish/v1/TelemetryService/Triggers"
559    },
560    "LogService": {
561        "@odata.id": "/redfish/v1/Managers/<manager_name>/LogServices/Journal"
562    },
563    "@odata.context": "/redfish/v1/$metadata#TelemetryService",
564    "@odata.id": "/redfish/v1/TelemetryService"
565}
566```
567
568Sample metric report definition:
569
570```json
571{
572  "@odata.type": "#MetricReportDefinition.v1_3_0.MetricReportDefinition",
573  "Id": "SampleMetric",
574  "Name": "Sample Metric Report Definition",
575  "MetricReportDefinitionType": "Periodic",
576  "Schedule": {
577    "RecurrenceInterval": "T00:00:10"
578  },
579  "ReportActions": ["LogToMetricReportsCollection"],
580  "ReportUpdates": "Overwrite",
581  "MetricReport": {
582    "@odata.id": "/redfish/v1/TelemetryService/MetricReports/SampleMetric"
583  },
584  "Status": {
585    "State": "Enabled"
586  },
587  "Metrics": [
588    {
589      "MetricId": "Test",
590      "MetricProperties": [
591        "/redfish/v1/Chassis/NC_Baseboard/Power#/PowerControl/0/PowerConsumedWatts"
592      ]
593    }
594  ],
595  "@odata.context": "/redfish/v1/$metadata#MetricReportDefinition.MetricReportDefinition",
596  "@odata.id": "/redfish/v1/TelemetryService/MetricReportDefinitions/PlatformPowerUsage"
597}
598```
599
600Sample metric report:
601
602```json
603{
604  "@odata.type": "#MetricReport.v1_2_0.MetricReport",
605  "Id": "SampleMetric",
606  "Name": "Sample Metric Report",
607  "ReportSequence": "0",
608  "MetricReportDefinition": {
609    "@odata.id": "/redfish/v1/TelemetryService/MetricReportDefinitions/SampleMetric"
610  },
611  "MetricValues": [
612    {
613      "MetricDefinition": {
614        "@odata.id": "/redfish/v1/TelemetryService/MetricDefinitions/SampleMetricDefinition"
615      },
616      "MetricId": "Test",
617      "MetricValue": "100",
618      "Timestamp": "2016-11-08T12:25:00-05:00",
619      "MetricProperty": "/redfish/v1/Chassis/Tray_1/Power#/0/PowerConsumedWatts"
620    }
621  ],
622  "@odata.context": "/redfish/v1/$metadata#MetricReport.MetricReport",
623  "@odata.id": "/redfish/v1/TelemetryService/MetricReports/AvgPlatformPowerUsage"
624}
625```
626
627Sample trigger, that will trigger metric report update:
628
629```json
630{
631  "@odata.type": "#Triggers.v1_1_1.Triggers",
632  "Id": "SampleTrigger",
633  "Name": "Sample Trigger",
634  "MetricType": "Numeric",
635  "Links": {
636    "MetricReportDefinitions": [
637      "/redfish/v1/TelemetryService/MetricReportDefinitions/SampleMetric"
638    ]
639  },
640  "Status": {
641    "State": "Enabled"
642  },
643  "TriggerActions": ["RedfishMetricReport"],
644  "NumericThresholds": {
645    "UpperCritical": {
646      "Reading": 50,
647      "Activation": "Increasing",
648      "DwellTime": "PT0.001S"
649    },
650    "UpperWarning": {
651      "Reading": 48.1,
652      "Activation": "Increasing",
653      "DwellTime": "PT0.004S"
654    }
655  },
656  "MetricProperties": [
657    "/redfish/v1/Chassis/Tray_1/Power#/0/PowerConsumedWatts"
658  ],
659  "@odata.context": "/redfish/v1/$metadata#Triggers.Triggers",
660  "@odata.id": "/redfish/v1/TelemetryService/Triggers/PlatformPowerCapTriggers"
661}
662```
663
664**Performance tests**
665
666Performance test were conducted on the AST2500 system with 64 MB flash and 512
667MB RAM. Flash consumption by the Telemetry service is 197.5 kB. The runtime
668statistics are shown in the table below. The reading report is mapped into
669single Metric Report. The runtime data is collected for the Telemetry component
670only. All reports was created with
671`xyz.openbmc_project.Telemetry.Metric.OnChange` property to maximize the
672workload. In the configuration with 50 reports and 50 sensors it is about 200
673new readings per second, generating 200 reading reports per second. The table
674shows CPU usage and memory usage. The VSZ is the amount of memory mapped into
675the address space of the process. It includes pages backed by the process'
676executable file and shared libraries, its heap and stack, as well as anything
677else it has mapped.
678
679| Telemetry service state                       | VSZ    | %VSZ | %CPU   |
680| --------------------------------------------- | ------ | ---- | ------ |
681| Idle (0 reports, 0 sensors)                   | 5188 B | 1%   | 0%     |
682| 1 report, 1 sensor                            | 5188 B | 1%   | 1%     |
683| 2 reports, 1 sensor                           | 5188 B | 1%   | 1%     |
684| 2 reports, 2 sensors (1 sensor per report)    | 5188 B | 1%   | 1%     |
685| 1 report, 10 sensors                          | 5188 B | 1%   | 1%     |
686| 10 reports, 10 sensors (same for each report) | 5320 B | 1%   | 1-2%   |
687| 2 reports, 20 sensors (10 per report)         | 5188 B | 1%   | 1%     |
688| 30 reports, 30 sensors (10 per report)        | 5444 B | 1%   | 5-9%   |
689| 50 reports, 50 sensors (10 per report)        | 5572 B | 1%   | 11-14% |
690
691The last two configurations use 10 sensors per reading report, which gives 3 or
6925 distinctive configurations. Each such configuration is used to create 10
693reading reports to obtain the desired amount of 30 or 50 reading reports.
694
695In this architecture reading report is created every time when Redfish Metric
696Report Definition is posted (creating new Metric Report).
697
698## Alternatives Considered
699
700The [framework based on collectd/librrd][5] was considered as alternate design.
701Although it seems to be versatile and scalable solution, it has some drawbacks
702from our point of view:
703
704- Collectd's footprint in the minimal working configuration is around 2.6 MB,
705  while available space for the OpenBMC is limited to 64 MB.
706- In this design, librrd is used to store metrics on the BMC's non-volatile
707  storage, which may be an issue, when lots of metrics are captured and stored
708  to OpenBMC's limited storage space. Also flash wear-out issue may occur, when
709  metrics are captured frequently (like once per second).
710- Telemetry service is directly compatible with Redfish Telemetry Service API,
711  which means, that Telemetry's reading reports can be directly mapped to
712  Redfish Metric Reports.
713- Telemetry service unifies the way how the BMC's telemetry is exposed over the
714  Redfish and may be used with multiple front-ends, thus there is no problem to
715  add support telemetry over IPMI or any other API.
716
717Since this design assumes flexibility and modularity, there is no obstacles to
718use collectd in cooperation with Telemetry. The one of possible configurations
719is shown on the diagram below.
720
721```ascii
722   +-----------------+      +-----------------+
723   |  D-Bus sensors  |      |   Telemetry     |
724   +--------^--------+      +--------^--------+
725            |                        |
726            |                        |
727            |                        |
728<--------^--v-----------D-Bus--------v-^---------->
729         |                             |
730         |                             |
731         |                             |
732 +-------v------------+     +----------v--------+
733 |  collectd metrics  |     |                   |
734 |  exposed as D-Bus  |     |     bmcweb        |
735 |      sensors       |     |  (with Redfish    |
736 +---------^----------+     |    Telemetry      |
737           |                |     Service)      |
738           |                |                   |
739    +------+-------+        +-------------------+
740    |              |
741    |   collectd   |
742    |              |
743    +--------------+
744```
745
746Here collectd is used as the source of some set of metrics. It exposes them as
747the D-Bus sensors, which can easily be consumed either by the bmcweb and
748Telemetry service without any changes in their D-Bus interfaces. In such
749configuration Telemetry service provides metric reports and triggers management.
750
751Other possible configuration is to use collectd without the Telemetry service,
752but in such case, collectd does not provide metric reports and triggers support
753compatible with the Redfish. In such case, Redfish Telemetry Service won't be
754supported or metric reports and triggers support has to be provided by the
755collectd.
756
757## Impacts
758
759This design impacts the architecture of the bmcweb component, since it adds the
760Redfish Telemetry Service implementation as a component for the existing Redfish
761API implementation.
762
763## Testing
764
765This is the very high-level description of the proposed set of tests. Testing
766shall be done on three basic levels:
767
768- Unit tests
769- Functional tests
770- Performance tests
771
772**Unit tests**
773
774The Telemetry's code shall be covered by the unit tests. The preferred framework
775is the [GTest/GMock][7]. The unit tests shall be ran before code change is to be
776committed to make sure, that nothing is broken in existing functionality. Also,
777when new code is introduced, a new set of unit tests shall be committed with it
778according to test-driven development principle. Unit tests shall be also
779carefully reviewed.
780
781**Functional tests**
782
783Functional tests will be divided into two steps.
784
785First step is for testing the Telemetry metric reports management. Test scenario
786shall contain creating metric report by POSTing proper metric report definition,
787reading metric report (using GET on proper URI) and deleting the metric report.
788The required configuration for such test is D-Bus sensors (at least some of
789them) and bmcweb with Redfish Telemetry Service implemented. The tests shall be
790performed on real hardware. For ease of metric testing, dummy D-Bus sensors may
791be provided to provide specifically prepared metrics. This configuration shall
792also enable testing aggregated operations (MIN, MAX, SUM, AVG).
793
794Second step is to test triggers and events generation. This will require also
795Event Service to be implemented along with Log Service. Tests shall cover all
796scenarios with sending metric report as an event, triggering metric report
797update and logging events.
798
799**Performance tests**
800
801Performance tests shall be done using full OpenBMC configuration with all the
802required set of features. The tests shall create a lot of metric reports (up to
803maximum number) along with all possible triggers. Measurements shall cover the
804periodic metric report jitter, delays in event logging or sending, BMC's CPU
805utilization and the performance impact on other services.
806
807[1]:
808  https://www.dmtf.org/sites/default/files/standards/documents/DSP2051_0.1.0a.zip
809[2]:
810  https://github.com/openbmc/docs/blob/master/architecture/sensor-architecture.md
811[3]: https://www.kernel.org/doc/Documentation/hwmon/
812[4]: https://www.freedesktop.org/wiki/Software/dbus/
813[5]: https://gerrit.openbmc.org/c/openbmc/docs/+/22257
814[6]: https://gerrit.openbmc.org/c/openbmc/docs/+/24749
815[7]: https://github.com/google/googletest
816[8]:
817  https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Telemetry/ReportManager.interface.yaml
818[9]:
819  https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Telemetry/Report.interface.yaml
820[10]:
821  https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Object/Delete.interface.yaml
822