xref: /openbmc/linux/Documentation/powerpc/cxl.rst (revision 4ec575b7)
1====================================
2Coherent Accelerator Interface (CXL)
3====================================
4
5Introduction
6============
7
8    The coherent accelerator interface is designed to allow the
9    coherent connection of accelerators (FPGAs and other devices) to a
10    POWER system. These devices need to adhere to the Coherent
11    Accelerator Interface Architecture (CAIA).
12
13    IBM refers to this as the Coherent Accelerator Processor Interface
14    or CAPI. In the kernel it's referred to by the name CXL to avoid
15    confusion with the ISDN CAPI subsystem.
16
17    Coherent in this context means that the accelerator and CPUs can
18    both access system memory directly and with the same effective
19    addresses.
20
21
22Hardware overview
23=================
24
25    ::
26
27         POWER8/9             FPGA
28       +----------+        +---------+
29       |          |        |         |
30       |   CPU    |        |   AFU   |
31       |          |        |         |
32       |          |        |         |
33       |          |        |         |
34       +----------+        +---------+
35       |   PHB    |        |         |
36       |   +------+        |   PSL   |
37       |   | CAPP |<------>|         |
38       +---+------+  PCIE  +---------+
39
40    The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP)
41    unit which is part of the PCIe Host Bridge (PHB). This is managed
42    by Linux by calls into OPAL. Linux doesn't directly program the
43    CAPP.
44
45    The FPGA (or coherently attached device) consists of two parts.
46    The POWER Service Layer (PSL) and the Accelerator Function Unit
47    (AFU). The AFU is used to implement specific functionality behind
48    the PSL. The PSL, among other things, provides memory address
49    translation services to allow each AFU direct access to userspace
50    memory.
51
52    The AFU is the core part of the accelerator (eg. the compression,
53    crypto etc function). The kernel has no knowledge of the function
54    of the AFU. Only userspace interacts directly with the AFU.
55
56    The PSL provides the translation and interrupt services that the
57    AFU needs. This is what the kernel interacts with. For example, if
58    the AFU needs to read a particular effective address, it sends
59    that address to the PSL, the PSL then translates it, fetches the
60    data from memory and returns it to the AFU. If the PSL has a
61    translation miss, it interrupts the kernel and the kernel services
62    the fault. The context to which this fault is serviced is based on
63    who owns that acceleration function.
64
65    - POWER8 and PSL Version 8 are compliant to the CAIA Version 1.0.
66    - POWER9 and PSL Version 9 are compliant to the CAIA Version 2.0.
67
68    This PSL Version 9 provides new features such as:
69
70    * Interaction with the nest MMU on the P9 chip.
71    * Native DMA support.
72    * Supports sending ASB_Notify messages for host thread wakeup.
73    * Supports Atomic operations.
74    * etc.
75
76    Cards with a PSL9 won't work on a POWER8 system and cards with a
77    PSL8 won't work on a POWER9 system.
78
79AFU Modes
80=========
81
82    There are two programming modes supported by the AFU. Dedicated
83    and AFU directed. AFU may support one or both modes.
84
85    When using dedicated mode only one MMU context is supported. In
86    this mode, only one userspace process can use the accelerator at
87    time.
88
89    When using AFU directed mode, up to 16K simultaneous contexts can
90    be supported. This means up to 16K simultaneous userspace
91    applications may use the accelerator (although specific AFUs may
92    support fewer). In this mode, the AFU sends a 16 bit context ID
93    with each of its requests. This tells the PSL which context is
94    associated with each operation. If the PSL can't translate an
95    operation, the ID can also be accessed by the kernel so it can
96    determine the userspace context associated with an operation.
97
98
99MMIO space
100==========
101
102    A portion of the accelerator MMIO space can be directly mapped
103    from the AFU to userspace. Either the whole space can be mapped or
104    just a per context portion. The hardware is self describing, hence
105    the kernel can determine the offset and size of the per context
106    portion.
107
108
109Interrupts
110==========
111
112    AFUs may generate interrupts that are destined for userspace. These
113    are received by the kernel as hardware interrupts and passed onto
114    userspace by a read syscall documented below.
115
116    Data storage faults and error interrupts are handled by the kernel
117    driver.
118
119
120Work Element Descriptor (WED)
121=============================
122
123    The WED is a 64-bit parameter passed to the AFU when a context is
124    started. Its format is up to the AFU hence the kernel has no
125    knowledge of what it represents. Typically it will be the
126    effective address of a work queue or status block where the AFU
127    and userspace can share control and status information.
128
129
130
131
132User API
133========
134
1351. AFU character devices
136
137    For AFUs operating in AFU directed mode, two character device
138    files will be created. /dev/cxl/afu0.0m will correspond to a
139    master context and /dev/cxl/afu0.0s will correspond to a slave
140    context. Master contexts have access to the full MMIO space an
141    AFU provides. Slave contexts have access to only the per process
142    MMIO space an AFU provides.
143
144    For AFUs operating in dedicated process mode, the driver will
145    only create a single character device per AFU called
146    /dev/cxl/afu0.0d. This will have access to the entire MMIO space
147    that the AFU provides (like master contexts in AFU directed).
148
149    The types described below are defined in include/uapi/misc/cxl.h
150
151    The following file operations are supported on both slave and
152    master devices.
153
154    A userspace library libcxl is available here:
155
156	https://github.com/ibm-capi/libcxl
157
158    This provides a C interface to this kernel API.
159
160open
161----
162
163    Opens the device and allocates a file descriptor to be used with
164    the rest of the API.
165
166    A dedicated mode AFU only has one context and only allows the
167    device to be opened once.
168
169    An AFU directed mode AFU can have many contexts, the device can be
170    opened once for each context that is available.
171
172    When all available contexts are allocated the open call will fail
173    and return -ENOSPC.
174
175    Note:
176	  IRQs need to be allocated for each context, which may limit
177          the number of contexts that can be created, and therefore
178          how many times the device can be opened. The POWER8 CAPP
179          supports 2040 IRQs and 3 are used by the kernel, so 2037 are
180          left. If 1 IRQ is needed per context, then only 2037
181          contexts can be allocated. If 4 IRQs are needed per context,
182          then only 2037/4 = 509 contexts can be allocated.
183
184
185ioctl
186-----
187
188    CXL_IOCTL_START_WORK:
189        Starts the AFU context and associates it with the current
190        process. Once this ioctl is successfully executed, all memory
191        mapped into this process is accessible to this AFU context
192        using the same effective addresses. No additional calls are
193        required to map/unmap memory. The AFU memory context will be
194        updated as userspace allocates and frees memory. This ioctl
195        returns once the AFU context is started.
196
197        Takes a pointer to a struct cxl_ioctl_start_work
198
199            ::
200
201                struct cxl_ioctl_start_work {
202                        __u64 flags;
203                        __u64 work_element_descriptor;
204                        __u64 amr;
205                        __s16 num_interrupts;
206                        __s16 reserved1;
207                        __s32 reserved2;
208                        __u64 reserved3;
209                        __u64 reserved4;
210                        __u64 reserved5;
211                        __u64 reserved6;
212                };
213
214            flags:
215                Indicates which optional fields in the structure are
216                valid.
217
218            work_element_descriptor:
219                The Work Element Descriptor (WED) is a 64-bit argument
220                defined by the AFU. Typically this is an effective
221                address pointing to an AFU specific structure
222                describing what work to perform.
223
224            amr:
225                Authority Mask Register (AMR), same as the powerpc
226                AMR. This field is only used by the kernel when the
227                corresponding CXL_START_WORK_AMR value is specified in
228                flags. If not specified the kernel will use a default
229                value of 0.
230
231            num_interrupts:
232                Number of userspace interrupts to request. This field
233                is only used by the kernel when the corresponding
234                CXL_START_WORK_NUM_IRQS value is specified in flags.
235                If not specified the minimum number required by the
236                AFU will be allocated. The min and max number can be
237                obtained from sysfs.
238
239            reserved fields:
240                For ABI padding and future extensions
241
242    CXL_IOCTL_GET_PROCESS_ELEMENT:
243        Get the current context id, also known as the process element.
244        The value is returned from the kernel as a __u32.
245
246
247mmap
248----
249
250    An AFU may have an MMIO space to facilitate communication with the
251    AFU. If it does, the MMIO space can be accessed via mmap. The size
252    and contents of this area are specific to the particular AFU. The
253    size can be discovered via sysfs.
254
255    In AFU directed mode, master contexts are allowed to map all of
256    the MMIO space and slave contexts are allowed to only map the per
257    process MMIO space associated with the context. In dedicated
258    process mode the entire MMIO space can always be mapped.
259
260    This mmap call must be done after the START_WORK ioctl.
261
262    Care should be taken when accessing MMIO space. Only 32 and 64-bit
263    accesses are supported by POWER8. Also, the AFU will be designed
264    with a specific endianness, so all MMIO accesses should consider
265    endianness (recommend endian(3) variants like: le64toh(),
266    be64toh() etc). These endian issues equally apply to shared memory
267    queues the WED may describe.
268
269
270read
271----
272
273    Reads events from the AFU. Blocks if no events are pending
274    (unless O_NONBLOCK is supplied). Returns -EIO in the case of an
275    unrecoverable error or if the card is removed.
276
277    read() will always return an integral number of events.
278
279    The buffer passed to read() must be at least 4K bytes.
280
281    The result of the read will be a buffer of one or more events,
282    each event is of type struct cxl_event, of varying size::
283
284            struct cxl_event {
285                    struct cxl_event_header header;
286                    union {
287                            struct cxl_event_afu_interrupt irq;
288                            struct cxl_event_data_storage fault;
289                            struct cxl_event_afu_error afu_error;
290                    };
291            };
292
293    The struct cxl_event_header is defined as
294
295        ::
296
297            struct cxl_event_header {
298                    __u16 type;
299                    __u16 size;
300                    __u16 process_element;
301                    __u16 reserved1;
302            };
303
304        type:
305            This defines the type of event. The type determines how
306            the rest of the event is structured. These types are
307            described below and defined by enum cxl_event_type.
308
309        size:
310            This is the size of the event in bytes including the
311            struct cxl_event_header. The start of the next event can
312            be found at this offset from the start of the current
313            event.
314
315        process_element:
316            Context ID of the event.
317
318        reserved field:
319            For future extensions and padding.
320
321    If the event type is CXL_EVENT_AFU_INTERRUPT then the event
322    structure is defined as
323
324        ::
325
326            struct cxl_event_afu_interrupt {
327                    __u16 flags;
328                    __u16 irq; /* Raised AFU interrupt number */
329                    __u32 reserved1;
330            };
331
332        flags:
333            These flags indicate which optional fields are present
334            in this struct. Currently all fields are mandatory.
335
336        irq:
337            The IRQ number sent by the AFU.
338
339        reserved field:
340            For future extensions and padding.
341
342    If the event type is CXL_EVENT_DATA_STORAGE then the event
343    structure is defined as
344
345        ::
346
347            struct cxl_event_data_storage {
348                    __u16 flags;
349                    __u16 reserved1;
350                    __u32 reserved2;
351                    __u64 addr;
352                    __u64 dsisr;
353                    __u64 reserved3;
354            };
355
356        flags:
357            These flags indicate which optional fields are present in
358            this struct. Currently all fields are mandatory.
359
360        address:
361            The address that the AFU unsuccessfully attempted to
362            access. Valid accesses will be handled transparently by the
363            kernel but invalid accesses will generate this event.
364
365        dsisr:
366            This field gives information on the type of fault. It is a
367            copy of the DSISR from the PSL hardware when the address
368            fault occurred. The form of the DSISR is as defined in the
369            CAIA.
370
371        reserved fields:
372            For future extensions
373
374    If the event type is CXL_EVENT_AFU_ERROR then the event structure
375    is defined as
376
377        ::
378
379            struct cxl_event_afu_error {
380                    __u16 flags;
381                    __u16 reserved1;
382                    __u32 reserved2;
383                    __u64 error;
384            };
385
386        flags:
387            These flags indicate which optional fields are present in
388            this struct. Currently all fields are Mandatory.
389
390        error:
391            Error status from the AFU. Defined by the AFU.
392
393        reserved fields:
394            For future extensions and padding
395
396
3972. Card character device (powerVM guest only)
398
399    In a powerVM guest, an extra character device is created for the
400    card. The device is only used to write (flash) a new image on the
401    FPGA accelerator. Once the image is written and verified, the
402    device tree is updated and the card is reset to reload the updated
403    image.
404
405open
406----
407
408    Opens the device and allocates a file descriptor to be used with
409    the rest of the API. The device can only be opened once.
410
411ioctl
412-----
413
414CXL_IOCTL_DOWNLOAD_IMAGE / CXL_IOCTL_VALIDATE_IMAGE:
415    Starts and controls flashing a new FPGA image. Partial
416    reconfiguration is not supported (yet), so the image must contain
417    a copy of the PSL and AFU(s). Since an image can be quite large,
418    the caller may have to iterate, splitting the image in smaller
419    chunks.
420
421    Takes a pointer to a struct cxl_adapter_image::
422
423        struct cxl_adapter_image {
424            __u64 flags;
425            __u64 data;
426            __u64 len_data;
427            __u64 len_image;
428            __u64 reserved1;
429            __u64 reserved2;
430            __u64 reserved3;
431            __u64 reserved4;
432        };
433
434    flags:
435        These flags indicate which optional fields are present in
436        this struct. Currently all fields are mandatory.
437
438    data:
439        Pointer to a buffer with part of the image to write to the
440        card.
441
442    len_data:
443        Size of the buffer pointed to by data.
444
445    len_image:
446        Full size of the image.
447
448
449Sysfs Class
450===========
451
452    A cxl sysfs class is added under /sys/class/cxl to facilitate
453    enumeration and tuning of the accelerators. Its layout is
454    described in Documentation/ABI/testing/sysfs-class-cxl
455
456
457Udev rules
458==========
459
460    The following udev rules could be used to create a symlink to the
461    most logical chardev to use in any programming mode (afuX.Yd for
462    dedicated, afuX.Ys for afu directed), since the API is virtually
463    identical for each::
464
465	SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b"
466	SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \
467	                  KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b"
468