xref: /openbmc/qemu/docs/system/devices/cxl.rst (revision 623d7e3551a6fc5693c06ea938c60fe281b52e27)
1Compute Express Link (CXL)
2==========================
3From the view of a single host, CXL is an interconnect standard that
4targets accelerators and memory devices attached to a CXL host.
5This description will focus on those aspects visible either to
6software running on a QEMU emulated host or to the internals of
7functional emulation. As such, it will skip over many of the
8electrical and protocol elements that would be more of interest
9for real hardware and will dominate more general introductions to CXL.
10It will also completely ignore the fabric management aspects of CXL
11by considering only a single host and a static configuration.
12
13CXL shares many concepts and much of the infrastructure of PCI Express,
14with CXL Host Bridges, which have CXL Root Ports which may be directly
15attached to CXL or PCI End Points. Alternatively there may be CXL Switches
16with CXL and PCI Endpoints attached below them.  In many cases additional
17control and capabilities are exposed via PCI Express interfaces.
18This sharing of interfaces and hence emulation code is reflected
19in how the devices are emulated in QEMU. In most cases the various
20CXL elements are built upon an equivalent PCIe devices.
21
22CXL devices support the following interfaces:
23
24* Most conventional PCIe interfaces
25
26  - Configuration space access
27  - BAR mapped memory accesses used for registers and mailboxes.
28  - MSI/MSI-X
29  - AER
30  - DOE mailboxes
31  - IDE
32  - Many other PCI express defined interfaces..
33
34* Memory operations
35
36  - Equivalent of accessing DRAM / NVDIMMs. Any access / feature
37    supported by the host for normal memory should also work for
38    CXL attached memory devices.
39
40* Cache operations. The are mostly irrelevant to QEMU emulation as
41  QEMU is not emulating a coherency protocol. Any emulation related
42  to these will be device specific and is out of the scope of this
43  document.
44
45CXL 2.0 Device Types
46--------------------
47CXL 2.0 End Points are often categorized into three types.
48
49**Type 1:** These support coherent caching of host memory.  Example might
50be a crypto accelerators.  May also have device private memory accessible
51via means such as PCI memory reads and writes to BARs.
52
53**Type 2:** These support coherent caching of host memory and host
54managed device memory (HDM) for which the coherency protocol is managed
55by the host. This is a complex topic, so for more information on CXL
56coherency see the CXL 2.0 specification.
57
58**Type 3 Memory devices:**  These devices act as a means of attaching
59additional memory (HDM) to a CXL host including both volatile and
60persistent memory. The CXL topology may support interleaving across a
61number of Type 3 memory devices using HDM Decoders in the host, host
62bridge, switch upstream port and endpoints.
63
64Scope of CXL emulation in QEMU
65------------------------------
66The focus of CXL emulation is CXL revision 2.0 and later. Earlier CXL
67revisions defined a smaller set of features, leaving much of the control
68interface as implementation defined or device specific, making generic
69emulation challenging with host specific firmware being responsible
70for setup and the Endpoints being presented to operating systems
71as Root Complex Integrated End Points. CXL rev 2.0 looks a lot
72more like PCI Express, with fully specified discoverability
73of the CXL topology.
74
75CXL System components
76----------------------
77A CXL system is made up a Host with a number of 'standard components'
78the control and capabilities of which are discoverable by system software
79using means described in the CXL 2.0 specification.
80
81CXL Fixed Memory Windows (CFMW)
82~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
83A CFMW consists of a particular range of Host Physical Address space
84which is routed to particular CXL Host Bridges.  At time of generic
85software initialization it will have a particularly interleaving
86configuration and associated Quality of Service Throttling Group (QTG).
87This information is available to system software, when making
88decisions about how to configure interleave across available CXL
89memory devices.  It is provide as CFMW Structures (CFMWS) in
90the CXL Early Discovery Table, an ACPI table.
91
92Note: QTG 0 is the only one currently supported in QEMU.
93
94CXL Host Bridge (CXL HB)
95~~~~~~~~~~~~~~~~~~~~~~~~
96A CXL host bridge is similar to the PCIe equivalent, but with a
97specification defined register interface called CXL Host Bridge
98Component Registers (CHBCR). The location of this CHBCR MMIO
99space is described to system software via a CXL Host Bridge
100Structure (CHBS) in the CEDT ACPI table.  The actual interfaces
101are identical to those used for other parts of the CXL hierarchy
102as CXL Component Registers in PCI BARs.
103
104Interfaces provided include:
105
106* Configuration of HDM Decoders to route CXL Memory accesses with
107  a particularly Host Physical Address range to the target port
108  below which the CXL device servicing that address lies.  This
109  may be a mapping to a single Root Port (RP) or across a set of
110  target RPs.
111
112CXL Root Ports (CXL RP)
113~~~~~~~~~~~~~~~~~~~~~~~
114A CXL Root Port serves the same purpose as a PCIe Root Port.
115There are a number of CXL specific Designated Vendor Specific
116Extended Capabilities (DVSEC) in PCIe Configuration Space
117and associated component register access via PCI bars.
118
119CXL Switch
120~~~~~~~~~~
121Here we consider a simple CXL switch with only a single
122virtual hierarchy. Whilst more complex devices exist, their
123visibility to a particular host is generally the same as for
124a simple switch design. Hosts often have no awareness
125of complex rerouting and device pooling, they simply see
126devices being hot added or hot removed.
127
128A CXL switch has a similar architecture to those in PCIe,
129with a single upstream port, internal PCI bus and multiple
130downstream ports.
131
132Both the CXL upstream and downstream ports have CXL specific
133DVSECs in configuration space, and component registers in PCI
134BARs.  The Upstream Port has the configuration interfaces for
135the HDM decoders which route incoming memory accesses to the
136appropriate downstream port.
137
138A CXL switch is created in a similar fashion to PCI switches
139by creating an upstream port (cxl-upstream) and a number of
140downstream ports on the internal switch bus (cxl-downstream).
141
142CXL Memory Devices - Type 3
143~~~~~~~~~~~~~~~~~~~~~~~~~~~
144CXL type 3 devices use a PCI class code and are intended to be supported
145by a generic operating system driver. They have HDM decoders
146though in these EP devices, the decoder is responsible not for
147routing but for translation of the incoming host physical address (HPA)
148into a Device Physical Address (DPA).
149
150CXL Memory Interleave
151---------------------
152To understand the interaction of different CXL hardware components which
153are emulated in QEMU, let us consider a memory read in a fully configured
154CXL topology.  Note that system software is responsible for configuration
155of all components with the exception of the CFMWs. System software is
156responsible for allocating appropriate ranges from within the CFMWs
157and exposing those via normal memory configurations as would be done
158for system RAM.
159
160Example system Topology. x marks the match in each decoder level::
161
162  |<------------------SYSTEM PHYSICAL ADDRESS MAP (1)----------------->|
163  |    __________   __________________________________   __________    |
164  |   |          | |                                  | |          |   |
165  |   | CFMW 0   | |  CXL Fixed Memory Window 1       | | CFMW 2   |   |
166  |   | HB0 only | |  Configured to interleave memory | | HB1 only |   |
167  |   |          | |  memory accesses across HB0/HB1  | |          |   |
168  |   |__________| |_____x____________________________| |__________|   |
169           |             |                     |             |
170           |             |                     |             |
171           |             |                     |             |
172           |       Interleave Decoder          |             |
173           |       Matches this HB             |             |
174           \_____________|                     |_____________/
175               __________|__________      _____|_______________
176              |                     |    |                     |
177       (2)    | CXL HB 0            |    | CXL HB 1            |
178              | HB IntLv Decoders   |    | HB IntLv Decoders   |
179              | PCI/CXL Root Bus 0c |    | PCI/CXL Root Bus 0d |
180              |                     |    |                     |
181              |___x_________________|    |_____________________|
182                  |                |       |               |
183                  |                |       |               |
184       A HB 0 HDM Decoder          |       |               |
185       matches this Port           |       |               |
186                  |                |       |               |
187       ___________|___   __________|__   __|_________   ___|_________
188   (3)|  Root Port 0  | | Root Port 1 | | Root Port 2| | Root Port 3 |
189      |  Appears in   | | Appears in  | | Appears in | | Appear in   |
190      |  PCI topology | | PCI Topology| | PCI Topo   | | PCI Topo    |
191      |  As 0c:00.0   | | as 0c:01.0  | | as de:00.0 | | as de:01.0  |
192      |_______________| |_____________| |____________| |_____________|
193            |                  |               |              |
194            |                  |               |              |
195       _____|_________   ______|______   ______|_____   ______|_______
196   (4)|     x         | |             | |            | |              |
197      | CXL Type3 0   | | CXL Type3 1 | | CXL type3 2| | CLX Type 3 3 |
198      |               | |             | |            | |              |
199      | PMEM0(Vol LSA)| | PMEM1 (...) | | PMEM2 (...)| | PMEM3 (...)  |
200      | Decoder to go | |             | |            | |              |
201      | from host PA  | | PCI 0e:00.0 | | PCI df:00.0| | PCI e0:00.0  |
202      | to device PA  | |             | |            | |              |
203      | PCI as 0d:00.0| |             | |            | |              |
204      |_______________| |_____________| |____________| |______________|
205
206Notes:
207
208(1) **3 CXL Fixed Memory Windows (CFMW)** corresponding to different
209    ranges of the system physical address map.  Each CFMW has
210    particular interleave setup across the CXL Host Bridges (HB)
211    CFMW0 provides uninterleaved access to HB0, CFMW2 provides
212    uninterleaved access to HB1. CFMW1 provides interleaved memory access
213    across HB0 and HB1.
214
215(2) **Two CXL Host Bridges**. Each of these has 2 CXL Root Ports and
216    programmable HDM decoders to route memory accesses either to
217    a single port or interleave them across multiple ports.
218    A complex configuration here, might be to use the following HDM
219    decoders in HB0. HDM0 routes CFMW0 requests to RP0 and hence
220    part of CXL Type3 0. HDM1 routes CFMW0 requests from a
221    different region of the CFMW0 PA range to RP2 and hence part
222    of CXL Type 3 1.  HDM2 routes yet another PA range from within
223    CFMW0 to be interleaved across RP0 and RP1, providing 2 way
224    interleave of part of the memory provided by CXL Type3 0 and
225    CXL Type 3 1. HDM3 routes those interleaved accesses from
226    CFMW1 that target HB0 to RP 0 and another part of the memory of
227    CXL Type 3 0 (as part of a 2 way interleave at the system level
228    across for example CXL Type3 0 and CXL Type3 2.
229    HDM4 is used to enable system wide 4 way interleave across all
230    the present CXL type3 devices, by interleaving those (interleaved)
231    requests that HB0 receives from from CFMW1 across RP 0 and
232    RP 1 and hence to yet more regions of the memory of the
233    attached Type3 devices.  Note this is a representative subset
234    of the full range of possible HDM decoder configurations in this
235    topology.
236
237(3) **Four CXL Root Ports.** In this case the CXL Type 3 devices are
238    directly attached to these ports.
239
240(4) **Four CXL Type3 memory expansion devices.**  These will each have
241    HDM decoders, but in this case rather than performing interleave
242    they will take the Host Physical Addresses of accesses and map
243    them to their own local Device Physical Address Space (DPA).
244
245Example topology involving a switch::
246
247  |<------------------SYSTEM PHYSICAL ADDRESS MAP (1)----------------->|
248  |    __________   __________________________________   __________    |
249  |   |          | |                                  | |          |   |
250  |   | CFMW 0   | |  CXL Fixed Memory Window 1       | | CFMW 2   |   |
251  |   | HB0 only | |  Configured to interleave memory | | HB1 only |   |
252  |   |          | |  memory accesses across HB0/HB1  | |          |   |
253  |   |____x_____| |__________________________________| |__________|   |
254           |             |                     |             |
255           |             |                     |             |
256           |             |                     |
257  Interleave Decoder     |                     |             |
258   Matches this HB       |                     |             |
259           \_____________|                     |_____________/
260               __________|__________      _____|_______________
261              |                     |    |                     |
262              | CXL HB 0            |    | CXL HB 1            |
263              | HB IntLv Decoders   |    | HB IntLv Decoders   |
264              | PCI/CXL Root Bus 0c |    | PCI/CXL Root Bus 0d |
265              |                     |    |                     |
266              |___x_________________|    |_____________________|
267                  |              |          |               |
268                  |
269       A HB 0 HDM Decoder
270       matches this Port
271       ___________|___
272      |  Root Port 0  |
273      |  Appears in   |
274      |  PCI topology |
275      |  As 0c:00.0   |
276      |___________x___|
277                  |
278                  |
279                  \_____________________
280                                        |
281                                        |
282            ---------------------------------------------------
283           |    Switch 0  USP as PCI 0d:00.0                   |
284           |    USP has HDM decoder which direct traffic to    |
285           |    appropriate downstream port                    |
286           |    Switch BUS appears as 0e                       |
287           |x__________________________________________________|
288            |                  |               |              |
289            |                  |               |              |
290       _____|_________   ______|______   ______|_____   ______|_______
291   (4)|     x         | |             | |            | |              |
292      | CXL Type3 0   | | CXL Type3 1 | | CXL type3 2| | CLX Type 3 3 |
293      |               | |             | |            | |              |
294      | PMEM0(Vol LSA)| | PMEM1 (...) | | PMEM2 (...)| | PMEM3 (...)  |
295      | Decoder to go | |             | |            | |              |
296      | from host PA  | | PCI 10:00.0 | | PCI 11:00.0| | PCI 12:00.0  |
297      | to device PA  | |             | |            | |              |
298      | PCI as 0f:00.0| |             | |            | |              |
299      |_______________| |_____________| |____________| |______________|
300
301Example command lines
302---------------------
303A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
304
305  qemu-system-x86_64 -M q35,cxl=on -m 4G,maxmem=8G,slots=8 -smp 4 \
306  ...
307  -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \
308  -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
309  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
310  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
311  -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
312  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
313
314A very simple setup with just one directly attached CXL Type 3 Volatile Memory device::
315
316  qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
317  ...
318  -object memory-backend-ram,id=vmem0,share=on,size=256M \
319  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
320  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
321  -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
322  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
323
324The same volatile setup may optionally include an LSA region::
325
326  qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
327  ...
328  -object memory-backend-ram,id=vmem0,share=on,size=256M \
329  -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa.raw,size=256M \
330  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
331  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
332  -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,lsa=cxl-lsa0,id=cxl-vmem0 \
333  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
334
335A setup suitable for 4 way interleave. Only one fixed window provided, to enable 2 way
336interleave across 2 CXL host bridges.  Each host bridge has 2 CXL Root Ports, with
337the CXL Type3 device directly attached (no switches).::
338
339  qemu-system-x86_64 -M q35,cxl=on -m 4G,maxmem=8G,slots=8 -smp 4 \
340  ...
341  -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \
342  -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \
343  -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \
344  -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest4.raw,size=256M \
345  -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
346  -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \
347  -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \
348  -object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa4.raw,size=256M \
349  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
350  -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2 \
351  -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
352  -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
353  -device cxl-rp,port=1,bus=cxl.1,id=root_port14,chassis=0,slot=3 \
354  -device cxl-type3,bus=root_port14,persistent-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
355  -device cxl-rp,port=0,bus=cxl.2,id=root_port15,chassis=0,slot=5 \
356  -device cxl-type3,bus=root_port15,persistent-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
357  -device cxl-rp,port=1,bus=cxl.2,id=root_port16,chassis=0,slot=6 \
358  -device cxl-type3,bus=root_port16,persistent-memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
359  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k
360
361An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
362
363  qemu-system-x86_64 -M q35,cxl=on -m 4G,maxmem=8G,slots=8 -smp 4 \
364  ...
365  -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
366  -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest1.raw,size=256M \
367  -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \
368  -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \
369  -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
370  -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa1.raw,size=256M \
371  -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \
372  -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \
373  -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
374  -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
375  -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
376  -device cxl-upstream,bus=root_port0,id=us0 \
377  -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
378  -device cxl-type3,bus=swport0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
379  -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
380  -device cxl-type3,bus=swport1,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1 \
381  -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
382  -device cxl-type3,bus=swport2,persistent-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2 \
383  -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
384  -device cxl-type3,bus=swport3,persistent-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3 \
385  -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
386
387Deprecations
388------------
389
390The Type 3 device [memdev] attribute has been deprecated in favor of the
391[persistent-memdev] attributes. [memdev] will default to a persistent memory
392device for backward compatibility and is incapable of being used in combination
393with [persistent-memdev].
394
395Kernel Configuration Options
396----------------------------
397
398In Linux 5.18 the following options are necessary to make use of
399OS management of CXL memory devices as described here.
400
401* CONFIG_CXL_BUS
402* CONFIG_CXL_PCI
403* CONFIG_CXL_ACPI
404* CONFIG_CXL_PMEM
405* CONFIG_CXL_MEM
406* CONFIG_CXL_PORT
407* CONFIG_CXL_REGION
408
409References
410----------
411
412 - Consortium website for specifications etc:
413   http://www.computeexpresslink.org
414 - Compute Express link Revision 2 specification, October 2020
415 - CEDT CFMWS & QTG _DSM ECN May 2021
416