qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

cxl.rst (19899B)


      1 Compute Express Link (CXL)
      2 ==========================
      3 From the view of a single host, CXL is an interconnect standard that
      4 targets accelerators and memory devices attached to a CXL host.
      5 This description will focus on those aspects visible either to
      6 software running on a QEMU emulated host or to the internals of
      7 functional emulation. As such, it will skip over many of the
      8 electrical and protocol elements that would be more of interest
      9 for real hardware and will dominate more general introductions to CXL.
     10 It will also completely ignore the fabric management aspects of CXL
     11 by considering only a single host and a static configuration.
     12 
     13 CXL shares many concepts and much of the infrastructure of PCI Express,
     14 with CXL Host Bridges, which have CXL Root Ports which may be directly
     15 attached to CXL or PCI End Points. Alternatively there may be CXL Switches
     16 with CXL and PCI Endpoints attached below them.  In many cases additional
     17 control and capabilities are exposed via PCI Express interfaces.
     18 This sharing of interfaces and hence emulation code is reflected
     19 in how the devices are emulated in QEMU. In most cases the various
     20 CXL elements are built upon an equivalent PCIe devices.
     21 
     22 CXL devices support the following interfaces:
     23 
     24 * Most conventional PCIe interfaces
     25 
     26   - Configuration space access
     27   - BAR mapped memory accesses used for registers and mailboxes.
     28   - MSI/MSI-X
     29   - AER
     30   - DOE mailboxes
     31   - IDE
     32   - Many other PCI express defined interfaces..
     33 
     34 * Memory operations
     35 
     36   - Equivalent of accessing DRAM / NVDIMMs. Any access / feature
     37     supported by the host for normal memory should also work for
     38     CXL attached memory devices.
     39 
     40 * Cache operations. The are mostly irrelevant to QEMU emulation as
     41   QEMU is not emulating a coherency protocol. Any emulation related
     42   to these will be device specific and is out of the scope of this
     43   document.
     44 
     45 CXL 2.0 Device Types
     46 --------------------
     47 CXL 2.0 End Points are often categorized into three types.
     48 
     49 **Type 1:** These support coherent caching of host memory.  Example might
     50 be a crypto accelerators.  May also have device private memory accessible
     51 via means such as PCI memory reads and writes to BARs.
     52 
     53 **Type 2:** These support coherent caching of host memory and host
     54 managed device memory (HDM) for which the coherency protocol is managed
     55 by the host. This is a complex topic, so for more information on CXL
     56 coherency see the CXL 2.0 specification.
     57 
     58 **Type 3 Memory devices:**  These devices act as a means of attaching
     59 additional memory (HDM) to a CXL host including both volatile and
     60 persistent memory. The CXL topology may support interleaving across a
     61 number of Type 3 memory devices using HDM Decoders in the host, host
     62 bridge, switch upstream port and endpoints.
     63 
     64 Scope of CXL emulation in QEMU
     65 ------------------------------
     66 The focus of CXL emulation is CXL revision 2.0 and later. Earlier CXL
     67 revisions defined a smaller set of features, leaving much of the control
     68 interface as implementation defined or device specific, making generic
     69 emulation challenging with host specific firmware being responsible
     70 for setup and the Endpoints being presented to operating systems
     71 as Root Complex Integrated End Points. CXL rev 2.0 looks a lot
     72 more like PCI Express, with fully specified discoverability
     73 of the CXL topology.
     74 
     75 CXL System components
     76 ----------------------
     77 A CXL system is made up a Host with a number of 'standard components'
     78 the control and capabilities of which are discoverable by system software
     79 using means described in the CXL 2.0 specification.
     80 
     81 CXL Fixed Memory Windows (CFMW)
     82 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     83 A CFMW consists of a particular range of Host Physical Address space
     84 which is routed to particular CXL Host Bridges.  At time of generic
     85 software initialization it will have a particularly interleaving
     86 configuration and associated Quality of Service Throttling Group (QTG).
     87 This information is available to system software, when making
     88 decisions about how to configure interleave across available CXL
     89 memory devices.  It is provide as CFMW Structures (CFMWS) in
     90 the CXL Early Discovery Table, an ACPI table.
     91 
     92 Note: QTG 0 is the only one currently supported in QEMU.
     93 
     94 CXL Host Bridge (CXL HB)
     95 ~~~~~~~~~~~~~~~~~~~~~~~~
     96 A CXL host bridge is similar to the PCIe equivalent, but with a
     97 specification defined register interface called CXL Host Bridge
     98 Component Registers (CHBCR). The location of this CHBCR MMIO
     99 space is described to system software via a CXL Host Bridge
    100 Structure (CHBS) in the CEDT ACPI table.  The actual interfaces
    101 are identical to those used for other parts of the CXL hierarchy
    102 as CXL Component Registers in PCI BARs.
    103 
    104 Interfaces provided include:
    105 
    106 * Configuration of HDM Decoders to route CXL Memory accesses with
    107   a particularly Host Physical Address range to the target port
    108   below which the CXL device servicing that address lies.  This
    109   may be a mapping to a single Root Port (RP) or across a set of
    110   target RPs.
    111 
    112 CXL Root Ports (CXL RP)
    113 ~~~~~~~~~~~~~~~~~~~~~~~
    114 A CXL Root Port servers te same purpose as a PCIe Root Port.
    115 There are a number of CXL specific Designated Vendor Specific
    116 Extended Capabilities (DVSEC) in PCIe Configuration Space
    117 and associated component register access via PCI bars.
    118 
    119 CXL Switch
    120 ~~~~~~~~~~
    121 Here we consider a simple CXL switch with only a single
    122 virtual hierarchy. Whilst more complex devices exist, their
    123 visibility to a particular host is generally the same as for
    124 a simple switch design. Hosts often have no awareness
    125 of complex rerouting and device pooling, they simply see
    126 devices being hot added or hot removed.
    127 
    128 A CXL switch has a similar architecture to those in PCIe,
    129 with a single upstream port, internal PCI bus and multiple
    130 downstream ports.
    131 
    132 Both the CXL upstream and downstream ports have CXL specific
    133 DVSECs in configuration space, and component registers in PCI
    134 BARs.  The Upstream Port has the configuration interfaces for
    135 the HDM decoders which route incoming memory accesses to the
    136 appropriate downstream port.
    137 
    138 A CXL switch is created in a similar fashion to PCI switches
    139 by creating an upstream port (cxl-upstream) and a number of
    140 downstream ports on the internal switch bus (cxl-downstream).
    141 
    142 CXL Memory Devices - Type 3
    143 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
    144 CXL type 3 devices use a PCI class code and are intended to be supported
    145 by a generic operating system driver. They have HDM decoders
    146 though in these EP devices, the decoder is responsible not for
    147 routing but for translation of the incoming host physical address (HPA)
    148 into a Device Physical Address (DPA).
    149 
    150 CXL Memory Interleave
    151 ---------------------
    152 To understand the interaction of different CXL hardware components which
    153 are emulated in QEMU, let us consider a memory read in a fully configured
    154 CXL topology.  Note that system software is responsible for configuration
    155 of all components with the exception of the CFMWs. System software is
    156 responsible for allocating appropriate ranges from within the CFMWs
    157 and exposing those via normal memory configurations as would be done
    158 for system RAM.
    159 
    160 Example system Topology. x marks the match in each decoder level::
    161 
    162   |<------------------SYSTEM PHYSICAL ADDRESS MAP (1)----------------->|
    163   |    __________   __________________________________   __________    |
    164   |   |          | |                                  | |          |   |
    165   |   | CFMW 0   | |  CXL Fixed Memory Window 1       | | CFMW 1   |   |
    166   |   | HB0 only | |  Configured to interleave memory | | HB1 only |   |
    167   |   |          | |  memory accesses across HB0/HB1  | |          |   |
    168   |   |__________| |_____x____________________________| |__________|   |
    169            |             |                     |             |
    170            |             |                     |             |
    171            |             |                     |             |
    172            |       Interleave Decoder          |             |
    173            |       Matches this HB             |             |
    174            \_____________|                     |_____________/
    175                __________|__________      _____|_______________
    176               |                     |    |                     |
    177        (2)    | CXL HB 0            |    | CXL HB 1            |
    178               | HB IntLv Decoders   |    | HB IntLv Decoders   |
    179               | PCI/CXL Root Bus 0c |    | PCI/CXL Root Bus 0d |
    180               |                     |    |                     |
    181               |___x_________________|    |_____________________|
    182                   |                |       |               |
    183                   |                |       |               |
    184        A HB 0 HDM Decoder          |       |               |
    185        matches this Port           |       |               |
    186                   |                |       |               |
    187        ___________|___   __________|__   __|_________   ___|_________
    188    (3)|  Root Port 0  | | Root Port 1 | | Root Port 2| | Root Port 3 |
    189       |  Appears in   | | Appears in  | | Appears in | | Appear in   |
    190       |  PCI topology | | PCI Topology| | PCI Topo   | | PCI Topo    |
    191       |  As 0c:00.0   | | as 0c:01.0  | | as de:00.0 | | as de:01.0  |
    192       |_______________| |_____________| |____________| |_____________|
    193             |                  |               |              |
    194             |                  |               |              |
    195        _____|_________   ______|______   ______|_____   ______|_______
    196    (4)|     x         | |             | |            | |              |
    197       | CXL Type3 0   | | CXL Type3 1 | | CXL type3 2| | CLX Type 3 3 |
    198       |               | |             | |            | |              |
    199       | PMEM0(Vol LSA)| | PMEM1 (...) | | PMEM2 (...)| | PMEM3 (...)  |
    200       | Decoder to go | |             | |            | |              |
    201       | from host PA  | | PCI 0e:00.0 | | PCI df:00.0| | PCI e0:00.0  |
    202       | to device PA  | |             | |            | |              |
    203       | PCI as 0d:00.0| |             | |            | |              |
    204       |_______________| |_____________| |____________| |______________|
    205 
    206 Notes:
    207 
    208 (1) **3 CXL Fixed Memory Windows (CFMW)** corresponding to different
    209     ranges of the system physical address map.  Each CFMW has
    210     particular interleave setup across the CXL Host Bridges (HB)
    211     CFMW0 provides uninterleaved access to HB0, CFW2 provides
    212     uninterleaved access to HB1. CFW1 provides interleaved memory access
    213     across HB0 and HB1.
    214 
    215 (2) **Two CXL Host Bridges**. Each of these has 2 CXL Root Ports and
    216     programmable HDM decoders to route memory accesses either to
    217     a single port or interleave them across multiple ports.
    218     A complex configuration here, might be to use the following HDM
    219     decoders in HB0. HDM0 routes CFMW0 requests to RP0 and hence
    220     part of CXL Type3 0. HDM1 routes CFMW0 requests from a
    221     different region of the CFMW0 PA range to RP2 and hence part
    222     of CXL Type 3 1.  HDM2 routes yet another PA range from within
    223     CFMW0 to be interleaved across RP0 and RP1, providing 2 way
    224     interleave of part of the memory provided by CXL Type3 0 and
    225     CXL Type 3 1. HDM3 routes those interleaved accesses from
    226     CFMW1 that target HB0 to RP 0 and another part of the memory of
    227     CXL Type 3 0 (as part of a 2 way interleave at the system level
    228     across for example CXL Type3 0 and CXL Type3 2.
    229     HDM4 is used to enable system wide 4 way interleave across all
    230     the present CXL type3 devices, by interleaving those (interleaved)
    231     requests that HB0 receives from from CFMW1 across RP 0 and
    232     RP 1 and hence to yet more regions of the memory of the
    233     attached Type3 devices.  Note this is a representative subset
    234     of the full range of possible HDM decoder configurations in this
    235     topology.
    236 
    237 (3) **Four CXL Root Ports.** In this case the CXL Type 3 devices are
    238     directly attached to these ports.
    239 
    240 (4) **Four CXL Type3 memory expansion devices.**  These will each have
    241     HDM decoders, but in this case rather than performing interleave
    242     they will take the Host Physical Addresses of accesses and map
    243     them to their own local Device Physical Address Space (DPA).
    244 
    245 Example topology involving a switch::
    246 
    247   |<------------------SYSTEM PHYSICAL ADDRESS MAP (1)----------------->|
    248   |    __________   __________________________________   __________    |
    249   |   |          | |                                  | |          |   |
    250   |   | CFMW 0   | |  CXL Fixed Memory Window 1       | | CFMW 1   |   |
    251   |   | HB0 only | |  Configured to interleave memory | | HB1 only |   |
    252   |   |          | |  memory accesses across HB0/HB1  | |          |   |
    253   |   |____x_____| |__________________________________| |__________|   |
    254            |             |                     |             |
    255            |             |                     |             |
    256            |             |                     |
    257   Interleave Decoder     |                     |             |
    258    Matches this HB       |                     |             |
    259            \_____________|                     |_____________/
    260                __________|__________      _____|_______________
    261               |                     |    |                     |
    262               | CXL HB 0            |    | CXL HB 1            |
    263               | HB IntLv Decoders   |    | HB IntLv Decoders   |
    264               | PCI/CXL Root Bus 0c |    | PCI/CXL Root Bus 0d |
    265               |                     |    |                     |
    266               |___x_________________|    |_____________________|
    267                   |              |          |               |
    268                   |
    269        A HB 0 HDM Decoder
    270        matches this Port
    271        ___________|___
    272       |  Root Port 0  |
    273       |  Appears in   |
    274       |  PCI topology |
    275       |  As 0c:00.0   |
    276       |___________x___|
    277                   |
    278                   |
    279                   \_____________________
    280                                         |
    281                                         |
    282             ---------------------------------------------------
    283            |    Switch 0  USP as PCI 0d:00.0                   |
    284            |    USP has HDM decoder which direct traffic to    |
    285            |    appropriate downstream port                    |
    286            |    Switch BUS appears as 0e                       |
    287            |x__________________________________________________|
    288             |                  |               |              |
    289             |                  |               |              |
    290        _____|_________   ______|______   ______|_____   ______|_______
    291    (4)|     x         | |             | |            | |              |
    292       | CXL Type3 0   | | CXL Type3 1 | | CXL type3 2| | CLX Type 3 3 |
    293       |               | |             | |            | |              |
    294       | PMEM0(Vol LSA)| | PMEM1 (...) | | PMEM2 (...)| | PMEM3 (...)  |
    295       | Decoder to go | |             | |            | |              |
    296       | from host PA  | | PCI 10:00.0 | | PCI 11:00.0| | PCI 12:00.0  |
    297       | to device PA  | |             | |            | |              |
    298       | PCI as 0f:00.0| |             | |            | |              |
    299       |_______________| |_____________| |____________| |______________|
    300 
    301 Example command lines
    302 ---------------------
    303 A very simple setup with just one directly attached CXL Type 3 device::
    304 
    305   qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
    306   ...
    307   -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \
    308   -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
    309   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
    310   -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
    311   -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
    312   -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
    313 
    314 A setup suitable for 4 way interleave. Only one fixed window provided, to enable 2 way
    315 interleave across 2 CXL host bridges.  Each host bridge has 2 CXL Root Ports, with
    316 the CXL Type3 device directly attached (no switches).::
    317 
    318   qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
    319   ...
    320   -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \
    321   -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \
    322   -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \
    323   -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest4.raw,size=256M \
    324   -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
    325   -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \
    326   -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \
    327   -object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa4.raw,size=256M \
    328   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
    329   -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2 \
    330   -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
    331   -device cxl-type3,bus=root_port13,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
    332   -device cxl-rp,port=1,bus=cxl.1,id=root_port14,chassis=0,slot=3 \
    333   -device cxl-type3,bus=root_port14,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
    334   -device cxl-rp,port=0,bus=cxl.2,id=root_port15,chassis=0,slot=5 \
    335   -device cxl-type3,bus=root_port15,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
    336   -device cxl-rp,port=1,bus=cxl.2,id=root_port16,chassis=0,slot=6 \
    337   -device cxl-type3,bus=root_port16,memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
    338   -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k
    339 
    340 An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
    341 
    342   qemu-system-aarch64 -M virt,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 -cpu max \
    343   ...
    344   -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
    345   -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest1.raw,size=256M \
    346   -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \
    347   -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \
    348   -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
    349   -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa1.raw,size=256M \
    350   -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \
    351   -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \
    352   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
    353   -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
    354   -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
    355   -device cxl-upstream,bus=root_port0,id=us0 \
    356   -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
    357   -device cxl-type3,bus=swport0,memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0,size=256M \
    358   -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
    359   -device cxl-type3,bus=swport1,memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1,size=256M \
    360   -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
    361   -device cxl-type3,bus=swport2,memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2,size=256M \
    362   -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
    363   -device cxl-type3,bus=swport3,memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3,size=256M \
    364   -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
    365 
    366 Kernel Configuration Options
    367 ----------------------------
    368 
    369 In Linux 5.18 the following options are necessary to make use of
    370 OS management of CXL memory devices as described here.
    371 
    372 * CONFIG_CXL_BUS
    373 * CONFIG_CXL_PCI
    374 * CONFIG_CXL_ACPI
    375 * CONFIG_CXL_PMEM
    376 * CONFIG_CXL_MEM
    377 * CONFIG_CXL_PORT
    378 * CONFIG_CXL_REGION
    379 
    380 References
    381 ----------
    382 
    383  - Consortium website for specifications etc:
    384    http://www.computeexpresslink.org
    385  - Compute Express link Revision 2 specification, October 2020
    386  - CEDT CFMWS & QTG _DSM ECN May 2021