qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

multi-thread-tcg.rst (14807B)


      1 ..
      2   Copyright (c) 2015-2020 Linaro Ltd.
      3 
      4   This work is licensed under the terms of the GNU GPL, version 2 or
      5   later. See the COPYING file in the top-level directory.
      6 
      7 ==================
      8 Multi-threaded TCG
      9 ==================
     10 
     11 This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
     12 system-mode emulation. user-mode emulation has always mirrored the
     13 thread structure of the translated executable although some of the
     14 changes done for MTTCG system emulation have improved the stability of
     15 linux-user emulation.
     16 
     17 The original system-mode TCG implementation was single threaded and
     18 dealt with multiple CPUs with simple round-robin scheduling. This
     19 simplified a lot of things but became increasingly limited as systems
     20 being emulated gained additional cores and per-core performance gains
     21 for host systems started to level off.
     22 
     23 vCPU Scheduling
     24 ===============
     25 
     26 We introduce a new running mode where each vCPU will run on its own
     27 user-space thread. This is enabled by default for all FE/BE
     28 combinations where the host memory model is able to accommodate the
     29 guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
     30 guest has had the required work done to support this safely
     31 (TARGET_SUPPORTS_MTTCG).
     32 
     33 System emulation will fall back to the original round robin approach
     34 if:
     35 
     36 * forced by --accel tcg,thread=single
     37 * enabling --icount mode
     38 * 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
     39 
     40 In the general case of running translated code there should be no
     41 inter-vCPU dependencies and all vCPUs should be able to run at full
     42 speed. Synchronisation will only be required while accessing internal
     43 shared data structures or when the emulated architecture requires a
     44 coherent representation of the emulated machine state.
     45 
     46 Shared Data Structures
     47 ======================
     48 
     49 Main Run Loop
     50 -------------
     51 
     52 Even when there is no code being generated there are a number of
     53 structures associated with the hot-path through the main run-loop.
     54 These are associated with looking up the next translation block to
     55 execute. These include:
     56 
     57     tb_jmp_cache (per-vCPU, cache of recent jumps)
     58     tb_ctx.htable (global hash table, phys address->tb lookup)
     59 
     60 As TB linking only occurs when blocks are in the same page this code
     61 is critical to performance as looking up the next TB to execute is the
     62 most common reason to exit the generated code.
     63 
     64 DESIGN REQUIREMENT: Make access to lookup structures safe with
     65 multiple reader/writer threads. Minimise any lock contention to do it.
     66 
     67 The hot-path avoids using locks where possible. The tb_jmp_cache is
     68 updated with atomic accesses to ensure consistent results. The fall
     69 back QHT based hash table is also designed for lockless lookups. Locks
     70 are only taken when code generation is required or TranslationBlocks
     71 have their block-to-block jumps patched.
     72 
     73 Global TCG State
     74 ----------------
     75 
     76 User-mode emulation
     77 ~~~~~~~~~~~~~~~~~~~
     78 
     79 We need to protect the entire code generation cycle including any post
     80 generation patching of the translated code. This also implies a shared
     81 translation buffer which contains code running on all cores. Any
     82 execution path that comes to the main run loop will need to hold a
     83 mutex for code generation. This also includes times when we need flush
     84 code or entries from any shared lookups/caches. Structures held on a
     85 per-vCPU basis won't need locking unless other vCPUs will need to
     86 modify them.
     87 
     88 DESIGN REQUIREMENT: Add locking around all code generation and TB
     89 patching.
     90 
     91 (Current solution)
     92 
     93 Code generation is serialised with mmap_lock().
     94 
     95 !User-mode emulation
     96 ~~~~~~~~~~~~~~~~~~~~
     97 
     98 Each vCPU has its own TCG context and associated TCG region, thereby
     99 requiring no locking during translation.
    100 
    101 Translation Blocks
    102 ------------------
    103 
    104 Currently the whole system shares a single code generation buffer
    105 which when full will force a flush of all translations and start from
    106 scratch again. Some operations also force a full flush of translations
    107 including:
    108 
    109   - debugging operations (breakpoint insertion/removal)
    110   - some CPU helper functions
    111   - linux-user spawning its first thread
    112 
    113 This is done with the async_safe_run_on_cpu() mechanism to ensure all
    114 vCPUs are quiescent when changes are being made to shared global
    115 structures.
    116 
    117 More granular translation invalidation events are typically due
    118 to a change of the state of a physical page:
    119 
    120   - code modification (self modify code, patching code)
    121   - page changes (new page mapping in linux-user mode)
    122 
    123 While setting the invalid flag in a TranslationBlock will stop it
    124 being used when looked up in the hot-path there are a number of other
    125 book-keeping structures that need to be safely cleared.
    126 
    127 Any TranslationBlocks which have been patched to jump directly to the
    128 now invalid blocks need the jump patches reversing so they will return
    129 to the C code.
    130 
    131 There are a number of look-up caches that need to be properly updated
    132 including the:
    133 
    134   - jump lookup cache
    135   - the physical-to-tb lookup hash table
    136   - the global page table
    137 
    138 The global page table (l1_map) which provides a multi-level look-up
    139 for PageDesc structures which contain pointers to the start of a
    140 linked list of all Translation Blocks in that page (see page_next).
    141 
    142 Both the jump patching and the page cache involve linked lists that
    143 the invalidated TranslationBlock needs to be removed from.
    144 
    145 DESIGN REQUIREMENT: Safely handle invalidation of TBs
    146                       - safely patch/revert direct jumps
    147                       - remove central PageDesc lookup entries
    148                       - ensure lookup caches/hashes are safely updated
    149 
    150 (Current solution)
    151 
    152 The direct jump themselves are updated atomically by the TCG
    153 tb_set_jmp_target() code. Modification to the linked lists that allow
    154 searching for linked pages are done under the protection of tb->jmp_lock,
    155 where tb is the destination block of a jump. Each origin block keeps a
    156 pointer to its destinations so that the appropriate lock can be acquired before
    157 iterating over a jump list.
    158 
    159 The global page table is a lockless radix tree; cmpxchg is used
    160 to atomically insert new elements.
    161 
    162 The lookup caches are updated atomically and the lookup hash uses QHT
    163 which is designed for concurrent safe lookup.
    164 
    165 Parallel code generation is supported. QHT is used at insertion time
    166 as the synchronization point across threads, thereby ensuring that we only
    167 keep track of a single TranslationBlock for each guest code block.
    168 
    169 Memory maps and TLBs
    170 --------------------
    171 
    172 The memory handling code is fairly critical to the speed of memory
    173 access in the emulated system. The SoftMMU code is designed so the
    174 hot-path can be handled entirely within translated code. This is
    175 handled with a per-vCPU TLB structure which once populated will allow
    176 a series of accesses to the page to occur without exiting the
    177 translated code. It is possible to set flags in the TLB address which
    178 will ensure the slow-path is taken for each access. This can be done
    179 to support:
    180 
    181   - Memory regions (dividing up access to PIO, MMIO and RAM)
    182   - Dirty page tracking (for code gen, SMC detection, migration and display)
    183   - Virtual TLB (for translating guest address->real address)
    184 
    185 When the TLB tables are updated by a vCPU thread other than their own
    186 we need to ensure it is done in a safe way so no inconsistent state is
    187 seen by the vCPU thread.
    188 
    189 Some operations require updating a number of vCPUs TLBs at the same
    190 time in a synchronised manner.
    191 
    192 DESIGN REQUIREMENTS:
    193 
    194   - TLB Flush All/Page
    195     - can be across-vCPUs
    196     - cross vCPU TLB flush may need other vCPU brought to halt
    197     - change may need to be visible to the calling vCPU immediately
    198   - TLB Flag Update
    199     - usually cross-vCPU
    200     - want change to be visible as soon as possible
    201   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
    202     - This is a per-vCPU table - by definition can't race
    203     - updated by its own thread when the slow-path is forced
    204 
    205 (Current solution)
    206 
    207 We have updated cputlb.c to defer operations when a cross-vCPU
    208 operation with async_run_on_cpu() which ensures each vCPU sees a
    209 coherent state when it next runs its work (in a few instructions
    210 time).
    211 
    212 A new set up operations (tlb_flush_*_all_cpus) take an additional flag
    213 which when set will force synchronisation by setting the source vCPUs
    214 work as "safe work" and exiting the cpu run loop. This ensure by the
    215 time execution restarts all flush operations have completed.
    216 
    217 TLB flag updates are all done atomically and are also protected by the
    218 corresponding page lock.
    219 
    220 (Known limitation)
    221 
    222 Not really a limitation but the wait mechanism is overly strict for
    223 some architectures which only need flushes completed by a barrier
    224 instruction. This could be a future optimisation.
    225 
    226 Emulated hardware state
    227 -----------------------
    228 
    229 Currently thanks to KVM work any access to IO memory is automatically
    230 protected by the global iothread mutex, also known as the BQL (Big
    231 QEMU Lock). Any IO region that doesn't use global mutex is expected to
    232 do its own locking.
    233 
    234 However IO memory isn't the only way emulated hardware state can be
    235 modified. Some architectures have model specific registers that
    236 trigger hardware emulation features. Generally any translation helper
    237 that needs to update more than a single vCPUs of state should take the
    238 BQL.
    239 
    240 As the BQL, or global iothread mutex is shared across the system we
    241 push the use of the lock as far down into the TCG code as possible to
    242 minimise contention.
    243 
    244 (Current solution)
    245 
    246 MMIO access automatically serialises hardware emulation by way of the
    247 BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
    248 and also defer the reset/startup of vCPUs to the vCPU context by way
    249 of async_run_on_cpu().
    250 
    251 Updates to interrupt state are also protected by the BQL as they can
    252 often be cross vCPU.
    253 
    254 Memory Consistency
    255 ==================
    256 
    257 Between emulated guests and host systems there are a range of memory
    258 consistency models. Even emulating weakly ordered systems on strongly
    259 ordered hosts needs to ensure things like store-after-load re-ordering
    260 can be prevented when the guest wants to.
    261 
    262 Memory Barriers
    263 ---------------
    264 
    265 Barriers (sometimes known as fences) provide a mechanism for software
    266 to enforce a particular ordering of memory operations from the point
    267 of view of external observers (e.g. another processor core). They can
    268 apply to any memory operations as well as just loads or stores.
    269 
    270 The Linux kernel has an excellent `write-up
    271 <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
    272 on the various forms of memory barrier and the guarantees they can
    273 provide.
    274 
    275 Barriers are often wrapped around synchronisation primitives to
    276 provide explicit memory ordering semantics. However they can be used
    277 by themselves to provide safe lockless access by ensuring for example
    278 a change to a signal flag will only be visible once the changes to
    279 payload are.
    280 
    281 DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
    282 
    283 This would enforce a strong load/store ordering so all loads/stores
    284 complete at the memory barrier. On single-core non-SMP strongly
    285 ordered backends this could become a NOP.
    286 
    287 Aside from explicit standalone memory barrier instructions there are
    288 also implicit memory ordering semantics which comes with each guest
    289 memory access instruction. For example all x86 load/stores come with
    290 fairly strong guarantees of sequential consistency whereas Arm has
    291 special variants of load/store instructions that imply acquire/release
    292 semantics.
    293 
    294 In the case of a strongly ordered guest architecture being emulated on
    295 a weakly ordered host the scope for a heavy performance impact is
    296 quite high.
    297 
    298 DESIGN REQUIREMENTS: Be efficient with use of memory barriers
    299        - host systems with stronger implied guarantees can skip some barriers
    300        - merge consecutive barriers to the strongest one
    301 
    302 (Current solution)
    303 
    304 The system currently has a tcg_gen_mb() which will add memory barrier
    305 operations if code generation is being done in a parallel context. The
    306 tcg_optimize() function attempts to merge barriers up to their
    307 strongest form before any load/store operations. The solution was
    308 originally developed and tested for linux-user based systems. All
    309 backends have been converted to emit fences when required. So far the
    310 following front-ends have been updated to emit fences when required:
    311 
    312     - target-i386
    313     - target-arm
    314     - target-aarch64
    315     - target-alpha
    316     - target-mips
    317 
    318 Memory Control and Maintenance
    319 ------------------------------
    320 
    321 This includes a class of instructions for controlling system cache
    322 behaviour. While QEMU doesn't model cache behaviour these instructions
    323 are often seen when code modification has taken place to ensure the
    324 changes take effect.
    325 
    326 Synchronisation Primitives
    327 --------------------------
    328 
    329 There are two broad types of synchronisation primitives found in
    330 modern ISAs: atomic instructions and exclusive regions.
    331 
    332 The first type offer a simple atomic instruction which will guarantee
    333 some sort of test and conditional store will be truly atomic w.r.t.
    334 other cores sharing access to the memory. The classic example is the
    335 x86 cmpxchg instruction.
    336 
    337 The second type offer a pair of load/store instructions which offer a
    338 guarantee that a region of memory has not been touched between the
    339 load and store instructions. An example of this is Arm's ldrex/strex
    340 pair where the strex instruction will return a flag indicating a
    341 successful store only if no other CPU has accessed the memory region
    342 since the ldrex.
    343 
    344 Traditionally TCG has generated a series of operations that work
    345 because they are within the context of a single translation block so
    346 will have completed before another CPU is scheduled. However with
    347 the ability to have multiple threads running to emulate multiple CPUs
    348 we will need to explicitly expose these semantics.
    349 
    350 DESIGN REQUIREMENTS:
    351   - Support classic atomic instructions
    352   - Support load/store exclusive (or load link/store conditional) pairs
    353   - Generic enough infrastructure to support all guest architectures
    354 CURRENT OPEN QUESTIONS:
    355   - How problematic is the ABA problem in general?
    356 
    357 (Current solution)
    358 
    359 The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
    360 can be used directly or combined to emulate other instructions like
    361 Arm's ldrex/strex instructions. While they are susceptible to the ABA
    362 problem so far common guests have not implemented patterns where
    363 this may be a problem - typically presenting a locking ABI which
    364 assumes cmpxchg like semantics.
    365 
    366 The code also includes a fall-back for cases where multi-threaded TCG
    367 ops can't work (e.g. guest atomic width > host atomic width). In this
    368 case an EXCP_ATOMIC exit occurs and the instruction is emulated with
    369 an exclusive lock which ensures all emulation is serialised.
    370 
    371 While the atomic helpers look good enough for now there may be a need
    372 to look at solutions that can more closely model the guest
    373 architectures semantics.