1. 16 6月, 2018 3 次提交
    • E
      tcg: remove tb_lock · 0ac20318
      Emilio G. Cota 提交于
      Use mmap_lock in user-mode to protect TCG state and the page descriptors.
      In !user-mode, each vCPU has its own TCG state, so no locks needed.
      Per-page locks are used to protect the page descriptors.
      
      Per-TB locks are used in both modes to protect TB jumps.
      
      Some notes:
      
      - tb_lock is removed from notdirty_mem_write by passing a
        locked page_collection to tb_invalidate_phys_page_fast.
      
      - tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
        so there is no need to further serialize access to them.
      
      - do_tb_flush is run in a safe async context, meaning no other
        vCPU threads are running. Therefore acquiring mmap_lock there
        is just to please tools such as thread sanitizer.
      
      - Not visible in the diff, but tb_invalidate_phys_page already
        has an assert_memory_lock.
      
      - cpu_io_recompile is !user-only, so no mmap_lock there.
      
      - Added mmap_unlock()'s before all siglongjmp's that could
        be called in user-mode while mmap_lock is held.
        + Added an assert for !have_mmap_lock() after returning from
          the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.
      
      Performance numbers before/after:
      
      Host: AMD Opteron(tm) Processor 6376
      
                       ubuntu 17.04 ppc64 bootup+shutdown time
      
        700 +-+--+----+------+------------+-----------+------------*--+-+
            |    +    +      +            +           +           *B    |
            |         before ***B***                            ** *    |
            |tb lock removal ###D###                         ***        |
        600 +-+                                           ***         +-+
            |                                           **         #    |
            |                                        *B*          #D    |
            |                                     *** *         ##      |
        500 +-+                                ***           ###      +-+
            |                             * ***           ###           |
            |                            *B*          # ##              |
            |                          ** *          #D#                |
        400 +-+                      **            ##                 +-+
            |                      **           ###                     |
            |                    **           ##                        |
            |                  **         # ##                          |
        300 +-+  *           B*          #D#                          +-+
            |    B         ***        ###                               |
            |    *       **       ####                                  |
            |     *   ***      ###                                      |
        200 +-+   B  *B     #D#                                       +-+
            |     #B* *   ## #                                          |
            |     #*    ##                                              |
            |    + D##D#     +            +           +            +    |
        100 +-+--+----+------+------------+-----------+------------+--+-+
                 1    8      16      Guest CPUs       48           64
        png: https://imgur.com/HwmBHXe
      
                    debian jessie aarch64 bootup+shutdown time
      
        90 +-+--+-----+-----+------------+------------+------------+--+-+
           |    +     +     +            +            +            +    |
           |         before ***B***                                B    |
        80 +tb lock removal ###D###                              **D  +-+
           |                                                   **###    |
           |                                                 **##       |
        70 +-+                                             ** #       +-+
           |                                             ** ##          |
           |                                           **  #            |
        60 +-+                                       *B  ##           +-+
           |                                       **  ##               |
           |                                    ***  #D                 |
        50 +-+                               ***   ##                 +-+
           |                             * **   ###                     |
           |                           **B*  ###                        |
        40 +-+                     ****  # ##                         +-+
           |                   ****     #D#                             |
           |             ***B**      ###                                |
        30 +-+    B***B**        ####                                 +-+
           |    B *   *     # ###                                       |
           |     B       ###D#                                          |
        20 +-+   D  ##D##                                             +-+
           |      D#                                                    |
           |    +     +     +            +            +            +    |
        10 +-+--+-----+-----+------------+------------+------------+--+-+
                1     8     16      Guest CPUs        48           64
        png: https://imgur.com/iGpGFtv
      
      The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
      lock contention significantly hurts scalability.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      0ac20318
    • E
      tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx · 128ed227
      Emilio G. Cota 提交于
      Thereby making it per-TCGContext. Once we remove tb_lock, this will
      avoid an atomic increment every time a TB is invalidated.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      128ed227
    • E
      tcg: track TBs with per-region BST's · be2cdc5e
      Emilio G. Cota 提交于
      This paves the way for enabling scalable parallel generation of TCG code.
      
      Instead of tracking TBs with a single binary search tree (BST), use a
      BST for each TCG region, protecting it with a lock. This is as scalable
      as it gets, since each TCG thread operates on a separate region.
      
      The core of this change is the introduction of struct tcg_region_tree,
      which contains a pointer to a GTree and an associated lock to serialize
      accesses to it. We then allocate an array of tcg_region_tree's, adding
      the appropriate padding to avoid false sharing based on
      qemu_dcache_linesize.
      
      Given a tc_ptr, we first find the corresponding region_tree. This
      is done by special-casing the first and last regions first, since they
      might be of size != region.size; otherwise we just divide the offset
      by region.stride. I was worried about this division (several dozen
      cycles of latency), but profiling shows that this is not a fast path.
      Note that region.stride is not required to be a power of two; it
      is only required to be a multiple of the host's page size.
      
      Note that with this design we can also provide consistent snapshots
      about all region trees at once; for instance, tcg_tb_foreach
      acquires/releases all region_tree locks before/after iterating over them.
      For this reason we now drop tb_lock in dump_exec_info().
      
      As an alternative I considered implementing a concurrent BST, but this
      can be tricky to get right, offers no consistent snapshots of the BST,
      and performance and scalability-wise I don't think it could ever beat
      having separate GTrees, given that our workload is insert-mostly (all
      concurrent BST designs I've seen focus, understandably, on making
      lookups fast, which comes at the expense of convoluted, non-wait-free
      insertions/removals).
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      be2cdc5e
  2. 25 10月, 2017 2 次提交
    • E
      tcg: take tb_ctx out of TCGContext · 44ded3d0
      Emilio G. Cota 提交于
      Groundwork for supporting multiple TCG contexts.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      44ded3d0
    • E
      translate-all: use a binary search tree to track TBs in TBContext · 2ac01d6d
      Emilio G. Cota 提交于
      This is a prerequisite for supporting multiple TCG contexts, since
      we will have threads generating code in separate regions of
      code_gen_buffer.
      
      For this we need a new field (.size) in struct tb_tc to keep
      track of the size of the translated code. This field uses a size_t
      to avoid adding a hole to the struct, although really an unsigned
      int would have been enough.
      
      The comparison function we use is optimized for the common case:
      insertions. Profiling shows that upon booting debian-arm, 98%
      of comparisons are between existing tb's (i.e. a->size and b->size
      are both !0), which happens during insertions (and removals, but
      those are rare). The remaining cases are lookups. From reading the glib
      sources we see that the first key is always the lookup key. However,
      the code does not assume this to always be the case because this
      behaviour is not guaranteed in the glib docs. However, we embed
      this knowledge in the code as a branch hint for the compiler.
      
      Note that tb_free does not free space in the code_gen_buffer anymore,
      since we cannot easily know whether the tb is the last one inserted
      in code_gen_buffer. The next patch in this series renames tb_free
      to tb_remove to reflect this.
      
      Performance-wise, lookups in tb_find_pc are the same as before:
      O(log n). However, insertions are O(log n) instead of O(1), which
      results in a small slowdown when booting debian-arm:
      
      Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
      	-machine type=virt -nographic -smp 1 -m 4096 \
      	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
      	-device virtio-net-device,netdev=unet \
      	-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
      	-device virtio-blk-device,drive=myblock \
      	-kernel img/arm/aarch32-current-linux-kernel-only.img \
      	-append console=ttyAMA0 root=/dev/vda1 \
      	-name arm,debug-threads=on -smp 1' (10 runs):
      
      - Before:
      
             8048.598422      task-clock (msec)         #    0.931 CPUs utilized            ( +-  0.28% )
                  16,974      context-switches          #    0.002 M/sec                    ( +-  0.12% )
                       0      cpu-migrations            #    0.000 K/sec
                  10,125      page-faults               #    0.001 M/sec                    ( +-  1.23% )
          35,144,901,879      cycles                    #    4.367 GHz                      ( +-  0.14% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
          65,758,252,643      instructions              #    1.87  insns per cycle          ( +-  0.33% )
          10,871,298,668      branches                  # 1350.707 M/sec                    ( +-  0.41% )
             192,322,212      branch-misses             #    1.77% of all branches          ( +-  0.32% )
      
             8.640869419 seconds time elapsed                                          ( +-  0.57% )
      
      - After:
             8146.242027      task-clock (msec)         #    0.923 CPUs utilized            ( +-  1.23% )
                  17,016      context-switches          #    0.002 M/sec                    ( +-  0.40% )
                       0      cpu-migrations            #    0.000 K/sec
                  18,769      page-faults               #    0.002 M/sec                    ( +-  0.45% )
          35,660,956,120      cycles                    #    4.378 GHz                      ( +-  1.22% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
          65,095,366,607      instructions              #    1.83  insns per cycle          ( +-  1.73% )
          10,803,480,261      branches                  # 1326.192 M/sec                    ( +-  1.95% )
             195,601,289      branch-misses             #    1.81% of all branches          ( +-  0.39% )
      
             8.828660235 seconds time elapsed                                          ( +-  0.38% )
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      2ac01d6d
  3. 20 6月, 2017 1 次提交
    • E
      tcg: allocate TB structs before the corresponding translated code · 6e3b2bfd
      Emilio G. Cota 提交于
      Allocating an arbitrarily-sized array of tbs results in either
      (a) a lot of memory wasted or (b) unnecessary flushes of the code
      cache when we run out of TB structs in the array.
      
      An obvious solution would be to just malloc a TB struct when needed,
      and keep the TB array as an array of pointers (recall that tb_find_pc()
      needs the TB array to run in O(log n)).
      
      Perhaps a better solution, which is implemented in this patch, is to
      allocate TB's right before the translated code they describe. This
      results in some memory waste due to padding to have code and TBs in
      separate cache lines--for instance, I measured 4.7% of padding in the
      used portion of code_gen_buffer when booting aarch64 Linux on a
      host with 64-byte cache lines. However, it can allow for optimizations
      in some host architectures, since TCG backends could safely assume that
      the TB and the corresponding translated code are very close to each
      other in memory. See this message by rth for a detailed explanation:
      
        https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html
        Subject: Re: GSoC 2017 Proposal: TCG performance enhancements
        Message-ID: <1e67644b-4b30-887e-d329-1848e94c9484@twiddle.net>
      Suggested-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1496790745-314-3-git-send-email-cota@braap.org>
      [rth: Simplify the arithmetic in tcg_tb_alloc]
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      6e3b2bfd
  4. 27 9月, 2016 1 次提交
  5. 12 7月, 2016 1 次提交
  6. 12 6月, 2016 1 次提交
    • E
      tb hash: track translated blocks with qht · 909eaac9
      Emilio G. Cota 提交于
      Having a fixed-size hash table for keeping track of all translation blocks
      is suboptimal: some workloads are just too big or too small to get maximum
      performance from the hash table. The MRU promotion policy helps improve
      performance when the hash table is a little undersized, but it cannot
      make up for severely undersized hash tables.
      
      Furthermore, frequent MRU promotions result in writes that are a scalability
      bottleneck. For scalability, lookups should only perform reads, not writes.
      This is not a big deal for now, but it will become one once MTTCG matures.
      
      The appended fixes these issues by using qht as the implementation of
      the TB hash table. This solution is superior to other alternatives considered,
      namely:
      
      - master: implementation in QEMU before this patchset
      - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
      - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
                    MRU is implemented here by adding an intermediate struct
                    that contains the u32 hash and a pointer to the TB; this
                    allows us, on an MRU promotion, to copy said struct (that is not
                    at the head), and put this new copy at the head. After a grace
                    period, the original non-head struct can be eliminated, and
                    after another grace period, freed.
      - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                         no MRU for lookups; MRU for inserts.
      The appended solution is the following:
      - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                       no MRU for lookups; MRU for inserts.
      
      The plots below compare the considered solutions. The Y axis shows the
      boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
      sweeps the number of buckets (or initial number of buckets for qht-autoresize).
      The plots in PNG format (and with errorbars) can be seen here:
        http://imgur.com/a/Awgnq
      
      Each test runs 5 times, and the entire QEMU process is pinned to a
      single core for repeatability of results.
      
                                  Host: Intel Xeon E5-2690
      
        28 ++------------+-------------+-------------+-------------+------------++
           A*****        +             +             +             master **A*** +
        27 ++    *                                                 xxhash ##B###++
           |      A******A******                               xxhash-rcu $$C$$$ |
        26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
           D%%$$                              A******A******A*qht-dyn-mru A*E****A
        25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
           B#####%                                                               |
        24 ++    #C$$$$$                                                        ++
           |      B###  $                                                        |
           |          ## C$$$$$$                                                 |
        23 ++           #       C$$$$$$                                         ++
           |             B######       C$$$$$$                                %%%D
        22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
           |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
        21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
           +             E@@@   F&&&   +      E@     +      F&&&   +             +
        20 ++------------+-------------+-------------+-------------+------------++
           14            16            18            20            22            24
                                   log2 number of buckets
      
                                       Host: Intel i7-4790K
      
        14.5 ++------------+------------+-------------+------------+------------++
             A**           +            +             +            master **A*** +
          14 ++ **                                                 xxhash ##B###++
        13.5 ++   **                                           xxhash-rcu $$C$$$++
             |                                            qht-fixed-nomru %%D%%% |
          13 ++     A******                                   qht-dyn-mru @@E@@@++
             |             A*****A******A******             qht-dyn-nomru &&F&&& |
        12.5 C$$                               A******A******A*****A******    ***A
          12 ++ $$                                                        A***  ++
             D%%% $$                                                             |
        11.5 ++  %%                                                             ++
             B###  %C$$$$$$                                                      |
          11 ++  ## D%%%%% C$$$$$                                               ++
             |     #      %      C$$$$$$                                         |
        10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
          10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
             +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
         9.5 ++------------+------------+-------------+------------+------------++
             14            16           18            20           22            24
                                    log2 number of buckets
      
      Note that the original point before this patch series is X=15 for "master";
      the little sensitivity to the increased number of buckets is due to the
      poor hashing function in master.
      
      xxhash-rcu has significant overhead due to the constant churn of allocating
      and deallocating intermediate structs for implementing MRU. An alternative
      would be do consider failed lookups as "maybe not there", and then
      acquire the external lock (tb_lock in this case) to really confirm that
      there was indeed a failed lookup. This, however, would not be enough
      to implement dynamic resizing--this is more complex: see
      "Resizable, Scalable, Concurrent Hash Tables via Relativistic
      Programming" by Triplett, McKenney and Walpole. This solution was
      discarded due to the very coarse RCU read critical sections that we have
      in MTTCG; resizing requires waiting for readers after every pointer update,
      and resizes require many pointer updates, so this would quickly become
      prohibitive.
      
      qht-fixed-nomru shows that MRU promotion is advisable for undersized
      hash tables.
      
      However, qht-dyn-mru shows that MRU promotion is not important if the
      hash table is properly sized: there is virtually no difference in
      performance between qht-dyn-nomru and qht-dyn-mru.
      
      Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
      X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
      can achieve with optimum sizing of the hash table, while keeping the hash
      table scalable for readers.
      
      The improvement we get before and after this patch for booting debian jessie
      with arm-softmmu is:
      
      - Intel Xeon E5-2690: 10.5% less time
      - Intel i7-4790K: 5.2% less time
      
      We could get this same improvement _for this particular workload_ by
      statically increasing the size of the hash table. But this would hurt
      workloads that do not need a large hash table. The dynamic (upward)
      resizing allows us to start small and enlarge the hash table as needed.
      
      A quick note on downsizing: the table is resized back to 2**15 buckets
      on every tb_flush; this makes sense because it is not guaranteed that the
      table will reach the same number of TBs later on (e.g. most bootup code is
      thrown away after boot); it makes sense to grow the hash table as
      more code blocks are translated. This also avoids the complication of
      having to build downsizing hysteresis logic into qht.
      Reviewed-by: NSergey Fedorov <serge.fedorov@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      909eaac9
  7. 19 5月, 2016 1 次提交
  8. 05 6月, 2015 2 次提交
  9. 03 6月, 2015 1 次提交
  10. 14 3月, 2014 1 次提交
  11. 10 12月, 2013 1 次提交
    • P
      split definitions for exec.c and translate-all.c radix trees · 03f49957
      Paolo Bonzini 提交于
      The exec.c and translate-all.c radix trees are quite different, and
      the exec.c one in particular is not limited to the CPU---it can be
      used also by devices that do DMA, and in that case the address space
      is not limited to TARGET_PHYS_ADDR_SPACE_BITS bits.
      
      We want to make exec.c's radix trees 64-bit wide.  As a first step,
      stop sharing the constants between exec.c and translate-all.c.
      exec.c gets P_L2_* constants, translate-all.c gets V_L2_*, for
      consistency with the existing V_L1_* symbols.  Though actually
      in the softmmu case translate-all.c is also indexed by physical
      addresses...
      
      This patch has no semantic change.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      03f49957
  12. 16 2月, 2013 1 次提交
  13. 16 12月, 2012 1 次提交
  14. 16 10月, 2011 1 次提交
  15. 04 2月, 2011 1 次提交
  16. 06 12月, 2009 1 次提交