1. 13 7月, 2018 1 次提交
  2. 16 6月, 2018 4 次提交
    • E
      tcg: remove tb_lock · 0ac20318
      Emilio G. Cota 提交于
      Use mmap_lock in user-mode to protect TCG state and the page descriptors.
      In !user-mode, each vCPU has its own TCG state, so no locks needed.
      Per-page locks are used to protect the page descriptors.
      
      Per-TB locks are used in both modes to protect TB jumps.
      
      Some notes:
      
      - tb_lock is removed from notdirty_mem_write by passing a
        locked page_collection to tb_invalidate_phys_page_fast.
      
      - tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
        so there is no need to further serialize access to them.
      
      - do_tb_flush is run in a safe async context, meaning no other
        vCPU threads are running. Therefore acquiring mmap_lock there
        is just to please tools such as thread sanitizer.
      
      - Not visible in the diff, but tb_invalidate_phys_page already
        has an assert_memory_lock.
      
      - cpu_io_recompile is !user-only, so no mmap_lock there.
      
      - Added mmap_unlock()'s before all siglongjmp's that could
        be called in user-mode while mmap_lock is held.
        + Added an assert for !have_mmap_lock() after returning from
          the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.
      
      Performance numbers before/after:
      
      Host: AMD Opteron(tm) Processor 6376
      
                       ubuntu 17.04 ppc64 bootup+shutdown time
      
        700 +-+--+----+------+------------+-----------+------------*--+-+
            |    +    +      +            +           +           *B    |
            |         before ***B***                            ** *    |
            |tb lock removal ###D###                         ***        |
        600 +-+                                           ***         +-+
            |                                           **         #    |
            |                                        *B*          #D    |
            |                                     *** *         ##      |
        500 +-+                                ***           ###      +-+
            |                             * ***           ###           |
            |                            *B*          # ##              |
            |                          ** *          #D#                |
        400 +-+                      **            ##                 +-+
            |                      **           ###                     |
            |                    **           ##                        |
            |                  **         # ##                          |
        300 +-+  *           B*          #D#                          +-+
            |    B         ***        ###                               |
            |    *       **       ####                                  |
            |     *   ***      ###                                      |
        200 +-+   B  *B     #D#                                       +-+
            |     #B* *   ## #                                          |
            |     #*    ##                                              |
            |    + D##D#     +            +           +            +    |
        100 +-+--+----+------+------------+-----------+------------+--+-+
                 1    8      16      Guest CPUs       48           64
        png: https://imgur.com/HwmBHXe
      
                    debian jessie aarch64 bootup+shutdown time
      
        90 +-+--+-----+-----+------------+------------+------------+--+-+
           |    +     +     +            +            +            +    |
           |         before ***B***                                B    |
        80 +tb lock removal ###D###                              **D  +-+
           |                                                   **###    |
           |                                                 **##       |
        70 +-+                                             ** #       +-+
           |                                             ** ##          |
           |                                           **  #            |
        60 +-+                                       *B  ##           +-+
           |                                       **  ##               |
           |                                    ***  #D                 |
        50 +-+                               ***   ##                 +-+
           |                             * **   ###                     |
           |                           **B*  ###                        |
        40 +-+                     ****  # ##                         +-+
           |                   ****     #D#                             |
           |             ***B**      ###                                |
        30 +-+    B***B**        ####                                 +-+
           |    B *   *     # ###                                       |
           |     B       ###D#                                          |
        20 +-+   D  ##D##                                             +-+
           |      D#                                                    |
           |    +     +     +            +            +            +    |
        10 +-+--+-----+-----+------------+------------+------------+--+-+
                1     8     16      Guest CPUs        48           64
        png: https://imgur.com/iGpGFtv
      
      The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
      lock contention significantly hurts scalability.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      0ac20318
    • E
      translate-all: protect TB jumps with a per-destination-TB lock · 194125e3
      Emilio G. Cota 提交于
      This applies to both user-mode and !user-mode emulation.
      
      Instead of relying on a global lock, protect the list of incoming
      jumps with tb->jmp_lock. This lock also protects tb->cflags,
      so update all tb->cflags readers outside tb->jmp_lock to use
      atomic reads via tb_cflags().
      
      In order to find the destination TB (and therefore its jmp_lock)
      from the origin TB, we introduce tb->jmp_dest[].
      
      I considered not using a linked list of jumps, which simplifies
      code and makes the struct smaller. However, it unnecessarily increases
      memory usage, which results in a performance decrease. See for
      instance these numbers booting+shutting down debian-arm:
                            Time (s)  Rel. err (%)  Abs. err (s)  Rel. slowdown (%)
      ------------------------------------------------------------------------------
       before                  20.88          0.74      0.154512                 0.
       after                   20.81          0.38      0.079078        -0.33524904
       GTree                   21.02          0.28      0.058856         0.67049808
       GHashTable + xxhash     21.63          1.08      0.233604          3.5919540
      
      Using a hash table or a binary tree to keep track of the jumps
      doesn't really pay off, not only due to the increased memory usage,
      but also because most TBs have only 0 or 1 jumps to them. The maximum
      number of jumps when booting debian-arm that I measured is 35, but
      as we can see in the histogram below a TB with that many incoming jumps
      is extremely rare; the average TB has 0.80 incoming jumps.
      
      n_jumps: 379208; avg jumps/tb: 0.801099
      dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁  ▁▁▁     ▁|[34.0,35.0]
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      194125e3
    • E
      translate-all: discard TB when tb_link_page returns an existing matching TB · 95590e24
      Emilio G. Cota 提交于
      Use the recently-gained QHT feature of returning the matching TB if it
      already exists. This allows us to get rid of the lookup we perform
      right after acquiring tb_lock.
      Suggested-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      95590e24
    • E
      translate-all: make l1_map lockless · 78722ed0
      Emilio G. Cota 提交于
      Groundwork for supporting parallel TCG generation.
      
      We never remove entries from the radix tree, so we can use cmpxchg
      to implement lockless insertions.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      78722ed0
  3. 08 6月, 2017 1 次提交
  4. 24 2月, 2017 1 次提交