1. 19 10月, 2018 2 次提交
  2. 27 9月, 2018 1 次提交
    • R
      tcg/i386: fix vector operations on 32-bit hosts · 93bf9a42
      Roman Kapl 提交于
      The TCG backend uses LOWREGMASK to get the low 3 bits of register numbers.
      This was defined as no-op for 32-bit x86, with the assumption that we have
      eight registers anyway. This assumption is not true once we have xmm regs.
      
      Since LOWREGMASK was a no-op, xmm register indidices were wrong in opcodes
      and have overflown into other opcode fields, wreaking havoc.
      
      To trigger these problems, you can try running the "movi d8, #0x0" AArch64
      instruction on 32-bit x86. "vpxor %xmm0, %xmm0, %xmm0" should be generated,
      but instead TCG generated "vpxor %xmm0, %xmm0, %xmm2".
      
      Fixes: 770c2fc7 ("Add vector operations")
      Signed-off-by: NRoman Kapl <rka@sysgo.com>
      Message-Id: <20180824131734.18557-1-rka@sysgo.com>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      93bf9a42
  3. 06 8月, 2018 1 次提交
  4. 24 7月, 2018 1 次提交
  5. 20 7月, 2018 1 次提交
  6. 09 7月, 2018 1 次提交
  7. 16 6月, 2018 5 次提交
    • R
      tcg: Reduce max TB opcode count · 9f754620
      Richard Henderson 提交于
      Also, assert that we don't overflow any of two different offsets into
      the TB. Both unwind and goto_tb both record a uint16_t for later use.
      
      This fixes an arm-softmmu test case utilizing NEON in which there is
      a TB generated that runs to 7800 opcodes, and compiles to 96k on an
      x86_64 host.  This overflows the 16-bit offset in which we record the
      goto_tb reset offset.  Because of that overflow, we install a jump
      destination that goes to neverland.  Boom.
      
      With this reduced op count, the same TB compiles to about 48k for
      aarch64, ppc64le, and x86_64 hosts, and neither assertion fires.
      
      Cc: qemu-stable@nongnu.org
      Reported-by: N"Jason A. Donenfeld" <Jason@zx2c4.com>
      Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      9f754620
    • E
      tcg: remove tb_lock · 0ac20318
      Emilio G. Cota 提交于
      Use mmap_lock in user-mode to protect TCG state and the page descriptors.
      In !user-mode, each vCPU has its own TCG state, so no locks needed.
      Per-page locks are used to protect the page descriptors.
      
      Per-TB locks are used in both modes to protect TB jumps.
      
      Some notes:
      
      - tb_lock is removed from notdirty_mem_write by passing a
        locked page_collection to tb_invalidate_phys_page_fast.
      
      - tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
        so there is no need to further serialize access to them.
      
      - do_tb_flush is run in a safe async context, meaning no other
        vCPU threads are running. Therefore acquiring mmap_lock there
        is just to please tools such as thread sanitizer.
      
      - Not visible in the diff, but tb_invalidate_phys_page already
        has an assert_memory_lock.
      
      - cpu_io_recompile is !user-only, so no mmap_lock there.
      
      - Added mmap_unlock()'s before all siglongjmp's that could
        be called in user-mode while mmap_lock is held.
        + Added an assert for !have_mmap_lock() after returning from
          the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.
      
      Performance numbers before/after:
      
      Host: AMD Opteron(tm) Processor 6376
      
                       ubuntu 17.04 ppc64 bootup+shutdown time
      
        700 +-+--+----+------+------------+-----------+------------*--+-+
            |    +    +      +            +           +           *B    |
            |         before ***B***                            ** *    |
            |tb lock removal ###D###                         ***        |
        600 +-+                                           ***         +-+
            |                                           **         #    |
            |                                        *B*          #D    |
            |                                     *** *         ##      |
        500 +-+                                ***           ###      +-+
            |                             * ***           ###           |
            |                            *B*          # ##              |
            |                          ** *          #D#                |
        400 +-+                      **            ##                 +-+
            |                      **           ###                     |
            |                    **           ##                        |
            |                  **         # ##                          |
        300 +-+  *           B*          #D#                          +-+
            |    B         ***        ###                               |
            |    *       **       ####                                  |
            |     *   ***      ###                                      |
        200 +-+   B  *B     #D#                                       +-+
            |     #B* *   ## #                                          |
            |     #*    ##                                              |
            |    + D##D#     +            +           +            +    |
        100 +-+--+----+------+------------+-----------+------------+--+-+
                 1    8      16      Guest CPUs       48           64
        png: https://imgur.com/HwmBHXe
      
                    debian jessie aarch64 bootup+shutdown time
      
        90 +-+--+-----+-----+------------+------------+------------+--+-+
           |    +     +     +            +            +            +    |
           |         before ***B***                                B    |
        80 +tb lock removal ###D###                              **D  +-+
           |                                                   **###    |
           |                                                 **##       |
        70 +-+                                             ** #       +-+
           |                                             ** ##          |
           |                                           **  #            |
        60 +-+                                       *B  ##           +-+
           |                                       **  ##               |
           |                                    ***  #D                 |
        50 +-+                               ***   ##                 +-+
           |                             * **   ###                     |
           |                           **B*  ###                        |
        40 +-+                     ****  # ##                         +-+
           |                   ****     #D#                             |
           |             ***B**      ###                                |
        30 +-+    B***B**        ####                                 +-+
           |    B *   *     # ###                                       |
           |     B       ###D#                                          |
        20 +-+   D  ##D##                                             +-+
           |      D#                                                    |
           |    +     +     +            +            +            +    |
        10 +-+--+-----+-----+------------+------------+------------+--+-+
                1     8     16      Guest CPUs        48           64
        png: https://imgur.com/iGpGFtv
      
      The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
      lock contention significantly hurts scalability.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      0ac20318
    • E
      tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx · 128ed227
      Emilio G. Cota 提交于
      Thereby making it per-TCGContext. Once we remove tb_lock, this will
      avoid an atomic increment every time a TB is invalidated.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      128ed227
    • E
      tcg: track TBs with per-region BST's · be2cdc5e
      Emilio G. Cota 提交于
      This paves the way for enabling scalable parallel generation of TCG code.
      
      Instead of tracking TBs with a single binary search tree (BST), use a
      BST for each TCG region, protecting it with a lock. This is as scalable
      as it gets, since each TCG thread operates on a separate region.
      
      The core of this change is the introduction of struct tcg_region_tree,
      which contains a pointer to a GTree and an associated lock to serialize
      accesses to it. We then allocate an array of tcg_region_tree's, adding
      the appropriate padding to avoid false sharing based on
      qemu_dcache_linesize.
      
      Given a tc_ptr, we first find the corresponding region_tree. This
      is done by special-casing the first and last regions first, since they
      might be of size != region.size; otherwise we just divide the offset
      by region.stride. I was worried about this division (several dozen
      cycles of latency), but profiling shows that this is not a fast path.
      Note that region.stride is not required to be a power of two; it
      is only required to be a multiple of the host's page size.
      
      Note that with this design we can also provide consistent snapshots
      about all region trees at once; for instance, tcg_tb_foreach
      acquires/releases all region_tree locks before/after iterating over them.
      For this reason we now drop tb_lock in dump_exec_info().
      
      As an alternative I considered implementing a concurrent BST, but this
      can be tricky to get right, offers no consistent snapshots of the BST,
      and performance and scalability-wise I don't think it could ever beat
      having separate GTrees, given that our workload is insert-mostly (all
      concurrent BST designs I've seen focus, understandably, on making
      lookups fast, which comes at the expense of convoluted, non-wait-free
      insertions/removals).
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      be2cdc5e
    • J
      tcg/i386: Use byte form of xgetbv instruction · 1019242a
      John Arbuckle 提交于
      The assembler in most versions of Mac OS X is pretty old and does not
      support the xgetbv instruction.  To go around this problem, the raw
      encoding of the instruction is used instead.
      Signed-off-by: NJohn Arbuckle <programmingkidx@gmail.com>
      Message-Id: <20180604215102.11002-1-programmingkidx@gmail.com>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      1019242a
  8. 02 6月, 2018 1 次提交
  9. 01 6月, 2018 1 次提交
  10. 20 5月, 2018 1 次提交
  11. 11 5月, 2018 2 次提交
  12. 09 5月, 2018 2 次提交
    • R
      tcg: Limit the number of ops in a TB · abebf925
      Richard Henderson 提交于
      In 6001f772 we partially attempt to address the branch
      displacement overflow caused by 15fa08f8.
      
      However, gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vqtbX.c
      is a testcase that contains a TB so large as to overflow anyway.
      The limit here of 8000 ops produces a maximum output TB size of
      24112 bytes on a ppc64le host with that test case.  This is still
      much less than the maximum forward branch distance of 32764 bytes.
      
      Cc: qemu-stable@nongnu.org
      Fixes: 15fa08f8 ("tcg: Dynamically allocate TCGOps")
      Reviewed-by: NLaurent Vivier <laurent@vivier.eu>
      Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      abebf925
    • P
      tcg/i386: Fix dup_vec in non-AVX2 codepath · 7eb30ef0
      Peter Maydell 提交于
      The VPUNPCKLD* instructions are all "non-destructive source",
      indicated by "NDS" in the encoding string in the x86 ISA manual.
      This means that they take two source operands, one of which is
      encoded in the VEX.vvvv field. We were incorrectly treating them
      as if they were destructive-source and passing 0 as the 'v'
      argument of tcg_out_vex_modrm(). This meant we were always
      using %xmm0 as one of the source operands, causing incorrect
      results if the register allocator happened to want to use
      something else. For instance the input AArch64 insn:
       DUP v26.16b, w21
      which becomes TCG IR ops:
       dup_vec v128,e8,tmp2,x21
       st_vec v128,e8,tmp2,env,$0xa40
      was assembled to:
      0x607c568c:  c4 c1 7a 7e 86 e8 00 00  vmovq    0xe8(%r14), %xmm0
      0x607c5694:  00
      0x607c5695:  c5 f9 60 c8              vpunpcklbw %xmm0, %xmm0, %xmm1
      0x607c5699:  c5 f9 61 c9              vpunpcklwd %xmm1, %xmm0, %xmm1
      0x607c569d:  c5 f9 70 c9 00           vpshufd  $0, %xmm1, %xmm1
      0x607c56a2:  c4 c1 7a 7f 8e 40 0a 00  vmovdqu  %xmm1, 0xa40(%r14)
      0x607c56aa:  00
      
      when the vpunpcklwd insn should be "%xmm1, %xmm1, %xmm1".
      This resulted in our incorrectly setting the output vector to
      q26=0000320000003200:0000320000003200
      when given an input of x21 == 0000000002803200
      rather than the expected all-zeroes.
      
      Pass the correct source register number to tcg_out_vex_modrm()
      for these insns.
      
      Fixes: 770c2fc7
      Cc: qemu-stable@nongnu.org
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      Message-Id: <20180504153431.5169-1-peter.maydell@linaro.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      7eb30ef0
  13. 02 5月, 2018 5 次提交
  14. 16 4月, 2018 1 次提交
    • P
      tcg/mips: Handle large offsets from target env to tlb_table · 161dfd1e
      Peter Maydell 提交于
      The MIPS TCG target makes the assumption that the offset from the
      target env pointer to the tlb_table is less than about 64K. This
      used to be true, but gradual addition of features to the Arm
      target means that it's no longer true there. This results in
      the build-time assertion failing:
      
      In file included from /home/pm215/qemu/include/qemu/osdep.h:36:0,
                       from /home/pm215/qemu/tcg/tcg.c:28:
      /home/pm215/qemu/tcg/mips/tcg-target.inc.c: In function ‘tcg_out_tlb_load’:
      /home/pm215/qemu/include/qemu/compiler.h:90:36: error: static assertion failed: "not expecting: offsetof(CPUArchState, tlb_table[NB_MMU_MODES - 1][1]) > 0x7ff0 + 0x7fff"
       #define QEMU_BUILD_BUG_MSG(x, msg) _Static_assert(!(x), msg)
                                          ^
      /home/pm215/qemu/include/qemu/compiler.h:98:30: note: in expansion of macro ‘QEMU_BUILD_BUG_MSG’
       #define QEMU_BUILD_BUG_ON(x) QEMU_BUILD_BUG_MSG(x, "not expecting: " #x)
                                    ^
      /home/pm215/qemu/tcg/mips/tcg-target.inc.c:1236:9: note: in expansion of macro ‘QEMU_BUILD_BUG_ON’
               QEMU_BUILD_BUG_ON(offsetof(CPUArchState,
               ^
      /home/pm215/qemu/rules.mak:66: recipe for target 'tcg/tcg.o' failed
      
      An ideal long term approach would be to rearrange the CPU state
      so that the tlb_table was not so far along it, but this is tricky
      because it would move it from the "not cleared on CPU reset" part
      of the struct to the "cleared on CPU reset" part. As a simple fix
      for the 2.12 release, make the MIPS TCG target handle an arbitrary
      offset by emitting more add instructions. This will mean an extra
      instruction in the fastpath for TCG loads and stores for the
      affected guests (currently just aarch64-softmmu).
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Message-id: 20180413142336.32163-1-peter.maydell@linaro.org
      161dfd1e
  15. 10 4月, 2018 1 次提交
  16. 28 3月, 2018 1 次提交
  17. 16 3月, 2018 3 次提交
  18. 08 2月, 2018 10 次提交