1. 19 10月, 2018 1 次提交
  2. 16 6月, 2018 3 次提交
    • R
      tcg: Reduce max TB opcode count · 9f754620
      Richard Henderson 提交于
      Also, assert that we don't overflow any of two different offsets into
      the TB. Both unwind and goto_tb both record a uint16_t for later use.
      
      This fixes an arm-softmmu test case utilizing NEON in which there is
      a TB generated that runs to 7800 opcodes, and compiles to 96k on an
      x86_64 host.  This overflows the 16-bit offset in which we record the
      goto_tb reset offset.  Because of that overflow, we install a jump
      destination that goes to neverland.  Boom.
      
      With this reduced op count, the same TB compiles to about 48k for
      aarch64, ppc64le, and x86_64 hosts, and neither assertion fires.
      
      Cc: qemu-stable@nongnu.org
      Reported-by: N"Jason A. Donenfeld" <Jason@zx2c4.com>
      Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      9f754620
    • E
      tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx · 128ed227
      Emilio G. Cota 提交于
      Thereby making it per-TCGContext. Once we remove tb_lock, this will
      avoid an atomic increment every time a TB is invalidated.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      128ed227
    • E
      tcg: track TBs with per-region BST's · be2cdc5e
      Emilio G. Cota 提交于
      This paves the way for enabling scalable parallel generation of TCG code.
      
      Instead of tracking TBs with a single binary search tree (BST), use a
      BST for each TCG region, protecting it with a lock. This is as scalable
      as it gets, since each TCG thread operates on a separate region.
      
      The core of this change is the introduction of struct tcg_region_tree,
      which contains a pointer to a GTree and an associated lock to serialize
      accesses to it. We then allocate an array of tcg_region_tree's, adding
      the appropriate padding to avoid false sharing based on
      qemu_dcache_linesize.
      
      Given a tc_ptr, we first find the corresponding region_tree. This
      is done by special-casing the first and last regions first, since they
      might be of size != region.size; otherwise we just divide the offset
      by region.stride. I was worried about this division (several dozen
      cycles of latency), but profiling shows that this is not a fast path.
      Note that region.stride is not required to be a power of two; it
      is only required to be a multiple of the host's page size.
      
      Note that with this design we can also provide consistent snapshots
      about all region trees at once; for instance, tcg_tb_foreach
      acquires/releases all region_tree locks before/after iterating over them.
      For this reason we now drop tb_lock in dump_exec_info().
      
      As an alternative I considered implementing a concurrent BST, but this
      can be tricky to get right, offers no consistent snapshots of the BST,
      and performance and scalability-wise I don't think it could ever beat
      having separate GTrees, given that our workload is insert-mostly (all
      concurrent BST designs I've seen focus, understandably, on making
      lookups fast, which comes at the expense of convoluted, non-wait-free
      insertions/removals).
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      be2cdc5e
  3. 09 5月, 2018 1 次提交
  4. 02 5月, 2018 2 次提交
  5. 08 2月, 2018 5 次提交
  6. 30 12月, 2017 3 次提交
  7. 03 11月, 2017 1 次提交
  8. 25 10月, 2017 19 次提交
  9. 11 10月, 2017 1 次提交
  10. 10 10月, 2017 2 次提交
  11. 17 9月, 2017 2 次提交