1. 26 10月, 2017 17 次提交
    • E
      qemu-img: Add find_nonzero() · debb38a4
      Eric Blake 提交于
      During 'qemu-img compare', when we are checking that an allocated
      portion of one file is all zeros, we don't need to waste time
      computing how many additional sectors after the first non-zero
      byte are also non-zero.  Create a new helper find_nonzero() to do
      the check for a first non-zero sector, and rebase
      check_empty_sectors() to use it.
      
      The new interface intentionally uses bytes in its interface, even
      though it still crawls the buffer a sector at a time; it is robust
      to a partial sector at the end of the buffer.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      debb38a4
    • E
      qemu-img: Speed up compare on pre-allocated larger file · 391cb1aa
      Eric Blake 提交于
      Compare the following images with all-zero contents:
      $ truncate --size 1M A
      $ qemu-img create -f qcow2 -o preallocation=off B 1G
      $ qemu-img create -f qcow2 -o preallocation=metadata C 1G
      
      On my machine, the difference is noticeable for pre-patch speeds,
      with more than an order of magnitude in difference caused by the
      choice of preallocation in the qcow2 file:
      
      $ time ./qemu-img compare -f raw -F qcow2 A B
      Warning: Image size mismatch!
      Images are identical.
      
      real	0m0.014s
      user	0m0.007s
      sys	0m0.007s
      
      $ time ./qemu-img compare -f raw -F qcow2 A C
      Warning: Image size mismatch!
      Images are identical.
      
      real	0m0.341s
      user	0m0.144s
      sys	0m0.188s
      
      Why? Because bdrv_is_allocated() returns false for image B but
      true for image C, throwing away the fact that both images know
      via lseek(SEEK_HOLE) that the entire image still reads as zero.
      From there, qemu-img ends up calling bdrv_pread() for every byte
      of the tail, instead of quickly looking for the next allocation.
      The solution: use block_status instead of is_allocated, giving:
      
      $ time ./qemu-img compare -f raw -F qcow2 A C
      Warning: Image size mismatch!
      Images are identical.
      
      real	0m0.014s
      user	0m0.011s
      sys	0m0.003s
      
      which is on par with the speeds for no pre-allocation.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Reviewed-by: NVladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      391cb1aa
    • E
      qemu-img: Simplify logic in img_compare() · 7daddc61
      Eric Blake 提交于
      As long as we are querying the status for a chunk smaller than
      the known image size, we are guaranteed that a successful return
      will have set pnum to a non-zero size (pnum is zero only for
      queries beyond the end of the file).  Use that to slightly
      simplify the calculation of the current chunk size being compared.
      Likewise, we don't have to shrink the amount of data operated on
      until we know we have to read the file, and therefore have to fit
      in the bounds of our buffer.  Also, note that 'total_sectors_over'
      is equivalent to 'progress_base'.
      
      With these changes in place, sectors_to_process() is now dead code,
      and can be removed.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      7daddc61
    • E
      block: Convert bdrv_get_block_status_above() to bytes · 31826642
      Eric Blake 提交于
      We are gradually moving away from sector-based interfaces, towards
      byte-based.  In the common case, allocation is unlikely to ever use
      values that are not naturally sector-aligned, but it is possible
      that byte-based values will let us be more precise about allocation
      at the end of an unaligned file that can do byte-based access.
      
      Changing the name of the function from bdrv_get_block_status_above()
      to bdrv_block_status_above() ensures that the compiler enforces that
      all callers are updated.  Likewise, since it a byte interface allows
      an offset mapping that might not be sector aligned, split the mapping
      out of the return value and into a pass-by-reference parameter.  For
      now, the io.c layer still assert()s that all uses are sector-aligned,
      but that can be relaxed when a later patch implements byte-based
      block status in the drivers.
      
      For the most part this patch is just the addition of scaling at the
      callers followed by inverse scaling at bdrv_block_status(), plus
      updates for the new split return interface.  But some code,
      particularly bdrv_block_status(), gets a lot simpler because it no
      longer has to mess with sectors.  Likewise, mirror code no longer
      computes s->granularity >> BDRV_SECTOR_BITS, and can therefore drop
      an assertion about alignment because the loop no longer depends on
      alignment (never mind that we don't really have a driver that
      reports sub-sector alignments, so it's not really possible to test
      the effect of sub-sector mirroring).  Fix a neighboring assertion to
      use is_power_of_2 while there.
      
      For ease of review, bdrv_get_block_status() was tackled separately.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      31826642
    • E
      block: Switch bdrv_co_get_block_status_above() to byte-based · 5b648c67
      Eric Blake 提交于
      We are gradually converting to byte-based interfaces, as they are
      easier to reason about than sector-based.  Convert another internal
      type (no semantic change), and rename it to match the corresponding
      public function rename.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      5b648c67
    • E
      block: Switch bdrv_common_block_status_above() to byte-based · 7ddb99b9
      Eric Blake 提交于
      We are gradually converting to byte-based interfaces, as they are
      easier to reason about than sector-based.  Convert another internal
      function (no semantic change).
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      7ddb99b9
    • E
      block: Switch BdrvCoGetBlockStatusData to byte-based · 4bcd936e
      Eric Blake 提交于
      We are gradually converting to byte-based interfaces, as they are
      easier to reason about than sector-based.  Convert another internal
      type (no semantic change), and rename it to match the corresponding
      public function rename.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      4bcd936e
    • E
      block: Switch bdrv_co_get_block_status() to byte-based · 2e8bc787
      Eric Blake 提交于
      We are gradually converting to byte-based interfaces, as they are
      easier to reason about than sector-based.  Convert another internal
      function (no semantic change); and as with its public counterpart,
      rename to bdrv_co_block_status() and split the offset return, to
      make the compiler enforce that we catch all uses.  For now, we
      assert that callers and the return value still use aligned data,
      but ultimately, this will be the function where we hand off to a
      byte-based driver callback, and will eventually need to add logic
      to ensure we round calls according to the driver's
      request_alignment then touch up the result handed back to the
      caller, to start permitting a caller to pass unaligned offsets.
      
      Note that we are now prepared to accepts 'bytes' larger than INT_MAX;
      this is okay as long as we clamp things internally before violating
      any 32-bit limits, and makes no difference to how a client will
      use the information (clients looping over the entire file must
      already be prepared for consecutive calls to return the same status,
      as drivers are already free to return shorter-than-maximal status
      due to any other convenient split points, such as when the L2 table
      crosses cluster boundaries in qcow2).
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      2e8bc787
    • E
      block: Convert bdrv_get_block_status() to bytes · 237d78f8
      Eric Blake 提交于
      We are gradually moving away from sector-based interfaces, towards
      byte-based.  In the common case, allocation is unlikely to ever use
      values that are not naturally sector-aligned, but it is possible
      that byte-based values will let us be more precise about allocation
      at the end of an unaligned file that can do byte-based access.
      
      Changing the name of the function from bdrv_get_block_status() to
      bdrv_block_status() ensures that the compiler enforces that all
      callers are updated.  For now, the io.c layer still assert()s that
      all callers are sector-aligned, but that can be relaxed when a later
      patch implements byte-based block status in the drivers.
      
      There was an inherent limitation in returning the offset via the
      return value: we only have room for BDRV_BLOCK_OFFSET_MASK bits, which
      means an offset can only be mapped for sector-aligned queries (or,
      if we declare that non-aligned input is at the same relative position
      modulo 512 of the answer), so the new interface also changes things to
      return the offset via output through a parameter by reference rather
      than mashed into the return value.  We'll have some glue code that
      munges between the two styles until we finish converting all uses.
      
      For the most part this patch is just the addition of scaling at the
      callers followed by inverse scaling at bdrv_block_status(), coupled
      with the tweak in calling convention.  But some code, particularly
      bdrv_is_allocated(), gets a lot simpler because it no longer has to
      mess with sectors.
      
      For ease of review, bdrv_get_block_status_above() will be tackled
      separately.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      237d78f8
    • E
      qemu-img: Switch get_block_status() to byte-based · 5e344dd8
      Eric Blake 提交于
      We are gradually converting to byte-based interfaces, as they are
      easier to reason about than sector-based.  Continue by converting
      an internal function (no semantic change), and simplifying its
      caller accordingly.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      5e344dd8
    • E
      block: Switch bdrv_make_zero() to byte-based · 7286d610
      Eric Blake 提交于
      We are gradually converting to byte-based interfaces, as they are
      easier to reason about than sector-based.  Change the internal
      loop iteration of zeroing a device to track by bytes instead of
      sectors (although we are still guaranteed that we iterate by steps
      that are sector-aligned).
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      7286d610
    • E
      qcow2: Switch is_zero_sectors() to byte-based · f06f6b66
      Eric Blake 提交于
      We are gradually converting to byte-based interfaces, as they are
      easier to reason about than sector-based.  Convert another internal
      function (no semantic change), and rename it to is_zero() in the
      process.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Reviewed-by: NFam Zheng <famz@redhat.com>
      Reviewed-by: NJohn Snow <jsnow@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      f06f6b66
    • E
      block: Make bdrv_round_to_clusters() signature more useful · 7cfd5275
      Eric Blake 提交于
      In the process of converting sector-based interfaces to bytes,
      I'm finding it easier to represent a byte count as a 64-bit
      integer at the block layer (even if we are internally capped
      by SIZE_MAX or even INT_MAX for individual transactions, it's
      still nicer to not have to worry about truncation/overflow
      issues on as many variables).  Update the signature of
      bdrv_round_to_clusters() to uniformly use int64_t, matching
      the signature already chosen for bdrv_is_allocated and the
      fact that off_t is also a signed type, then adjust clients
      according to the required fallout (even where the result could
      now exceed 32 bits, no client is directly assigning the result
      into a 32-bit value without breaking things into a loop first).
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      7cfd5275
    • E
      block: Add flag to avoid wasted work in bdrv_is_allocated() · c9ce8c4d
      Eric Blake 提交于
      Not all callers care about which BDS owns the mapping for a given
      range of the file, or where the zeroes lie within that mapping.  In
      particular, bdrv_is_allocated() cares more about finding the
      largest run of allocated data from the guest perspective, whether
      or not that data is consecutive from the host perspective, and
      whether or not the data reads as zero.  Therefore, doing subsequent
      refinements such as checking how much of the format-layer
      allocation also satisfies BDRV_BLOCK_ZERO at the protocol layer is
      wasted work - in the best case, it just costs extra CPU cycles
      during a single bdrv_is_allocated(), but in the worst case, it
      results in a smaller *pnum, and forces callers to iterate through
      more status probes when visiting the entire file for even more
      extra CPU cycles.
      
      This patch only optimizes the block layer (no behavior change when
      want_zero is true, but skip unnecessary effort when it is false).
      Then when subsequent patches tweak the driver callback to be
      byte-based, we can also pass this hint through to the driver.
      
      Tweak BdrvCoGetBlockStatusData to declare arguments in parameter
      order, rather than mixing things up (minimizing padding is not
      necessary here).
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      c9ce8c4d
    • E
      block: Allow NULL file for bdrv_get_block_status() · 298a1665
      Eric Blake 提交于
      Not all callers care about which BDS owns the mapping for a given
      range of the file.  This patch merely simplifies the callers by
      consolidating the logic in the common call point, while guaranteeing
      a non-NULL file to all the driver callbacks, for no semantic change.
      The only caller that does not care about pnum is bdrv_is_allocated,
      as invoked by vvfat; we can likewise add assertions that the rest
      of the stack does not have to worry about a NULL pnum.
      
      Furthermore, this will also set the stage for a future cleanup: when
      a caller does not care about which BDS owns an offset, it would be
      nice to allow the driver to optimize things to not have to return
      BDRV_BLOCK_OFFSET_VALID in the first place.  In the case of fragmented
      allocation (for example, it's fairly easy to create a qcow2 image
      where consecutive guest addresses are not at consecutive host
      addresses), the current contract requires bdrv_get_block_status()
      to clamp *pnum to the limit where host addresses are no longer
      consecutive, but allowing a NULL file means that *pnum could be
      set to the full length of known-allocated data.
      Signed-off-by: NEric Blake <eblake@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      298a1665
    • K
      qemu-iotests: Test backing_fmt with backing node reference · 760c4d43
      Kevin Wolf 提交于
      This changes test case 191 to include a backing image that has
      backing_fmt set in the image file, but is referenced by node name in the
      qemu command line.
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      Reviewed-by: NEric Blake <eblake@redhat.com>
      760c4d43
    • P
      block: don't add 'driver' to options when referring to backing via node name · 6bff597b
      Peter Krempa 提交于
      When referring to a backing file of an image via node name
      bdrv_open_backing_file would add the 'driver' option to the option list
      filling it with the backing format driver. This breaks construction of
      the backing chain via -blockdev, as bdrv_open_inherit reports an error
      if both 'reference' and 'options' are provided.
      
      $ qemu-img create -f raw /tmp/backing.raw 64M
      $ qemu-img create -f qcow2 -F raw -b /tmp/backing.raw /tmp/test.qcow2
      $ qemu-system-x86_64 \
        -blockdev driver=file,filename=/tmp/backing.raw,node-name=backing \
        -blockdev driver=qcow2,file.driver=file,file.filename=/tmp/test.qcow2,node-name=root,backing=backing
      qemu-system-x86_64: -blockdev driver=qcow2,file.driver=file,file.filename=/tmp/test.qcow2,node-name=root,backing=backing: Could not open backing file: Cannot reference an existing block device with additional options or a new filename
      Signed-off-by: NPeter Krempa <pkrempa@redhat.com>
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      6bff597b
  2. 25 10月, 2017 23 次提交
    • P
      Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20171025' into staging · ae49fbbc
      Peter Maydell 提交于
      TCG patch queue
      
      # gpg: Signature made Wed 25 Oct 2017 10:30:18 BST
      # gpg:                using RSA key 0x64DF38E8AF7E215F
      # gpg: Good signature from "Richard Henderson <richard.henderson@linaro.org>"
      # Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A  05C0 64DF 38E8 AF7E 215F
      
      * remotes/rth/tags/pull-tcg-20171025: (51 commits)
        translate-all: exit from tb_phys_invalidate if qht_remove fails
        tcg: Initialize cpu_env generically
        tcg: enable multiple TCG contexts in softmmu
        tcg: introduce regions to split code_gen_buffer
        translate-all: use qemu_protect_rwx/none helpers
        osdep: introduce qemu_mprotect_rwx/none
        tcg: allocate optimizer temps with tcg_malloc
        tcg: distribute profiling counters across TCGContext's
        tcg: introduce **tcg_ctxs to keep track of all TCGContext's
        gen-icount: fold exitreq_label into TCGContext
        tcg: define tcg_init_ctx and make tcg_ctx a pointer
        tcg: take tb_ctx out of TCGContext
        translate-all: report correct avg host TB size
        exec-all: rename tb_free to tb_remove
        translate-all: use a binary search tree to track TBs in TBContext
        tcg: Remove CF_IGNORE_ICOUNT
        tcg: Add CF_LAST_IO + CF_USE_ICOUNT to CF_HASH_MASK
        cpu-exec: lookup/generate TB outside exclusive region during step_atomic
        tcg: check CF_PARALLEL instead of parallel_cpus
        target/sparc: check CF_PARALLEL instead of parallel_cpus
        ...
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      ae49fbbc
    • P
      Merge remote-tracking branch 'remotes/juanquintela/tags/migration/20171023' into staging · 4e1b31db
      Peter Maydell 提交于
      migration/next for 20171023
      
      # gpg: Signature made Mon 23 Oct 2017 17:05:14 BST
      # gpg:                using RSA key 0xF487EF185872D723
      # gpg: Good signature from "Juan Quintela <quintela@redhat.com>"
      # gpg:                 aka "Juan Quintela <quintela@trasno.org>"
      # Primary key fingerprint: 1899 FF8E DEBF 58CC EE03  4B82 F487 EF18 5872 D723
      
      * remotes/juanquintela/tags/migration/20171023: (21 commits)
        migration: Improve migration thread error handling
        qapi: Fix grammar in x-multifd-page-count descriptions
        migration: add bitmap for received page
        migration: introduce qemu_ufd_copy_ioctl helper
        migration: postcopy_place_page factoring out
        migration: new ram_init_bitmaps()
        migration: clean up xbzrle cache init/destroy
        migration: provide ram_state_cleanup
        migration: provide ram_state_init()
        migration: pause-before-switchover for postcopy
        migration: allow cancel to unpause
        migrate: HMP migate_continue
        migration: migrate-continue
        migration: Wait for semaphore before completing migration
        migration: Add 'pre-switchover' and 'device' statuses
        migration: Add 'pause-before-switchover' capability
        migration: Make cache_init() take an error parameter
        migration: Move xbzrle cache resize error handling to xbzrle_cache_resize
        migration: Make cache size elements use the right types
        migratiom: Remove max_item_age parameter
        ...
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      4e1b31db
    • E
      translate-all: exit from tb_phys_invalidate if qht_remove fails · cc689485
      Emilio G. Cota 提交于
      Two or more threads might race while invalidating the same TB. We currently
      do not check for this at all despite taking tb_lock, which means we would
      wrongly invalidate the same TB more than once. This bug has actually been
      hit by users: I recently saw a report on IRC, although I have yet to see
      the corresponding test case.
      
      Fix this by using qht_remove as the synchronization point; if it fails,
      that means the TB has already been invalidated, and therefore there
      is nothing left to do in tb_phys_invalidate.
      
      Note that this solution works now that we still have tb_lock, and will
      continue working once we remove tb_lock.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1508445114-4717-1-git-send-email-cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      cc689485
    • R
      tcg: Initialize cpu_env generically · 1c2adb95
      Richard Henderson 提交于
      This is identical for each target.  So, move the initialization to
      common code.  Move the variable itself out of tcg_ctx and name it
      cpu_env to minimize changes within targets.
      
      This also means we can remove tcg_global_reg_new_{ptr,i32,i64},
      since there are no longer global-register temps created by targets.
      Reviewed-by: NEmilio G. Cota <cota@braap.org>
      Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      1c2adb95
    • E
      tcg: enable multiple TCG contexts in softmmu · 3468b59e
      Emilio G. Cota 提交于
      This enables parallel TCG code generation. However, we do not take
      advantage of it yet since tb_lock is still held during tb_gen_code.
      
      In user-mode we use a single TCG context; see the documentation
      added to tcg_region_init for the rationale.
      
      Note that targets do not need any conversion: targets initialize a
      TCGContext (e.g. defining TCG globals), and after this initialization
      has finished, the context is cloned by the vCPU threads, each of
      them keeping a separate copy.
      
      TCG threads claim one entry in tcg_ctxs[] by atomically increasing
      n_tcg_ctxs. Do not be too annoyed by the subsequent atomic_read's
      of that variable and tcg_ctxs; they are there just to play nice with
      analysis tools such as thread sanitizer.
      
      Note that we do not allocate an array of contexts (we allocate
      an array of pointers instead) because when tcg_context_init
      is called, we do not know yet how many contexts we'll use since
      the bool behind qemu_tcg_mttcg_enabled() isn't set yet.
      
      Previous patches folded some TCG globals into TCGContext. The non-const
      globals remaining are only set at init time, i.e. before the TCG
      threads are spawned. Here is a list of these set-at-init-time globals
      under tcg/:
      
      Only written by tcg_context_init:
      - indirect_reg_alloc_order
      - tcg_op_defs
      Only written by tcg_target_init (called from tcg_context_init):
      - tcg_target_available_regs
      - tcg_target_call_clobber_regs
      - arm: arm_arch, use_idiv_instructions
      - i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt,
              have_movbe, have_popcnt
      - mips: use_movnz_instructions, use_mips32_instructions,
              use_mips32r2_instructions, got_sigill (tcg_target_detect_isa)
      - ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr
      - s390: tb_ret_addr, s390_facilities
      - sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines),
               use_vis3_instructions
      
      Only written by tcg_prologue_init:
      - 'struct jit_code_entry one_entry'
      - aarch64: tb_ret_addr
      - arm: tb_ret_addr
      - i386: tb_ret_addr, guest_base_flags
      - ia64: tb_ret_addr
      - mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      3468b59e
    • E
      tcg: introduce regions to split code_gen_buffer · e8feb96f
      Emilio G. Cota 提交于
      This is groundwork for supporting multiple TCG contexts.
      
      The naive solution here is to split code_gen_buffer statically
      among the TCG threads; this however results in poor utilization
      if translation needs are different across TCG threads.
      
      What we do here is to add an extra layer of indirection, assigning
      regions that act just like pages do in virtual memory allocation.
      (BTW if you are wondering about the chosen naming, I did not want
      to use blocks or pages because those are already heavily used in QEMU).
      
      We use a global lock to serialize allocations as well as statistics
      reporting (we now export the size of the used code_gen_buffer with
      tcg_code_size()). Note that for the allocator we could just use
      a counter and atomic_inc; however, that would complicate the gathering
      of tcg_code_size()-like stats. So given that the region operations are
      not a fast path, a lock seems the most reasonable choice.
      
      The effectiveness of this approach is clear after seeing some numbers.
      I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark.
      Note that I'm evaluating this after enabling per-thread TCG (which
      is done by a subsequent commit).
      
      * -smp 1, 1 region (entire buffer):
          qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357
          qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363
          qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364
          qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373
          qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373
          qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360
          qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370
          qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367
      
      That is, 8 flushes.
      
      * -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]:
      
          qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356
          qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361
          qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361
          qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375
          qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375
          qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360
          qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365
          qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368
      
      Again, 8 flushes. Note how buffer utilization is not 100%, but it
      is close. Smaller region sizes would yield higher utilization,
      but we want region allocation to be rare (it acquires a lock), so
      we do not want to go too small.
      
      * -smp 8, static partitioning of 8 regions (10 MB per region):
          qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354
          qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370
          qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365
          qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377
          qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358
          qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367
          qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364
          qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358
          qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362
          qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372
          qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374
          qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376
          qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374
          qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372
          qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359
          qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362
          qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368
          qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378
          qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367
          qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364
      
      That is, 20 flushes. Note how a static partitioning approach uses
      the code buffer poorly, leading to many unnecessary flushes.
      Reviewed-by: NRichard Henderson <richard.henderson@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      e8feb96f
    • E
      translate-all: use qemu_protect_rwx/none helpers · f51f315a
      Emilio G. Cota 提交于
      The helpers require the address and size to be page-aligned, so
      do that before calling them.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      f51f315a
    • E
    • E
      tcg: allocate optimizer temps with tcg_malloc · 34184b07
      Emilio G. Cota 提交于
      Groundwork for supporting multiple TCG contexts.
      
      While at it, also allocate temps_used directly as a bitmap of the
      required size, instead of using a bitmap of TCG_MAX_TEMPS via
      TCGTempSet.
      
      Performance-wise we lose about 1.12% in a translation-heavy workload
      such as booting+shutting down debian-arm:
      
      Performance counter stats for 'taskset -c 0 arm-softmmu/qemu-system-arm \
      	-machine type=virt -nographic -smp 1 -m 4096 \
      	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
      	-device virtio-net-device,netdev=unet \
      	-drive file=die-on-boot.qcow2,id=myblock,index=0,if=none \
      	-device virtio-blk-device,drive=myblock \
      	-kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \
      	-name arm,debug-threads=on -smp 1' (10 runs):
      
                   exec time (s)  Relative slowdown wrt original (%)
      ---------------------------------------------------------------
       original     20.213321616                                  0.
       tcg_malloc   20.441130078                           1.1270214
       TCGContext   20.477846517                           1.3086662
       g_malloc     20.780527895                           2.8061013
      
      The other two alternatives shown in the table are:
      - TCGContext: embed temps[TCG_MAX_TEMPS] and TCGTempSet used_temps
        in TCGContext. This is simple enough but it isn't faster than using
        tcg_malloc; moreover, it wastes memory.
      - g_malloc: allocate/deallocate both temps and used_temps every time
        tcg_optimize is executed.
      Suggested-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      34184b07
    • E
      tcg: distribute profiling counters across TCGContext's · c3fac113
      Emilio G. Cota 提交于
      This is groundwork for supporting multiple TCG contexts.
      
      To avoid scalability issues when profiling info is enabled, this patch
      makes the profiling info counters distributed via the following changes:
      
      1) Consolidate profile info into its own struct, TCGProfile, which
         TCGContext also includes. Note that tcg_table_op_count is brought
         into TCGProfile after dropping the tcg_ prefix.
      2) Iterate over the TCG contexts in the system to obtain the total counts.
      
      This change also requires updating the accessors to TCGProfile fields to
      use atomic_read/set whenever there may be conflicting accesses (as defined
      in C11) to them.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      c3fac113
    • E
      tcg: introduce **tcg_ctxs to keep track of all TCGContext's · df2cce29
      Emilio G. Cota 提交于
      Groundwork for supporting multiple TCG contexts.
      
      Note that having n_tcg_ctxs is unnecessary. However, it is
      convenient to have it, since it will simplify iterating over the
      array: we'll have just a for loop instead of having to iterate
      over a NULL-terminated array (which would require n+1 elems)
      or having to check with ifdef's for usermode/softmmu.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      df2cce29
    • E
      gen-icount: fold exitreq_label into TCGContext · 26689780
      Emilio G. Cota 提交于
      Groundwork for supporting multiple TCG contexts.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      26689780
    • E
      tcg: define tcg_init_ctx and make tcg_ctx a pointer · b1311c4a
      Emilio G. Cota 提交于
      Groundwork for supporting multiple TCG contexts.
      
      The core of this patch is this change to tcg/tcg.h:
      
      > -extern TCGContext tcg_ctx;
      > +extern TCGContext tcg_init_ctx;
      > +extern TCGContext *tcg_ctx;
      
      Note that for now we set *tcg_ctx to whatever TCGContext is passed
      to tcg_context_init -- in this case &tcg_init_ctx.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      b1311c4a
    • E
      tcg: take tb_ctx out of TCGContext · 44ded3d0
      Emilio G. Cota 提交于
      Groundwork for supporting multiple TCG contexts.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      44ded3d0
    • E
      translate-all: report correct avg host TB size · f19c6cc6
      Emilio G. Cota 提交于
      Since commit 6e3b2bfd ("tcg: allocate TB structs before the
      corresponding translated code") we are not fully utilizing
      code_gen_buffer for translated code, and therefore are
      incorrectly reporting the amount of translated code as well as
      the average host TB size. Address this by:
      
      - Making the conscious choice of misreporting the total translated code;
        doing otherwise would mislead users into thinking "-tb-size" is not
        honoured.
      
      - Expanding tb_tree_stats to accurately count the bytes of translated code on
        the host, and using this for reporting the average tb host size,
        as well as the expansion ratio.
      
      In the future we might want to consider reporting the accurate numbers for
      the total translated code, together with a "bookkeeping/overhead" field to
      account for the TB structs.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      f19c6cc6
    • E
      exec-all: rename tb_free to tb_remove · be1e0117
      Emilio G. Cota 提交于
      We don't really free anything in this function anymore; we just remove
      the TB from the binary search tree.
      Suggested-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      be1e0117
    • E
      translate-all: use a binary search tree to track TBs in TBContext · 2ac01d6d
      Emilio G. Cota 提交于
      This is a prerequisite for supporting multiple TCG contexts, since
      we will have threads generating code in separate regions of
      code_gen_buffer.
      
      For this we need a new field (.size) in struct tb_tc to keep
      track of the size of the translated code. This field uses a size_t
      to avoid adding a hole to the struct, although really an unsigned
      int would have been enough.
      
      The comparison function we use is optimized for the common case:
      insertions. Profiling shows that upon booting debian-arm, 98%
      of comparisons are between existing tb's (i.e. a->size and b->size
      are both !0), which happens during insertions (and removals, but
      those are rare). The remaining cases are lookups. From reading the glib
      sources we see that the first key is always the lookup key. However,
      the code does not assume this to always be the case because this
      behaviour is not guaranteed in the glib docs. However, we embed
      this knowledge in the code as a branch hint for the compiler.
      
      Note that tb_free does not free space in the code_gen_buffer anymore,
      since we cannot easily know whether the tb is the last one inserted
      in code_gen_buffer. The next patch in this series renames tb_free
      to tb_remove to reflect this.
      
      Performance-wise, lookups in tb_find_pc are the same as before:
      O(log n). However, insertions are O(log n) instead of O(1), which
      results in a small slowdown when booting debian-arm:
      
      Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
      	-machine type=virt -nographic -smp 1 -m 4096 \
      	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
      	-device virtio-net-device,netdev=unet \
      	-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
      	-device virtio-blk-device,drive=myblock \
      	-kernel img/arm/aarch32-current-linux-kernel-only.img \
      	-append console=ttyAMA0 root=/dev/vda1 \
      	-name arm,debug-threads=on -smp 1' (10 runs):
      
      - Before:
      
             8048.598422      task-clock (msec)         #    0.931 CPUs utilized            ( +-  0.28% )
                  16,974      context-switches          #    0.002 M/sec                    ( +-  0.12% )
                       0      cpu-migrations            #    0.000 K/sec
                  10,125      page-faults               #    0.001 M/sec                    ( +-  1.23% )
          35,144,901,879      cycles                    #    4.367 GHz                      ( +-  0.14% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
          65,758,252,643      instructions              #    1.87  insns per cycle          ( +-  0.33% )
          10,871,298,668      branches                  # 1350.707 M/sec                    ( +-  0.41% )
             192,322,212      branch-misses             #    1.77% of all branches          ( +-  0.32% )
      
             8.640869419 seconds time elapsed                                          ( +-  0.57% )
      
      - After:
             8146.242027      task-clock (msec)         #    0.923 CPUs utilized            ( +-  1.23% )
                  17,016      context-switches          #    0.002 M/sec                    ( +-  0.40% )
                       0      cpu-migrations            #    0.000 K/sec
                  18,769      page-faults               #    0.002 M/sec                    ( +-  0.45% )
          35,660,956,120      cycles                    #    4.378 GHz                      ( +-  1.22% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
          65,095,366,607      instructions              #    1.83  insns per cycle          ( +-  1.73% )
          10,803,480,261      branches                  # 1326.192 M/sec                    ( +-  1.95% )
             195,601,289      branch-misses             #    1.81% of all branches          ( +-  0.39% )
      
             8.828660235 seconds time elapsed                                          ( +-  0.38% )
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      2ac01d6d
    • R
      tcg: Remove CF_IGNORE_ICOUNT · 416986d3
      Richard Henderson 提交于
      Now that we have curr_cflags, we can include CF_USE_ICOUNT
      early and then remove it as necessary.
      Reviewed-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      416986d3
    • R
      tcg: Add CF_LAST_IO + CF_USE_ICOUNT to CF_HASH_MASK · 0cf8a44c
      Richard Henderson 提交于
      These flags are used by target/*/translate.c,
      and affect code generation.
      Reviewed-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      0cf8a44c
    • E
      cpu-exec: lookup/generate TB outside exclusive region during step_atomic · ac03ee53
      Emilio G. Cota 提交于
      Now that all code generation has been converted to check CF_PARALLEL, we can
      generate !CF_PARALLEL code without having yet set !parallel_cpus --
      and therefore without having to be in the exclusive region during
      cpu_exec_step_atomic.
      
      While at it, merge cpu_exec_step into cpu_exec_step_atomic.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      ac03ee53
    • E
      tcg: check CF_PARALLEL instead of parallel_cpus · e82d5a24
      Emilio G. Cota 提交于
      Thereby decoupling the resulting translated code from the current state
      of the system.
      
      The tb->cflags field is not passed to tcg generation functions. So
      we add a field to TCGContext, storing there a copy of tb->cflags.
      
      Most architectures have <= 32 registers, which results in a 4-byte hole
      in TCGContext. Use this hole for the new field.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      e82d5a24
    • E
      target/sparc: check CF_PARALLEL instead of parallel_cpus · 87d757d6
      Emilio G. Cota 提交于
      Thereby decoupling the resulting translated code from the current state
      of the system.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      87d757d6
    • E
      target/sh4: check CF_PARALLEL instead of parallel_cpus · 671f9a85
      Emilio G. Cota 提交于
      Thereby decoupling the resulting translated code from the current state
      of the system.
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
      671f9a85