1. 24 2月, 2017 8 次提交
    • P
      cpu-exec: remove unnecessary check of cpu->exit_request · 55ac0a9b
      Paolo Bonzini 提交于
      The cpu->exit_request check in cpu_loop_exec_tb is unnecessary,
      because cpu->tcg_exit_req is always set after cpu->exit_request.
      So let the TB exit and we will pick up the exit request later
      in cpu_handle_interrupt.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      55ac0a9b
    • P
      replay: check icount in cpu exec loop · cfb2d02b
      Pavel Dovgalyuk 提交于
      This patch adds check to break cpu loop when icount expires without
      setting the TB_EXIT_ICOUNT_EXPIRED flag. It happens when there is no
      available translated blocks and all instructions were executed.
      In icount replay mode unnecessary tb_find will be called (which may
      cause an exception) and execution will be non-deterministic.
      Because cpu_loop_exec_tb cannot longjmp anymore, we can remove
      the anticipated call to align_clocks in cpu_loop_exec_tb, as
      well as the SyncClocks *sc argument.
      Signed-off-by: NPavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
      Message-Id: <002801d2810f$18809c20$4981d460$@ru>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPavel Dovgalyuk <dovgaluk@ispras.ru>
      cfb2d02b
    • P
      tcg: handle EXCP_ATOMIC exception for system emulation · 08e73c48
      Pranith Kumar 提交于
      The patch enables handling atomic code in the guest. This should be
      preferably done in cpu_handle_exception(), but the current assumptions
      regarding when we can execute atomic sections cause a deadlock.
      
      The current mechanism discards the flags which were set in atomic
      execution. We ensure they are properly saved by calling the
      cc->cpu_exec_enter/leave() functions around the loop.
      
      As we are running cpu_exec_step_atomic() from the outermost loop we
      need to avoid an abort() when single stepping over atomic code since
      debug exception longjmp will point to the the setlongjmp in
      cpu_exec(). We do this by setting a new jmp_env so that it jumps back
      here on an exception.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      [AJB: tweak title, merge with new patches, add mmap_lock]
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      08e73c48
    • A
      tcg: enable thread-per-vCPU · 37257942
      Alex Bennée 提交于
      There are a couple of changes that occur at the same time here:
      
        - introduce a single vCPU qemu_tcg_cpu_thread_fn
      
        One of these is spawned per vCPU with its own Thread and Condition
        variables. qemu_tcg_rr_cpu_thread_fn is the new name for the old
        single threaded function.
      
        - the TLS current_cpu variable is now live for the lifetime of MTTCG
          vCPU threads. This is for future work where async jobs need to know
          the vCPU context they are operating in.
      
      The user to switch on multi-thread behaviour and spawn a thread
      per-vCPU. For a simple test kvm-unit-test like:
      
        ./arm/run ./arm/locking-test.flat -smp 4 -accel tcg,thread=multi
      
      Will now use 4 vCPU threads and have an expected FAIL (instead of the
      unexpected PASS) as the default mode of the test has no protection when
      incrementing a shared variable.
      
      We enable the parallel_cpus flag to ensure we generate correct barrier
      and atomic code if supported by the front and backends. This doesn't
      automatically enable MTTCG until default_mttcg_enabled() is updated to
      check the configuration is supported.
      Signed-off-by: NKONRAD Frederic <fred.konrad@greensocs.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      [AJB: Some fixes, conditionally, commit rewording]
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      37257942
    • A
      tcg: remove global exit_request · e5143e30
      Alex Bennée 提交于
      There are now only two uses of the global exit_request left.
      
      The first ensures we exit the run_loop when we first start to process
      pending work and in the kick handler. This is just as easily done by
      setting the first_cpu->exit_request flag.
      
      The second use is in the round robin kick routine. The global
      exit_request ensured every vCPU would set its local exit_request and
      cause a full exit of the loop. Now the iothread isn't being held while
      running we can just rely on the kick handler to push us out as intended.
      
      We lightly re-factor the main vCPU thread to ensure cpu->exit_requests
      cause us to exit the main loop and process any IO requests that might
      come along. As an cpu->exit_request may legitimately get squashed
      while processing the EXCP_INTERRUPT exception we also check
      cpu->queued_work_first to ensure queued work is expedited as soon as
      possible.
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      e5143e30
    • J
      tcg: drop global lock during TCG code execution · 8d04fb55
      Jan Kiszka 提交于
      This finally allows TCG to benefit from the iothread introduction: Drop
      the global mutex while running pure TCG CPU code. Reacquire the lock
      when entering MMIO or PIO emulation, or when leaving the TCG loop.
      
      We have to revert a few optimization for the current TCG threading
      model, namely kicking the TCG thread in qemu_mutex_lock_iothread and not
      kicking it in qemu_cpu_kick. We also need to disable RAM block
      reordering until we have a more efficient locking mechanism at hand.
      
      Still, a Linux x86 UP guest and my Musicpal ARM model boot fine here.
      These numbers demonstrate where we gain something:
      
      20338 jan       20   0  331m  75m 6904 R   99  0.9   0:50.95 qemu-system-arm
      20337 jan       20   0  331m  75m 6904 S   20  0.9   0:26.50 qemu-system-arm
      
      The guest CPU was fully loaded, but the iothread could still run mostly
      independent on a second core. Without the patch we don't get beyond
      
      32206 jan       20   0  330m  73m 7036 R   82  0.9   1:06.00 qemu-system-arm
      32204 jan       20   0  330m  73m 7036 S   21  0.9   0:17.03 qemu-system-arm
      
      We don't benefit significantly, though, when the guest is not fully
      loading a host CPU.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Message-Id: <1439220437-23957-10-git-send-email-fred.konrad@greensocs.com>
      [FK: Rebase, fix qemu_devices_reset deadlock, rm address_space_* mutex]
      Signed-off-by: NKONRAD Frederic <fred.konrad@greensocs.com>
      [EGC: fixed iothread lock for cpu-exec IRQ handling]
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      [AJB: -smp single-threaded fix, clean commit msg, BQL fixes]
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NPranith Kumar <bobby.prani@gmail.com>
      [PM: target-arm changes]
      Acked-by: NPeter Maydell <peter.maydell@linaro.org>
      8d04fb55
    • A
      tcg: rename tcg_current_cpu to tcg_current_rr_cpu · 791158d9
      Alex Bennée 提交于
      ..and make the definition local to cpus. In preparation for MTTCG the
      concept of a global tcg_current_cpu will no longer make sense. However
      we still need to keep track of it in the single-threaded case to be able
      to exit quickly when required.
      
      qemu_cpu_kick_no_halt() moves and becomes qemu_cpu_kick_rr_cpu() to
      emphasise its use-case. qemu_cpu_kick now kicks the relevant cpu as
      well as qemu_kick_rr_cpu() which will become a no-op in MTTCG.
      
      For the time being the setting of the global exit_request remains.
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NPranith Kumar <bobby.prani@gmail.com>
      791158d9
    • P
      mttcg: Add missing tb_lock/unlock() in cpu_exec_step() · 4ec66704
      Pranith Kumar 提交于
      The recent patch enabling lock assertions uncovered the missing lock
      acquisition in cpu_exec_step(). This patch adds them.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      4ec66704
  2. 22 2月, 2017 1 次提交
  3. 17 2月, 2017 1 次提交
  4. 16 2月, 2017 5 次提交
  5. 01 2月, 2017 1 次提交
  6. 28 1月, 2017 1 次提交
  7. 02 11月, 2016 1 次提交
    • R
      log: Add locking to large logging blocks · 1ee73216
      Richard Henderson 提交于
      Reuse the existing locking provided by stdio to keep in_asm, cpu,
      op, op_opt, op_ind, and out_asm as contiguous blocks.
      
      While it isn't possible to interleave e.g. in_asm or op_opt logs
      because of the TB lock protecting all code generation, it is
      possible to interleave cpu logs, or to interleave a cpu dump with
      an out_asm dump.
      
      For mingw32, we appear to have no viable solution for this.  The locking
      functions are not properly exported from the system runtime library.
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      1ee73216
  8. 31 10月, 2016 2 次提交
  9. 26 10月, 2016 2 次提交
  10. 04 10月, 2016 1 次提交
  11. 27 9月, 2016 1 次提交
  12. 16 9月, 2016 1 次提交
  13. 14 9月, 2016 8 次提交
  14. 17 7月, 2016 1 次提交
  15. 12 6月, 2016 2 次提交
    • E
      tb hash: track translated blocks with qht · 909eaac9
      Emilio G. Cota 提交于
      Having a fixed-size hash table for keeping track of all translation blocks
      is suboptimal: some workloads are just too big or too small to get maximum
      performance from the hash table. The MRU promotion policy helps improve
      performance when the hash table is a little undersized, but it cannot
      make up for severely undersized hash tables.
      
      Furthermore, frequent MRU promotions result in writes that are a scalability
      bottleneck. For scalability, lookups should only perform reads, not writes.
      This is not a big deal for now, but it will become one once MTTCG matures.
      
      The appended fixes these issues by using qht as the implementation of
      the TB hash table. This solution is superior to other alternatives considered,
      namely:
      
      - master: implementation in QEMU before this patchset
      - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
      - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
                    MRU is implemented here by adding an intermediate struct
                    that contains the u32 hash and a pointer to the TB; this
                    allows us, on an MRU promotion, to copy said struct (that is not
                    at the head), and put this new copy at the head. After a grace
                    period, the original non-head struct can be eliminated, and
                    after another grace period, freed.
      - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                         no MRU for lookups; MRU for inserts.
      The appended solution is the following:
      - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                       no MRU for lookups; MRU for inserts.
      
      The plots below compare the considered solutions. The Y axis shows the
      boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
      sweeps the number of buckets (or initial number of buckets for qht-autoresize).
      The plots in PNG format (and with errorbars) can be seen here:
        http://imgur.com/a/Awgnq
      
      Each test runs 5 times, and the entire QEMU process is pinned to a
      single core for repeatability of results.
      
                                  Host: Intel Xeon E5-2690
      
        28 ++------------+-------------+-------------+-------------+------------++
           A*****        +             +             +             master **A*** +
        27 ++    *                                                 xxhash ##B###++
           |      A******A******                               xxhash-rcu $$C$$$ |
        26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
           D%%$$                              A******A******A*qht-dyn-mru A*E****A
        25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
           B#####%                                                               |
        24 ++    #C$$$$$                                                        ++
           |      B###  $                                                        |
           |          ## C$$$$$$                                                 |
        23 ++           #       C$$$$$$                                         ++
           |             B######       C$$$$$$                                %%%D
        22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
           |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
        21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
           +             E@@@   F&&&   +      E@     +      F&&&   +             +
        20 ++------------+-------------+-------------+-------------+------------++
           14            16            18            20            22            24
                                   log2 number of buckets
      
                                       Host: Intel i7-4790K
      
        14.5 ++------------+------------+-------------+------------+------------++
             A**           +            +             +            master **A*** +
          14 ++ **                                                 xxhash ##B###++
        13.5 ++   **                                           xxhash-rcu $$C$$$++
             |                                            qht-fixed-nomru %%D%%% |
          13 ++     A******                                   qht-dyn-mru @@E@@@++
             |             A*****A******A******             qht-dyn-nomru &&F&&& |
        12.5 C$$                               A******A******A*****A******    ***A
          12 ++ $$                                                        A***  ++
             D%%% $$                                                             |
        11.5 ++  %%                                                             ++
             B###  %C$$$$$$                                                      |
          11 ++  ## D%%%%% C$$$$$                                               ++
             |     #      %      C$$$$$$                                         |
        10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
          10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
             +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
         9.5 ++------------+------------+-------------+------------+------------++
             14            16           18            20           22            24
                                    log2 number of buckets
      
      Note that the original point before this patch series is X=15 for "master";
      the little sensitivity to the increased number of buckets is due to the
      poor hashing function in master.
      
      xxhash-rcu has significant overhead due to the constant churn of allocating
      and deallocating intermediate structs for implementing MRU. An alternative
      would be do consider failed lookups as "maybe not there", and then
      acquire the external lock (tb_lock in this case) to really confirm that
      there was indeed a failed lookup. This, however, would not be enough
      to implement dynamic resizing--this is more complex: see
      "Resizable, Scalable, Concurrent Hash Tables via Relativistic
      Programming" by Triplett, McKenney and Walpole. This solution was
      discarded due to the very coarse RCU read critical sections that we have
      in MTTCG; resizing requires waiting for readers after every pointer update,
      and resizes require many pointer updates, so this would quickly become
      prohibitive.
      
      qht-fixed-nomru shows that MRU promotion is advisable for undersized
      hash tables.
      
      However, qht-dyn-mru shows that MRU promotion is not important if the
      hash table is properly sized: there is virtually no difference in
      performance between qht-dyn-nomru and qht-dyn-mru.
      
      Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
      X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
      can achieve with optimum sizing of the hash table, while keeping the hash
      table scalable for readers.
      
      The improvement we get before and after this patch for booting debian jessie
      with arm-softmmu is:
      
      - Intel Xeon E5-2690: 10.5% less time
      - Intel i7-4790K: 5.2% less time
      
      We could get this same improvement _for this particular workload_ by
      statically increasing the size of the hash table. But this would hurt
      workloads that do not need a large hash table. The dynamic (upward)
      resizing allows us to start small and enlarge the hash table as needed.
      
      A quick note on downsizing: the table is resized back to 2**15 buckets
      on every tb_flush; this makes sense because it is not guaranteed that the
      table will reach the same number of TBs later on (e.g. most bootup code is
      thrown away after boot); it makes sense to grow the hash table as
      more code blocks are translated. This also avoids the complication of
      having to build downsizing hysteresis logic into qht.
      Reviewed-by: NSergey Fedorov <serge.fedorov@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      909eaac9
    • E
      tb hash: hash phys_pc, pc, and flags with xxhash · 42bd3228
      Emilio G. Cota 提交于
      For some workloads such as arm bootup, tb_phys_hash is performance-critical.
      The is due to the high frequency of accesses to the hash table, originated
      by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
      More info:
        https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
      
      To dig further into this I modified an arm image booting debian jessie to
      immediately shut down after boot. Analysis revealed that quite a bit of time
      is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
      results in very uneven loading of chains in the hash table's buckets;
      the longest observed chain had ~550 elements.
      
      The appended addresses this with two changes:
      
      1) Use xxhash as the hash table's hash function. xxhash is a fast,
         high-quality hashing function.
      
      2) Feed the hashing function with not just tb_phys, but also pc and flags.
      
      This improves performance over using just tb_phys for hashing, since that
      resulted in some hash buckets having many TB's, while others getting very few;
      with these changes, the longest observed chain on a single hash bucket is
      brought down from ~550 to ~40.
      
      Tests show that the other element checked for in tb_find_physical,
      cs_base, is always a match when tb_phys+pc+flags are a match,
      so hashing cs_base is wasteful. It could be that this is an ARM-only
      thing, though. UPDATE:
      On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
      > The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
      > consisting of only a delay slot).
      > It may well still turn out to be reasonable to ignore cs_base for hashing.
      
      BTW, after this change the hash table should not be called "tb_hash_phys"
      anymore; this is addressed later in this series.
      
      This change gives consistent bootup time improvements. I tested two
      host machines:
      - Intel Xeon E5-2690: 11.6% less time
      - Intel i7-4790K: 19.2% less time
      
      Increasing the number of hash buckets yields further improvements. However,
      using a larger, fixed number of buckets can degrade performance for other
      workloads that do not translate as many blocks (600K+ for debian-jessie arm
      bootup). This is dealt with later in this series.
      Reviewed-by: NSergey Fedorov <sergey.fedorov@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      42bd3228
  16. 26 5月, 2016 1 次提交
  17. 19 5月, 2016 1 次提交
  18. 13 5月, 2016 2 次提交