1. 24 10月, 2016 1 次提交
  2. 27 9月, 2016 1 次提交
  3. 16 9月, 2016 1 次提交
    • R
      tcg: Merge GETPC and GETRA · 01ecaf43
      Richard Henderson 提交于
      The return address argument to the softmmu template helpers was
      confused.  In the legacy case, we wanted to indicate that there
      is no return address, and so passed in NULL.  However, we then
      immediately subtracted GETPC_ADJ from NULL, resulting in a non-zero
      value, indicating the presence of an (invalid) return address.
      
      Push the GETPC_ADJ subtraction down to the only point it's required:
      immediately before use within cpu_restore_state_from_tb, after all
      NULL pointer checks have been completed.
      
      This makes GETPC and GETRA identical.  Remove GETRA as the lesser
      used macro, replacing all uses with GETPC.
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      01ecaf43
  4. 14 9月, 2016 4 次提交
  5. 30 8月, 2016 1 次提交
  6. 02 8月, 2016 1 次提交
    • E
      qht: do not segfault when gathering stats from an uninitialized qht · 7266ae91
      Emilio G. Cota 提交于
      So far, QHT functions assume that the passed qht has previously been
      initialized--otherwise they segfault.
      
      This patch makes an exception for qht_statistics_init, with the goal
      of simplifying calling code. For instance, qht_statistics_init is
      called from the 'info jit' dump, and given that under KVM the TB qht
      is never initialized, we get a segfault. Thus, instead of complicating
      the 'info jit' code with additional checks, let's allow passing an
      uninitialized qht to qht_statistics_init.
      
      While at it, add a test for this to test-qht.
      
      Before the patch (for $ qemu -enable-kvm [...]):
      (qemu) info jit
      [...]
      direct jump count   0 (0%) (2 jumps=0 0%)
      Program received signal SIGSEGV, Segmentation fault.
      
      After the patch the "TB hash buckets", "TB hash occupancy"
      and "TB hash avg chain" lines are omitted.
      (qemu) info jit
      [...]
      direct jump count   0 (0%) (2 jumps=0 0%)
      TB hash buckets     0/0 (-nan% head buckets used)
      TB hash occupancy   nan% avg chain occ. Histogram: (null)
      TB hash avg chain   nan buckets. Histogram: (null)
      [...]
      
      Reported by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1469205390-14369-1-git-send-email-cota@braap.org>
      [Extract printing statistics to an entirely separate function. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7266ae91
  7. 09 7月, 2016 1 次提交
    • S
      translate-all: Fix user-mode self-modifying code in 2 page long TB · 7399a337
      Stanislav Shmarov 提交于
      In user-mode emulation Translation Block can consist of 2 guest pages.
      In that case QEMU also mprotects 2 host pages that are dedicated for
      guest memory, containing instructions. QEMU detects self-modifying code
      with SEGFAULT signal processing.
      
      In case if instruction in 1st page is modifying memory of 2nd
      page (or vice versa) QEMU will mark 2nd page with PAGE_WRITE,
      invalidate TB, generate new TB contatining 1 guest instruction and
      exit to CPU loop. QEMU won't call mprotect, and new TB will cause
      same SEGFAULT. Page will have both PAGE_WRITE_ORG and PAGE_WRITE
      flags, so QEMU will handle the signal as guest binary problem,
      and exit with guest SEGFAULT.
      
      Solution is to do following: In case if current TB was invalidated
      continue to invalidate TBs from remaining guest pages and mark pages
      as PAGE_WRITE. After that disable host page protection with mprotect.
      If current tb was invalidated longjmp to main loop. That is more
      efficient, since we won't get SEGFAULT when executing new TB.
      Reviewed-by: NSergey Fedorov <sergey.fedorov@linaro.org>
      Signed-off-by: NStanislav Shmarov <snarpix@gmail.com>
      Message-Id: <1467880392-1043630-1-git-send-email-snarpix@gmail.com>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      7399a337
  8. 20 6月, 2016 1 次提交
  9. 17 6月, 2016 1 次提交
  10. 12 6月, 2016 3 次提交
    • E
      translate-all: add tb hash bucket info to 'info jit' dump · 329844d4
      Emilio G. Cota 提交于
      Examples:
      
      - Good hashing, i.e. tb_hash_func5(phys_pc, pc, flags):
      TB count            715135/2684354
      [...]
      TB hash buckets     388775/524288 (74.15% head buckets used)
      TB hash occupancy   33.04% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
      TB hash avg chain   1.017 buckets. Histogram: 1|█▁▁|3
      
      - Not-so-good hashing, i.e. tb_hash_func5(phys_pc, pc, 0):
      TB count            712636/2684354
      [...]
      TB hash buckets     344924/524288 (65.79% head buckets used)
      TB hash occupancy   31.64% avg chain occ. Histogram: [0,10)%|█ ▆  ▅▁▃▁▂|[90,100]%
      TB hash avg chain   1.047 buckets. Histogram: 1|█▁▁▁|4
      
      - Bad hashing, i.e. tb_hash_func5(phys_pc, 0, 0):
      TB count            702818/2684354
      [...]
      TB hash buckets     112741/524288 (21.50% head buckets used)
      TB hash occupancy   10.15% avg chain occ. Histogram: [0,10)%|█ ▁  ▁▁▁▁▁|[90,100]%
      TB hash avg chain   2.107 buckets. Histogram: [1.0,10.2)|█▁▁▁▁▁▁▁▁▁|[83.8,93.0]
      
      - Good hashing, but no auto-resize:
      TB count            715634/2684354
      TB hash buckets     8192/8192 (100.00% head buckets used)
      TB hash occupancy   98.30% avg chain occ. Histogram: [95.3,95.8)%|▁▁▃▄▃▄▁▇▁█|[99.5,100.0]%
      TB hash avg chain   22.070 buckets. Histogram: [15.0,16.7)|▁▂▅▄█▅▁▁▁▁|[30.3,32.0]
      Acked-by: NSergey Fedorov <sergey.fedorov@linaro.org>
      Suggested-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1465412133-3029-16-git-send-email-cota@braap.org>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      329844d4
    • E
      tb hash: track translated blocks with qht · 909eaac9
      Emilio G. Cota 提交于
      Having a fixed-size hash table for keeping track of all translation blocks
      is suboptimal: some workloads are just too big or too small to get maximum
      performance from the hash table. The MRU promotion policy helps improve
      performance when the hash table is a little undersized, but it cannot
      make up for severely undersized hash tables.
      
      Furthermore, frequent MRU promotions result in writes that are a scalability
      bottleneck. For scalability, lookups should only perform reads, not writes.
      This is not a big deal for now, but it will become one once MTTCG matures.
      
      The appended fixes these issues by using qht as the implementation of
      the TB hash table. This solution is superior to other alternatives considered,
      namely:
      
      - master: implementation in QEMU before this patchset
      - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
      - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
                    MRU is implemented here by adding an intermediate struct
                    that contains the u32 hash and a pointer to the TB; this
                    allows us, on an MRU promotion, to copy said struct (that is not
                    at the head), and put this new copy at the head. After a grace
                    period, the original non-head struct can be eliminated, and
                    after another grace period, freed.
      - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                         no MRU for lookups; MRU for inserts.
      The appended solution is the following:
      - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                       no MRU for lookups; MRU for inserts.
      
      The plots below compare the considered solutions. The Y axis shows the
      boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
      sweeps the number of buckets (or initial number of buckets for qht-autoresize).
      The plots in PNG format (and with errorbars) can be seen here:
        http://imgur.com/a/Awgnq
      
      Each test runs 5 times, and the entire QEMU process is pinned to a
      single core for repeatability of results.
      
                                  Host: Intel Xeon E5-2690
      
        28 ++------------+-------------+-------------+-------------+------------++
           A*****        +             +             +             master **A*** +
        27 ++    *                                                 xxhash ##B###++
           |      A******A******                               xxhash-rcu $$C$$$ |
        26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
           D%%$$                              A******A******A*qht-dyn-mru A*E****A
        25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
           B#####%                                                               |
        24 ++    #C$$$$$                                                        ++
           |      B###  $                                                        |
           |          ## C$$$$$$                                                 |
        23 ++           #       C$$$$$$                                         ++
           |             B######       C$$$$$$                                %%%D
        22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
           |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
        21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
           +             E@@@   F&&&   +      E@     +      F&&&   +             +
        20 ++------------+-------------+-------------+-------------+------------++
           14            16            18            20            22            24
                                   log2 number of buckets
      
                                       Host: Intel i7-4790K
      
        14.5 ++------------+------------+-------------+------------+------------++
             A**           +            +             +            master **A*** +
          14 ++ **                                                 xxhash ##B###++
        13.5 ++   **                                           xxhash-rcu $$C$$$++
             |                                            qht-fixed-nomru %%D%%% |
          13 ++     A******                                   qht-dyn-mru @@E@@@++
             |             A*****A******A******             qht-dyn-nomru &&F&&& |
        12.5 C$$                               A******A******A*****A******    ***A
          12 ++ $$                                                        A***  ++
             D%%% $$                                                             |
        11.5 ++  %%                                                             ++
             B###  %C$$$$$$                                                      |
          11 ++  ## D%%%%% C$$$$$                                               ++
             |     #      %      C$$$$$$                                         |
        10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
          10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
             +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
         9.5 ++------------+------------+-------------+------------+------------++
             14            16           18            20           22            24
                                    log2 number of buckets
      
      Note that the original point before this patch series is X=15 for "master";
      the little sensitivity to the increased number of buckets is due to the
      poor hashing function in master.
      
      xxhash-rcu has significant overhead due to the constant churn of allocating
      and deallocating intermediate structs for implementing MRU. An alternative
      would be do consider failed lookups as "maybe not there", and then
      acquire the external lock (tb_lock in this case) to really confirm that
      there was indeed a failed lookup. This, however, would not be enough
      to implement dynamic resizing--this is more complex: see
      "Resizable, Scalable, Concurrent Hash Tables via Relativistic
      Programming" by Triplett, McKenney and Walpole. This solution was
      discarded due to the very coarse RCU read critical sections that we have
      in MTTCG; resizing requires waiting for readers after every pointer update,
      and resizes require many pointer updates, so this would quickly become
      prohibitive.
      
      qht-fixed-nomru shows that MRU promotion is advisable for undersized
      hash tables.
      
      However, qht-dyn-mru shows that MRU promotion is not important if the
      hash table is properly sized: there is virtually no difference in
      performance between qht-dyn-nomru and qht-dyn-mru.
      
      Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
      X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
      can achieve with optimum sizing of the hash table, while keeping the hash
      table scalable for readers.
      
      The improvement we get before and after this patch for booting debian jessie
      with arm-softmmu is:
      
      - Intel Xeon E5-2690: 10.5% less time
      - Intel i7-4790K: 5.2% less time
      
      We could get this same improvement _for this particular workload_ by
      statically increasing the size of the hash table. But this would hurt
      workloads that do not need a large hash table. The dynamic (upward)
      resizing allows us to start small and enlarge the hash table as needed.
      
      A quick note on downsizing: the table is resized back to 2**15 buckets
      on every tb_flush; this makes sense because it is not guaranteed that the
      table will reach the same number of TBs later on (e.g. most bootup code is
      thrown away after boot); it makes sense to grow the hash table as
      more code blocks are translated. This also avoids the complication of
      having to build downsizing hysteresis logic into qht.
      Reviewed-by: NSergey Fedorov <serge.fedorov@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      909eaac9
    • E
      tb hash: hash phys_pc, pc, and flags with xxhash · 42bd3228
      Emilio G. Cota 提交于
      For some workloads such as arm bootup, tb_phys_hash is performance-critical.
      The is due to the high frequency of accesses to the hash table, originated
      by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
      More info:
        https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
      
      To dig further into this I modified an arm image booting debian jessie to
      immediately shut down after boot. Analysis revealed that quite a bit of time
      is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
      results in very uneven loading of chains in the hash table's buckets;
      the longest observed chain had ~550 elements.
      
      The appended addresses this with two changes:
      
      1) Use xxhash as the hash table's hash function. xxhash is a fast,
         high-quality hashing function.
      
      2) Feed the hashing function with not just tb_phys, but also pc and flags.
      
      This improves performance over using just tb_phys for hashing, since that
      resulted in some hash buckets having many TB's, while others getting very few;
      with these changes, the longest observed chain on a single hash bucket is
      brought down from ~550 to ~40.
      
      Tests show that the other element checked for in tb_find_physical,
      cs_base, is always a match when tb_phys+pc+flags are a match,
      so hashing cs_base is wasteful. It could be that this is an ARM-only
      thing, though. UPDATE:
      On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
      > The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
      > consisting of only a delay slot).
      > It may well still turn out to be reasonable to ignore cs_base for hashing.
      
      BTW, after this change the hash table should not be called "tb_hash_phys"
      anymore; this is addressed later in this series.
      
      This change gives consistent bootup time improvements. I tested two
      host machines:
      - Intel Xeon E5-2690: 11.6% less time
      - Intel i7-4790K: 19.2% less time
      
      Increasing the number of hash buckets yields further improvements. However,
      using a larger, fixed number of buckets can degrade performance for other
      workloads that do not translate as many blocks (600K+ for debian-jessie arm
      bootup). This is dealt with later in this series.
      Reviewed-by: NSergey Fedorov <sergey.fedorov@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      42bd3228
  11. 09 6月, 2016 3 次提交
  12. 23 5月, 2016 1 次提交
  13. 19 5月, 2016 1 次提交
  14. 13 5月, 2016 15 次提交
  15. 08 4月, 2016 1 次提交
  16. 23 3月, 2016 2 次提交
  17. 03 2月, 2016 1 次提交
  18. 29 1月, 2016 1 次提交
    • P
      exec: Clean up includes · 7b31bbc2
      Peter Maydell 提交于
      Clean up includes so that osdep.h is included first and headers
      which it implies are not included manually.
      
      This commit was created with scripts/clean-includes.
      Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
      Message-id: 1453832250-766-4-git-send-email-peter.maydell@linaro.org
      7b31bbc2