1. 24 2月, 2017 5 次提交
    • A
      cputlb: introduce tlb_flush_*_all_cpus[_synced] · c3b9a07a
      Alex Bennée 提交于
      This introduces support to the cputlb API for flushing all CPUs TLBs
      with one call. This avoids the need for target helpers to iterate
      through the vCPUs themselves.
      
      An additional variant of the API (_synced) will cause the source vCPUs
      work to be scheduled as "safe work". The result will be all the flush
      operations will be complete by the time the originating vCPU executes
      its safe work. The calling implementation can either end the TB
      straight away (which will then pick up the cpu->exit_request on
      entering the next block) or defer the exit until the architectural
      sync point (usually a barrier instruction).
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      c3b9a07a
    • A
      cputlb and arm/sparc targets: convert mmuidx flushes from varg to bitmap · 0336cbf8
      Alex Bennée 提交于
      While the vargs approach was flexible the original MTTCG ended up
      having munge the bits to a bitmap so the data could be used in
      deferred work helpers. Instead of hiding that in cputlb we push the
      change to the API to make it take a bitmap of MMU indexes instead.
      
      For ARM some the resulting flushes end up being quite long so to aid
      readability I've tended to move the index shifting to a new line so
      all the bits being or-ed together line up nicely, for example:
      
          tlb_flush_page_by_mmuidx(other_cs, pageaddr,
                                   (1 << ARMMMUIdx_S1SE1) |
                                   (1 << ARMMMUIdx_S1SE0));
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      [AT: SPARC parts only]
      Reviewed-by: NArtyom Tarasenko <atar4qemu@gmail.com>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      [PM: ARM parts only]
      Reviewed-by: NPeter Maydell <peter.maydell@linaro.org>
      0336cbf8
    • K
      cputlb: introduce tlb_flush_* async work. · e3b9ca81
      KONRAD Frederic 提交于
      Some architectures allow to flush the tlb of other VCPUs. This is not a problem
      when we have only one thread for all VCPUs but it definitely needs to be an
      asynchronous work when we are in true multithreaded work.
      
      We take the tb_lock() when doing this to avoid racing with other threads
      which may be invalidating TB's at the same time. The alternative would
      be to use proper atomic primitives to clear the tlb entries en-mass.
      
      This patch doesn't do anything to protect other cputlb function being
      called in MTTCG mode making cross vCPU changes.
      Signed-off-by: NKONRAD Frederic <fred.konrad@greensocs.com>
      [AJB: remove need for g_malloc on defer, make check fixes, tb_lock]
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      e3b9ca81
    • A
      tcg: remove global exit_request · e5143e30
      Alex Bennée 提交于
      There are now only two uses of the global exit_request left.
      
      The first ensures we exit the run_loop when we first start to process
      pending work and in the kick handler. This is just as easily done by
      setting the first_cpu->exit_request flag.
      
      The second use is in the round robin kick routine. The global
      exit_request ensured every vCPU would set its local exit_request and
      cause a full exit of the loop. Now the iothread isn't being held while
      running we can just rely on the kick handler to push us out as intended.
      
      We lightly re-factor the main vCPU thread to ensure cpu->exit_requests
      cause us to exit the main loop and process any IO requests that might
      come along. As an cpu->exit_request may legitimately get squashed
      while processing the EXCP_INTERRUPT exception we also check
      cpu->queued_work_first to ensure queued work is expedited as soon as
      possible.
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      e5143e30
    • A
      tcg: rename tcg_current_cpu to tcg_current_rr_cpu · 791158d9
      Alex Bennée 提交于
      ..and make the definition local to cpus. In preparation for MTTCG the
      concept of a global tcg_current_cpu will no longer make sense. However
      we still need to keep track of it in the single-threaded case to be able
      to exit quickly when required.
      
      qemu_cpu_kick_no_halt() moves and becomes qemu_cpu_kick_rr_cpu() to
      emphasise its use-case. qemu_cpu_kick now kicks the relevant cpu as
      well as qemu_kick_rr_cpu() which will become a no-op in MTTCG.
      
      For the time being the setting of the global exit_request remains.
      Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Reviewed-by: NPranith Kumar <bobby.prani@gmail.com>
      791158d9
  2. 16 2月, 2017 1 次提交
  3. 13 1月, 2017 1 次提交
  4. 31 10月, 2016 2 次提交
  5. 26 10月, 2016 1 次提交
  6. 25 10月, 2016 1 次提交
  7. 27 9月, 2016 1 次提交
  8. 16 9月, 2016 1 次提交
    • R
      tcg: Merge GETPC and GETRA · 01ecaf43
      Richard Henderson 提交于
      The return address argument to the softmmu template helpers was
      confused.  In the legacy case, we wanted to indicate that there
      is no return address, and so passed in NULL.  However, we then
      immediately subtracted GETPC_ADJ from NULL, resulting in a non-zero
      value, indicating the presence of an (invalid) return address.
      
      Push the GETPC_ADJ subtraction down to the only point it's required:
      immediately before use within cpu_restore_state_from_tb, after all
      NULL pointer checks have been completed.
      
      This makes GETPC and GETRA identical.  Remove GETRA as the lesser
      used macro, replacing all uses with GETPC.
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      01ecaf43
  9. 14 9月, 2016 1 次提交
  10. 27 7月, 2016 1 次提交
  11. 12 7月, 2016 2 次提交
  12. 12 6月, 2016 1 次提交
    • E
      tb hash: track translated blocks with qht · 909eaac9
      Emilio G. Cota 提交于
      Having a fixed-size hash table for keeping track of all translation blocks
      is suboptimal: some workloads are just too big or too small to get maximum
      performance from the hash table. The MRU promotion policy helps improve
      performance when the hash table is a little undersized, but it cannot
      make up for severely undersized hash tables.
      
      Furthermore, frequent MRU promotions result in writes that are a scalability
      bottleneck. For scalability, lookups should only perform reads, not writes.
      This is not a big deal for now, but it will become one once MTTCG matures.
      
      The appended fixes these issues by using qht as the implementation of
      the TB hash table. This solution is superior to other alternatives considered,
      namely:
      
      - master: implementation in QEMU before this patchset
      - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
      - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
                    MRU is implemented here by adding an intermediate struct
                    that contains the u32 hash and a pointer to the TB; this
                    allows us, on an MRU promotion, to copy said struct (that is not
                    at the head), and put this new copy at the head. After a grace
                    period, the original non-head struct can be eliminated, and
                    after another grace period, freed.
      - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                         no MRU for lookups; MRU for inserts.
      The appended solution is the following:
      - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                       no MRU for lookups; MRU for inserts.
      
      The plots below compare the considered solutions. The Y axis shows the
      boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
      sweeps the number of buckets (or initial number of buckets for qht-autoresize).
      The plots in PNG format (and with errorbars) can be seen here:
        http://imgur.com/a/Awgnq
      
      Each test runs 5 times, and the entire QEMU process is pinned to a
      single core for repeatability of results.
      
                                  Host: Intel Xeon E5-2690
      
        28 ++------------+-------------+-------------+-------------+------------++
           A*****        +             +             +             master **A*** +
        27 ++    *                                                 xxhash ##B###++
           |      A******A******                               xxhash-rcu $$C$$$ |
        26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
           D%%$$                              A******A******A*qht-dyn-mru A*E****A
        25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
           B#####%                                                               |
        24 ++    #C$$$$$                                                        ++
           |      B###  $                                                        |
           |          ## C$$$$$$                                                 |
        23 ++           #       C$$$$$$                                         ++
           |             B######       C$$$$$$                                %%%D
        22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
           |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
        21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
           +             E@@@   F&&&   +      E@     +      F&&&   +             +
        20 ++------------+-------------+-------------+-------------+------------++
           14            16            18            20            22            24
                                   log2 number of buckets
      
                                       Host: Intel i7-4790K
      
        14.5 ++------------+------------+-------------+------------+------------++
             A**           +            +             +            master **A*** +
          14 ++ **                                                 xxhash ##B###++
        13.5 ++   **                                           xxhash-rcu $$C$$$++
             |                                            qht-fixed-nomru %%D%%% |
          13 ++     A******                                   qht-dyn-mru @@E@@@++
             |             A*****A******A******             qht-dyn-nomru &&F&&& |
        12.5 C$$                               A******A******A*****A******    ***A
          12 ++ $$                                                        A***  ++
             D%%% $$                                                             |
        11.5 ++  %%                                                             ++
             B###  %C$$$$$$                                                      |
          11 ++  ## D%%%%% C$$$$$                                               ++
             |     #      %      C$$$$$$                                         |
        10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
          10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
             +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
         9.5 ++------------+------------+-------------+------------+------------++
             14            16           18            20           22            24
                                    log2 number of buckets
      
      Note that the original point before this patch series is X=15 for "master";
      the little sensitivity to the increased number of buckets is due to the
      poor hashing function in master.
      
      xxhash-rcu has significant overhead due to the constant churn of allocating
      and deallocating intermediate structs for implementing MRU. An alternative
      would be do consider failed lookups as "maybe not there", and then
      acquire the external lock (tb_lock in this case) to really confirm that
      there was indeed a failed lookup. This, however, would not be enough
      to implement dynamic resizing--this is more complex: see
      "Resizable, Scalable, Concurrent Hash Tables via Relativistic
      Programming" by Triplett, McKenney and Walpole. This solution was
      discarded due to the very coarse RCU read critical sections that we have
      in MTTCG; resizing requires waiting for readers after every pointer update,
      and resizes require many pointer updates, so this would quickly become
      prohibitive.
      
      qht-fixed-nomru shows that MRU promotion is advisable for undersized
      hash tables.
      
      However, qht-dyn-mru shows that MRU promotion is not important if the
      hash table is properly sized: there is virtually no difference in
      performance between qht-dyn-nomru and qht-dyn-mru.
      
      Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
      X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
      can achieve with optimum sizing of the hash table, while keeping the hash
      table scalable for readers.
      
      The improvement we get before and after this patch for booting debian jessie
      with arm-softmmu is:
      
      - Intel Xeon E5-2690: 10.5% less time
      - Intel i7-4790K: 5.2% less time
      
      We could get this same improvement _for this particular workload_ by
      statically increasing the size of the hash table. But this would hurt
      workloads that do not need a large hash table. The dynamic (upward)
      resizing allows us to start small and enlarge the hash table as needed.
      
      A quick note on downsizing: the table is resized back to 2**15 buckets
      on every tb_flush; this makes sense because it is not guaranteed that the
      table will reach the same number of TBs later on (e.g. most bootup code is
      thrown away after boot); it makes sense to grow the hash table as
      more code blocks are translated. This also avoids the complication of
      having to build downsizing hysteresis logic into qht.
      Reviewed-by: NSergey Fedorov <serge.fedorov@linaro.org>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Reviewed-by: NRichard Henderson <rth@twiddle.net>
      Signed-off-by: NEmilio G. Cota <cota@braap.org>
      Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
      Signed-off-by: NRichard Henderson <rth@twiddle.net>
      909eaac9
  13. 09 6月, 2016 1 次提交
  14. 19 5月, 2016 2 次提交
  15. 13 5月, 2016 10 次提交
  16. 23 3月, 2016 2 次提交
  17. 21 1月, 2016 6 次提交
  18. 10 11月, 2015 1 次提交