提交 · 30f3dda24b2a4cd30f8fbf984ab08ef08eaf5020 · openeuler / qemu

24 2月, 2017 8 次提交

cpu-exec: remove unnecessary check of cpu->exit_request · 55ac0a9b

由 Paolo Bonzini 提交于 2月 07, 2017

The cpu->exit_request check in cpu_loop_exec_tb is unnecessary,
because cpu->tcg_exit_req is always set after cpu->exit_request.
So let the TB exit and we will pick up the exit request later
in cpu_handle_interrupt.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

55ac0a9b

replay: check icount in cpu exec loop · cfb2d02b

由 Pavel Dovgalyuk 提交于 2月 07, 2017

This patch adds check to break cpu loop when icount expires without
setting the TB_EXIT_ICOUNT_EXPIRED flag. It happens when there is no
available translated blocks and all instructions were executed.
In icount replay mode unnecessary tb_find will be called (which may
cause an exception) and execution will be non-deterministic.
Because cpu_loop_exec_tb cannot longjmp anymore, we can remove
the anticipated call to align_clocks in cpu_loop_exec_tb, as
well as the SyncClocks *sc argument.
Signed-off-by: NPavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Message-Id: <002801d2810f$18809c20$4981d460$@ru>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPavel Dovgalyuk <dovgaluk@ispras.ru>

cfb2d02b

tcg: handle EXCP_ATOMIC exception for system emulation · 08e73c48

由 Pranith Kumar 提交于 2月 23, 2017

The patch enables handling atomic code in the guest. This should be
preferably done in cpu_handle_exception(), but the current assumptions
regarding when we can execute atomic sections cause a deadlock.

The current mechanism discards the flags which were set in atomic
execution. We ensure they are properly saved by calling the
cc->cpu_exec_enter/leave() functions around the loop.

As we are running cpu_exec_step_atomic() from the outermost loop we
need to avoid an abort() when single stepping over atomic code since
debug exception longjmp will point to the the setlongjmp in
cpu_exec(). We do this by setting a new jmp_env so that it jumps back
here on an exception.
Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
[AJB: tweak title, merge with new patches, add mmap_lock]
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
CC: Paolo Bonzini <pbonzini@redhat.com>

08e73c48

tcg: enable thread-per-vCPU · 37257942

由 Alex Bennée 提交于 2月 23, 2017

There are a couple of changes that occur at the same time here:

  - introduce a single vCPU qemu_tcg_cpu_thread_fn

  One of these is spawned per vCPU with its own Thread and Condition
  variables. qemu_tcg_rr_cpu_thread_fn is the new name for the old
  single threaded function.

  - the TLS current_cpu variable is now live for the lifetime of MTTCG
    vCPU threads. This is for future work where async jobs need to know
    the vCPU context they are operating in.

The user to switch on multi-thread behaviour and spawn a thread
per-vCPU. For a simple test kvm-unit-test like:

  ./arm/run ./arm/locking-test.flat -smp 4 -accel tcg,thread=multi

Will now use 4 vCPU threads and have an expected FAIL (instead of the
unexpected PASS) as the default mode of the test has no protection when
incrementing a shared variable.

We enable the parallel_cpus flag to ensure we generate correct barrier
and atomic code if supported by the front and backends. This doesn't
automatically enable MTTCG until default_mttcg_enabled() is updated to
check the configuration is supported.
Signed-off-by: NKONRAD Frederic <fred.konrad@greensocs.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[AJB: Some fixes, conditionally, commit rewording]
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>

37257942

tcg: remove global exit_request · e5143e30

由 Alex Bennée 提交于 2月 23, 2017

There are now only two uses of the global exit_request left.

The first ensures we exit the run_loop when we first start to process
pending work and in the kick handler. This is just as easily done by
setting the first_cpu->exit_request flag.

The second use is in the round robin kick routine. The global
exit_request ensured every vCPU would set its local exit_request and
cause a full exit of the loop. Now the iothread isn't being held while
running we can just rely on the kick handler to push us out as intended.

We lightly re-factor the main vCPU thread to ensure cpu->exit_requests
cause us to exit the main loop and process any IO requests that might
come along. As an cpu->exit_request may legitimately get squashed
while processing the EXCP_INTERRUPT exception we also check
cpu->queued_work_first to ensure queued work is expedited as soon as
possible.
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>

e5143e30

tcg: drop global lock during TCG code execution · 8d04fb55

由 Jan Kiszka 提交于 2月 23, 2017

This finally allows TCG to benefit from the iothread introduction: Drop
the global mutex while running pure TCG CPU code. Reacquire the lock
when entering MMIO or PIO emulation, or when leaving the TCG loop.

We have to revert a few optimization for the current TCG threading
model, namely kicking the TCG thread in qemu_mutex_lock_iothread and not
kicking it in qemu_cpu_kick. We also need to disable RAM block
reordering until we have a more efficient locking mechanism at hand.

Still, a Linux x86 UP guest and my Musicpal ARM model boot fine here.
These numbers demonstrate where we gain something:

20338 jan       20   0  331m  75m 6904 R   99  0.9   0:50.95 qemu-system-arm
20337 jan       20   0  331m  75m 6904 S   20  0.9   0:26.50 qemu-system-arm

The guest CPU was fully loaded, but the iothread could still run mostly
independent on a second core. Without the patch we don't get beyond

32206 jan       20   0  330m  73m 7036 R   82  0.9   1:06.00 qemu-system-arm
32204 jan       20   0  330m  73m 7036 S   21  0.9   0:17.03 qemu-system-arm

We don't benefit significantly, though, when the guest is not fully
loading a host CPU.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Message-Id: <1439220437-23957-10-git-send-email-fred.konrad@greensocs.com>
[FK: Rebase, fix qemu_devices_reset deadlock, rm address_space_* mutex]
Signed-off-by: NKONRAD Frederic <fred.konrad@greensocs.com>
[EGC: fixed iothread lock for cpu-exec IRQ handling]
Signed-off-by: NEmilio G. Cota <cota@braap.org>
[AJB: -smp single-threaded fix, clean commit msg, BQL fixes]
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Reviewed-by: NPranith Kumar <bobby.prani@gmail.com>
[PM: target-arm changes]
Acked-by: NPeter Maydell <peter.maydell@linaro.org>

8d04fb55

tcg: rename tcg_current_cpu to tcg_current_rr_cpu · 791158d9

由 Alex Bennée 提交于 2月 23, 2017

..and make the definition local to cpus. In preparation for MTTCG the
concept of a global tcg_current_cpu will no longer make sense. However
we still need to keep track of it in the single-threaded case to be able
to exit quickly when required.

qemu_cpu_kick_no_halt() moves and becomes qemu_cpu_kick_rr_cpu() to
emphasise its use-case. qemu_cpu_kick now kicks the relevant cpu as
well as qemu_kick_rr_cpu() which will become a no-op in MTTCG.

For the time being the setting of the global exit_request remains.
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Reviewed-by: NPranith Kumar <bobby.prani@gmail.com>

791158d9

mttcg: Add missing tb_lock/unlock() in cpu_exec_step() · 4ec66704

由 Pranith Kumar 提交于 2月 23, 2017

The recent patch enabling lock assertions uncovered the missing lock
acquisition in cpu_exec_step(). This patch adds them.
Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>

4ec66704

22 2月, 2017 1 次提交

cpu-exec: unify icount_decr and tcg_exit_req · 1aab16c2

由 Paolo Bonzini 提交于 1月 27, 2017

The icount interrupt flag and tcg_exit_req serve almost the same
purpose, let's make them completely the same.

The former TB_EXIT_REQUESTED and TB_EXIT_ICOUNT_EXPIRED cases are
unified, since we can distinguish them from the value of the
interrupt flag.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1aab16c2

17 2月, 2017 1 次提交

target-i386: correctly propagate retaddr into SVM helpers · 65c9d60a

由 Paolo Bonzini 提交于 2月 16, 2017

Commit 2afbdf84 ("target-i386: exception handling for memory helpers",
2015-09-15) changed tlb_fill's cpu_restore_state+raise_exception_err
to raise_exception_err_ra.  After this change, the cpu_restore_state
and raise_exception_err's cpu_loop_exit are merged into
raise_exception_err_ra's cpu_loop_exit_restore.

This actually fixed some bugs, but when SVM is enabled there is a
second path from raise_exception_err_ra to cpu_loop_exit.  This is
the VMEXIT path, and now cpu_vmexit is called without a
cpu_restore_state before.

The fix is to pass the retaddr to cpu_vmexit (via
cpu_svm_check_intercept_param).  All helpers can now use GETPC() to pass
the correct retaddr, too.

Cc: qemu-stable@nongnu.org
Fixes: 2afbdf84Reported-by: NAlexander Boettcher <alexander.boettcher@genode-labs.com>
Tested-by: NAlexander Boettcher <alexander.boettcher@genode-labs.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

65c9d60a

16 2月, 2017 5 次提交

cpu-exec: remove outermost infinite loop · 4515e58d

由 Paolo Bonzini 提交于 1月 29, 2017

Reorganize the sigsetjmp so that the restart case falls through
to cpu_handle_exception and the execution loop.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4515e58d

cpu-exec: avoid repeated sigsetjmp on interrupts · a42cf3f3

由 Paolo Bonzini 提交于 1月 27, 2017

The sigsetjmp only needs to be prepared once for the whole execution
of cpu_exec.  This patch takes care of the "== 0" side, using a
nested loop so that cpu_handle_interrupt goes straight back to
cpu_handle_exception without doing another sigsetjmp.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a42cf3f3

cpu-exec: avoid cpu_loop_exit in cpu_handle_interrupt · 209b71b6

由 Paolo Bonzini 提交于 1月 27, 2017

The siglongjmp goes straight back to the beginning of cpu_exec's
outermost loop.  We do not need a siglongjmp, we can simply
leave the inner TB execution loop.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

209b71b6

cpu-exec: tighten barrier on TCG_EXIT_REQUESTED · a70fe14b

由 Paolo Bonzini 提交于 1月 29, 2017

This seems to have worked just fine so far on weakly-ordered
architectures, but I don't see anything that prevents the
reordering from:

    store 1 to exit_request
    store 1 to tcg_exit_req
                                 load tcg_exit_req
                                 store 0 to tcg_exit_req
                                 load exit_request
                                 store 0 to exit_request
    store 1 to exit_request
    store 1 to tcg_exit_req

to this:

    store 1 to exit_request
    store 1 to tcg_exit_req
                                 load tcg_exit_req
                                 load exit_request
    store 1 to exit_request
    store 1 to tcg_exit_req
                                 store 0 to tcg_exit_req
                                 store 0 to exit_request

therefore losing a request.  It's possible that other memory barriers
(e.g. in rcu_read_unlock) are hiding it, but better safe than
sorry.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a70fe14b

cpu-exec: fix icount out-of-bounds access · 43d70ddf

由 Paolo Bonzini 提交于 1月 29, 2017

When icount is active, tb_add_jump is surprisingly called with an
out of bounds basic block index.  I have no idea how that can work,
but it does not seem like a good idea.  Clear *last_tb for all
TB_EXIT_ICOUNT_EXPIRED cases, even when all you have to do is
refill icount_extra.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

43d70ddf

01 2月, 2017 1 次提交

trace: switch to modular code generation for sub-directories · 0ab8ed18

由 Daniel P. Berrange 提交于 1月 25, 2017

Introduce rules in the top level Makefile that are able to generate
trace.[ch] files in every subdirectory which has a trace-events file.

The top level directory is handled specially, so instead of creating
trace.h, it creates trace-root.h. This allows sub-directories to
include the top level trace-root.h file, without ambiguity wrt to
the trace.g file in the current sub-dir.
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NDaniel P. Berrange <berrange@redhat.com>
Message-id: 20170125161417.31949-7-berrange@redhat.com
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>

0ab8ed18

28 1月, 2017 1 次提交

replay: improve interrupt handling · d718b14b

由 Pavel Dovgalyuk 提交于 1月 24, 2017

This patch improves interrupt handling in record/replay mode.
Now "interrupt" event is saved only when cc->cpu_exec_interrupt returns true.
This patch also adds missing return to cpu_exec_interrupt function.
Signed-off-by: NPavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
Message-Id: <20170124071708.4572.64023.stgit@PASHA-ISP>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d718b14b

02 11月, 2016 1 次提交

log: Add locking to large logging blocks · 1ee73216

由 Richard Henderson 提交于 9月 22, 2016

Reuse the existing locking provided by stdio to keep in_asm, cpu,
op, op_opt, op_ind, and out_asm as contiguous blocks.

While it isn't possible to interleave e.g. in_asm or op_opt logs
because of the TB lock protecting all code generation, it is
possible to interleave cpu logs, or to interleave a cpu dump with
an out_asm dump.

For mingw32, we appear to have no viable solution for this.  The locking
functions are not properly exported from the system runtime library.
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRichard Henderson <rth@twiddle.net>

1ee73216

31 10月, 2016 2 次提交

tcg: protect translation related stuff with tb_lock. · a5e99826

由 KONRAD Frederic 提交于 10月 27, 2016

This protects all translation related work with tb_lock() too ensure
thread safety. This effectively serialises all code generation. In
addition to the code generation we also take the lock for TB
invalidation. This has a knock on effect of meaning tb_lock() is held
for modification of the SoftMMU TLB by non-self threads which will be
used in later patches.
Signed-off-by: NKONRAD Frederic <fred.konrad@greensocs.com>
Message-Id: <1439220437-23957-8-git-send-email-fred.konrad@greensocs.com>
Signed-off-by: NEmilio G. Cota <cota@braap.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[AJB: moved into tree, clean-up history]
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Message-Id: <20161027151030.20863-10-alex.bennee@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a5e99826

cpu-exec: include cpu_index in CPU_LOG_EXEC messages · 4426f83a

由 Alex Bennée 提交于 10月 27, 2016

Even more important when debugging MTTCG is seeing which vCPU is
currently executing.
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Message-Id: <20161027151030.20863-5-alex.bennee@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4426f83a

26 10月, 2016 2 次提交

tcg: Add EXCP_ATOMIC · fdbc2b57

由 Richard Henderson 提交于 6月 29, 2016

When we cannot emulate an atomic operation within a parallel
context, this exception allows us to stop the world and try
again in a serial context.
Reviewed-by: NEmilio G. Cota <cota@braap.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Signed-off-by: NRichard Henderson <rth@twiddle.net>

fdbc2b57

target-m68k: remove m68k_cpu_exec_enter() and m68k_cpu_exec_exit() · 20a8856e

由 Laurent Vivier 提交于 8月 09, 2015

Update cc_op directly from tcg_gen_insn_start() and
restore_state_to_opc()

Copied from target-i386
Signed-off-by: NLaurent Vivier <laurent@vivier.eu>

20a8856e

04 10月, 2016 1 次提交

cpu: atomically modify cpu->exit_request · 027d9a7d

由 Alex Bennée 提交于 9月 30, 2016

ThreadSanitizer picks up potential races although we already use
barriers to ensure things are in the correct order when processing exit
requests. For true C11 defined behaviour across threads we need to use
relaxed atomic_set/atomic_read semantics to reassure tsan.
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Message-Id: <20160930213106.20186-9-alex.bennee@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

027d9a7d

27 9月, 2016 1 次提交

tcg: Make tb_flush() thread safe · 3359baad

由 Sergey Fedorov 提交于 8月 02, 2016

Use async_safe_run_on_cpu() to make tb_flush() thread safe.  This is
possible now that code generation does not happen in the middle of
execution.

It can happen that multiple threads schedule a safe work to flush the
translation buffer. To keep statistics and debugging output sane, always
check if the translation buffer has already been flushed.
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
[AJB: minor re-base fixes]
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Message-Id: <1470158864-17651-13-git-send-email-alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3359baad

16 9月, 2016 1 次提交
- R
  cpu-exec: Check -dfilter for -d cpu · be2208e2
  由 Richard Henderson 提交于 7月 12, 2016
```
Signed-off-by: NRichard Henderson <rth@twiddle.net>
```
  be2208e2
14 9月, 2016 8 次提交

tcg: rename tb_find_physical() · b34de45f

由 Sergey Fedorov 提交于 7月 15, 2016

In fact, this function does not exactly perform a lookup by physical
address as it is descibed for comment on get_page_addr_code(). Thus
it may be a bit confusing to have "physical" in it's name. So rename it
to tb_htable_lookup() to better reflect its actual functionality.
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <20160715175852.30749-13-sergey.fedorov@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b34de45f

tcg: Merge tb_find_slow() and tb_find_fast() · bd2710d5

由 Sergey Fedorov 提交于 7月 15, 2016

These functions are not too big and can be merged together. This makes
locking scheme more clear and easier to follow.
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-12-sergey.fedorov@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bd2710d5

tcg: Avoid bouncing tb_lock between tb_gen_code() and tb_add_jump() · 74d356dd

由 Sergey Fedorov 提交于 7月 15, 2016

Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-11-sergey.fedorov@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

74d356dd

tcg: cpu-exec: remove tb_lock from the hot-path · 518615c6

由 Alex Bennée 提交于 7月 15, 2016

Lock contention in the hot path of moving between existing patched
TranslationBlocks is the main drag in multithreaded performance. This
patch pushes the tb_lock() usage down to the two places that really need
it:

  - code generation (tb_gen_code)
  - jump patching (tb_add_jump)

The rest of the code doesn't really need to hold a lock as it is either
using per-CPU structures, atomically updated or designed to be used in
concurrent read situations (qht_lookup).

To keep things simple I removed the #ifdef CONFIG_USER_ONLY stuff as the
locks become NOPs anyway until the MTTCG work is completed.
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Reviewed-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>

Message-Id: <20160715175852.30749-10-sergey.fedorov@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

518615c6

tcg: Prepare TB invalidation for lockless TB lookup · 6d21e420

由 Paolo Bonzini 提交于 7月 19, 2016

When invalidating a translation block, set an invalid flag into the
TranslationBlock structure first. It is also necessary to check whether
the target TB is still valid after acquiring 'tb_lock' but before calling
tb_add_jump() since TB lookup is to be performed out of 'tb_lock' in
future. Note that we don't have to check 'last_tb'; an already invalidated
TB will not be executed anyway and it is thus safe to patch it.
Suggested-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6d21e420

tcg: Prepare safe access to tb_flushed out of tb_lock · 118b0730

由 Sergey Fedorov 提交于 7月 15, 2016

Ensure atomicity and ordering of CPU's 'tb_flushed' access for future
translation block lookup out of 'tb_lock'.

This field can only be touched from another thread by tb_flush() in user
mode emulation. So the only access to be sequential atomic is:
 * a single write in tb_flush();
 * reads/writes out of 'tb_lock'.

In future, before enabling MTTCG in system mode, tb_flush() must be safe
and this field becomes unnecessary.
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-5-sergey.fedorov@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

118b0730

tcg: Prepare safe tb_jmp_cache lookup out of tb_lock · 89a16b1e

由 Sergey Fedorov 提交于 7月 15, 2016

Ensure atomicity of CPU's 'tb_jmp_cache' access for future translation
block lookup out of 'tb_lock'.

Note that this patch does *not* make CPU's TLB invalidation safe if it
is done from some other thread while the CPU is in its execution loop.
Signed-off-by: NAlex Bennée <alex.bennee@linaro.org>
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Message-Id: <20160715175852.30749-4-sergey.fedorov@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

89a16b1e

tcg: Pass last_tb by value to tb_find_fast() · 4b7e6950

由 Sergey Fedorov 提交于 7月 15, 2016

This is a small clean up. tb_find_fast() is a final consumer of this
variable so no need to pass it by reference. 'last_tb' is always updated
by subsequent cpu_loop_exec_tb() in cpu_exec().

This change also simplifies calling cpu_exec_nocache() in
cpu_handle_exception().
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <20160715175852.30749-3-sergey.fedorov@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4b7e6950

17 7月, 2016 1 次提交

cpu-exec: Move down some declarations in cpu_exec() · ca7d8e1c

由 Sergey Fedorov 提交于 7月 15, 2016

This will fix a compiler warning with -Wclobbered:

http://lists.nongnu.org/archive/html/qemu-devel/2016-07/msg03347.htmlReported-by: NStefan Weil <sw@weilnetz.de>
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <20160715193123.28113-1-sergey.fedorov@linaro.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ca7d8e1c

12 6月, 2016 2 次提交

tb hash: track translated blocks with qht · 909eaac9

由 Emilio G. Cota 提交于 6月 08, 2016

Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
              MRU is implemented here by adding an intermediate struct
              that contains the u32 hash and a pointer to the TB; this
              allows us, on an MRU promotion, to copy said struct (that is not
              at the head), and put this new copy at the head. After a grace
              period, the original non-head struct can be eliminated, and
              after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                   no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                 no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
  http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

                            Host: Intel Xeon E5-2690

  28 ++------------+-------------+-------------+-------------+------------++
     A*****        +             +             +             master **A*** +
  27 ++    *                                                 xxhash ##B###++
     |      A******A******                               xxhash-rcu $$C$$$ |
  26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
     D%%$$                              A******A******A*qht-dyn-mru A*E****A
  25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
     B#####%                                                               |
  24 ++    #C$$$$$                                                        ++
     |      B###  $                                                        |
     |          ## C$$$$$$                                                 |
  23 ++           #       C$$$$$$                                         ++
     |             B######       C$$$$$$                                %%%D
  22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
     |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
  21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
     +             E@@@   F&&&   +      E@     +      F&&&   +             +
  20 ++------------+-------------+-------------+-------------+------------++
     14            16            18            20            22            24
                             log2 number of buckets

                                 Host: Intel i7-4790K

  14.5 ++------------+------------+-------------+------------+------------++
       A**           +            +             +            master **A*** +
    14 ++ **                                                 xxhash ##B###++
  13.5 ++   **                                           xxhash-rcu $$C$$$++
       |                                            qht-fixed-nomru %%D%%% |
    13 ++     A******                                   qht-dyn-mru @@E@@@++
       |             A*****A******A******             qht-dyn-nomru &&F&&& |
  12.5 C$$                               A******A******A*****A******    ***A
    12 ++ $$                                                        A***  ++
       D%%% $$                                                             |
  11.5 ++  %%                                                             ++
       B###  %C$$$$$$                                                      |
    11 ++  ## D%%%%% C$$$$$                                               ++
       |     #      %      C$$$$$$                                         |
  10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
    10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
       +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
   9.5 ++------------+------------+-------------+------------+------------++
       14            16           18            20           22            24
                              log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.
Reviewed-by: NSergey Fedorov <serge.fedorov@linaro.org>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Signed-off-by: NEmilio G. Cota <cota@braap.org>
Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
Signed-off-by: NRichard Henderson <rth@twiddle.net>

909eaac9

tb hash: hash phys_pc, pc, and flags with xxhash · 42bd3228

由 Emilio G. Cota 提交于 6月 08, 2016

For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.
Reviewed-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
Signed-off-by: NEmilio G. Cota <cota@braap.org>
Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org>
Signed-off-by: NRichard Henderson <rth@twiddle.net>

42bd3228

26 5月, 2016 1 次提交

cpu-exec: Fix direct jump to TB spanning page · c88c67e5

由 Sergey Fedorov 提交于 5月 16, 2016

It is not safe to make a direct jump to a TB spanning two pages in
system emulation because the mapping for the second page can get changed
but we don't take care of direct jumps in this case.

However in user mode emulation, this is not the case because there's
only static address translation and TBs are always invalidated properly.

Fixes: 5b053a4a ("tcg: Clean up direct block chaining safety checks")
Reported-by: NMax Filippov <jcmvbkbc@gmail.com>
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Tested-by: NMax Filippov <jcmvbkbc@gmail.com>
Message-id: 1463404380-29302-1-git-send-email-sergey.fedorov@linaro.org
Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>

c88c67e5

19 5月, 2016 1 次提交

cpu: move exec-all.h inclusion out of cpu.h · 63c91552

由 Paolo Bonzini 提交于 3月 15, 2016

exec-all.h contains TCG-specific definitions.  It is not needed outside
TCG-specific files such as translate.c, exec.c or *helper.c.

One generic function had snuck into include/exec/exec-all.h; move it to
include/qom/cpu.h.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

63c91552

13 5月, 2016 2 次提交

cpu-exec: Clean up 'interrupt_request' reloading in cpu_handle_interrupt() · 8b1fe3f4

由 Sergey Fedorov 提交于 5月 12, 2016

Suggested-by: NRichard Henderson <rth@twiddle.net>
Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Message-Id: <1463071937-26607-1-git-send-email-sergey.fedorov@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Signed-off-by: NRichard Henderson <rth@twiddle.net>

8b1fe3f4

cpu-exec: Remove unused 'x86_cpu' and 'env' from cpu_exec() · ba048a4a

由 Sergey Fedorov 提交于 5月 11, 2016

Signed-off-by: NSergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: NSergey Fedorov <sergey.fedorov@linaro.org>
Reviewed-by: NRichard Henderson <rth@twiddle.net>
Message-Id: <1462962111-32237-6-git-send-email-sergey.fedorov@linaro.org>
Signed-off-by: NRichard Henderson <rth@twiddle.net>

ba048a4a