- 16 6月, 2018 14 次提交
-
-
由 Emilio G. Cota 提交于
This applies to both user-mode and !user-mode emulation. Instead of relying on a global lock, protect the list of incoming jumps with tb->jmp_lock. This lock also protects tb->cflags, so update all tb->cflags readers outside tb->jmp_lock to use atomic reads via tb_cflags(). In order to find the destination TB (and therefore its jmp_lock) from the origin TB, we introduce tb->jmp_dest[]. I considered not using a linked list of jumps, which simplifies code and makes the struct smaller. However, it unnecessarily increases memory usage, which results in a performance decrease. See for instance these numbers booting+shutting down debian-arm: Time (s) Rel. err (%) Abs. err (s) Rel. slowdown (%) ------------------------------------------------------------------------------ before 20.88 0.74 0.154512 0. after 20.81 0.38 0.079078 -0.33524904 GTree 21.02 0.28 0.058856 0.67049808 GHashTable + xxhash 21.63 1.08 0.233604 3.5919540 Using a hash table or a binary tree to keep track of the jumps doesn't really pay off, not only due to the increased memory usage, but also because most TBs have only 0 or 1 jumps to them. The maximum number of jumps when booting debian-arm that I measured is 35, but as we can see in the histogram below a TB with that many incoming jumps is extremely rare; the average TB has 0.80 incoming jumps. n_jumps: 379208; avg jumps/tb: 0.801099 dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁ ▁▁▁ ▁|[34.0,35.0] Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Use the recently-gained QHT feature of returning the matching TB if it already exists. This allows us to get rid of the lookup we perform right after acquiring tb_lock. Suggested-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
The appended adds assertions to make sure we do not longjmp with page locks held. Note that user-mode has nothing to check, since page_locks are !user-mode only. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This is only compiled under CONFIG_DEBUG_TCG to avoid bloating the binary. In user-mode, assert_page_locked is equivalent to assert_mmap_lock. Note: There are some tb_lock assertions left that will be removed by later patches. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Suggested-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Groundwork for supporting parallel TCG generation. Instead of using a global lock (tb_lock) to protect changes to pages, use fine-grained, per-page locks in !user-mode. User-mode stays with mmap_lock. Sometimes changes need to happen atomically on more than one page (e.g. when a TB that spans across two pages is added/invalidated, or when a range of pages is invalidated). We therefore introduce struct page_collection, which helps us keep track of a set of pages that have been locked in the appropriate locking order (i.e. by ascending page index). This commit first introduces the structs and the function helpers, to then convert the calling code to use per-page locking. Note that tb_lock is not removed yet. While at it, rename tb_alloc_page to tb_page_add, which pairs with tb_page_remove. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This greatly simplifies next commit's diff. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
So that we pass a same-page range to tb_invalidate_phys_page_range, instead of always passing an end address that could be on a different page. As discussed with Peter Maydell on the list [1], tb_invalidate_phys_page_range doesn't actually do much with 'end', which explains why we have never hit a bug despite going against what the comment on top of tb_invalidate_phys_page_range requires: > * Invalidate all TBs which intersect with the target physical address range > * [start;end[. NOTE: start and end must refer to the *same* physical page. The appended honours the comment, which avoids confusion. While at it, rework the loop into a for loop, which is less error prone (e.g. "continue" won't result in an infinite loop). [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg09165.htmlReviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Groundwork for supporting parallel TCG generation. Move the hole to the end of the struct, so that a u32 field can be added there without bloating the struct. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Groundwork for supporting parallel TCG generation. We never remove entries from the radix tree, so we can use cmpxchg to implement lockless insertions. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This commit does several things, but to avoid churn I merged them all into the same commit. To wit: - Use uintptr_t instead of TranslationBlock * for the list of TBs in a page. Just like we did in (c37e6d7e "tcg: Use uintptr_t type for jmp_list_{next|first} fields of TB"), the rationale is the same: these are tagged pointers, not pointers. So use a more appropriate type. - Only check the least significant bit of the tagged pointers. Masking with 3/~3 is unnecessary and confusing. - Introduce the TB_FOR_EACH_TAGGED macro, and use it to define PAGE_FOR_EACH_TB, which improves readability. Note that TB_FOR_EACH_TAGGED will gain another user in a subsequent patch. - Update tb_page_remove to use PAGE_FOR_EACH_TB. In case there is a bug and we attempt to remove a TB that is not in the list, instead of segfaulting (since the list is NULL-terminated) we will reach g_assert_not_reached(). Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Thereby making it per-TCGContext. Once we remove tb_lock, this will avoid an atomic increment every time a TB is invalidated. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This paves the way for enabling scalable parallel generation of TCG code. Instead of tracking TBs with a single binary search tree (BST), use a BST for each TCG region, protecting it with a lock. This is as scalable as it gets, since each TCG thread operates on a separate region. The core of this change is the introduction of struct tcg_region_tree, which contains a pointer to a GTree and an associated lock to serialize accesses to it. We then allocate an array of tcg_region_tree's, adding the appropriate padding to avoid false sharing based on qemu_dcache_linesize. Given a tc_ptr, we first find the corresponding region_tree. This is done by special-casing the first and last regions first, since they might be of size != region.size; otherwise we just divide the offset by region.stride. I was worried about this division (several dozen cycles of latency), but profiling shows that this is not a fast path. Note that region.stride is not required to be a power of two; it is only required to be a multiple of the host's page size. Note that with this design we can also provide consistent snapshots about all region trees at once; for instance, tcg_tb_foreach acquires/releases all region_tree locks before/after iterating over them. For this reason we now drop tb_lock in dump_exec_info(). As an alternative I considered implementing a concurrent BST, but this can be tricky to get right, offers no consistent snapshots of the BST, and performance and scalability-wise I don't think it could ever beat having separate GTrees, given that our workload is insert-mostly (all concurrent BST designs I've seen focus, understandably, on making lookups fast, which comes at the expense of convoluted, non-wait-free insertions/removals). Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
The meaning of "existing" is now changed to "matches in hash and ht->cmp result". This is saner than just checking the pointer value. Suggested-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
qht_lookup now uses the default cmp function. qht_lookup_custom is defined to retain the old behaviour, that is a cmp function is explicitly provided. qht_insert will gain use of the default cmp in the next patch. Note that we move qht_lookup_custom's @func to be the last argument, which makes the new qht_lookup as simple as possible. Instead of this (i.e. keeping @func 2nd): 0000000000010750 <qht_lookup>: 10750: 89 d1 mov %edx,%ecx 10752: 48 89 f2 mov %rsi,%rdx 10755: 48 8b 77 08 mov 0x8(%rdi),%rsi 10759: e9 22 ff ff ff jmpq 10680 <qht_lookup_custom> 1075e: 66 90 xchg %ax,%ax We get: 0000000000010740 <qht_lookup>: 10740: 48 8b 4f 08 mov 0x8(%rdi),%rcx 10744: e9 37 ff ff ff jmpq 10680 <qht_lookup_custom> 10749: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
- 31 5月, 2018 2 次提交
-
-
由 Peter Maydell 提交于
As part of plumbing MemTxAttrs down to the IOMMU translate method, add MemTxAttrs as an argument to address_space_translate() and address_space_translate_cached(). Callers either have an attrs value to hand, or don't care and can use MEMTXATTRS_UNSPECIFIED. Signed-off-by: NPeter Maydell <peter.maydell@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Message-id: 20180521140402.23318-4-peter.maydell@linaro.org
-
由 Peter Maydell 提交于
As part of plumbing MemTxAttrs down to the IOMMU translate method, add MemTxAttrs as an argument to tb_invalidate_phys_addr(). Its callers either have an attrs value to hand, or don't care and can use MEMTXATTRS_UNSPECIFIED. Signed-off-by: NPeter Maydell <peter.maydell@linaro.org> Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Message-id: 20180521140402.23318-3-peter.maydell@linaro.org
-
- 20 5月, 2018 1 次提交
-
-
由 Laurent Vivier 提交于
Re-run Coccinelle script scripts/coccinelle/return_directly.cocci Signed-off-by: NLaurent Vivier <lvivier@redhat.com> ppc part Acked-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Tokarev <mjt@tls.msk.ru>
-
- 11 4月, 2018 1 次提交
-
-
由 Pavel Dovgalyuk 提交于
In icount mode, instructions that access io memory spaces in the middle of the translation block invoke TB recompilation. After recompilation, such instructions become last in the TB and are allowed to access io memory spaces. When the code includes instruction like i386 'xchg eax, 0xffffd080' which accesses APIC, QEMU goes into an infinite loop of the recompilation. This instruction includes two memory accesses - one read and one write. After the first access, APIC calls cpu_report_tpr_access, which restores the CPU state to get the current eip. But cpu_restore_state_from_tb resets the cpu->can_do_io flag which makes the second memory access invalid. Therefore the second memory access causes a recompilation of the block. Then these operations repeat again and again. This patch moves resetting cpu->can_do_io flag from cpu_restore_state_from_tb to cpu_loop_exit* functions. It also adds a parameter for cpu_restore_state which controls restoring icount. There is no need to restore icount when we only query CPU state without breaking the TB. Restoring it in such cases leads to the incorrect flow of the virtual time. In most cases new parameter is true (icount should be recalculated). But there are two cases in i386 and openrisc when the CPU state is only queried without the need to break the TB. This patch fixes both of these cases. Signed-off-by: NPavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru> Message-Id: <20180409091320.12504.35329.stgit@pasha-VirtualBox> [rth: Make can_do_io setting unconditional; move from cpu_exec; make cpu_loop_exit_{noexc,restore} call cpu_loop_exit.] Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
- 26 3月, 2018 1 次提交
-
-
由 Richard Henderson 提交于
We have confused the number of instructions that have been executed in the TB with the number of instructions needed to repeat the I/O instruction. We have used cpu_restore_state_from_tb, which means that the guest pc is pointing to the I/O instruction. The only time the answer to the later question is not 1 is when MIPS or SH4 need to re-execute the branch for the delay slot as well. We must rely on cpu->cflags_next_tb to generate the next TB, as otherwise we have a race condition with other guest cpus within the TB cache. Fixes: 0790f868Signed-off-by: NRichard Henderson <richard.henderson@linaro.org> Message-Id: <20180319031545.29359-1-richard.henderson@linaro.org> Tested-by: NPavel Dovgalyuk <pavel.dovgaluk@ispras.ru> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 13 3月, 2018 1 次提交
-
-
由 Pavel Dovgalyuk 提交于
cpu_io_recompile() function was broken by the commit 9b990ee5. Instead of regenerating the block starting from PC of the original block, it just set the instruction counter for TCG. In most cases this was unnoticed, but in icount mode there was an exception for incorrect usage of CF_LAST_IO flag. This patch recovers recompilation of the original block and also configures translation for executing single IO instruction which caused a recompilation. Signed-off-by: NPavel Dovgalyuk <pavel.dovgaluk@ispras.ru> Message-Id: <20180227095338.1060.27385.stgit@pasha-VirtualBox> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NPavel Dovgalyuk <Pavel.Dovgaluk@ispras.ru>
-
- 23 1月, 2018 1 次提交
-
-
由 Peter Maydell 提交于
If multiple guest threads in user-mode emulation write to a page which QEMU has marked read-only because of cached TCG translations, the threads can race in page_unprotect: * threads A & B both try to do a write to a page with code in it at the same time (ie which we've made non-writeable, so SEGV) * they race into the signal handler with this faulting address * thread A happens to get to page_unprotect() first and takes the mmap lock, so thread B sits waiting for it to be done * A then finds the page, marks it PAGE_WRITE and mprotect()s it writable * A can then continue OK (returns from signal handler to retry the memory access) * ...but when B gets the mmap lock it finds that the page is already PAGE_WRITE, and so it exits page_unprotect() via the "not due to protected translation" code path, and wrongly delivers the signal to the guest rather than just retrying the access In particular, this meant that trying to run 'javac' in user-mode emulation would fail with a spurious guest SIGSEGV. Handle this by making page_unprotect() assume that a call for a page which is already PAGE_WRITE is due to a race of this sort and return a "fault handled" indication. Since this would cause an infinite loop if we ever called page_unprotect() for some other kind of fault than "write failed due to bad access permissions", tighten the condition in handle_cpu_signal() to check the signal number and si_code, and add a comment so that if somebody does ever find themselves debugging an infinite loop of faults they have some clue about why. (The trick for identifying the correct setting for current_tb_invalidated for thread B (needed to handle the precise-SMC case) is due to Richard Henderson. Paolo Bonzini suggested just relying on si_code rather than trying anything more complicated.) Signed-off-by: NPeter Maydell <peter.maydell@linaro.org> Message-Id: <1511879725-9576-3-git-send-email-peter.maydell@linaro.org> Signed-off-by: NLaurent Vivier <laurent@vivier.eu>
-
- 18 12月, 2017 2 次提交
-
-
由 Philippe Mathieu-Daudé 提交于
exec: housekeeping (funny since 02d0e095) applied using ./scripts/clean-includes Signed-off-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by: NPeter Maydell <peter.maydell@linaro.org> Acked-by: NCornelia Huck <cohuck@redhat.com> Reviewed-by: NAnthony PERARD <anthony.perard@citrix.com> Signed-off-by: NMichael Tokarev <mjt@tls.msk.ru>
-
由 Emilio G. Cota 提交于
Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NMichael Tokarev <mjt@tls.msk.ru>
-
- 13 11月, 2017 1 次提交
-
-
由 Alex Bennée 提交于
We are still seeing signals during translation time when we walk over a page protection boundary. This expands the check to ensure the host PC is inside the code generation buffer. The original suggestion was to check versus tcg_ctx.code_gen_ptr but as we now segment the translation buffer we have to settle for just a general check for being inside. I've also fixed up the declaration to make it clear it can deal with invalid addresses. A later patch will fix up the call sites. Signed-off-by: NAlex Bennée <alex.bennee@linaro.org> Reported-by: NPeter Maydell <peter.maydell@linaro.org> Reviewed-by: NLaurent Vivier <laurent@vivier.eu> Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Message-id: 20171108153245.20740-2-alex.bennee@linaro.org Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Cc: Richard Henderson <rth@twiddle.net> Tested-by: NPeter Maydell <peter.maydell@linaro.org> Signed-off-by: NPeter Maydell <peter.maydell@linaro.org>
-
- 25 10月, 2017 14 次提交
-
-
由 Emilio G. Cota 提交于
Two or more threads might race while invalidating the same TB. We currently do not check for this at all despite taking tb_lock, which means we would wrongly invalidate the same TB more than once. This bug has actually been hit by users: I recently saw a report on IRC, although I have yet to see the corresponding test case. Fix this by using qht_remove as the synchronization point; if it fails, that means the TB has already been invalidated, and therefore there is nothing left to do in tb_phys_invalidate. Note that this solution works now that we still have tb_lock, and will continue working once we remove tb_lock. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Message-Id: <1508445114-4717-1-git-send-email-cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This enables parallel TCG code generation. However, we do not take advantage of it yet since tb_lock is still held during tb_gen_code. In user-mode we use a single TCG context; see the documentation added to tcg_region_init for the rationale. Note that targets do not need any conversion: targets initialize a TCGContext (e.g. defining TCG globals), and after this initialization has finished, the context is cloned by the vCPU threads, each of them keeping a separate copy. TCG threads claim one entry in tcg_ctxs[] by atomically increasing n_tcg_ctxs. Do not be too annoyed by the subsequent atomic_read's of that variable and tcg_ctxs; they are there just to play nice with analysis tools such as thread sanitizer. Note that we do not allocate an array of contexts (we allocate an array of pointers instead) because when tcg_context_init is called, we do not know yet how many contexts we'll use since the bool behind qemu_tcg_mttcg_enabled() isn't set yet. Previous patches folded some TCG globals into TCGContext. The non-const globals remaining are only set at init time, i.e. before the TCG threads are spawned. Here is a list of these set-at-init-time globals under tcg/: Only written by tcg_context_init: - indirect_reg_alloc_order - tcg_op_defs Only written by tcg_target_init (called from tcg_context_init): - tcg_target_available_regs - tcg_target_call_clobber_regs - arm: arm_arch, use_idiv_instructions - i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt, have_movbe, have_popcnt - mips: use_movnz_instructions, use_mips32_instructions, use_mips32r2_instructions, got_sigill (tcg_target_detect_isa) - ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr - s390: tb_ret_addr, s390_facilities - sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines), use_vis3_instructions Only written by tcg_prologue_init: - 'struct jit_code_entry one_entry' - aarch64: tb_ret_addr - arm: tb_ret_addr - i386: tb_ret_addr, guest_base_flags - ia64: tb_ret_addr - mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This is groundwork for supporting multiple TCG contexts. The naive solution here is to split code_gen_buffer statically among the TCG threads; this however results in poor utilization if translation needs are different across TCG threads. What we do here is to add an extra layer of indirection, assigning regions that act just like pages do in virtual memory allocation. (BTW if you are wondering about the chosen naming, I did not want to use blocks or pages because those are already heavily used in QEMU). We use a global lock to serialize allocations as well as statistics reporting (we now export the size of the used code_gen_buffer with tcg_code_size()). Note that for the allocator we could just use a counter and atomic_inc; however, that would complicate the gathering of tcg_code_size()-like stats. So given that the region operations are not a fast path, a lock seems the most reasonable choice. The effectiveness of this approach is clear after seeing some numbers. I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark. Note that I'm evaluating this after enabling per-thread TCG (which is done by a subsequent commit). * -smp 1, 1 region (entire buffer): qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357 qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363 qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364 qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373 qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373 qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360 qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370 qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367 That is, 8 flushes. * -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]: qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356 qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361 qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361 qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375 qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375 qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360 qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365 qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368 Again, 8 flushes. Note how buffer utilization is not 100%, but it is close. Smaller region sizes would yield higher utilization, but we want region allocation to be rare (it acquires a lock), so we do not want to go too small. * -smp 8, static partitioning of 8 regions (10 MB per region): qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354 qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370 qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365 qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377 qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358 qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367 qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364 qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358 qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362 qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372 qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374 qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376 qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374 qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372 qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359 qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362 qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368 qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378 qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367 qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364 That is, 20 flushes. Note how a static partitioning approach uses the code buffer poorly, leading to many unnecessary flushes. Reviewed-by: NRichard Henderson <richard.henderson@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
The helpers require the address and size to be page-aligned, so do that before calling them. Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This is groundwork for supporting multiple TCG contexts. To avoid scalability issues when profiling info is enabled, this patch makes the profiling info counters distributed via the following changes: 1) Consolidate profile info into its own struct, TCGProfile, which TCGContext also includes. Note that tcg_table_op_count is brought into TCGProfile after dropping the tcg_ prefix. 2) Iterate over the TCG contexts in the system to obtain the total counts. This change also requires updating the accessors to TCGProfile fields to use atomic_read/set whenever there may be conflicting accesses (as defined in C11) to them. Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Groundwork for supporting multiple TCG contexts. The core of this patch is this change to tcg/tcg.h: > -extern TCGContext tcg_ctx; > +extern TCGContext tcg_init_ctx; > +extern TCGContext *tcg_ctx; Note that for now we set *tcg_ctx to whatever TCGContext is passed to tcg_context_init -- in this case &tcg_init_ctx. Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Groundwork for supporting multiple TCG contexts. Reviewed-by: NRichard Henderson <rth@twiddle.net> Reviewed-by: NAlex Bennée <alex.bennee@linaro.org> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Since commit 6e3b2bfd ("tcg: allocate TB structs before the corresponding translated code") we are not fully utilizing code_gen_buffer for translated code, and therefore are incorrectly reporting the amount of translated code as well as the average host TB size. Address this by: - Making the conscious choice of misreporting the total translated code; doing otherwise would mislead users into thinking "-tb-size" is not honoured. - Expanding tb_tree_stats to accurately count the bytes of translated code on the host, and using this for reporting the average tb host size, as well as the expansion ratio. In the future we might want to consider reporting the accurate numbers for the total translated code, together with a "bookkeeping/overhead" field to account for the TB structs. Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
We don't really free anything in this function anymore; we just remove the TB from the binary search tree. Suggested-by: NAlex Bennée <alex.bennee@linaro.org> Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This is a prerequisite for supporting multiple TCG contexts, since we will have threads generating code in separate regions of code_gen_buffer. For this we need a new field (.size) in struct tb_tc to keep track of the size of the translated code. This field uses a size_t to avoid adding a hole to the struct, although really an unsigned int would have been enough. The comparison function we use is optimized for the common case: insertions. Profiling shows that upon booting debian-arm, 98% of comparisons are between existing tb's (i.e. a->size and b->size are both !0), which happens during insertions (and removals, but those are rare). The remaining cases are lookups. From reading the glib sources we see that the first key is always the lookup key. However, the code does not assume this to always be the case because this behaviour is not guaranteed in the glib docs. However, we embed this knowledge in the code as a branch hint for the compiler. Note that tb_free does not free space in the code_gen_buffer anymore, since we cannot easily know whether the tb is the last one inserted in code_gen_buffer. The next patch in this series renames tb_free to tb_remove to reflect this. Performance-wise, lookups in tb_find_pc are the same as before: O(log n). However, insertions are O(log n) instead of O(1), which results in a small slowdown when booting debian-arm: Performance counter stats for 'build/arm-softmmu/qemu-system-arm \ -machine type=virt -nographic -smp 1 -m 4096 \ -netdev user,id=unet,hostfwd=tcp::2222-:22 \ -device virtio-net-device,netdev=unet \ -drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \ -device virtio-blk-device,drive=myblock \ -kernel img/arm/aarch32-current-linux-kernel-only.img \ -append console=ttyAMA0 root=/dev/vda1 \ -name arm,debug-threads=on -smp 1' (10 runs): - Before: 8048.598422 task-clock (msec) # 0.931 CPUs utilized ( +- 0.28% ) 16,974 context-switches # 0.002 M/sec ( +- 0.12% ) 0 cpu-migrations # 0.000 K/sec 10,125 page-faults # 0.001 M/sec ( +- 1.23% ) 35,144,901,879 cycles # 4.367 GHz ( +- 0.14% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 65,758,252,643 instructions # 1.87 insns per cycle ( +- 0.33% ) 10,871,298,668 branches # 1350.707 M/sec ( +- 0.41% ) 192,322,212 branch-misses # 1.77% of all branches ( +- 0.32% ) 8.640869419 seconds time elapsed ( +- 0.57% ) - After: 8146.242027 task-clock (msec) # 0.923 CPUs utilized ( +- 1.23% ) 17,016 context-switches # 0.002 M/sec ( +- 0.40% ) 0 cpu-migrations # 0.000 K/sec 18,769 page-faults # 0.002 M/sec ( +- 0.45% ) 35,660,956,120 cycles # 4.378 GHz ( +- 1.22% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 65,095,366,607 instructions # 1.83 insns per cycle ( +- 1.73% ) 10,803,480,261 branches # 1326.192 M/sec ( +- 1.95% ) 195,601,289 branch-misses # 1.81% of all branches ( +- 0.39% ) 8.828660235 seconds time elapsed ( +- 0.38% ) Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Richard Henderson 提交于
Now that we have curr_cflags, we can include CF_USE_ICOUNT early and then remove it as necessary. Reviewed-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
Thereby decoupling the resulting translated code from the current state of the system. The tb->cflags field is not passed to tcg generation functions. So we add a field to TCGContext, storing there a copy of tb->cflags. Most architectures have <= 32 registers, which results in a 4-byte hole in TCGContext. Use this hole for the new field. Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Richard Henderson 提交于
We were generating code during tb_invalidate_phys_page_range, check_watchpoint, cpu_io_recompile, and (seemingly) discarding the TB, assuming that it would magically be picked up during the next iteration through the cpu_exec loop. Instead, record the desired cflags in CPUState so that we request the proper TB so that there is no more magic. Reviewed-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
由 Emilio G. Cota 提交于
This will enable us to decouple code translation from the value of parallel_cpus at any given time. It will also help us minimize TB flushes when generating code via EXCP_ATOMIC. Note that the declaration of parallel_cpus is brought to exec-all.h to be able to define there the "curr_cflags" inline. Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-
- 16 10月, 2017 1 次提交
-
-
由 Richard Henderson 提交于
Most of the users of page_set_flags offset (page, page + len) as the end points. One might consider this an error, since the other users do supply an endpoint as the last byte of the region. However, the first thing that page_set_flags does is round end UP to the start of the next page. Which means computing page + len - 1 is in the end pointless. Therefore, accept this usage and do not assert when given the exact size of the vm as the endpoint. Signed-off-by: NRichard Henderson <rth@twiddle.net> Reviewed-by: NPhilippe Mathieu-Daudé <f4bug@amsat.org> Message-Id: <20170708025030.15845-2-rth@twiddle.net> Signed-off-by: NRiku Voipio <riku.voipio@linaro.org>
-
- 10 10月, 2017 1 次提交
-
-
由 Emilio G. Cota 提交于
In preparation for adding tc.size to be able to keep track of TB's using the binary search tree implementation from glib. Reviewed-by: NRichard Henderson <rth@twiddle.net> Signed-off-by: NEmilio G. Cota <cota@braap.org> Signed-off-by: NRichard Henderson <richard.henderson@linaro.org>
-