- 30 6月, 2017 2 次提交
-
-
由 Andy Lutomirski 提交于
The comment describes the old explicit IPI-based flush logic, which is long gone. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/55e44997e56086528140c5180f8337dc53fb7ffc.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
It was historically possible to have two concurrent TLB flushes targetting the same CPU: one initiated locally and one initiated remotely. This can now cause an OOPS in leave_mm() at arch/x86/mm/tlb.c:47: if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) BUG(); with this call trace: flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline] flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317 Without reentrancy, this OOPS is impossible: leave_mm() is only called if we're not in TLBSTATE_OK, but then we're unexpectedly in TLBSTATE_OK in leave_mm(). This can be caused by flush_tlb_func_remote() happening between the two checks and calling leave_mm(), resulting in two consecutive leave_mm() calls on the same CPU with no intervening switch_mm() calls. We never saw this OOPS before because the old leave_mm() implementation didn't put us back in TLBSTATE_OK, so the assertion didn't fire. Nadav noticed the reentrancy issue in a different context, but neither of us realized that it caused a problem yet. Reported-by: NLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NNadav Amit <nadav.amit@gmail.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Fixes: 3d28ebce ("x86/mm: Rework lazy TLB to track the actual loaded mm") Link: http://lkml.kernel.org/r/855acf733268d521c9f2e191faee2dcc23a29729.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 22 6月, 2017 1 次提交
-
-
由 Andy Lutomirski 提交于
Originally, Linux reloaded the LDT whenever the prev mm or the next mm had an LDT. It was changed in 2002 in: 0bbed3beb4f2 ("[PATCH] Thread-Local Storage (TLS) support") (commit from the historical tree), like this: - /* load_LDT, if either the previous or next thread - * has a non-default LDT. + /* + * load the LDT, if the LDT is different: */ - if (next->context.size+prev->context.size) + if (unlikely(prev->context.ldt != next->context.ldt)) load_LDT(&next->context); The current code is unlikely to avoid any LDT reloads, since different mms won't share an LDT. When we redo lazy mode to stop flush IPIs without switching to init_mm, though, the current logic would become incorrect: it will be possible to have real_prev == next but nonetheless have a stale LDT descriptor. Simplify the code to update LDTR if either the previous or the next mm has an LDT, i.e. effectively restore the historical logic.. While we're at it, clean up the code by moving all the ifdeffery to a header where it belongs. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NBorislav Petkov <bp@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/2a859ac01245f9594c58f9d0a8b2ed8a7cd2507e.1498022414.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 6月, 2017 7 次提交
-
-
由 Andy Lutomirski 提交于
Nadav pointed out that some code used PAGE_SIZE and other code used PAGE_SHIFT. Use PAGE_SHIFT instead of multiplying or dividing by PAGE_SIZE. Requested-by: NNadav Amit <nadav.amit@gmail.com> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Lazy TLB state is currently managed in a rather baroque manner. AFAICT, there are three possible states: - Non-lazy. This means that we're running a user thread or a kernel thread that has called use_mm(). current->mm == current->active_mm == cpu_tlbstate.active_mm and cpu_tlbstate.state == TLBSTATE_OK. - Lazy with user mm. We're running a kernel thread without an mm and we're borrowing an mm_struct. We have current->mm == NULL, current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state != TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0). The current cpu is set in mm_cpumask(current->active_mm). CR3 points to current->active_mm->pgd. The TLB is up to date. - Lazy with init_mm. This happens when we call leave_mm(). We have current->mm == NULL, current->active_mm == cpu_tlbstate.active_mm, but that mm is only relelvant insofar as the scheduler is tracking it for refcounting. cpu_tlbstate.state != TLBSTATE_OK. The current cpu is clear in mm_cpumask(current->active_mm). CR3 points to swapper_pg_dir, i.e. init_mm->pgd. This patch simplifies the situation. Other than perf, x86 stops caring about current->active_mm at all. We have cpu_tlbstate.loaded_mm pointing to the mm that CR3 references. The TLB is always up to date for that mm. leave_mm() just switches us to init_mm. There are no longer any special cases for mm_cpumask, and switch_mm() switches mms without worrying about laziness. After this patch, cpu_tlbstate.state serves only to tell the TLB flush code whether it may switch to init_mm instead of doing a normal flush. This makes fairly extensive changes to xen_exit_mmap(), which used to look a bit like black magic. Perf is unchanged. With or without this change, perf may behave a bit erratically if it tries to read user memory in kernel thread context. We should build on this patch to teach perf to never look at user memory when cpu_tlbstate.loaded_mm != current->mm. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The UP asm/tlbflush.h generates somewhat nicer code than the SMP version. Aside from that, it's fallen quite a bit behind the SMP code: - flush_tlb_mm_range() didn't flush individual pages if the range was small. - The lazy TLB code was much weaker. This usually wouldn't matter, but, if a kernel thread flushed its lazy "active_mm" more than once (due to reclaim or similar), it wouldn't be unlazied and would instead pointlessly flush repeatedly. - Tracepoints were missing. Aside from that, simply having the UP code around was a maintanence burden, since it means that any change to the TLB flush code had to make sure not to break it. Simplify everything by deleting the UP code. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Now there's only one copy of the local tlb flush logic for non-kernel pages on SMP kernels. The only functional change is that arch_tlbbatch_flush() will now leave_mm() on the local CPU if that CPU is in the batch and is in TLBSTATE_LAZY mode. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The local flush path is very similar to the remote flush path. Merge them. This is intended to make no difference to behavior whatsoever. It removes some code and will make future changes to the flushing mechanics simpler. This patch does remove one small optimization: flush_tlb_mm_range() now has an unconditional smp_mb() instead of using MOV to CR3 or INVLPG as a full barrier when applicable. I think this is okay for a few reasons. First, smp_mb() is quite cheap compared to the cost of a TLB flush. Second, this rearrangement makes a bigger optimization available: with some work on the SMP function call code, we could do the local and remote flushes in parallel. Third, I'm planning a rework of the TLB flush algorithm that will require an atomic operation at the beginning of each flush, and that operation will replace the smp_mb(). Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
On a remote TLB flush, we leave_mm() if we're TLBSTATE_LAZY. For a local flush_tlb_mm_range(), we leave_mm() if !current->mm. These are approximately the same condition -- the scheduler sets lazy TLB mode when switching to a thread with no mm. I'm about to merge the local and remote flush code, but for ease of verifying and bisecting the patch, I want the local and remote flush behavior to match first. This patch changes the local code to match the remote code. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Rather than passing all the contents of flush_tlb_info to flush_tlb_others(), pass a pointer to the structure directly. For consistency, this also removes the unnecessary cpu parameter from uv_flush_tlb_others() to make its signature match the other *flush_tlb_others() functions. This serves two purposes: - It will dramatically simplify future patches that change struct flush_tlb_info, which I'm planning to do. - struct flush_tlb_info is an adequate description of what to do for a local flush, too, so by reusing it we can remove duplicated code between local and remove flushes in a future patch. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org [ Fix build warning. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 5月, 2017 3 次提交
-
-
由 Andy Lutomirski 提交于
try_to_unmap_flush() used to open-code a rather x86-centric flush sequence: local_flush_tlb() + flush_tlb_others(). Rearrange the code so that the arch (only x86 for now) provides arch_tlbbatch_add_mm() and arch_tlbbatch_flush() and the core code calls those functions instead. I'll want this for x86 because, to enable address space ids, I can't support the flush_tlb_others() mode used by exising try_to_unmap_flush() implementation with good performance. I can support the new API fairly easily, though. I imagine that other architectures may be in a similar position. Architectures with strong remote flush primitives (arm64?) may have even worse performance problems with flush_tlb_others() the way that try_to_unmap_flush() uses it. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NKees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/19f25a8581f9fb77876b7ff3b001f89835e34ea3.1495492063.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The leave_mm() case can just exit the function early so we don't need to indent the entire remainder of the function. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NKees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/97901ddcc9821d7bc7b296d2918d1179f08aaf22.1495492063.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
flush_tlb_page() was very similar to flush_tlb_mm_range() except that it had a couple of issues: - It was missing an smp_mb() in the case where current->active_mm != mm. (This is a longstanding bug reported by Nadav Amit) - It was missing tracepoints and vm counter updates. The only reason that I can see for keeping it at as a separate function is that it could avoid a few branches that flush_tlb_mm_range() needs to decide to flush just one page. This hardly seems worthwhile. If we decide we want to get rid of those branches again, a better way would be to introduce an __flush_tlb_mm_range() helper and make both flush_tlb_page() and flush_tlb_mm_range() use it. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NKees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.1495492063.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 26 4月, 2017 3 次提交
-
-
由 Andy Lutomirski 提交于
flush_tlb_page() passes a bogus range to flush_tlb_others() and expects the latter to fix it up. native_flush_tlb_others() has the fixup but Xen's version doesn't. Move the fixup to flush_tlb_others(). AFAICS the only real effect is that, without this fix, Xen would flush everything instead of just the one page on remote vCPUs in when flush_tlb_page() was called. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: e7b52ffd ("x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range") Link: http://lkml.kernel.org/r/10ed0e4dfea64daef10b87fb85df1746999b4dba.1492844372.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
I'm about to rewrite the function almost completely, but first I want to get a functional change out of the way. Currently, if flush_tlb_mm_range() does not flush the local TLB at all, it will never do individual page flushes on remote CPUs. This seems to be an accident, and preserving it will be awkward. Let's change it first so that any regressions in the rewrite will be easier to bisect and so that the rewrite can attempt to change no visible behavior at all. The fix is simple: we can simply avoid short-circuiting the calculation of base_pages_to_flush. As a side effect, this also eliminates a potential corner case: if tlb_single_page_flush_ceiling == TLB_FLUSH_ALL, flush_tlb_mm_range() could have ended up flushing the entire address space one page at a time. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NDave Hansen <dave.hansen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4b29b771d9975aad7154c314534fec235618175a.1492844372.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
I was trying to figure out what how flush_tlb_current_task() would possibly work correctly if current->mm != current->active_mm, but I realized I could spare myself the effort: it has no callers except the unused flush_tlb() macro. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 8月, 2016 1 次提交
-
-
由 Andy Lutomirski 提交于
This allows x86_64 kernels to enable vmapped stacks by setting HAVE_ARCH_VMAP_STACK=y - which enables the CONFIG_VMAP_STACK=y high level Kconfig option. There are a couple of interesting bits: First, x86 lazily faults in top-level paging entries for the vmalloc area. This won't work if we get a page fault while trying to access the stack: the CPU will promote it to a double-fault and we'll die. To avoid this problem, probe the new stack when switching stacks and forcibly populate the pgd entry for the stack when switching mms. Second, once we have guard pages around the stack, we'll want to detect and handle stack overflow. I didn't enable it on x86_32. We'd need to rework the double-fault code a bit and I'm concerned about running out of vmalloc virtual addresses under some workloads. This patch, by itself, will behave somewhat erratically when the stack overflows while RSP is still more than a few tens of bytes above the bottom of the stack. Specifically, we'll get #PF and make it to no_context and them oops without reliably triggering a double-fault, and no_context doesn't know about stack overflows. The next patch will improve that case. Thank you to Nadav and Brian for helping me pay enough attention to the SDM to hopefully get this right. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/c88f3e2920b18e6cc621d772a04a62c06869037e.1470907718.git.luto@kernel.org [ Minor edits. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 7月, 2016 1 次提交
-
-
由 Paul Gortmaker 提交于
Historically a lot of these existed because we did not have a distinction between what was modular code and what was providing support to modules via EXPORT_SYMBOL and friends. That changed when we forked out support for the latter into the export.h file. This means we should be able to reduce the usage of module.h in code that is obj-y Makefile or bool Kconfig. The advantage in doing so is that module.h itself sources about 15 other headers; adding significantly to what we feed cpp, and it can obscure what headers we are effectively using. Since module.h was the source for init.h (for __init) and for export.h (for EXPORT_SYMBOL) we consider each obj-y/bool instance for the presence of either and replace accordingly where needed. Note that some bool/obj-y instances remain since module.h is the header for some exception table entry stuff, and for things like __init_or_module (code that is tossed when MODULES=n). Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20160714001901.31603-3-paul.gortmaker@windriver.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 28 4月, 2016 3 次提交
-
-
由 Andy Lutomirski 提交于
Potential races between switch_mm() and TLB-flush or LDT-flush IPIs could be very messy. AFAICT the code is currently okay, whether by accident or by careful design, but enabling PCID will make it considerably more complicated and will no longer be obviously safe. Fix it with a big hammer: run switch_mm() with IRQs off. To avoid a performance hit in the scheduler, we take advantage of our knowledge that the scheduler already has IRQs disabled when it calls switch_mm(). Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/f19baf759693c9dcae64bbff76189db77cb13398.1461688545.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
It's fairly large and it has quite a few callers. This may also help untangle some headers down the road. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/54f3367803e7f80b2be62c8a21879aa74b1a5f57.1461688545.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Currently all of the functions that live in tlb.c are inlined on !SMP builds. One can debate whether this is a good idea (in many respects the code in tlb.c is better than the inlined UP code). Regardless, I want to add code that needs to be built on UP and SMP kernels and relates to tlb flushing, so arrange for tlb.c to be compiled unconditionally. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/f0d778f0d828fc46e5d1946bca80f0aaf9abf032.1461688545.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 02 4月, 2016 2 次提交
-
-
由 Nadav Amit 提交于
The recently introduced batched invalidations mechanism uses its own mechanism for shootdown. However, it does wrong accounting of interrupts (e.g., inc_irq_stat is called for local invalidations), trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and may break some platforms as it bypasses the invalidation mechanisms of Xen and SGI UV. This patch reuses the existing TLB flushing mechnaisms instead. We use NULL as mm to indicate a global invalidation is required. Fixes 72b252ae ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages") Signed-off-by: NNadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nadav Amit 提交于
TLB_REMOTE_SEND_IPI was recently introduced, but it counts bytes instead of pages. In addition, it does not report correctly the case in which flush_tlb_page flushes a page. Fix it to be consistent with other TLB counters. Fixes: 5b74283a ("x86, mm: trace when an IPI is about to be sent") Signed-off-by: NNadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 1月, 2016 1 次提交
-
-
由 Andy Lutomirski 提交于
When switch_mm() activates a new PGD, it also sets a bit that tells other CPUs that the PGD is in use so that TLB flush IPIs will be sent. In order for that to work correctly, the bit needs to be visible prior to loading the PGD and therefore starting to fill the local TLB. Document all the barriers that make this work correctly and add a couple that were missing. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 9月, 2015 1 次提交
-
-
由 Mel Gorman 提交于
When unmapping pages it is necessary to flush the TLB. If that page was accessed by another CPU then an IPI is used to flush the remote CPU. That is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second. There already is a window between when a page is unmapped and when it is TLB flushed. This series increases the window so multiple pages can be flushed using a single IPI. This should be safe or the kernel is hosed already. Patch 1 simply made the rest of the series easier to write as ftrace could identify all the senders of TLB flush IPIS. Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI to flush the entire TLB. Patch 3 tracks when there potentially are writable TLB entries that need to be batched differently Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes The performance impact is documented in the changelogs but in the optimistic case on a 4-socket machine the full series reduces interrupts from 900K interrupts/second to 60K interrupts/second. This patch (of 4): It is easy to trace when an IPI is received to flush a TLB but harder to detect what event sent it. This patch makes it easy to identify the source of IPIs being transmitted for TLB flushes on x86. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NDave Hansen <dave.hansen@intel.com> Acked-by: NIngo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 7月, 2015 1 次提交
-
-
由 Dave Hansen 提交于
flush_tlb_info->flush_start/end are both normal virtual addresses. When calculating 'nr_pages' (only used for the tracepoint), I neglected to put parenthesis in. Thanks to David Koufaty for pointing this out. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@sr71.net Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20150720230153.9E834081@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 2月, 2015 1 次提交
-
-
由 Andy Lutomirski 提交于
Context switches and TLB flushes can change individual bits of CR4. CR4 reads take several cycles, so store a shadow copy of CR4 in a per-cpu variable. To avoid wasting a cache line, I added the CR4 shadow to cpu_tlbstate, which is already touched in switch_mm. The heaviest users of the cr4 shadow will be switch_mm and __switch_to_xtra, and __switch_to_xtra is called shortly after switch_mm during context switch, so the cacheline is likely to be hot. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 10 8月, 2014 1 次提交
-
-
由 Jeremiah Mahler 提交于
A sparse warning is generated about 'tlb_single_page_flush_ceiling' not being declared. arch/x86/mm/tlb.c:177:15: warning: symbol 'tlb_single_page_flush_ceiling' was not declared. Should it be static? Since it isn't used anywhere outside this file, fix the warning by making it static. Also, optimize the use of this variable by adding the __read_mostly directive, as suggested by David Rientjes. Suggested-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NJeremiah Mahler <jmmahler@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/1407569913-4035-1-git-send-email-jmmahler@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 8月, 2014 1 次提交
-
-
由 Dave Hansen 提交于
Dave Jones reported seeing a bug from one of my TLB tracepoints: http://lkml.kernel.org/r/20140806181801.GA4605@redhat.com According to Paul McKenney, the right way to fix this is adding an _rcuidle suffix to the tracepoint. http://lkml.kernel.org/r/20140807065055.GA5821@linux.vnet.ibm.com This patch does just that. Reported-by: Dave Jones <davej@redhat.com>, Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Dave Hansen <dave@sr71.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140807175841.5C92D878@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 31 7月, 2014 7 次提交
-
-
由 Dave Hansen 提交于
This has been run through Intel's LKP tests across a wide range of modern sytems and workloads and it wasn't shown to make a measurable performance difference positive or negative. Now that we have some shiny new tracepoints, we can actually figure out what the heck is going on. During a kernel compile, 60% of the flush_tlb_mm_range() calls are for a single page. It breaks down like this: size percent percent<= V V V GLOBAL: 2.20% 2.20% avg cycles: 2283 1: 56.92% 59.12% avg cycles: 1276 2: 13.78% 72.90% avg cycles: 1505 3: 8.26% 81.16% avg cycles: 1880 4: 7.41% 88.58% avg cycles: 2447 5: 1.73% 90.31% avg cycles: 2358 6: 1.32% 91.63% avg cycles: 2563 7: 1.14% 92.77% avg cycles: 2862 8: 0.62% 93.39% avg cycles: 3542 9: 0.08% 93.47% avg cycles: 3289 10: 0.43% 93.90% avg cycles: 3570 11: 0.20% 94.10% avg cycles: 3767 12: 0.08% 94.18% avg cycles: 3996 13: 0.03% 94.20% avg cycles: 4077 14: 0.02% 94.23% avg cycles: 4836 15: 0.04% 94.26% avg cycles: 5699 16: 0.06% 94.32% avg cycles: 5041 17: 0.57% 94.89% avg cycles: 5473 18: 0.02% 94.91% avg cycles: 5396 19: 0.03% 94.95% avg cycles: 5296 20: 0.02% 94.96% avg cycles: 6749 21: 0.18% 95.14% avg cycles: 6225 22: 0.01% 95.15% avg cycles: 6393 23: 0.01% 95.16% avg cycles: 6861 24: 0.12% 95.28% avg cycles: 6912 25: 0.05% 95.32% avg cycles: 7190 26: 0.01% 95.33% avg cycles: 7793 27: 0.01% 95.34% avg cycles: 7833 28: 0.01% 95.35% avg cycles: 8253 29: 0.08% 95.42% avg cycles: 8024 30: 0.03% 95.45% avg cycles: 9670 31: 0.01% 95.46% avg cycles: 8949 32: 0.01% 95.46% avg cycles: 9350 33: 3.11% 98.57% avg cycles: 8534 34: 0.02% 98.60% avg cycles: 10977 35: 0.02% 98.62% avg cycles: 11400 We get in to dimishing returns pretty quickly. On pre-IvyBridge CPUs, we used to set the limit at 8 pages, and it was set at 128 on IvyBrige. That 128 number looks pretty silly considering that less than 0.5% of the flushes are that large. The previous code tried to size this number based on the size of the TLB. Good idea, but it's error-prone, needs maintenance (which it didn't get up to now), and probably would not matter in practice much. Settting it to 33 means that we cover the mallopt M_TRIM_THRESHOLD, which is the most universally common size to do flushes. That's the short version. Here's the long one for why I chose 33: 1. These numbers have a constant bias in the timestamps from the tracing. Probably counts for a couple hundred cycles in each of these tests, but it should be fairly _even_ across all of them. The smallest delta between the tracepoints I have ever seen is 335 cycles. This is one reason the cycles/page cost goes down in general as the flushes get larger. The true cost is nearer to 100 cycles. 2. A full flush is more expensive than a single invlpg, but not by much (single percentages). 3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. 4. 22,000 cycles is approximately the equivalent of doing 85 invlpg operations. But, the odds are that the TLB can actually be filled up faster than that because TLB misses that are close in time also tend to leverage the same caches. 6. ~98% of flushes are <=33 pages. There are a lot of flushes of 33 pages, probably because libc's M_TRIM_THRESHOLD is set to 128k (32 pages) 7. I've found no consistent data to support changing the IvyBridge vs. SandyBridge tunable by a factor of 16 I used the performance counters on this hardware (IvyBridge i5-3320M) to figure out the tlb miss costs: ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush 7,720,030,970 dtlb_load_misses_walk_duration [57.13%] 169,856,353 dtlb_load_misses_walk_completed [57.15%] 708,832,859 dtlb_store_misses_walk_duration [57.17%] 19,346,823 dtlb_store_misses_walk_completed [57.17%] 2,779,687,402 itlb_misses_walk_duration [57.15%] 82,241,148 itlb_misses_walk_completed [57.13%] 770,717 itlb_itlb_flush [57.11%] Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns (~34 cycles). At those rates, refilling the 512-entry dTLB takes 22,000 cycles. On a SandyBridge system with more cores and larger caches, those are dtlb=13.4ns and itlb=9.5ns. cat perf.stat.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, " ----- ", icyc,imiss, dcyc,dmiss } On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any The assumptions that this code went in under: https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are about 100ns. Being generous, that is over by a factor of 6 on the refill side, although it is fairly close on the cost of an invlpg. An increase of a single invlpg operation seems to lengthen the flush range operation by about 200 cycles. Here is one example of the data collected for flushing 10 and 11 pages (full data are below): 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 How to generate this table: echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb echo x86-tsc > /sys/kernel/debug/tracing/trace_clock echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable Pipe the trace output in to this script: http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt Note that these data were gathered with the invlpg threshold set to 150 pages. Only data points with >=50 of samples were printed: Flush % of %<= in flush this pages es size ------------------------------------------------------------------------------ -1: 2.20% 2.20% avg cycles: 2283 cycles/page: xxxx samples: 23960 1: 56.92% 59.12% avg cycles: 1276 cycles/page: 1276 samples: 620895 2: 13.78% 72.90% avg cycles: 1505 cycles/page: 752 samples: 150335 3: 8.26% 81.16% avg cycles: 1880 cycles/page: 626 samples: 90131 4: 7.41% 88.58% avg cycles: 2447 cycles/page: 611 samples: 80877 5: 1.73% 90.31% avg cycles: 2358 cycles/page: 471 samples: 18885 6: 1.32% 91.63% avg cycles: 2563 cycles/page: 427 samples: 14397 7: 1.14% 92.77% avg cycles: 2862 cycles/page: 408 samples: 12441 8: 0.62% 93.39% avg cycles: 3542 cycles/page: 442 samples: 6721 9: 0.08% 93.47% avg cycles: 3289 cycles/page: 365 samples: 917 10: 0.43% 93.90% avg cycles: 3570 cycles/page: 357 samples: 4714 11: 0.20% 94.10% avg cycles: 3767 cycles/page: 342 samples: 2145 12: 0.08% 94.18% avg cycles: 3996 cycles/page: 333 samples: 864 13: 0.03% 94.20% avg cycles: 4077 cycles/page: 313 samples: 289 14: 0.02% 94.23% avg cycles: 4836 cycles/page: 345 samples: 236 15: 0.04% 94.26% avg cycles: 5699 cycles/page: 379 samples: 390 16: 0.06% 94.32% avg cycles: 5041 cycles/page: 315 samples: 643 17: 0.57% 94.89% avg cycles: 5473 cycles/page: 321 samples: 6229 18: 0.02% 94.91% avg cycles: 5396 cycles/page: 299 samples: 224 19: 0.03% 94.95% avg cycles: 5296 cycles/page: 278 samples: 367 20: 0.02% 94.96% avg cycles: 6749 cycles/page: 337 samples: 185 21: 0.18% 95.14% avg cycles: 6225 cycles/page: 296 samples: 1964 22: 0.01% 95.15% avg cycles: 6393 cycles/page: 290 samples: 83 23: 0.01% 95.16% avg cycles: 6861 cycles/page: 298 samples: 61 24: 0.12% 95.28% avg cycles: 6912 cycles/page: 288 samples: 1307 25: 0.05% 95.32% avg cycles: 7190 cycles/page: 287 samples: 533 26: 0.01% 95.33% avg cycles: 7793 cycles/page: 299 samples: 94 27: 0.01% 95.34% avg cycles: 7833 cycles/page: 290 samples: 66 28: 0.01% 95.35% avg cycles: 8253 cycles/page: 294 samples: 73 29: 0.08% 95.42% avg cycles: 8024 cycles/page: 276 samples: 846 30: 0.03% 95.45% avg cycles: 9670 cycles/page: 322 samples: 296 31: 0.01% 95.46% avg cycles: 8949 cycles/page: 288 samples: 79 32: 0.01% 95.46% avg cycles: 9350 cycles/page: 292 samples: 60 33: 3.11% 98.57% avg cycles: 8534 cycles/page: 258 samples: 33936 34: 0.02% 98.60% avg cycles: 10977 cycles/page: 322 samples: 268 35: 0.02% 98.62% avg cycles: 11400 cycles/page: 325 samples: 177 36: 0.01% 98.63% avg cycles: 11504 cycles/page: 319 samples: 161 37: 0.02% 98.65% avg cycles: 11596 cycles/page: 313 samples: 182 38: 0.02% 98.66% avg cycles: 11850 cycles/page: 311 samples: 195 39: 0.01% 98.68% avg cycles: 12158 cycles/page: 311 samples: 128 40: 0.01% 98.68% avg cycles: 11626 cycles/page: 290 samples: 78 41: 0.04% 98.73% avg cycles: 11435 cycles/page: 278 samples: 477 42: 0.01% 98.73% avg cycles: 12571 cycles/page: 299 samples: 74 43: 0.01% 98.74% avg cycles: 12562 cycles/page: 292 samples: 78 44: 0.01% 98.75% avg cycles: 12991 cycles/page: 295 samples: 108 45: 0.01% 98.76% avg cycles: 13169 cycles/page: 292 samples: 78 46: 0.02% 98.78% avg cycles: 12891 cycles/page: 280 samples: 261 47: 0.01% 98.79% avg cycles: 13099 cycles/page: 278 samples: 67 48: 0.01% 98.80% avg cycles: 13851 cycles/page: 288 samples: 77 49: 0.01% 98.80% avg cycles: 13749 cycles/page: 280 samples: 66 50: 0.01% 98.81% avg cycles: 13949 cycles/page: 278 samples: 73 52: 0.00% 98.82% avg cycles: 14243 cycles/page: 273 samples: 52 54: 0.01% 98.83% avg cycles: 15312 cycles/page: 283 samples: 87 55: 0.01% 98.84% avg cycles: 15197 cycles/page: 276 samples: 109 56: 0.02% 98.86% avg cycles: 15234 cycles/page: 272 samples: 208 57: 0.00% 98.86% avg cycles: 14888 cycles/page: 261 samples: 53 58: 0.01% 98.87% avg cycles: 15037 cycles/page: 259 samples: 59 59: 0.01% 98.87% avg cycles: 15752 cycles/page: 266 samples: 63 62: 0.00% 98.89% avg cycles: 16222 cycles/page: 261 samples: 54 64: 0.02% 98.91% avg cycles: 17179 cycles/page: 268 samples: 248 65: 0.12% 99.03% avg cycles: 18762 cycles/page: 288 samples: 1324 85: 0.00% 99.10% avg cycles: 21649 cycles/page: 254 samples: 50 127: 0.01% 99.18% avg cycles: 32397 cycles/page: 255 samples: 75 128: 0.13% 99.31% avg cycles: 31711 cycles/page: 247 samples: 1466 129: 0.18% 99.49% avg cycles: 33017 cycles/page: 255 samples: 1927 181: 0.33% 99.84% avg cycles: 2489 cycles/page: 13 samples: 3547 256: 0.05% 99.91% avg cycles: 2305 cycles/page: 9 samples: 550 512: 0.03% 99.95% avg cycles: 2133 cycles/page: 4 samples: 304 1512: 0.01% 99.99% avg cycles: 3038 cycles/page: 2 samples: 65 Here are the tlb counters during a 10-second slice of a kernel compile for a SandyBridge system. It's better than IvyBridge, but probably due to the larger caches since this was one of the 'X' extreme parts. 10,873,007,282 dtlb_load_misses_walk_duration 250,711,333 dtlb_load_misses_walk_completed 1,212,395,865 dtlb_store_misses_walk_duration 31,615,772 dtlb_store_misses_walk_completed 5,091,010,274 itlb_misses_walk_duration 163,193,511 itlb_misses_walk_completed 1,321,980 itlb_itlb_flush 10.008045158 seconds time elapsed # cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, " ----- ", icyc,imiss, dcyc,dmiss }' itlb ns/miss: 9.45338 dtlb ns/miss: 12.9716 Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154103.10C1115E@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
Most of the logic here is in the documentation file. Please take a look at it. I know we've come full-circle here back to a tunable, but this new one is *WAY* simpler. I challenge anyone to describe in one sentence how the old one worked. Here's the way the new one works: If we are flushing more pages than the ceiling, we use the full flush, otherwise we use per-page flushes. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154101.12B52CAF@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
We don't have any good way to figure out what kinds of flushes are being attempted. Right now, we can try to use the vm counters, but those only tell us what we actually did with the hardware (one-by-one vs full) and don't tell us what was actually _requested_. This allows us to select out "interesting" TLB flushes that we might want to optimize (like the ranged ones) and ignore the ones that we have very little control over (the ones at context switch). Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154059.4C96CBA5@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
There are currently three paths through the remote flush code: 1. full invalidation 2. single page invalidation using invlpg 3. ranged invalidation using invlpg This takes 2 and 3 and combines them in to a single path by making the single-page one just be the start and end be start plus a single page. This makes placement of our tracepoint easier. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154058.E0F90408@viggo.jf.intel.com Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
If we take the if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) { local_flush_tlb(); goto out; } path out of flush_tlb_mm_range(), we will have flushed the tlb, but not incremented NR_TLB_LOCAL_FLUSH_ALL. This unifies the way out of the function so that we always take a single path when doing a full tlb flush. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154056.FF763B76@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
I think the flush_tlb_mm_range() code that tries to tune the flush sizes based on the CPU needs to get ripped out for several reasons: 1. It is obviously buggy. It uses mm->total_vm to judge the task's footprint in the TLB. It should certainly be using some measure of RSS, *NOT* ->total_vm since only resident memory can populate the TLB. 2. Haswell, and several other CPUs are missing from the intel_tlb_flushall_shift_set() function. Thus, it has been demonstrated to bitrot quickly in practice. 3. It is plain wrong in my vm: [ 0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0 [ 0.037444] tlb_flushall_shift: 6 Which leads to it to never use invlpg. 4. The assumptions about TLB refill costs are wrong: http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com (more on this in later patches) 5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59 I believe the sample times were too short. Running the benchmark in a loop yields times that vary quite a bit. Note that this leaves us with a static ceiling of 1 page. This is a conservative, dumb setting, and will be revised in a later patch. This also removes the code which attempts to predict whether we are flushing data or instructions. We expect instruction flushes to be relatively rare and not worth tuning for explicitly. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
由 Dave Hansen 提交于
The if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids) line of code is not exactly the easiest to audit, especially when it ends up at two different indentation levels. This eliminates one of the the copy-n-paste versions. It also gives us a unified exit point for each path through this function. We need this in a minute for our tracepoint. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/r/20140731154054.44F1CDDC@viggo.jf.intel.comAcked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
-
- 25 1月, 2014 3 次提交
-
-
由 Mel Gorman 提交于
When choosing between doing an address space or ranged flush, the x86 implementation of flush_tlb_mm_range takes into account whether there are any large pages in the range. A per-page flush typically requires fewer entries than would covered by a single large page and the check is redundant. There is one potential exception. THP migration flushes single THP entries and it conceivably would benefit from flushing a single entry instead of the mm. However, this flush is after a THP allocation, copy and page table update potentially with any other threads serialised behind it. In comparison to that, the flush is noise. It makes more sense to optimise balancing to require fewer flushes than to optimise the flush itself. This patch deletes the redundant huge page check. Signed-off-by: NMel Gorman <mgorman@suse.de> Tested-by: NDavidlohr Bueso <davidlohr@hp.com> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Alex Shi <alex.shi@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-sgei1drpOcburujPsfh6ovmo@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and the comparison with total_vm is done before taking tlb_flushall_shift into account. Clean it up. Signed-off-by: NMel Gorman <mgorman@suse.de> Tested-by: NDavidlohr Bueso <davidlohr@hp.com> Reviewed-by: NAlex Shi <alex.shi@linaro.org> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/n/tip-Iz5gcahrgskIldvukulzi0hh@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm: vmstats: tlb flush counters") to cause overhead problems. The counters are undeniably useful but how often do we really need to debug TLB flush related issues? It does not justify taking the penalty everywhere so make it a debugging option. Signed-off-by: NMel Gorman <mgorman@suse.de> Tested-by: NDavidlohr Bueso <davidlohr@hp.com> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 12 9月, 2013 1 次提交
-
-
由 Dave Hansen 提交于
The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled for SMP. UP systems do not do remote TLB flushes, so compile those counters out on UP. arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is probably an optimization since both the mtrr code and __flush_tlb() write cr4. It would probably be safe to make that a flush_tlb_all() (and then get these statistics), but the mtrr code is ancient and I'm hesitant to touch it other than to just stick in the counters. [akpm@linux-foundation.org: tweak comments] Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-