- 05 7月, 2017 4 次提交
-
-
由 Andy Lutomirski 提交于
Now that lazy TLB suppresses all flush IPIs (as opposed to all but the first), there's no need to leave_mm() when going idle. This means we can get rid of the rcuidle hack in switch_mm_irqs_off() and we can unexport leave_mm(). This also removes acpi_unlazy_tlb() from the x86 and ia64 headers, since it has no callers any more. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NNadav Amit <nadav.amit@gmail.com> Reviewed-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/03c699cfd6021e467be650d6b73deaccfe4b4bd7.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
x86's lazy TLB mode used to be fairly weak -- it would switch to init_mm the first time it tried to flush a lazy TLB. This meant an unnecessary CR3 write and, if the flush was remote, an unnecessary IPI. Rewrite it entirely. When we enter lazy mode, we simply remove the CPU from mm_cpumask. This means that we need a way to figure out whether we've missed a flush when we switch back out of lazy mode. I use the tlb_gen machinery to track whether a context is up to date. Note to reviewers: this patch, my itself, looks a bit odd. I'm using an array of length 1 containing (ctx_id, tlb_gen) rather than just storing tlb_gen, and making it at array isn't necessary yet. I'm doing this because the next few patches add PCID support, and, with PCID, we need ctx_id, and the array will end up with a length greater than 1. Making it an array now means that there will be less churn and therefore less stress on your eyeballs. NB: This is dubious but, AFAICT, still correct on Xen and UV. xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this patch changes the way that mm_cpumask() works. This should be okay, since Xen *also* iterates all online CPUs to find all the CPUs it needs to twiddle. The UV tlbflush code is rather dated and should be changed. Here are some benchmark results, done on a Skylake laptop at 2.3 GHz (turbo off, intel_pstate requesting max performance) under KVM with the guest using idle=poll (to avoid artifacts when bouncing between CPUs). I haven't done any real statistics here -- I just ran them in a loop and picked the fastest results that didn't look like outliers. Unpatched means commit a4eb8b99, so all the bookkeeping overhead is gone. MADV_DONTNEED; touch the page; switch CPUs using sched_setaffinity. In an unpatched kernel, MADV_DONTNEED will send an IPI to the previous CPU. This is intended to be a nearly worst-case test. patched: 13.4µs unpatched: 21.6µs Vitaly's pthread_mmap microbenchmark with 8 threads (on four cores), nrounds = 100, 256M data patched: 1.1 seconds or so unpatched: 1.9 seconds or so The sleepup on Vitaly's test appearss to be because it spends a lot of time blocked on mmap_sem, and this patch avoids sending IPIs to blocked CPUs. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NNadav Amit <nadav.amit@gmail.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Banman <abanman@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Travis <travis@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/ddf2c92962339f4ba39d8fc41b853936ec0b44f1.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
There are two kernel features that would benefit from tracking how up-to-date each CPU's TLB is in the case where IPIs aren't keeping it up to date in real time: - Lazy mm switching currently works by switching to init_mm when it would otherwise flush. This is wasteful: there isn't fundamentally any need to update CR3 at all when going lazy or when returning from lazy mode, nor is there any need to receive flush IPIs at all. Instead, we should just stop trying to keep the TLB coherent when we go lazy and, when unlazying, check whether we missed any flushes. - PCID will let us keep recent user contexts alive in the TLB. If we start doing this, we need a way to decide whether those contexts are up to date. On some paravirt systems, remote TLBs can be flushed without IPIs. This won't update the target CPUs' tlb_gens, which may cause unnecessary local flushes later on. We can address this if it becomes a problem by carefully updating the target CPU's tlb_gen directly. By itself, this patch is a very minor optimization that avoids unnecessary flushes when multiple TLB flushes targetting the same CPU race. The complexity in this patch would not be worth it on its own, but it will enable improved lazy TLB tracking and PCID. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NNadav Amit <nadav.amit@gmail.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1210fb244bc9cbe7677f7f0b72db4d359675f24b.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
This adds two new variables to mmu_context_t: ctx_id and tlb_gen. ctx_id uniquely identifies the mm_struct and will never be reused. For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic count of the number of times that a TLB flush has been requested. The pair (ctx_id, tlb_gen) can be used as an identifier for TLB flush actions and will be used in subsequent patches to reliably determine whether all needed TLB flushes have occurred on a given CPU. This patch is split out for ease of review. By itself, it has no real effect other than creating and updating the new variables. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NNadav Amit <nadav.amit@gmail.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/413a91c24dab3ed0caa5f4e4d017d87b0857f920.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 30 6月, 2017 2 次提交
-
-
由 Andy Lutomirski 提交于
The comment describes the old explicit IPI-based flush logic, which is long gone. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/55e44997e56086528140c5180f8337dc53fb7ffc.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
It was historically possible to have two concurrent TLB flushes targetting the same CPU: one initiated locally and one initiated remotely. This can now cause an OOPS in leave_mm() at arch/x86/mm/tlb.c:47: if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) BUG(); with this call trace: flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline] flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317 Without reentrancy, this OOPS is impossible: leave_mm() is only called if we're not in TLBSTATE_OK, but then we're unexpectedly in TLBSTATE_OK in leave_mm(). This can be caused by flush_tlb_func_remote() happening between the two checks and calling leave_mm(), resulting in two consecutive leave_mm() calls on the same CPU with no intervening switch_mm() calls. We never saw this OOPS before because the old leave_mm() implementation didn't put us back in TLBSTATE_OK, so the assertion didn't fire. Nadav noticed the reentrancy issue in a different context, but neither of us realized that it caused a problem yet. Reported-by: NLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NNadav Amit <nadav.amit@gmail.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Fixes: 3d28ebce ("x86/mm: Rework lazy TLB to track the actual loaded mm") Link: http://lkml.kernel.org/r/855acf733268d521c9f2e191faee2dcc23a29729.1498751203.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 26 6月, 2017 1 次提交
-
-
由 Jérôme Glisse 提交于
Since commit: af2cf278 ("x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()") we no longer free PUDs so that we do not have to synchronize all PGDs on hot-remove/vfree(). But the new 5-level page table patchset reverted that for 4-level page tables, in the following commit: f2a6a705: ("x86: Convert the rest of the code to support p4d_t") This patch restores the damage and disables free_pud() if we are in the 4-level page table case, thus avoiding BUG_ON() after hot-remove. Signed-off-by: NJérôme Glisse <jglisse@redhat.com> [ Clarified the changelog and the code comments. ] Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170624180514.3821-1-jglisse@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 6月, 2017 1 次提交
-
-
由 Michal Hocko 提交于
Since the following commit in 2008: cc503c1b ("x86: PIE executable randomization") We added a heuristics to treat applications with RLIMIT_STACK configured to unlimited as legacy. This means: a) set the mmap_base to 1/3 of address space + randomization and b) mmap from bottom to top. This makes some sense as it allows the stack to grow really large. On the other hand it reduces the address space usable for default mmaps (without address hint) quite a lot. We have received a bug report that SAP HANA workload has hit into this limitation. We could argue that the user just got what he asked for when setting up the unlimited stack but to be realistic growing stack up to 1/6 TASK_SIZE (allowed by mmap_base) is pretty much unimited in the real life. This would give mmap 20TB of additional address space which is quite nice. Especially when it is much more likely to use that address space than the reserved stack. Digging into the history the original implementation of the randomization: 8817210d ("[PATCH] x86_64: Flexmap for 32bit and randomized mappings for 64bit") didn't have this restriction. So let's try and remove this assumption - hopefully nothing breaks. Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJiri Kosina <jkosina@suse.cz> Acked-by: NOleg Nesterov <oleg@redhat.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: hughd@google.com Cc: linux-mm@kvack.org Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/tip-86b110d2ae6365ce91cabd37588bc8611770421a@git.kernel.org [ So I've applied this to tip:x86/mm with a wider Cc: list - if anyone objects to this change please holler. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 22 6月, 2017 1 次提交
-
-
由 Andy Lutomirski 提交于
Originally, Linux reloaded the LDT whenever the prev mm or the next mm had an LDT. It was changed in 2002 in: 0bbed3beb4f2 ("[PATCH] Thread-Local Storage (TLS) support") (commit from the historical tree), like this: - /* load_LDT, if either the previous or next thread - * has a non-default LDT. + /* + * load the LDT, if the LDT is different: */ - if (next->context.size+prev->context.size) + if (unlikely(prev->context.ldt != next->context.ldt)) load_LDT(&next->context); The current code is unlikely to avoid any LDT reloads, since different mms won't share an LDT. When we redo lazy mode to stop flush IPIs without switching to init_mm, though, the current logic would become incorrect: it will be possible to have real_prev == next but nonetheless have a stale LDT descriptor. Simplify the code to update LDTR if either the previous or the next mm has an LDT, i.e. effectively restore the historical logic.. While we're at it, clean up the code by moving all the ifdeffery to a header where it belongs. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NBorislav Petkov <bp@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/2a859ac01245f9594c58f9d0a8b2ed8a7cd2507e.1498022414.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 19 6月, 2017 1 次提交
-
-
由 Hugh Dickins 提交于
Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: NOleg Nesterov <oleg@redhat.com> Original-patch-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 6月, 2017 8 次提交
-
-
由 Kirill A. Shutemov 提交于
With 5-level paging randomization happens on P4D level instead of PUD. Maximum amount of physical memory also bumped to 52-bits for 5-level paging. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-13-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill A. Shutemov 提交于
Populate additional page table level if CONFIG_X86_5LEVEL is enabled. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-12-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill A. Shutemov 提交于
This basically restores slightly modified version of original sync_global_pgds() which we had before folded p4d was introduced. The only modification is protection against 'addr' overflow. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-11-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill A. Shutemov 提交于
With CONFIG_X86_5LEVEL=y, level 4 is no longer top level of page tables. Let's give these variable more generic names: init_top_pgt and early_top_pgt. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NJuergen Gross <jgross@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-9-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill A. Shutemov 提交于
This patch provides all required callbacks required by the generic get_user_pages_fast() code and switches x86 over - and removes the platform specific implementation. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20170606113133.22974-2-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The kernel has several code paths that read CR3. Most of them assume that CR3 contains the PGD's physical address, whereas some of them awkwardly use PHYSICAL_PAGE_MASK to mask off low bits. Add explicit mask macros for CR3 and convert all of the CR3 readers. This will keep them from breaking when PCID is enabled. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: xen-devel <xen-devel@lists.xen.org> Link: http://lkml.kernel.org/r/883f8fb121f4616c1c1427ad87350bb2f5ffeca1.1497288170.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vlastimil Babka 提交于
The kmemleak and debug_pagealloc features both disable using huge pages for direct mappings so they can do cpa() on page level granularity in any context. However they only do that for 2MB pages, which means 1GB pages can still be used if the CPU supports it, unless disabled by a boot param, which is non-obvious. Disable also 1GB pages when disabling 2MB pages. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegardno@ifi.uio.no> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/2be70c78-6130-855d-3dfa-d87bd1dd4fda@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Hans managed to trigger a WARN very early in the boot which killed his (Virtual) box. The reason is that the recent rework of WARN() to use UD0 forgot to add the fixup_bug() call to early_fixup_exception(). As a result the kernel does not handle the WARN_ON injected UD0 exception and panics. Add the missing fixup call, so early UD's injected by WARN() get handled. Fixes: 9a93848f ("x86/debug: Implement __WARN() using UD0") Reported-and-tested-by: NHans de Goede <hdegoede@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Frank Mehnert <frank.mehnert@oracle.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Michael Thayer <michael.thayer@oracle.com> Link: http://lkml.kernel.org/r/20170612180108.w4vgu2ckucmllf3a@hirez.programming.kicks-ass.net
-
- 05 6月, 2017 7 次提交
-
-
由 Andy Lutomirski 提交于
Nadav pointed out that some code used PAGE_SIZE and other code used PAGE_SHIFT. Use PAGE_SHIFT instead of multiplying or dividing by PAGE_SIZE. Requested-by: NNadav Amit <nadav.amit@gmail.com> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Lazy TLB state is currently managed in a rather baroque manner. AFAICT, there are three possible states: - Non-lazy. This means that we're running a user thread or a kernel thread that has called use_mm(). current->mm == current->active_mm == cpu_tlbstate.active_mm and cpu_tlbstate.state == TLBSTATE_OK. - Lazy with user mm. We're running a kernel thread without an mm and we're borrowing an mm_struct. We have current->mm == NULL, current->active_mm == cpu_tlbstate.active_mm, cpu_tlbstate.state != TLBSTATE_OK (i.e. TLBSTATE_LAZY or 0). The current cpu is set in mm_cpumask(current->active_mm). CR3 points to current->active_mm->pgd. The TLB is up to date. - Lazy with init_mm. This happens when we call leave_mm(). We have current->mm == NULL, current->active_mm == cpu_tlbstate.active_mm, but that mm is only relelvant insofar as the scheduler is tracking it for refcounting. cpu_tlbstate.state != TLBSTATE_OK. The current cpu is clear in mm_cpumask(current->active_mm). CR3 points to swapper_pg_dir, i.e. init_mm->pgd. This patch simplifies the situation. Other than perf, x86 stops caring about current->active_mm at all. We have cpu_tlbstate.loaded_mm pointing to the mm that CR3 references. The TLB is always up to date for that mm. leave_mm() just switches us to init_mm. There are no longer any special cases for mm_cpumask, and switch_mm() switches mms without worrying about laziness. After this patch, cpu_tlbstate.state serves only to tell the TLB flush code whether it may switch to init_mm instead of doing a normal flush. This makes fairly extensive changes to xen_exit_mmap(), which used to look a bit like black magic. Perf is unchanged. With or without this change, perf may behave a bit erratically if it tries to read user memory in kernel thread context. We should build on this patch to teach perf to never look at user memory when cpu_tlbstate.loaded_mm != current->mm. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The UP asm/tlbflush.h generates somewhat nicer code than the SMP version. Aside from that, it's fallen quite a bit behind the SMP code: - flush_tlb_mm_range() didn't flush individual pages if the range was small. - The lazy TLB code was much weaker. This usually wouldn't matter, but, if a kernel thread flushed its lazy "active_mm" more than once (due to reclaim or similar), it wouldn't be unlazied and would instead pointlessly flush repeatedly. - Tracepoints were missing. Aside from that, simply having the UP code around was a maintanence burden, since it means that any change to the TLB flush code had to make sure not to break it. Simplify everything by deleting the UP code. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Now there's only one copy of the local tlb flush logic for non-kernel pages on SMP kernels. The only functional change is that arch_tlbbatch_flush() will now leave_mm() on the local CPU if that CPU is in the batch and is in TLBSTATE_LAZY mode. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The local flush path is very similar to the remote flush path. Merge them. This is intended to make no difference to behavior whatsoever. It removes some code and will make future changes to the flushing mechanics simpler. This patch does remove one small optimization: flush_tlb_mm_range() now has an unconditional smp_mb() instead of using MOV to CR3 or INVLPG as a full barrier when applicable. I think this is okay for a few reasons. First, smp_mb() is quite cheap compared to the cost of a TLB flush. Second, this rearrangement makes a bigger optimization available: with some work on the SMP function call code, we could do the local and remote flushes in parallel. Third, I'm planning a rework of the TLB flush algorithm that will require an atomic operation at the beginning of each flush, and that operation will replace the smp_mb(). Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
On a remote TLB flush, we leave_mm() if we're TLBSTATE_LAZY. For a local flush_tlb_mm_range(), we leave_mm() if !current->mm. These are approximately the same condition -- the scheduler sets lazy TLB mode when switching to a thread with no mm. I'm about to merge the local and remote flush code, but for ease of verifying and bisecting the patch, I want the local and remote flush behavior to match first. This patch changes the local code to match the remote code. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Rather than passing all the contents of flush_tlb_info to flush_tlb_others(), pass a pointer to the structure directly. For consistency, this also removes the unnecessary cpu parameter from uv_flush_tlb_others() to make its signature match the other *flush_tlb_others() functions. This serves two purposes: - It will dramatically simplify future patches that change struct flush_tlb_info, which I'm planning to do. - struct flush_tlb_info is an adequate description of what to do for a local flush, too, so by reusing it we can remove duplicated code between local and remove flushes in a future patch. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org [ Fix build warning. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 01 6月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
This reverts commit cbed27cd. As Andy Lutomirski observed: "I think this patch is bogus. pat_enabled() sure looks like it's supposed to return true if PAT is *enabled*, and these days PAT is 'enabled' even if there's no HW PAT support." Reported-by: NBernhard Held <berny156@gmx.de> Reported-by: NChris Wilson <chris@chris-wilson.co.uk> Acked-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: stable@vger.kernel.org # v4.2+ Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 27 5月, 2017 1 次提交
-
-
由 Steven Rostedt (VMware) 提交于
With function tracing starting in early bootup and having its trampoline pages being read only, a bug triggered with the following: kernel BUG at arch/x86/mm/pageattr.c:189! invalid opcode: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.12.0-rc2-test+ #3 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 task: ffffffffb4222500 task.stack: ffffffffb4200000 RIP: 0010:change_page_attr_set_clr+0x269/0x302 RSP: 0000:ffffffffb4203c88 EFLAGS: 00010046 RAX: 0000000000000046 RBX: 0000000000000000 RCX: 00000001b6000000 RDX: ffffffffb4203d40 RSI: 0000000000000000 RDI: ffffffffb4240d60 RBP: ffffffffb4203d18 R08: 00000001b6000000 R09: 0000000000000001 R10: ffffffffb4203aa8 R11: 0000000000000003 R12: ffffffffc029b000 R13: ffffffffb4203d40 R14: 0000000000000001 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff9a639ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff9a636b384000 CR3: 00000001ea21d000 CR4: 00000000000406b0 Call Trace: change_page_attr_clear+0x1f/0x21 set_memory_ro+0x1e/0x20 arch_ftrace_update_trampoline+0x207/0x21c ? ftrace_caller+0x64/0x64 ? 0xffffffffc029b000 ftrace_startup+0xf4/0x198 register_ftrace_function+0x26/0x3c function_trace_init+0x5e/0x73 tracer_init+0x1e/0x23 tracing_set_tracer+0x127/0x15a register_tracer+0x19b/0x1bc init_function_trace+0x90/0x92 early_trace_init+0x236/0x2b3 start_kernel+0x200/0x3f5 x86_64_start_reservations+0x29/0x2b x86_64_start_kernel+0x17c/0x18f secondary_startup_64+0x9f/0x9f ? secondary_startup_64+0x9f/0x9f Interrupts should not be enabled at this early in the boot process. It is also fine to leave interrupts enabled during this time as there's only one CPU running, and on_each_cpu() means to only run on the current CPU. If early_boot_irqs_disabled is set, it is safe to run cpu_flush_range() with interrupts disabled. Don't trigger a BUG_ON() in that case. Link: http://lkml.kernel.org/r/20170526093717.0be3b849@gandalf.local.homeSuggested-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 24 5月, 2017 4 次提交
-
-
由 Andy Lutomirski 提交于
try_to_unmap_flush() used to open-code a rather x86-centric flush sequence: local_flush_tlb() + flush_tlb_others(). Rearrange the code so that the arch (only x86 for now) provides arch_tlbbatch_add_mm() and arch_tlbbatch_flush() and the core code calls those functions instead. I'll want this for x86 because, to enable address space ids, I can't support the flush_tlb_others() mode used by exising try_to_unmap_flush() implementation with good performance. I can support the new API fairly easily, though. I imagine that other architectures may be in a similar position. Architectures with strong remote flush primitives (arm64?) may have even worse performance problems with flush_tlb_others() the way that try_to_unmap_flush() uses it. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NKees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/19f25a8581f9fb77876b7ff3b001f89835e34ea3.1495492063.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The leave_mm() case can just exit the function early so we don't need to indent the entire remainder of the function. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NKees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/97901ddcc9821d7bc7b296d2918d1179f08aaf22.1495492063.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
flush_tlb_page() was very similar to flush_tlb_mm_range() except that it had a couple of issues: - It was missing an smp_mb() in the case where current->active_mm != mm. (This is a longstanding bug reported by Nadav Amit) - It was missing tracepoints and vm counter updates. The only reason that I can see for keeping it at as a separate function is that it could avoid a few branches that flush_tlb_mm_range() needs to decide to flush just one page. This hardly seems worthwhile. If we decide we want to get rid of those branches again, a better way would be to introduce an __flush_tlb_mm_range() helper and make both flush_tlb_page() and flush_tlb_mm_range() use it. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NKees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.1495492063.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mikulas Patocka 提交于
In the file arch/x86/mm/pat.c, there's a '__pat_enabled' variable. The variable is set to 1 by default and the function pat_init() sets __pat_enabled to 0 if the CPU doesn't support PAT. However, on AMD K6-3 CPUs, the processor initialization code never calls pat_init() and so __pat_enabled stays 1 and the function pat_enabled() returns true, even though the K6-3 CPU doesn't support PAT. The result of this bug is that a kernel warning is produced when attempting to start the Xserver and the Xserver doesn't start (fork() returns ENOMEM). Another symptom of this bug is that the framebuffer driver doesn't set the K6-3 MTRR registers: x86/PAT: Xorg:3891 map pfn expected mapping type uncached-minus for [mem 0xe4000000-0xe5ffffff], got write-combining ------------[ cut here ]------------ WARNING: CPU: 0 PID: 3891 at arch/x86/mm/pat.c:1020 untrack_pfn+0x5c/0x9f ... x86/PAT: Xorg:3891 map pfn expected mapping type uncached-minus for [mem 0xe4000000-0xe5ffffff], got write-combining To fix the bug change pat_enabled() so that it returns true only if PAT initialization was actually done. Also, I changed boot_cpu_has(X86_FEATURE_PAT) to this_cpu_has(X86_FEATURE_PAT) in pat_ap_init(), so that we check the PAT feature on the processor that is being initialized. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: stable@vger.kernel.org # v4.2+ Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1704181501450.26399@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 09 5月, 2017 2 次提交
-
-
由 Laura Abbott 提交于
'__vmalloc_start_set' currently only gets set in initmem_init() when !CONFIG_NEED_MULTIPLE_NODES. This breaks detection of vmalloc address with virt_addr_valid() with CONFIG_NEED_MULTIPLE_NODES=y, causing a kernel crash: [mm/usercopy] 517e1fbe: kernel BUG at arch/x86/mm/physaddr.c:78! Set '__vmalloc_start_set' appropriately for that case as well. Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NLaura Abbott <labbott@redhat.com> Reviewed-by: NKees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: dc16ecf7 ("x86-32: use specific __vmalloc_start_set flag in __virt_addr_valid") Link: http://lkml.kernel.org/r/1494278596-30373-1-git-send-email-labbott@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Laura Abbott 提交于
set_memory_* functions have moved to set_memory.h. Switch to this explicitly. Link: http://lkml.kernel.org/r/1488920133-27229-6-git-send-email-labbott@redhat.comSigned-off-by: NLaura Abbott <labbott@redhat.com> Acked-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 5月, 2017 1 次提交
-
-
由 Xunlei Pang 提交于
Kernel identity mappings on x86-64 kernels are created in two ways: by the early x86 boot code, or by kernel_ident_mapping_init(). Native kernels (which is the dominant usecase) use the former, but the kexec and the hibernation code uses kernel_ident_mapping_init(). There's a subtle difference between these two ways of how identity mappings are created, the current kernel_ident_mapping_init() code creates identity mappings always using 2MB page(PMD level) - while the native kernel boot path also utilizes gbpages where available. This difference is suboptimal both for performance and for memory usage: kernel_ident_mapping_init() needs to allocate pages for the page tables when creating the new identity mappings. This patch adds 1GB page(PUD level) support to kernel_ident_mapping_init() to address these concerns. The primary advantage would be better TLB coverage/performance, because we'd utilize 1GB TLBs instead of 2MB ones. It is also useful for machines with large number of memory to save paging structure allocations(around 4MB/TB using 2MB page) when setting identity mappings for all the memory, after using 1GB page it will consume only 8KB/TB. ( Note that this change alone does not activate gbpages in kexec, we are doing that in a separate patch. ) Signed-off-by: NXunlei Pang <xlpang@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: akpm@linux-foundation.org Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/1493862171-8799-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 5月, 2017 1 次提交
-
-
由 Baoquan He 提交于
Jeff Moyer reported that on his system with two memory regions 0~64G and 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling KASLR will make the system hang intermittently during boot. While adding 'nokaslr' won't. The back trace is: Oops: 0000 [#1] SMP RIP: memcpy_erms() [ .... ] Call Trace: pmem_rw_page() bdev_read_page() do_mpage_readpage() mpage_readpages() blkdev_readpages() __do_page_cache_readahead() force_page_cache_readahead() page_cache_sync_readahead() generic_file_read_iter() blkdev_read_iter() __vfs_read() vfs_read() SyS_read() entry_SYSCALL_64_fastpath() This crash happens because the for loop count calculation in sync_global_pgds() is not correct. When a mapping area crosses PGD entries, we should calculate the starting address of region which next PGD covers and assign it to next for loop count, but not add PGDIR_SIZE directly. The old code works right only if the mapping area is an exact multiple of PGDIR_SIZE, otherwize the end region could be skipped so that it can't be synchronized to all other processes from kernel PGD init_mm.pgd. In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it makes this area be mapped inside one PGD entry. With KASLR enabled, this area could cross two PGD entries, then the next PGD entry won't be synced to all other processes. That is why we saw empty PGD. Fix it. Reported-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NBaoquan He <bhe@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jinbum Park <jinb.park7@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com> Cc: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1493864747-8506-1-git-send-email-bhe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 26 4月, 2017 4 次提交
-
-
由 Andy Lutomirski 提交于
flush_tlb_page() passes a bogus range to flush_tlb_others() and expects the latter to fix it up. native_flush_tlb_others() has the fixup but Xen's version doesn't. Move the fixup to flush_tlb_others(). AFAICS the only real effect is that, without this fix, Xen would flush everything instead of just the one page on remote vCPUs in when flush_tlb_page() was called. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: e7b52ffd ("x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range") Link: http://lkml.kernel.org/r/10ed0e4dfea64daef10b87fb85df1746999b4dba.1492844372.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
I'm about to rewrite the function almost completely, but first I want to get a functional change out of the way. Currently, if flush_tlb_mm_range() does not flush the local TLB at all, it will never do individual page flushes on remote CPUs. This seems to be an accident, and preserving it will be awkward. Let's change it first so that any regressions in the rewrite will be easier to bisect and so that the rewrite can attempt to change no visible behavior at all. The fix is simple: we can simply avoid short-circuiting the calculation of base_pages_to_flush. As a side effect, this also eliminates a potential corner case: if tlb_single_page_flush_ceiling == TLB_FLUSH_ALL, flush_tlb_mm_range() could have ended up flushing the entire address space one page at a time. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NDave Hansen <dave.hansen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4b29b771d9975aad7154c314534fec235618175a.1492844372.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
I was trying to figure out what how flush_tlb_current_task() would possibly work correctly if current->mm != current->active_mm, but I realized I could spare myself the effort: it has no callers except the unused flush_tlb() macro. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill A. Shutemov 提交于
remove_pagetable() does page walk using p*d_page_vaddr() plus cast. It's not canonical approach -- we usually use p*d_offset() for that. It works fine as long as all page table levels are present. We broke the invariant by introducing folded p4d page table level. As result, remove_pagetable() interprets PMD as PUD and it leads to crash: BUG: unable to handle kernel paging request at ffff880300000000 IP: memchr_inv+0x60/0x110 PGD 317d067 P4D 317d067 PUD 3180067 PMD 33f102067 PTE 8000000300000060 Let's fix this by using p*d_offset() instead of p*d_page_vaddr() for page walk. Reported-by: NDan Williams <dan.j.williams@intel.com> Tested-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: f2a6a705 ("x86: Convert the rest of the code to support p4d_t") Link: http://lkml.kernel.org/r/20170425092557.21852-1-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 4月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
This reverts commit 2947ba05. Dan Williams reported dax-pmem kernel warnings with the following signature: WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200 percpu ref (dax_pmem_percpu_release [dax_pmem]) <= 0 (0) after switching to atomic ... and bisected it to this commit, which suggests possible memory corruption caused by the x86 fast-GUP conversion. He also pointed out: " This is similar to the backtrace when we were not properly handling pud faults and was fixed with this commit: 220ced16 "mm: fix get_user_pages() vs device-dax pud mappings" I've found some missing _devmap checks in the generic get_user_pages_fast() path, but this does not fix the regression [...] " So given that there are known bugs, and a pretty robust looking bisection points to this commit suggesting that are unknown bugs in the conversion as well, revert it for the time being - we'll re-try in v4.13. Reported-by: NDan Williams <dan.j.williams@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: aneesh.kumar@linux.vnet.ibm.com Cc: dann.frazier@canonical.com Cc: dave.hansen@intel.com Cc: steve.capper@linaro.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-