- 05 3月, 2015 2 次提交
-
-
由 Luis R. Rodriguez 提交于
direct_gbpages can be force enabled as an early parameter but not really have taken effect when DEBUG_PAGEALLOC or KMEMCHECK is enabled. You can also enable direct_gbpages right now if you have an x86_64 architecture but your CPU doesn't really have support for this feature. In both cases PG_LEVEL_1G won't actually be enabled but direct_gbpages is used in other areas under the assumptions that PG_LEVEL_1G was set. Fix this by putting together all requirements which make this feature sensible to enable under, and only enable both finally flipping on PG_LEVEL_1G and leaving PG_LEVEL_1G set when this is true. We only enable this feature then to be possible on sensible builds defined by the new ENABLE_DIRECT_GBPAGES. If the CPU has support for it you can either enable this by using the DIRECT_GBPAGES option or using the early kernel parameter. If a platform had support for this you can always force disable it as well. Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: JBeulich@suse.com Cc: Jan Beulich <JBeulich@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: julia.lawall@lip6.fr Link: http://lkml.kernel.org/r/1425518654-3403-3-git-send-email-mcgrof@do-not-panic.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luis R. Rodriguez 提交于
Replace #ifdef eyesore with IS_ENABLED() use. Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: JBeulich@suse.com Cc: Jan Beulich <JBeulich@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: julia.lawall@lip6.fr Link: http://lkml.kernel.org/r/1425518654-3403-2-git-send-email-mcgrof@do-not-panic.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 28 2月, 2015 1 次提交
-
-
由 Daniel Borkmann 提交于
This effectively unexports set_memory_ro() and set_memory_rw() functions, and thus reverts: a03352d2 ("x86: export set_memory_ro and set_memory_rw"). They have been introduced for debugging purposes in e1000e, but no module user is in mainline kernel (anymore?) and we explicitly do not want modules to use these functions, as they i.e. protect eBPF (interpreted & JIT'ed) images from malicious modifications or bugs. Outside of eBPF scope, I believe also other set_memory_*() functions should be unexported on x86 for modules. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Bruce Allan <bruce.w.allan@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Link: http://lkml.kernel.org/r/a064393a0a5d319eebde5c761cfd743132d4f213.1425040940.git.daniel@iogearbox.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 19 2月, 2015 4 次提交
-
-
由 Hector Marco-Gisbert 提交于
The issue is that the stack for processes is not properly randomized on 64 bit architectures due to an integer overflow. The affected function is randomize_stack_top() in file "fs/binfmt_elf.c": static unsigned long randomize_stack_top(unsigned long stack_top) { unsigned int random_variable = 0; if ((current->flags & PF_RANDOMIZE) && !(current->personality & ADDR_NO_RANDOMIZE)) { random_variable = get_random_int() & STACK_RND_MASK; random_variable <<= PAGE_SHIFT; } return PAGE_ALIGN(stack_top) + random_variable; return PAGE_ALIGN(stack_top) - random_variable; } Note that, it declares the "random_variable" variable as "unsigned int". Since the result of the shifting operation between STACK_RND_MASK (which is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64): random_variable <<= PAGE_SHIFT; then the two leftmost bits are dropped when storing the result in the "random_variable". This variable shall be at least 34 bits long to hold the (22+12) result. These two dropped bits have an impact on the entropy of process stack. Concretely, the total stack entropy is reduced by four: from 2^28 to 2^30 (One fourth of expected entropy). This patch restores back the entropy by correcting the types involved in the operations in the functions randomize_stack_top() and stack_maxrandom_size(). The successful fix can be tested with: $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done 7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0 [stack] 7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0 [stack] 7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0 [stack] 7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0 [stack] ... Once corrected, the leading bytes should be between 7ffc and 7fff, rather than always being 7fff. Signed-off-by: NHector Marco-Gisbert <hecmargi@upv.es> Signed-off-by: NIsmael Ripoll <iripoll@upv.es> [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ] Signed-off-by: NKees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Fixes: CVE-2015-1593 Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.netSigned-off-by: NBorislav Petkov <bp@suse.de>
-
由 Dave Hansen 提交于
With 32-bit non-PAE kernels, we have 2 page sizes available (at most): 4k and 4M. Enabling PAE replaces that 4M size with a 2M one (which 64-bit systems use too). But, when booting a 32-bit non-PAE kernel, in one of our early-boot printouts, we say: init_memory_mapping: [mem 0x00000000-0x000fffff] [mem 0x00000000-0x000fffff] page 4k init_memory_mapping: [mem 0x37000000-0x373fffff] [mem 0x37000000-0x373fffff] page 2M init_memory_mapping: [mem 0x00100000-0x36ffffff] [mem 0x00100000-0x003fffff] page 4k [mem 0x00400000-0x36ffffff] page 2M init_memory_mapping: [mem 0x37400000-0x377fdfff] [mem 0x37400000-0x377fdfff] page 4k Which is obviously wrong. There is no 2M page available. This is probably because of a badly-named variable: in the map_range code: PG_LEVEL_2M. Instead of renaming all the PG_LEVEL_2M's. This patch just fixes the printout: init_memory_mapping: [mem 0x00000000-0x000fffff] [mem 0x00000000-0x000fffff] page 4k init_memory_mapping: [mem 0x37000000-0x373fffff] [mem 0x37000000-0x373fffff] page 4M init_memory_mapping: [mem 0x00100000-0x36ffffff] [mem 0x00100000-0x003fffff] page 4k [mem 0x00400000-0x36ffffff] page 4M init_memory_mapping: [mem 0x37400000-0x377fdfff] [mem 0x37400000-0x377fdfff] page 4k BRK [0x03206000, 0x03206fff] PGTABLE Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/20150210212030.665EC267@viggo.jf.intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
由 Pavel Machek 提交于
STRICT_DEVMEM and PAT produce same failure accessing /dev/mem, which is quite confusing to the user. Make printk messages different to lessen confusion. Signed-off-by: NPavel Machek <pavel@ucw.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Fenghua Yu 提交于
With more embedded systems emerging using Quark, among other things, 32-bit kernel matters again. 32-bit machine and kernel uses PAE paging, which currently wastes at least 4K of memory per process on Linux where we have to reserve an entire page to support a single 32-byte PGD structure. It would be a very good thing if we could eliminate that wastage. PAE paging is used to access more than 4GB memory on x86-32. And it is required for NX. In this patch, we still allocate one page for pgd for a Xen domain and 64-bit kernel because one page pgd is assumed in these cases. But we can save memory space by only allocating 32-byte pgd for 32-bit PAE kernel when it is not running as a Xen domain. Signed-off-by: NFenghua Yu <fenghua.yu@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Glenn Williamson <glenn.p.williamson@intel.com> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1421382601-46912-1-git-send-email-fenghua.yu@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 2月, 2015 4 次提交
-
-
由 Andrey Ryabinin 提交于
This feature let us to detect accesses out of bounds of global variables. This will work as for globals in kernel image, so for globals in modules. Currently this won't work for symbols in user-specified sections (e.g. __init, __read_mostly, ...) The idea of this is simple. Compiler increases each global variable by redzone size and add constructors invoking __asan_register_globals() function. Information about global variable (address, size, size with redzone ...) passed to __asan_register_globals() so we could poison variable's redzone. This patch also forces module_alloc() to return 8*PAGE_SIZE aligned address making shadow memory handling ( kasan_module_alloc()/kasan_module_free() ) more simple. Such alignment guarantees that each shadow page backing modules address space correspond to only one module_alloc() allocation. Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Ryabinin 提交于
Stack instrumentation allows to detect out of bounds memory accesses for variables allocated on stack. Compiler adds redzones around every variable on stack and poisons redzones in function's prologue. Such approach significantly increases stack usage, so all in-kernel stacks size were doubled. Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Ryabinin 提交于
This patch adds arch specific code for kernel address sanitizer. 16TB of virtual addressed used for shadow memory. It's located in range [ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup stacks. At early stage we map whole shadow region with zero page. Latter, after pages mapped to direct mapping address range we unmap zero pages from corresponding shadow (see kasan_map_shadow()) and allocate and map a real shadow memory reusing vmemmap_populate() function. Also replace __pa with __pa_nodebug before shadow initialized. __pa with CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr) __phys_addr is instrumented, so __asan_load could be called before shadow area initialized. Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Jim Davis <jim.epost@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tejun Heo 提交于
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. * Unnecessary buffer size calculation and condition on the lenght removed from intel_cacheinfo.c::show_shared_cpu_map_func(). * uv_nmi_nr_cpus_pr() got overly smart and implemented "..." abbreviation if the output stretched over the predefined 1024 byte buffer. Replaced with plain printk. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Mike Travis <travis@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2015 1 次提交
-
-
由 Mel Gorman 提交于
Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: NSasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2015 4 次提交
-
-
由 Andrea Arcangeli 提交于
This allows the get_user_pages_fast slow path to release the mmap_sem before blocking. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE. The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0. #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h> #define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21) #define NR_PUD 130000 int main(void) { char *addr = NULL; unsigned long i; prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); } The patch addresses the issue by account PMD tables to the process the same way we account PTE. The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases: - HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it. - x86 PAE pre-allocates few PMD tables on fork. - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2). Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: NSedat Dilek <sedat.dilek@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Migrating hugepages and hwpoisoned hugepages are considered as non-present hugepages, and they are referenced via migration entries and hwpoison entries in their page table slots. This behavior causes race condition because pmd_huge() doesn't tell non-huge pages from migrating/hwpoisoned hugepages. follow_page_mask() is one example where the kernel would call follow_page_pte() for such hugepage while this function is supposed to handle only normal pages. To avoid this, this patch makes pmd_huge() return true when pmd_none() is true *and* pmd_present() is false. We don't have to worry about mixing up non-present pmd entry with normal pmd (pointing to leaf level pte entry) because pmd_present() is true in normal pmd. The same race condition could happen in (x86-specific) gup_pmd_range(), where this patch simply adds pmd_present() check instead of pmd_huge(). This is because gup_pmd_range() is fast path. If we have non-present hugepage in this function, we will go into gup_huge_pmd(), then return 0 at flag mask check, and finally fall back to the slow path. Fixes: 290408d4 ("hugetlb: hugepage migration core") Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: <stable@vger.kernel.org> [2.6.36+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Currently we have many duplicates in definitions around follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove the m. The basic idea is to put the default implementation for these functions in mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement arch-specific code only when the arch needs it. For follow_huge_addr(), only powerpc and ia64 have their own implementation, and in all other architectures this function just returns ERR_PTR(-EINVAL). So this patch sets returning ERR_PTR(-EINVAL) as default. As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always return 0 in your architecture (like in ia64 or sparc,) it's never called (the callsite is optimized away) no matter how implemented it is. So in such architectures, we don't need arch-specific implementation. In some architecture (like mips, s390 and tile,) their current arch-specific follow_huge_(pmd|pud)() are effectively identical with the common code, so this patch lets these architecture use the common code. One exception is metag, where pmd_huge() could return non-zero but it expects follow_huge_pmd() to always return NULL. This means that we need arch-specific implementation which returns NULL. This behavior looks strange to me (because non-zero pmd_huge() implies that the architecture supports PMD-based hugepage, so follow_huge_pmd() can/should return some relevant value,) but that's beyond this cleanup patch, so let's keep it. Justification of non-trivial changes: - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE is true when follow_huge_pmd() can be called (note that pmd_huge() has the same check and always returns 0 for !MACHINE_HAS_HPAGE.) - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common code. This patch forces these archs use PMD_MASK, but it's OK because they are identical in both archs. In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20. In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but PTE_ORDER is always 0, so these are identical. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 2月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
After commit 944d9fec ("hugetlb: add support for gigantic page allocation at runtime") we can allocate 1G pages at runtime if CMA is enabled. Let's register 1G pages into hugetlb even if the user hasn't requested them explicitly at boot time with hugepagesz=1G. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NLuiz Capitulino <lcapitulino@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 2月, 2015 2 次提交
-
-
由 Andy Lutomirski 提交于
Context switches and TLB flushes can change individual bits of CR4. CR4 reads take several cycles, so store a shadow copy of CR4 in a per-cpu variable. To avoid wasting a cache line, I added the CR4 shadow to cpu_tlbstate, which is already touched in switch_mm. The heaviest users of the cr4 shadow will be switch_mm and __switch_to_xtra, and __switch_to_xtra is called shortly after switch_mm during context switch, so the cacheline is likely to be hot. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
CR4 manipulation was split, seemingly at random, between direct (write_cr4) and using a helper (set/clear_in_cr4). Unfortunately, the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code, which only a small subset of users actually wanted. This patch replaces all cr4 access in functions that don't leave cr4 exactly the way they found it with new helpers cr4_set_bits, cr4_clear_bits, and cr4_set_bits_and_update_boot. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 30 1月, 2015 1 次提交
-
-
由 Linus Torvalds 提交于
The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de> Tested-by: NJan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 1月, 2015 2 次提交
-
-
由 Juergen Gross 提交于
Commit 281d4078 ("x86: Make page cache mode a real type") introduced the symbols __cachemode2pte_tbl and __pte2cachemode_tbl and exported them via EXPORT_SYMBOL_GPL. The exports are part of a replacement of code which has been EXPORT_SYMBOL before these changes resulting in build breakage of out-of-tree non-gpl modules. Change EXPORT_SYMBOL_GPL to EXPORT-SYMBOL for these two symbols. Fixes: 281d4078 "x86: Make page cache mode a real type" Reported-and-tested-by: NSteven Noonan <steven@uplinklabs.net> Signed-off-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NToshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/1421926997-28615-1-git-send-email-jgross@suse.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Dave Hansen 提交于
We had originally planned on submitting MPX support in one patch set. We eventually broke it up in to two pieces for easier review. One of the features that didn't make the first round was supporting 32-bit binaries on 64-bit kernels. Once we split the set up, we never added code to restrict 32-bit binaries from _using_ MPX on 64-bit kernels. The 32-bit bounds tables are a different format than the 64-bit ones. Without this patch, the kernel will try to read a 32-bit binary's tables as if they were the 64-bit version. They will likely be noticed as being invalid rather quickly and the app will get killed, but that's kinda mean. This patch adds an explicit check, and will make a 64-bit kernel essentially behave as if it has no MPX support when called from a 32-bit binary. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20150108223020.9E9AA511@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 20 1月, 2015 1 次提交
-
-
由 Juergen Gross 提交于
VMWare seems not to emulate the PAT MSR correctly: reaeding MSR_IA32_CR_PAT returns 0 even after writing another value to it. Commit bd809af1 triggers this VMWare bug when the kernel is booted as a VMWare guest. Detect this bug and don't use the read value if it is 0. Fixes: bd809af1 "x86: Enable PAT to use cache mode translation tables" Reported-and-tested-by: NJongman Heo <jongman.heo@samsung.com> Acked-by: NAlok N Kataria <akataria@vmware.com> Signed-off-by: NJuergen Gross <jgross@suse.com> Link: http://lkml.kernel.org/r/1421039745-14335-1-git-send-email-jgross@suse.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 02 1月, 2015 1 次提交
-
-
由 Pavel Machek 提交于
Use capital BIOS in comment. Its cleaner, and allows diference between BIOS and BIOs. Signed-off-by: NPavel Machek <pavel@ucw.cz> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 23 12月, 2014 1 次提交
-
-
由 Jan Beulich 提交于
The old scheme can lead to failure in certain cases - the problem is that after bumping step_size the next (non-final) iteration is only guaranteed to make available a memory block the size of what step_size was before. E.g. for a memory block [0,3004600000) we'd have: iter start end step amount 1 3004400000 30045fffff 2M 2M 2 3004000000 30043fffff 64M 4M 3 3000000000 3003ffffff 2G 64M 4 2000000000 2fffffffff 64G 64G Yet to map 64G with 4k pages (as happens e.g. under PV Xen) we need slightly over 128M, but the first three iterations made only about 70M available. The condition (new_mapped_ram_size > mapped_ram_size) for bumping step_size is just not suitable. Instead we want to bump it when we know we have enough memory available to cover a block of the new step_size. And rather than making that condition more complicated than needed, simply adjust step_size by the largest possible factor we know we can cover at that point - which is shifting it left by one less than the difference between page table level shifts. (Interestingly the original STEP_SIZE_SHIFT definition had a comment hinting at that having been the intention, just that it should have been PUD_SHIFT-PMD_SHIFT-1 instead of (PUD_SHIFT-PMD_SHIFT)/2, and of course for non-PAE 32-bit we can't really use these two constants as they're equal there.) Furthermore the comment in get_new_step_size() didn't get updated when the bottom-down mapping logic got added. Yet while an overflow (flushing step_size to zero) of the shift doesn't matter for the top-down method, it does for bottom-up because round_up(x, 0) = 0, and an upper range boundary of zero can't really work well. Signed-off-by: NJan Beulich <jbeulich@suse.com> Acked-by: NYinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/54945C1E020000780005114E@mail.emea.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 12月, 2014 2 次提交
-
-
由 Christian Borntraeger 提交于
ACCESS_ONCE does not work reliably on non-scalar types. For example gcc 4.6 and 4.7 might remove the volatile tag for such accesses during the SRA (scalar replacement of aggregates) step (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145) Change the gup code to replace ACCESS_ONCE with READ_ONCE. Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Linus Torvalds 提交于
My commit 26178ec1 ("x86: mm: consolidate VM_FAULT_RETRY handling") had a really stupid typo: the FAULT_FLAG_USER bit is in the 'flags' variable, not the 'fault' variable. Duh, The one silver lining in this is that Dave finding this at least confirms that trinity actually triggers this special path easily, in a way normal use does not. Reported-by: NDave Jones <davej@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 12月, 2014 2 次提交
-
-
由 Linus Torvalds 提交于
The VM_FAULT_RETRY handling was confusing and incorrect for the case of returning to kernel mode. We need to handle the exception table fixup if we return to kernel mode due to a fatal signal - it will basically look to the kernel user mode access like the access failed due to the VM going away from udner it. Which is correct - the process is dying - and avoids the whole "repeat endless kernel page faults" case. Handling the VM_FAULT_RETRY early and in just one place also simplifies the mmap_sem handling, since once we've taken care of VM_FAULT_RETRY we know that we can just drop the lock. The remaining accounting and possible error handling is thread-local and does not need the mmap_sem. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
This replaces four copies in various stages of mm_fault_error() handling with just a single one. It will also allow for more natural placement of the unlocking after some further cleanup. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 12月, 2014 1 次提交
-
-
由 Joonsoo Kim 提交于
Now, we have prepared to avoid using debug-pagealloc in boottime. So introduce new kernel-parameter to disable debug-pagealloc in boottime, and makes related functions to be disabled in this case. Only non-intuitive part is change of guard page functions. Because guard page is effective only if debug-pagealloc is enabled, turning off according to debug-pagealloc is reasonable thing to do. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2014 1 次提交
-
-
由 Xishi Qiu 提交于
This is the usual physical memory layout boot printout: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0xc3fffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0xbf78ffff] [ 0.000000] node 0: [mem 0x100000000-0x63fffffff] [ 0.000000] node 1: [mem 0x640000000-0xc3fffffff] ... This is the log when we set "mem=2G" on the boot cmdline: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] // should be 0x7fffffff, right? [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0x7fffffff] ... This patch fixes the printout, the following log shows the right ranges: ... [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0x7fffffff] [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x00099fff] [ 0.000000] node 0: [mem 0x00100000-0x7fffffff] ... Suggested-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Cc: Linux MM <linux-mm@kvack.org> Cc: <dave@sr71.net> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/r/5487AB3D.6070306@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 12月, 2014 1 次提交
-
-
由 Rasmus Villemoes 提交于
seq_puts is a lot cheaper than seq_printf, so use that to print literal strings. Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Link: http://lkml.kernel.org/r/1417208622-12264-1-git-send-email-linux@rasmusvillemoes.dkSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 12月, 2014 1 次提交
-
-
由 Juergen Gross 提交于
Introduces lookup_pmd_address() to get the address of the pmd entry related to a virtual address in the current address space. This function is needed for support of a virtual mapped sparse p2m list in xen pv domains, as we need the address of the pmd entry, not the one of the pte in that case. Signed-off-by: NJuergen Gross <jgross@suse.com> Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
-
- 19 11月, 2014 2 次提交
-
-
由 Dave Hansen 提交于
get_reg_offset() used to return the register contents themselves instead of the register offset. When it did that, it was an unsigned long. I changed it to return an integer _offset_ instead of the register. But, I neglected to change the return type of the function or the variables in which we store the result of the call. This fixes up the code to clear up the warnings from the smatch bot: New smatch warnings: arch/x86/mm/mpx.c:178 mpx_get_addr_ref() warn: unsigned 'addr_offset' is never less than zero. arch/x86/mm/mpx.c:184 mpx_get_addr_ref() warn: unsigned 'base_offset' is never less than zero. arch/x86/mm/mpx.c:188 mpx_get_addr_ref() warn: unsigned 'indx_offset' is never less than zero. arch/x86/mm/mpx.c:196 mpx_get_addr_ref() warn: unsigned 'addr_offset' is never less than zero. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20141118182343.C3E0C629@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Kees Cook 提交于
When setting up permissions on kernel memory at boot, the end of the PMD that was split from bss remained executable. It should be NX like the rest. This performs a PMD alignment instead of a PAGE alignment to get the correct span of memory. Before: ---[ High Kernel Mapping ]--- ... 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte 0xffffffff82200000-0xffffffff82c00000 10M RW PSE GLB NX pmd 0xffffffff82c00000-0xffffffff82df5000 2004K RW GLB NX pte 0xffffffff82df5000-0xffffffff82e00000 44K RW GLB x pte 0xffffffff82e00000-0xffffffffc0000000 978M pmd After: ---[ High Kernel Mapping ]--- ... 0xffffffff8202d000-0xffffffff82200000 1868K RW GLB NX pte 0xffffffff82200000-0xffffffff82e00000 12M RW PSE GLB NX pmd 0xffffffff82e00000-0xffffffffc0000000 978M pmd [ tglx: Changed it to roundup(_brk_end, PMD_SIZE) and added a comment. We really should unmap the reminder along with the holes caused by init,initdata etc. but thats a different issue ] Signed-off-by: NKees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Wang Nan <wangnan0@huawei.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20141114194737.GA3091@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 18 11月, 2014 4 次提交
-
-
由 Dave Hansen 提交于
The previous patch allocates bounds tables on-demand. As noted in an earlier description, these can add up to *HUGE* amounts of memory. This has caused OOMs in practice when running tests. This patch adds support for freeing bounds tables when they are no longer in use. There are two types of mappings in play when unmapping tables: 1. The mapping with the actual data, which userspace is munmap()ing or brk()ing away, etc... 2. The mapping for the bounds table *backing* the data (is tagged with VM_MPX, see the patch "add MPX specific mmap interface"). If userspace use the prctl() indroduced earlier in this patchset to enable the management of bounds tables in kernel, when it unmaps the first type of mapping with the actual data, the kernel needs to free the mapping for the bounds table backing the data. This patch hooks in at the very end of do_unmap() to do so. We look at the addresses being unmapped and find the bounds directory entries and tables which cover those addresses. If an entire table is unused, we clear associated directory entry and free the table. Once we unmap the bounds table, we would have a bounds directory entry pointing at empty address space. That address space might now be allocated for some other (random) use, and the MPX hardware might now try to walk it as if it were a bounds table. That would be bad. So any unmapping of an enture bounds table has to be accompanied by a corresponding write to the bounds directory entry to invalidate it. That write to the bounds directory can fault, which causes the following problem: Since we are doing the freeing from munmap() (and other paths like it), we hold mmap_sem for write. If we fault, the page fault handler will attempt to acquire mmap_sem for read and we will deadlock. To avoid the deadlock, we pagefault_disable() when touching the bounds directory entry and use a get_user_pages() to resolve the fault. The unmapping of bounds tables happends under vm_munmap(). We also (indirectly) call vm_munmap() to _do_ the unmapping of the bounds tables. We avoid unbounded recursion by disallowing freeing of bounds tables *for* bounds tables. This would not occur normally, so should not have any practical impact. Being strict about it here helps ensure that we do not have an exploitable stack overflow. Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Dave Hansen 提交于
This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Dave Hansen 提交于
This patch sets bound violation fields of siginfo struct in #BR exception handler by decoding the user instruction and constructing the faulting pointer. We have to be very careful when decoding these instructions. They are completely controlled by userspace and may be changed at any time up to and including the point where we try to copy them in to the kernel. They may or may not be MPX instructions and could be completely invalid for all we know. Note: This code is based on Qiaowei Ren's specialized MPX decoder, but uses the generic decoder whenever possible. It was tested for robustness by generating a completely random data stream and trying to decode that stream. I also unmapped random pages inside the stream to test the "partial instruction" short read code. We kzalloc() the siginfo instead of stack allocating it because we need to memset() it anyway, and doing this makes it much more clear when it got initialized by the MPX instruction decoder. Changes from the old decoder: * Use the generic decoder instead of custom functions. Saved ~70 lines of code overall. * Remove insn->addr_bytes code (never used??) * Make sure never to possibly overflow the regoff[] array, plus check the register range correctly in 32 and 64-bit modes. * Allow get_reg() to return an error and have mpx_get_addr_ref() handle when it sees errors. * Only call insn_get_*() near where we actually use the values instead if trying to call them all at once. * Handle short reads from copy_from_user() and check the actual number of read bytes against what we expect from insn_get_length(). If a read stops in the middle of an instruction, we error out. * Actually check the opcodes intead of ignoring them. * Dynamically kzalloc() siginfo_t so we don't leak any stack data. * Detect and handle decoder failures instead of ignoring them. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151828.5BDD0915@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Qiaowei Ren 提交于
We have chosen to perform the allocation of bounds tables in kernel (See the patch "on-demand kernel allocation of bounds tables") and to mark these VMAs with VM_MPX. However, there is currently no suitable interface to actually do this. Existing interfaces, like do_mmap_pgoff(), have no way to set a modified ->vm_ops or ->vm_flags and don't hold mmap_sem long enough to let a caller do it. This patch wraps mmap_region() and hold mmap_sem long enough to make the modifications to the VMA which we need. Also note the 32/64-bit #ifdef in the header. We actually need to do this at runtime eventually. But, for now, we don't support running 32-bit binaries on 64-bit kernels. Support for this will come in later patches. Signed-off-by: NQiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151827.CE440F67@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 17 11月, 2014 1 次提交
-
-
由 Thomas Gleixner 提交于
Commit e00c8cc9 "x86: Use new cache mode type in memtype related functions" broke the ARCH=um build. arch/x86/include/asm/cacheflush.h:67:36: error: return type is an incomplete type static inline enum page_cache_mode get_page_memtype(struct page *pg) The reason is simple. get_page_memtype() and set_page_memtype() require enum page_cache_mode now, which is defined in asm/pgtable_types.h. UM does not include that file for obvious reasons. The simple solution is to move that functions to arch/x86/mm/pat.c where the only callsites of this are located. They should have been there in the first place. Fixes: e00c8cc9 "x86: Use new cache mode type in memtype related functions" Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Juergen Gross <jgross@suse.com> Cc: Richard Weinberger <richard@nod.at>
-