- 19 2月, 2016 1 次提交
-
-
由 Dave Hansen 提交于
We try to enforce protection keys in software the same way that we do in hardware. (See long example below). But, we only want to do this when accessing our *own* process's memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then tried to PTRACE_POKE a target process which just happened to have some mprotect_pkey(pkey=6) memory, we do *not* want to deny the debugger access to that memory. PKRU is fundamentally a thread-local structure and we do not want to enforce it on access to _another_ thread's data. This gets especially tricky when we have workqueues or other delayed-work mechanisms that might run in a random process's context. We can check that we only enforce pkeys when operating on our *own* mm, but delayed work gets performed when a random user context is active. We might end up with a situation where a delayed-work gup fails when running randomly under its "own" task but succeeds when running under another process. We want to avoid that. To avoid that, we use the new GUP flag: FOLL_REMOTE and add a fault flag: FAULT_FLAG_REMOTE. They indicate that we are walking an mm which is not guranteed to be the same as current->mm and should not be subject to protection key enforcement. Thanks to Jerome Glisse for pointing out this scenario. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: Dominik Vogt <vogt@linux.vnet.ibm.com> Cc: Eric B Munson <emunson@akamai.com> Cc: Geliang Tang <geliangtang@163.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Low <jason.low2@hp.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: iommu@lists.linux-foundation.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: linux-s390@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 2月, 2016 2 次提交
-
-
由 Dave Hansen 提交于
Lots of things seem to do: vma->vm_page_prot = vm_get_page_prot(flags); and the ptes get created right from things we pull out of ->vm_page_prot. So it is very convenient if we can store the protection key in flags and vm_page_prot, just like the existing permission bits (_PAGE_RW/PRESENT). It greatly reduces the amount of plumbing and arch-specific hacking we have to do in generic code. This also takes the new PROT_PKEY{0,1,2,3} flags and turns *those* in to VM_ flags for vma->vm_flags. The protection key values are stored in 4 places: 1. "prot" argument to system calls 2. vma->vm_flags, filled from the mmap "prot" 3. vma->vm_page prot, filled from vma->vm_flags 4. the PTE itself. The pseudocode for these for steps are as follows: mmap(PROT_PKEY*) vma->vm_flags = ... | arch_calc_vm_prot_bits(mmap_prot); vma->vm_page_prot = ... | arch_vm_get_page_prot(vma->vm_flags); pte = pfn | vma->vm_page_prot Note that this provides a new definitions for x86: arch_vm_get_page_prot() Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210210.FE483A42@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Dave Hansen 提交于
vma->vm_flags is an 'unsigned long', so has space for 32 flags on 32-bit architectures. The high 32 bits are unused on 64-bit platforms. We've steered away from using the unused high VMA bits for things because we would have difficulty supporting it on 32-bit. Protection Keys are not available in 32-bit mode, so there is no concern about supporting this feature in 32-bit mode or on 32-bit CPUs. This patch carves out 4 bits from the high half of vma->vm_flags and allows architectures to set config option to make them available. Sparse complains about these constants unless we explicitly call them "UL". Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Valentin Rothberg <valentinrothberg@gmail.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210208.81AF00D5@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 16 2月, 2016 2 次提交
-
-
由 Dave Hansen 提交于
The concept here was a suggestion from Ingo. The implementation horrors are all mine. This allows get_user_pages(), get_user_pages_unlocked(), and get_user_pages_locked() to be called with or without the leading tsk/mm arguments. We will give a compile-time warning about the old style being __deprecated and we will also WARN_ON() if the non-remote version is used for a remote-style access. Doing this, folks will get nice warnings and will not break the build. This should be nice for -next and will hopefully let developers fix up their own code instead of maintainers needing to do it at merge time. The way we do this is hideous. It uses the __VA_ARGS__ macro functionality to call different functions based on the number of arguments passed to the macro. There's an additional hack to ensure that our EXPORT_SYMBOL() of the deprecated symbols doesn't trigger a warning. We should be able to remove this mess as soon as -rc1 hits in the release after this is merged. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: Geliang Tang <geliangtang@163.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Leon Romanovsky <leon@leon.nu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Dave Hansen 提交于
For protection keys, we need to understand whether protections should be enforced in software or not. In general, we enforce protections when working on our own task, but not when on others. We call these "current" and "remote" operations. This patch introduces a new get_user_pages() variant: get_user_pages_remote() Which is a replacement for when get_user_pages() is called on non-current tsk/mm. We also introduce a new gup flag: FOLL_REMOTE which can be used for the "__" gup variants to get this new behavior. The uprobes is_trap_at_addr() location holds mmap_sem and calls get_user_pages(current->mm) on an instruction address. This makes it a pretty unique gup caller. Being an instruction access and also really originating from the kernel (vs. the app), I opted to consider this a 'remote' access where protection keys will not be enforced. Without protection keys, this patch should not change any behavior. Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: jack@suse.cz Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 2月, 2016 2 次提交
-
-
由 Konstantin Khlebnikov 提交于
* add VM_STACK as alias for VM_GROWSUP/DOWN depending on architecture * always account VMAs with flag VM_STACK as stack (as it was before) * cleanup classifying helpers * update comments and documentation Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Tested-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Commit b7643757 ("procfs: mark thread stack correctly in proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps. Finding the task of a stack VMA requires walking the entire thread list, turning this into quadratic behavior: a thousand threads means a thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a million combinations. The cost is not in proportion to the usefulness as described in the patch. Drop the [stack:TID] annotation to make /proc/<pid>/maps (and /proc/<pid>/numa_maps) usable again for higher thread counts. The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as identifying the stack VMA there is an O(1) operation. Siddesh said: "The end users needed a way to identify thread stacks programmatically and there wasn't a way to do that. I'm afraid I no longer remember (or have access to the resources that would aid my memory since I changed employers) the details of their requirement. However, I did do this on my own time because I thought it was an interesting project for me and nobody really gave any feedback then as to its utility, so as far as I am concerned you could roll back the main thread maps information since the information is available in the thread-specific files" Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Cc: Shaohua Li <shli@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 1月, 2016 14 次提交
-
-
由 Wang Xiaoqiang 提交于
Since can_do_mlock only return 1 or 0, so make it boolean. No functional change. [akpm@linux-foundation.org: update declaration in mm.h] Signed-off-by: NWang Xiaoqiang <wangxq10@lzu.edu.cn> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dominik Dingel 提交于
During Jason's work with postcopy migration support for s390 a problem regarding gmap faults was discovered. The gmap code will call fixup_user_fault which will end up always in handle_mm_fault. Till now we never cared about retries, but as the userfaultfd code kind of relies on it. this needs some fix. This patchset does not take care of the futex code. I will now look closer at this. This patch (of 2): With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the faulting we ever unlocked mmap_sem. This patch brings in the logic to handle retries as well as it cleans up the current documentation. fixup_user_fault was not having the same semantics as filemap_fault. It never indicated if a retry happened and so a caller wasn't able to handle that case. So we now changed the behaviour to always retry a locked mmap_sem. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Eric B Munson <emunson@akamai.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Williams 提交于
A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver has established a devm_memremap_pages() mapping, i.e. when the pfn_t return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when encountering _PAGE_DEVMAP during a page table walk we lookup and pin a struct dev_pagemap instance to keep the result of pfn_to_page() valid until put_page(). Signed-off-by: NDan Williams <dan.j.williams@intel.com> Tested-by: NLogan Gunthorpe <logang@deltatee.com> Cc: Dave Hansen <dave@sr71.net> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Williams 提交于
A dax-huge-page mapping while it uses some thp helpers is ultimately not a transparent huge page. The distinction is especially important in the get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds from pmd_huge() and pmd_trans_huge() which have slightly different semantics. Explicitly mark the pmd_trans_huge() helpers that dax needs by adding pmd_devmap() checks. [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()] Signed-off-by: NDan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Williams 提交于
Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers _PAGE_DEVMAP to be set in the resulting pte. There are no functional changes to the gpu drivers as a result of this conversion. Signed-off-by: NDan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: David Airlie <airlied@linux.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Williams 提交于
In support of providing struct page for large persistent memory capacities, use struct vmem_altmap to change the default policy for allocating memory for the memmap array. The default vmemmap_populate() allocates page table storage area from the page allocator. Given persistent memory capacities relative to DRAM it may not be feasible to store the memmap in 'System Memory'. Instead vmem_altmap represents pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() requests. Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reported-by: Nkbuild test robot <lkp@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Williams 提交于
Prevent userspace from trying and failing to online ZONE_DEVICE pages which are meant to never be onlined. For example on platforms with a udev rule like the following: SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online" ...will generate futile attempts to online the ZONE_DEVICE sections. Example kernel messages: Built 1 zonelists in Node order, mobility grouping on. Total pages: 1004747 Policy zone: Normal online_pages [mem 0x248000000-0x24fffffff] failed Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Both page_referenced() and page_idle_clear_pte_refs_one() assume that THP can only be mapped with PMD, so there's no reason to look on PTEs for PageTransHuge() pages. That's no true anymore: THP can be mapped with PTEs too. The patch removes PageTransHuge() test from the functions and opencode page table check. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Currently we don't split huge page on partial unmap. It's not an ideal situation. It can lead to memory overhead. Furtunately, we can detect partial unmap on page_remove_rmap(). But we cannot call split_huge_page() from there due to locking context. It's also counterproductive to do directly from munmap() codepath: in many cases we will hit this from exit(2) and splitting the huge page just to free it up in small pages is not what we really want. The patch introduce deferred_split_huge_page() which put the huge page into queue for splitting. The splitting itself will happen when we get memory pressure via shrinker interface. The page will be dropped from list on freeing through compound page destructor. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Some mm-related BUG_ON()s could trigger from hwpoison code due to recent changes in thp refcounting rule. This patch fixes them up. In the new refcounting, we no longer use tail->_mapcount to keep tail's refcount, and thereby we can simplify get/put_hwpoison_page(). And another change is that tail's refcount is not transferred to the raw page during thp split (more precisely, in new rule we don't take refcount on tail page any more.) So when we need thp split, we have to transfer the refcount properly to the 4kB soft-offlined page before migration. thp split code goes into core code only when precheck (total_mapcount(head) == page_count(head) - 1) passes to avoid useless split, where we assume that one refcount is held by the caller of thp split and the others are taken via mapping. To meet this assumption, this patch moves thp split part in soft_offline_page() after get_any_page(). [akpm@linux-foundation.org: remove unneeded #define, per Kirill] Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Let's define page_mapped() to be true for compound pages if any sub-pages of the compound page is mapped (with PMD or PTE). On other hand page_mapcount() return mapcount for this particular small page. This will make cases like page_get_anon_vma() behave correctly once we allow huge pages to be mapped with PTE. Most users outside core-mm should use page_mapcount() instead of page_mapped(). Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We're going to allow mapping of individual 4k pages of THP compound. It means we need to track mapcount on per small page basis. Straight-forward approach is to use ->_mapcount in all subpages to track how many time this subpage is mapped with PMDs or PTEs combined. But this is rather expensive: mapping or unmapping of a THP page with PMD would require HPAGE_PMD_NR atomic operations instead of single we have now. The idea is to store separately how many times the page was mapped as whole -- compound_mapcount. This frees up ->_mapcount in subpages to track PTE mapcount. We use the same approach as with compound page destructor and compound order to store compound_mapcount: use space in first tail page, ->mapping this time. Any time we map/unmap whole compound page (THP or hugetlb) -- we increment/decrement compound_mapcount. When we map part of compound page with PTE we operate on ->_mapcount of the subpage. page_mapcount() counts both: PTE and PMD mappings of the page. Basically, we have mapcount for a subpage spread over two counters. It makes tricky to detect when last mapcount for a page goes away. We introduced PageDoubleMap() for this. When we split THP PMD for the first time and there's other PMD mapping left we offset up ->_mapcount in all subpages by one and set PG_double_map on the compound page. These additional references go away with last compound_mapcount. This approach provides a way to detect when last mapcount goes away on per small page basis without introducing new overhead for most common cases. [akpm@linux-foundation.org: fix typo in comment] [mhocko@suse.com: ignore partial THP when moving task] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We are going to use migration entries to stabilize page counts. It means we don't need compound_lock() for that. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Tail page refcounting is utterly complicated and painful to support. It uses ->_mapcount on tail pages to store how many times this page is pinned. get_page() bumps ->_mapcount on tail page in addition to ->_count on head. This information is required by split_huge_page() to be able to distribute pins from head of compound page to tails during the split. We will need ->_mapcount to account PTE mappings of subpages of the compound page. We eliminate need in current meaning of ->_mapcount in tail pages by forbidding split entirely if the page is pinned. The only user of tail page refcounting is THP which is marked BROKEN for now. Let's drop all this mess. It makes get_page() and put_page() much simpler. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 1月, 2016 4 次提交
-
-
由 Konstantin Khlebnikov 提交于
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which testing the RLIMIT_DATA value to figure out if we're allowed to assign new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been commited that RLIMIT_DATA in a form it's implemented now doesn't do anything useful because most of user-space libraries use mmap() syscall for dynamic memory allocations. Linus suggested to convert RLIMIT_DATA rlimit into something suitable for anonymous memory accounting. But in this patch we go further, and the changes are bundled together as: * keep vma counting if CONFIG_PROC_FS=n, will be used for limits * replace mm->shared_vm with better defined mm->data_vm * account anonymous executable areas as executable * account file-backed growsdown/up areas as stack * drop struct file* argument from vm_stat_account * enforce RLIMIT_DATA for size of data areas This way code looks cleaner: now code/stack/data classification depends only on vm_flags state: VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc) VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk) VM_WRITE & ~VM_SHARED & !stack -> data (VmData) The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared", but that might be strange beast like readonly-private or VM_IO area. - RLIMIT_AS limits whole address space "VmSize" - RLIMIT_STACK limits stack "VmStk" (but each vma individually) - RLIMIT_DATA now limits "VmData" Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Kees Cook <keescook@google.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is called from the vm_operations_struct::fault() context during the page fault. This context doesn't need the reclaim protection normally. ceph and ocfs2 which call filemap_fault from their fault handlers seem to be OK because they are not taking any fs lock before invoking generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the reclaim recursion POV because this lock serializes truncate and punch hole with the page faults and it doesn't get involved in the reclaim. There is simply no reason to deliberately use a weaker allocation context when a __GFP_FS | __GFP_IO can be used. The GFP_NOFS protection might be even harmful. There is a push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. We cannot play with mapping_gfp_mask directly because that would be racy wrt. parallel page faults and it might interfere with other users who really rely on NOFS semantic from the stored gfp_mask. The mask is also inode proper so it would even be a layering violation. What we can do instead is to push the gfp_mask into struct vm_fault and allow fs layer to overwrite it should the callback need to be called with a different allocation context. Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO) because this should be safe from the page fault path normally. Why do we care about mapping_gfp_mask at all then? Because this doesn't hold only reclaim protection flags but it also might contain zone and movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect those. Signed-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: NJan Kara <jack@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daniel Cashman 提交于
Address Space Layout Randomization (ASLR) provides a barrier to exploitation of user-space processes in the presence of security vulnerabilities by making it more difficult to find desired code/data which could help an attack. This is done by adding a random offset to the location of regions in the process address space, with a greater range of potential offset values corresponding to better protection/a larger search-space for brute force, but also to greater potential for fragmentation. The offset added to the mmap_base address, which provides the basis for the majority of the mappings for a process, is set once on process exec in arch_pick_mmap_layout() and is done via hard-coded per-arch values, which reflect, hopefully, the best compromise for all systems. The trade-off between increased entropy in the offset value generation and the corresponding increased variability in address space fragmentation is not absolute, however, and some platforms may tolerate higher amounts of entropy. This patch introduces both new Kconfig values and a sysctl interface which may be used to change the amount of entropy used for offset generation on a system. The direct motivation for this change was in response to the libstagefright vulnerabilities that affected Android, specifically to information provided by Google's project zero at: http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html The attack presented therein, by Google's project zero, specifically targeted the limited randomness used to generate the offset added to the mmap_base address in order to craft a brute-force-based attack. Concretely, the attack was against the mediaserver process, which was limited to respawning every 5 seconds, on an arm device. The hard-coded 8 bits used resulted in an average expected success rate of defeating the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a piece). With this patch, and an accompanying increase in the entropy value to 16 bits, the same attack would take an average expected time of over 45 hours (32768 tries), which makes it both less feasible and more likely to be noticed. The introduced Kconfig and sysctl options are limited by per-arch minimum and maximum values, the minimum of which was chosen to match the current hard-coded value and the maximum of which was chosen so as to give the greatest flexibility without generating an invalid mmap_base address, generally a 3-4 bits less than the number of bits in the user-space accessible virtual address space. When decided whether or not to change the default value, a system developer should consider that mmap_base address could be placed anywhere up to 2^(value) bits away from the non-randomized location, which would introduce variable-sized areas above and below the mmap_base address such that the maximum vm_area_struct size may be reduced, preventing very large allocations. This patch (of 4): ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Signed-off-by: NDaniel Cashman <dcashman@google.com> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: NKees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Nick Kralevich <nnk@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hector Marco-Gisbert <hecmargi@upv.es> Cc: Borislav Petkov <bp@suse.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jerome Marchand 提交于
Currently looking at /proc/<pid>/status or statm, there is no way to distinguish shmem pages from pages mapped to a regular file (shmem pages are mapped to /dev/zero), even though their implication in actual memory use is quite different. The internal accounting currently counts shmem pages together with regular files. As a preparation to extend the userspace interfaces, this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for shmem pages separately from MM_FILEPAGES. The next patch will expose it to userspace - this patch doesn't change the exported values yet, by adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was used before. The only user-visible change after this patch is the OOM killer message that separates the reported "shmem-rss" from "file-rss". [vbabka@suse.cz: forward-porting, tweak changelog] Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 1月, 2016 1 次提交
-
-
由 Andy Lutomirski 提交于
The x86 vvar vma contains pages with differing cacheability flags. x86 currently implements this by manually inserting all the ptes using (io_)remap_pfn_range when the vma is set up. x86 wants to move to using .fault with VM_FAULT_NOPAGE to set up the mappings as needed. The correct API to use to insert a pfn in .fault is vm_insert_pfn(), but vm_insert_pfn() can't override the vma's cache mode, and the HPET page in particular needs to be uncached despite the fact that the rest of the VMA is cached. Add vm_insert_pfn_prot() to support varying cacheability within the same non-COW VMA in a more sane manner. x86 could alternatively use multiple VMAs, but that's messy, would break CRIU, and would create unnecessary VMAs that would waste memory. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Reviewed-by: NKees Cook <keescook@chromium.org> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/d2938d1eb37be7a5e4f86182db646551f11e45aa.1451446564.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 07 11月, 2015 3 次提交
-
-
由 Kirill A. Shutemov 提交于
Let's try to be consistent about data type of page order. [sfr@canb.auug.org.au: fix build (type of pageblock_order)] [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
The patch halves space occupied by compound_dtor and compound_order in struct page. For compound_order, it's trivial long -> short conversion. For get_compound_page_dtor(), we now use hardcoded table for destructor lookup and store its index in the struct page instead of direct pointer to destructor. It shouldn't be a big trouble to maintain the table: we have only two destructor and NULL currently. This patch free up one word in tail pages for reuse. This is preparation for the next patch. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 11月, 2015 3 次提交
-
-
由 Eric B Munson 提交于
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in. Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
If ALLOC_SPLIT_PTLOCKS is defined, ptlock_init may fail, in which case we shouldn't increment NR_PAGETABLE. Since small allocations, such as ptlock, normally do not fail (currently they can fail if kmemcg is used though), this patch does not really fix anything and should be considered as a code cleanup. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Roman Gushchin 提交于
Maximal readahead size is limited now by two values: 1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead()) 2) by configurable per-device value* (bdi->ra_pages) There are devices, which require custom readahead limit. For instance, for RAIDs it's calculated as number of devices multiplied by chunk size times 2. Readahead size can never be larger than bdi->ra_pages * 2 value (POSIX_FADV_SEQUNTIAL doubles readahead size). If so, why do we need two limits? I suggest to completely remove this max_sane_readahead() stuff and use per-device readahead limit everywhere. Also, using right readahead size for RAID disks can significantly increase i/o performance: before: dd if=/dev/md2 of=/dev/null bs=100M count=100 100+0 records in 100+0 records out 10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s after: $ dd if=/dev/md2 of=/dev/null bs=100M count=100 100+0 records in 100+0 records out 10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s (It's an 8-disks RAID5 storage). This patch doesn't change sys_readahead and madvise(MADV_WILLNEED) behavior introduced by 6d2be915 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages"). Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 10月, 2015 1 次提交
-
-
由 Greg Thelen 提交于
The problem starts with a file backed dirty page which is charged to a memcg. Then page migration is used to move oldpage to newpage. Migration: - copies the oldpage's data to newpage - clears oldpage.PG_dirty - sets newpage.PG_dirty - uncharges oldpage from memcg - charges newpage to memcg Clearing oldpage.PG_dirty decrements the charged memcg's dirty page count. However, because newpage is not yet charged, setting newpage.PG_dirty does not increment the memcg's dirty page count. After migration completes newpage.PG_dirty is eventually cleared, often in account_page_cleaned(). At this time newpage is charged to a memcg so the memcg's dirty page count is decremented which causes underflow because the count was not previously incremented by migration. This underflow causes balance_dirty_pages() to see a very large unsigned number of dirty memcg pages which leads to aggressive throttling of buffered writes by processes in non root memcg. This issue: - can harm performance of non root memcg buffered writes. - can report too small (even negative) values in memory.stat[(total_)dirty] counters of all memcg, including the root. To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce page_memcg() and set_page_memcg() helpers. Test: 0) setup and enter limited memcg mkdir /sys/fs/cgroup/test echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes echo $$ > /sys/fs/cgroup/test/cgroup.procs 1) buffered writes baseline dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat 2) buffered writes with compaction antagonist to induce migration yes 1 > /proc/sys/vm/compact_memory & rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k kill % sync grep ^dirty /sys/fs/cgroup/test/memory.stat 3) buffered writes without antagonist, should match baseline rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat (speed, dirty residue) unpatched patched 1) 841 MB/s 0 dirty pages 886 MB/s 0 dirty pages 2) 611 MB/s -33427456 dirty pages 793 MB/s 0 dirty pages 3) 114 MB/s -33427456 dirty pages 891 MB/s 0 dirty pages Notice that unpatched baseline performance (1) fell after migration (3): 841 -> 114 MB/s. In the patched kernel, post migration performance matches baseline. Fixes: c4843a75 ("memcg: add per cgroup dirty page accounting") Signed-off-by: NGreg Thelen <gthelen@google.com> Reported-by: NDave Hansen <dave.hansen@intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 9月, 2015 1 次提交
-
-
由 Oleg Nesterov 提交于
Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(), rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple wrapper on top of do_mmap(). Perhaps we should update the callers of do_mmap_pgoff() and kill it later. This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not play with vm internals. After this change mmap_region() has a single user outside of mmap.c, arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to change arch/tile/ and unexport mmap_region(). [kirill@shutemov.name: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NDave Hansen <dave.hansen@linux.intel.com> Tested-by: NDave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Minchan Kim <minchan@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 9月, 2015 5 次提交
-
-
由 Wanpeng Li 提交于
Introduce put_hwpoison_page to put refcount for memory error handling. Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com> Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The pair of get/set_freepage_migratetype() functions are used to cache pageblock migratetype for a page put on a pcplist, so that it does not have to be retrieved again when the page is put on a free list (e.g. when pcplists become full). Historically it was also assumed that the value is accurate for pages on freelists (as the functions' names unfortunately suggest), but that cannot be guaranteed without affecting various allocator fast paths. It is in fact not needed and all such uses have been removed. The last remaining (but pointless) usage related to pages of freelists is in move_freepages(), which this patch removes. To prevent further confusion, rename the functions to get/set_pcppage_migratetype() and expand their description. Since all the users are now in mm/page_alloc.c, move the functions there from the shared header. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <lauraa@codeaurora.org> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Seungho Park <seungho1.park@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vineet Gupta 提交于
It has no callers. Signed-off-by: NVineet Gupta <vgupta@synopsys.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
Allow non-anonymous VMAs to provide huge pages in response to a page fault. Signed-off-by: NMatthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
special_mapping_fault() is absolutely broken. It seems it was always wrong, but this didn't matter until vdso/vvar started to use more than one page. And after this change vma_is_anonymous() becomes really trivial, it simply checks vm_ops == NULL. However, I do think the helper makes sense. There are a lot of ->vm_ops != NULL checks, the helper makes the caller's code more understandable (self-documented) and this is more grep-friendly. This patch (of 3): Preparation. Add the new simple helper, vma_is_anonymous(vma), and change handle_pte_fault() to use it. It will have more users. The name is not accurate, say a hpet_mmap()'ed vma is not anonymous. Perhaps it should be named vma_has_fault() instead. But it matches the logic in mmap.c/memory.c (see next changes). "True" just means that a page fault will use do_anonymous_page(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 9月, 2015 1 次提交
-
-
由 Oleg Nesterov 提交于
vma->vm_ops->mremap() looks more natural and clean in move_vma(), and this way ->mremap() can have more users. Say, vdso. While at it, s/aio_ring_remap/aio_ring_mremap/. Note: this is the minimal change before ->mremap() finds another user in file_operations; this method should have more arguments, and it can be used to kill arch_remap(). Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NPavel Emelyanov <xemul@parallels.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-