- 11 2月, 2015 2 次提交
-
-
由 Kirill A. Shutemov 提交于
We don't create non-linear mappings anymore. Let's drop code which handles them on page fault. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We have remap_file_pages(2) emulation in -mm tree for few release cycles and we plan to have it mainline in v3.20. This patchset removes rest of VM_NONLINEAR infrastructure. Patches 1-8 take care about generic code. They are pretty straight-forward and can be applied without other of patches. Rest patches removes pte_file()-related stuff from architecture-specific code. It usually frees up one bit in non-present pte. I've tried to reuse that bit for swap offset, where I was able to figure out how to do that. For obvious reason I cannot test all that arch-specific code and would like to see acks from maintainers. In total, remap_file_pages(2) required about 1.4K lines of not-so-trivial kernel code. That's too much for functionality nobody uses. Tested-by: NFelipe Balbi <balbi@ti.com> This patch (of 38): We don't create non-linear mappings anymore. Let's drop code which handles them on unmap/zap. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 1月, 2015 1 次提交
-
-
由 Linus Torvalds 提交于
The stack guard page error case has long incorrectly caused a SIGBUS rather than a SIGSEGV, but nobody actually noticed until commit fee7e49d ("mm: propagate error from stack expansion even for guard page") because that error case was never actually triggered in any normal situations. Now that we actually report the error, people noticed the wrong signal that resulted. So far, only the test suite of libsigsegv seems to have actually cared, but there are real applications that use libsigsegv, so let's not wait for any of those to break. Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de> Tested-by: NJan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 1月, 2015 1 次提交
-
-
由 Will Deacon 提交于
When batching up address ranges for TLB invalidation, we check tlb->end != 0 to indicate that some pages have actually been unmapped. As of commit f045bbb9 ("mmu_gather: fix over-eager tlb_flush_mmu_free() calling"), we use the same check for freeing these pages in order to avoid a performance regression where we call free_pages_and_swap_cache even when no pages are actually queued up. Unfortunately, the range could have been reset (tlb->end = 0) by tlb_end_vma, which has been shown to cause memory leaks on arm64. Furthermore, investigation into these leaks revealed that the fullmm case on task exit no longer invalidates the TLB, by virtue of tlb->end == 0 (in 3.18, need_flush would have been set). This patch resolves the problem by reverting commit f045bbb9, using instead tlb->local.nr as the predicate for page freeing in tlb_flush_mmu_free and ensuring that tlb->end is initialised to a non-zero value in the fullmm case. Tested-by: NMark Langsdorf <mlangsdo@redhat.com> Tested-by: NDave Hansen <dave@sr71.net> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 1月, 2015 1 次提交
-
-
由 Johannes Weiner 提交于
Tejun, while reviewing the code, spotted the following race condition between the dirtying and truncation of a page: __set_page_dirty_nobuffers() __delete_from_page_cache() if (TestSetPageDirty(page)) page->mapping = NULL if (PageDirty()) dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); if (page->mapping) account_page_dirtied(page) __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE. Dirtiers usually lock out truncation, either by holding the page lock directly, or in case of zap_pte_range(), by pinning the mapcount with the page table lock held. The notable exception to this rule, though, is do_wp_page(), for which this race exists. However, do_wp_page() already waits for a locked page to unlock before setting the dirty bit, in order to prevent a race where clear_page_dirty() misses the page bit in the presence of dirty ptes. Upgrade that wait to a fully locked set_page_dirty() to also cover the situation explained above. Afterwards, the code in set_page_dirty() dealing with a truncation race is no longer needed. Remove it. Reported-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 1月, 2015 1 次提交
-
-
由 Linus Torvalds 提交于
Jay Foad reports that the address sanitizer test (asan) sometimes gets confused by a stack pointer that ends up being outside the stack vma that is reported by /proc/maps. This happens due to an interaction between RLIMIT_STACK and the guard page: when we do the guard page check, we ignore the potential error from the stack expansion, which effectively results in a missing guard page, since the expected stack expansion won't have been done. And since /proc/maps explicitly ignores the guard page (commit d7824370: "mm: fix up some user-visible effects of the stack guard page"), the stack pointer ends up being outside the reported stack area. This is the minimal patch: it just propagates the error. It also effectively makes the guard page part of the stack limit, which in turn measn that the actual real stack is one page less than the stack limit. Let's see if anybody notices. We could teach acct_stack_growth() to allow an extra page for a grow-up/grow-down stack in the rlimit test, but I don't want to add more complexity if it isn't needed. Reported-and-tested-by: NJay Foad <jay.foad@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 12月, 2014 1 次提交
-
-
由 Kirill A. Shutemov 提交于
This reverts commit c8475d14. There are several[1][2] of bug reports which points to this commit as potential cause[3]. Let's revert it until we figure out what's going on. [1] https://lkml.org/lkml/2014/11/14/342 [2] https://lkml.org/lkml/2014/12/22/213 [3] https://lkml.org/lkml/2014/12/9/741Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NSasha Levin <sasha.levin@oracle.com> Acked-by: NDavidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 12月, 2014 1 次提交
-
-
由 Andrew Morton 提交于
Belatedly document the changes in commit f0c6d4d2 ("mm: introduce do_shared_fault() and drop do_fault()"). Cc: Andi Kleen <ak@linux.intel.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 12月, 2014 2 次提交
-
-
由 Christian Borntraeger 提交于
ACCESS_ONCE does not work reliably on non-scalar types. For example gcc 4.6 and 4.7 might remove the volatile tag for such accesses during the SRA (scalar replacement of aggregates) step (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145) Let's change the code to access the page table elements with READ_ONCE that does implicit scalar accesses for the gup code. mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t as array of longs. This code requires just that the pmd_present and pmd_trans_huge check are done on the same value, so a barrier is sufficent. A similar case is in handle_pte_fault. On ppc44x the word size is 32 bit, but a pte is 64 bit. A barrier is ok as well. Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: linux-mm@kvack.org Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
-
由 Linus Torvalds 提交于
Dave Hansen reports that commit fb7332a9 ("mmu_gather: move minimal range calculations into generic code") caused a performance problem: "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch applied, but does not show up at all on the commit before" and the reason is that Will moved the test for whether we need to flush from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that tlb_flush_mmu_free() basically lost that check. Move it back into tlb_flush_mmu() where it belongs, so that it covers both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free(). Reported-and-tested-by: NDave Hansen <dave@sr71.net> Acked-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 12月, 2014 3 次提交
-
-
由 Jesse Barnes 提交于
This lets drivers like the AMD IOMMUv2 driver handle faults a bit more simply, rather than doing tricks with page refs and get_user_pages(). Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org> Cc: Oded Gabbay <oded.gabbay@amd.com> Cc: Joerg Roedel <jroedel@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
The unmap_mapping_range family of functions do the unmapping of user pages (ultimately via zap_page_range_single) without touching the actual interval tree, thus share the lock. Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: NHugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
Convert all open coded mutex_lock/unlock calls to the i_mmap_[lock/unlock]_write() helpers. Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: NHugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 12月, 2014 1 次提交
-
-
由 Hugh Dickins 提交于
I've been seeing swapoff hangs in recent testing: it's cycling around trying unsuccessfully to find an mm for some remaining pages of swap. I have been exercising swap and page migration more heavily recently, and now notice a long-standing error in copy_one_pte(): it's trying to add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing so even when it's a migration entry or an hwpoison entry. Which wouldn't matter much, except it adds dst_mm next to src_mm, assuming src_mm is already on the mmlist: which may not be so. Then if pages are later swapped out from dst_mm, swapoff won't be able to find where to replace them. There's already a !non_swap_entry() test for stats: move that up before the swap_duplicate() and the addition to mmlist. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Cc: <stable@vger.kernel.org> [2.6.18+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 11月, 2014 1 次提交
-
-
由 Will Deacon 提交于
On architectures with hardware broadcasting of TLB invalidation messages , it makes sense to reduce the range of the mmu_gather structure when unmapping page ranges based on the dirty address information passed to tlb_remove_tlb_entry. arm64 already does this by directly manipulating the start/end fields of the gather structure, but this confuses the generic code which does not expect these fields to change and can end up calculating invalid, negative ranges when forcing a flush in zap_pte_range. This patch moves the minimal range calculation out of the arm64 code and into the generic implementation, simplifying zap_pte_range in the process (which no longer needs to care about start/end, since they will point to the appropriate ranges already). With the range being tracked by core code, the need_flush flag is dropped in favour of checking that the end of the range has actually been set. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Michal Simek <monstr@monstr.eu> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NWill Deacon <will.deacon@arm.com>
-
- 29 10月, 2014 1 次提交
-
-
由 Will Deacon 提交于
When unmapping a range of pages in zap_pte_range, the page being unmapped is added to an mmu_gather_batch structure for asynchronous freeing. If we run out of space in the batch structure before the range has been completely unmapped, then we break out of the loop, force a TLB flush and free the pages that we have batched so far. If there are further pages to unmap, then we resume the loop where we left off. Unfortunately, we forget to update addr when we break out of the loop, which causes us to truncate the range being invalidated as the end address is exclusive. When we re-enter the loop at the same address, the page has already been freed and the pte_present test will fail, meaning that we do not reconsider the address for invalidation. This patch fixes the problem by incrementing addr by the PAGE_SIZE before breaking out of the loop on batch failure. Signed-off-by: NWill Deacon <will.deacon@arm.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 10月, 2014 1 次提交
-
-
由 Dominik Dingel 提交于
Add a new function stub to allow architectures to disable for an mm_structthe backing of non-present, anonymous pages with read-only empty zero pages. Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 14 10月, 2014 1 次提交
-
-
由 Peter Feiner 提交于
For VMAs that don't want write notifications, PTEs created for read faults have their write bit set. If the read fault happens after VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain clear after subsequent writes. Here's a simple code snippet to demonstrate the bug: char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0); system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */ assert(*m == '\0'); /* new PTE allows write access */ assert(!soft_dirty(x)); *m = 'x'; /* should dirty the page */ assert(soft_dirty(x)); /* fails */ With this patch, write notifications are enabled when VM_SOFTDIRTY is cleared. Furthermore, to avoid unnecessary faults, write notifications are disabled when VM_SOFTDIRTY is set. As a side effect of enabling and disabling write notifications with care, this patch fixes a bug in mprotect where vm_page_prot bits set by drivers were zapped on mprotect. An analogous bug was fixed in mmap by commit c9d0bf24 ("mm: uncached vma support with writenotify"). Signed-off-by: NPeter Feiner <pfeiner@google.com> Reported-by: NPeter Feiner <pfeiner@google.com> Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 9月, 2014 1 次提交
-
-
由 Peter Feiner 提交于
This fixes the same bug as b43790ee ("mm: softdirty: don't forget to save file map softdiry bit on unmap") and 9aed8614 ("mm/memory.c: don't forget to set softdirty on file mapped fault") where the return value of pte_*mksoft_dirty was being ignored. To be sure that no other pte/pmd "mk" function return values were being ignored, I annotated the functions in arch/x86/include/asm/pgtable.h with __must_check and rebuilt. The userspace effect of this bug is that the softdirty mark might be lost if a file mapped pte get zapped. Signed-off-by: NPeter Feiner <pfeiner@google.com> Acked-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 9月, 2014 1 次提交
-
-
由 Ard Biesheuvel 提交于
In order to make the static inline function is_zero_pfn() callable by modules, export its symbol dependencies 'zero_pfn' and (for s390 and mips) 'zero_page_mask'. We need this for KVM, as CONFIG_KVM is a tristate for all supported architectures except ARM and arm64, and testing a pfn whether it refers to the zero page is required to correctly distinguish the zero page from other special RAM ranges that may also have the PG_reserved bit set, but need to be treated as MMIO memory. Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 30 8月, 2014 1 次提交
-
-
由 Hugh Dickins 提交于
Sasha Levin has shown oopses on ffffea0003480048 and ffffea0003480008 at mm/memory.c:1132, running Trinity on different 3.16-rc-next kernels: where zap_pte_range() checks page->mapping to see if PageAnon(page). Those addresses fit struct pages for pfns d2001 and d2000, and in each dump a register or a stack slot showed d2001730 or d2000730: pte flags 0x730 are PCD ACCESSED PROTNONE SPECIAL IOMAP; and Sasha's e820 map has a hole between cfffffff and 100000000, which would need special access. Commit c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") has broken vm_normal_page(): a PROTNONE SPECIAL pte no longer passes the pte_special() test, so zap_pte_range() goes on to try to access a non-existent struct page. Fix this by refining pte_special() (SPECIAL with PRESENT or PROTNONE) to complement pte_numa() (SPECIAL with neither PRESENT nor PROTNONE). A hint that this was a problem was that c46a7c81 added pte_numa() test to vm_normal_page(), and moved its is_zero_pfn() test from slow to fast path: This was papering over a pte_special() snag when the zero page was encountered during zap. This patch reverts vm_normal_page() to how it was before, relying on pte_special(). It still appears that this patch may be incomplete: aren't there other places which need to be handling PROTNONE along with PRESENT? For example, pte_mknuma() clears _PAGE_PRESENT and sets _PAGE_NUMA, but on a PROT_NONE area, that would make it pte_special(). This is side-stepped by the fact that NUMA hinting faults skipped PROT_NONE VMAs and there are no grounds where a NUMA hinting fault on a PROT_NONE VMA would be interesting. Fixes: c46a7c81 ("x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") Reported-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: <stable@vger.kernel.org> [3.16] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 8月, 2014 3 次提交
-
-
由 Andy Lutomirski 提交于
The core mm code will provide a default gate area based on FIXADDR_USER_START and FIXADDR_USER_END if !defined(__HAVE_ARCH_GATE_AREA) && defined(AT_SYSINFO_EHDR). This default is only useful for ia64. arm64, ppc, s390, sh, tile, 64-bit UML, and x86_32 have their own code just to disable it. arm, 32-bit UML, and x86_64 have gate areas, but they have their own implementations. This gets rid of the default and moves the code into ia64. This should save some code on architectures without a gate area: it's now possible to inline the gate_area functions in the default case. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Acked-by: NNathan Lynch <nathan_lynch@mentor.com> Acked-by: NH. Peter Anvin <hpa@linux.intel.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [in principle] Acked-by: Richard Weinberger <richard@nod.at> [for um] Acked-by: Will Deacon <will.deacon@arm.com> [for arm64] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Nathan Lynch <Nathan_Lynch@mentor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The memcg uncharging code that is involved towards the end of a page's lifetime - truncation, reclaim, swapout, migration - is impressively complicated and fragile. Because anonymous and file pages were always charged before they had their page->mapping established, uncharges had to happen when the page type could still be known from the context; as in unmap for anonymous, page cache removal for file and shmem pages, and swap cache truncation for swap pages. However, these operations happen well before the page is actually freed, and so a lot of synchronization is necessary: - Charging, uncharging, page migration, and charge migration all need to take a per-page bit spinlock as they could race with uncharging. - Swap cache truncation happens during both swap-in and swap-out, and possibly repeatedly before the page is actually freed. This means that the memcg swapout code is called from many contexts that make no sense and it has to figure out the direction from page state to make sure memory and memory+swap are always correctly charged. - On page migration, the old page might be unmapped but then reused, so memcg code has to prevent untimely uncharging in that case. Because this code - which should be a simple charge transfer - is so special-cased, it is not reusable for replace_page_cache(). But now that charged pages always have a page->mapping, introduce mem_cgroup_uncharge(), which is called after the final put_page(), when we know for sure that nobody is looking at the page anymore. For page migration, introduce mem_cgroup_migrate(), which is called after the migration is successful and the new page is fully rmapped. Because the old page is no longer uncharged after migration, prevent double charges by decoupling the page's memcg association (PCG_USED and pc->mem_cgroup) from the page holding an actual charge. The new bits PCG_MEM and PCG_MEMSW represent the respective charges and are transferred to the new page during migration. mem_cgroup_migrate() is suitable for replace_page_cache() as well, which gets rid of mem_cgroup_replace_page_cache(). However, care needs to be taken because both the source and the target page can already be charged and on the LRU when fuse is splicing: grab the page lock on the charge moving side to prevent changing pc->mem_cgroup of a page under migration. Also, the lruvecs of both pages change as we uncharge the old and charge the new during migration, and putback may race with us, so grab the lru lock and isolate the pages iff on LRU to prevent races and ensure the pages are on the right lruvec afterward. Swap accounting is massively simplified: because the page is no longer uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry before the final put_page() in page reclaim. Finally, page_cgroup changes are now protected by whatever protection the page itself offers: anonymous pages are charged under the page table lock, whereas page cache insertions, swapin, and migration hold the page lock. Uncharging happens under full exclusion with no outstanding references. Charging and uncharging also ensure that the page is off-LRU, which serializes against charge migration. Remove the very costly page_cgroup lock and set pc->flags non-atomically. [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable] [vdavydov@parallels.com: fix flags definition] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Tested-by: NJet Chen <jet.chen@intel.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Tested-by: NFelipe Balbi <balbi@ti.com> Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
These patches rework memcg charge lifetime to integrate more naturally with the lifetime of user pages. This drastically simplifies the code and reduces charging and uncharging overhead. The most expensive part of charging and uncharging is the page_cgroup bit spinlock, which is removed entirely after this series. Here are the top-10 profile entries of a stress test that reads a 128G sparse file on a freshly booted box, without even a dedicated cgroup (i.e. executing in the root memcg). Before: 15.36% cat [kernel.kallsyms] [k] copy_user_generic_string 13.31% cat [kernel.kallsyms] [k] memset 11.48% cat [kernel.kallsyms] [k] do_mpage_readpage 4.23% cat [kernel.kallsyms] [k] get_page_from_freelist 2.38% cat [kernel.kallsyms] [k] put_page 2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge 2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common 1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn After: 15.67% cat [kernel.kallsyms] [k] copy_user_generic_string 13.48% cat [kernel.kallsyms] [k] memset 11.42% cat [kernel.kallsyms] [k] do_mpage_readpage 3.98% cat [kernel.kallsyms] [k] get_page_from_freelist 2.46% cat [kernel.kallsyms] [k] put_page 2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn 1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk 1.30% cat [kernel.kallsyms] [k] kfree As you can see, the memcg footprint has shrunk quite a bit. text data bss dec hex filename 37970 9892 400 48262 bc86 mm/memcontrol.o.old 35239 9892 400 45531 b1db mm/memcontrol.o This patch (of 4): The memcg charge API charges pages before they are rmapped - i.e. have an actual "type" - and so every callsite needs its own set of charge and uncharge functions to know what type is being operated on. Worse, uncharge has to happen from a context that is still type-specific, rather than at the end of the page's lifetime with exclusive access, and so requires a lot of synchronization. Rewrite the charge API to provide a generic set of try_charge(), commit_charge() and cancel_charge() transaction operations, much like what's currently done for swap-in: mem_cgroup_try_charge() attempts to reserve a charge, reclaiming pages from the memcg if necessary. mem_cgroup_commit_charge() commits the page to the charge once it has a valid page->mapping and PageAnon() reliably tells the type. mem_cgroup_cancel_charge() aborts the transaction. This reduces the charge API and enables subsequent patches to drastically simplify uncharging. As pages need to be committed after rmap is established but before they are added to the LRU, page_add_new_anon_rmap() must stop doing LRU additions again. Revive lru_cache_add_active_or_unevictable(). [hughd@google.com: fix shmem_unuse] [hughd@google.com: Add comments on the private use of -EAGAIN] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 8月, 2014 7 次提交
-
-
由 Rik van Riel 提交于
This patch changes confusing #ifdef use in __access_remote_vm into merely ugly #ifdef use. Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651Signed-off-by: NRik van Riel <riel@redhat.com> Reported-by: NDavid Binderman <dcb314@hotmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
fault_around_bytes can only be changed via debugfs. Let's mark it read-mostly. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: NDavid Rientjes <rientjes@google.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Things can go wrong if fault_around_bytes will be changed under do_fault_around(): between fault_around_mask() and fault_around_pages(). Let's read fault_around_bytes only once during do_fault_around() and calculate mask based on the reading. Note: fault_around_bytes can only be updated via debug interface. Also I've tried but was not able to trigger a bad behaviour without the patch. So I would not consider this patch as urgent. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Paul Cassella 提交于
Add a comment describing the circumstances in which __lock_page_or_retry() will or will not release the mmap_sem when returning 0. Add comments to lock_page_or_retry()'s callers (filemap_fault(), do_swap_page()) noting the impact on VM_FAULT_RETRY returns. Add comments on up the call tree, particularly replacing the false "We return with mmap_sem still held" comments. Signed-off-by: NPaul Cassella <cassella@cray.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cyrill Gorcunov 提交于
Otherwise we may not notice that pte was softdirty because pte_mksoft_dirty helper _returns_ new pte but doesn't modify the argument. In case if page fault happend on dirty filemapping the newly created pte may loose softdirty bit thus if a userspace program is tracking memory changes with help of a memory tracker (CONFIG_MEM_SOFT_DIRTY) it might miss modification of a memory page (which in worts case may lead to data inconsistency). Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NPavel Emelyanov <xemul@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jerome Marchand 提交于
Commit 71e3aac0 ("thp: transparent hugepage core") adds copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this function have been used outside of memory.c, but it currently isn't. This patch makes copy_pte_range() static again. Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or orig_pte upon which all subsequent decisions and pte_same() tests will be made. I have no evidence that its lack is responsible for the mm/filemap.c:202 BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by trinity, and I am not optimistic that it will fix it. But I have found no other explanation, and ACCESS_ONCE() here will surely not hurt. If gcc does re-access the pte before passing it down, then that would be disastrous for correct page fault handling, and certainly could explain the page_mapped() BUGs seen (concurrent fault causing page to be mapped in a second time on top of itself: mapcount 2 for a single pte). Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 7月, 2014 1 次提交
-
-
由 Andrey Ryabinin 提交于
do_fault_around() expects fault_around_bytes rounded down to nearest page order. Instead of calling rounddown_pow_of_two every time in fault_around_pages()/fault_around_mask() we could do round down when user changes fault_around_bytes via debugfs interface. This also fixes bug when user set fault_around_bytes to 0. Result of rounddown_pow_of_two(0) is not defined, therefore fault_around_bytes == 0 doesn't work without this patch. Let's set fault_around_bytes to PAGE_SIZE if user sets to something less than PAGE_SIZE [akpm@linux-foundation.org: tweak code layout] Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order") Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Reported-by: NSasha Levin <sasha.levin@oracle.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [3.15.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 7月, 2014 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
Ingo Korb reported that "repeated mapping of the same file on tmpfs using remap_file_pages sometimes triggers a BUG at mm/filemap.c:202 when the process exits". He bisected the bug to d7c17551 ("mm: implement ->map_pages for shmem/tmpfs"), although the bug was actually added by commit 8c6e50b0 ("mm: introduce vm_ops->map_pages()"). The problem is caused by calling do_fault_around for a _non-linear_ fault. In this case pgoff is shifted and might become negative during calculation. Faulting around non-linear page-fault makes no sense and breaks the logic in do_fault_around because pgoff is shifted. Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com> Reported-by: NIngo Korb <ingo.korb@tu-dortmund.de> Tested-by: NIngo Korb <ingo.korb@tu-dortmund.de> Cc: Hugh Dickins <hughd@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Ning Qu <quning@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [3.15.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 6月, 2014 5 次提交
-
-
由 Kirill A. Shutemov 提交于
Some clarification on how faultaround works. [akpm@linux-foundation.org: tweak comment text] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
There is evidencs that the faultaround feature is less relevant on architectures with page size bigger then 4k. Which makes sense since page fault overhead per byte of mapped area should be less there. Let's rework the feature to specify faultaround area in bytes instead of page order. It's 64 kilobytes for now. The patch effectively disables faultaround on architectures with page size >= 64k (like ppc64). It's possible that some other size of faultaround area is relevant for a platform. We can expose `fault_around_bytes' variable to arch-specific code once such platforms will be found. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Hugh Dickins <hughd@google.com> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
mm/memory.c is overloaded: over 4k lines. get_user_pages() code is pretty much self-contained let's move it to separate file. No other changes made. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
_PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting faults on x86. Care is taken such that _PAGE_NUMA is used only in situations where the VMA flags distinguish between NUMA hinting faults and prot_none faults. This decision was x86-specific and conceptually it is difficult requiring special casing to distinguish between PROTNONE and NUMA ptes based on context. Fundamentally, we only need the _PAGE_NUMA bit to tell the difference between an entry that is really unmapped and a page that is protected for NUMA hinting faults as if the PTE is not present then a fault will be trapped. Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset. This patch shrinks the maximum possible swap size and uses the bit to uniquely distinguish between NUMA hinting ptes and swap ptes. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Anvin <hpa@zytor.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Noonan <steven@uplinklabs.net> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 5月, 2014 1 次提交
-
-
由 Rik van Riel 提交于
Changing PTEs and PMDs to pte_numa & pmd_numa is done with the mmap_sem held for reading, which means a pmd can be instantiated and turned into a numa one while __handle_mm_fault() is examining the value of old_pmd. If that happens, __handle_mm_fault() should just return and let the page fault retry, instead of throwing an oops. This is handled by the test for pmd_trans_huge(*pmd) below. Signed-off-by: NRik van Riel <riel@redhat.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: NSunil Pandey <sunil.k.pandey@intel.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: linux-mm@kvack.org Cc: lwoodman@redhat.com Cc: dave.hansen@intel.com Link: http://lkml.kernel.org/r/20140429153615.2d72098e@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 26 4月, 2014 1 次提交
-
-
由 Linus Torvalds 提交于
The mmu-gather operation 'tlb_flush_mmu()' has done two things: the actual tlb flush operation, and the batched freeing of the pages that the TLB entries pointed at. This splits the operation into separate phases, so that the forced batched flushing done by zap_pte_range() can now do the actual TLB flush while still holding the page table lock, but delay the batched freeing of all the pages to after the lock has been dropped. This in turn allows us to avoid a race condition between set_page_dirty() (as called by zap_pte_range() when it finds a dirty shared memory pte) and page_mkclean(): because we now flush all the dirty page data from the TLB's while holding the pte lock, page_mkclean() will be held up walking the (recently cleaned) page tables until after the TLB entries have been flushed from all CPU's. Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: NDave Hansen <dave.hansen@intel.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-