1. 09 10月, 2013 1 次提交
    • M
      mm: numa: Sanitize task_numa_fault() callsites · 8191acbd
      Mel Gorman 提交于
      There are three callers of task_numa_fault():
      
       - do_huge_pmd_numa_page():
           Accounts against the current node, not the node where the
           page resides, unless we migrated, in which case it accounts
           against the node we migrated to.
      
       - do_numa_page():
           Accounts against the current node, not the node where the
           page resides, unless we migrated, in which case it accounts
           against the node we migrated to.
      
       - do_pmd_numa_page():
           Accounts not at all when the page isn't migrated, otherwise
           accounts against the node we migrated towards.
      
      This seems wrong to me; all three sites should have the same
      sementaics, furthermore we should accounts against where the page
      really is, we already know where the task is.
      
      So modify all three sites to always account; we did after all receive
      the fault; and always account to where the page is after migration,
      regardless of success.
      
      They all still differ on when they clear the PTE/PMD; ideally that
      would get sorted too.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8191acbd
  2. 13 9月, 2013 3 次提交
    • K
      thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() · c0292554
      Kirill A. Shutemov 提交于
      do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
      to handle fallback path.
      
      Let's consolidate code back by introducing VM_FAULT_FALLBACK return
      code.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0292554
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
    • J
      mm: memcg: enable memcg OOM killer only for user faults · 519e5247
      Johannes Weiner 提交于
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      519e5247
  3. 12 9月, 2013 2 次提交
  4. 16 8月, 2013 1 次提交
    • L
      Fix TLB gather virtual address range invalidation corner cases · 2b047252
      Linus Torvalds 提交于
      Ben Tebulin reported:
      
       "Since v3.7.2 on two independent machines a very specific Git
        repository fails in 9/10 cases on git-fsck due to an SHA1/memory
        failures.  This only occurs on a very specific repository and can be
        reproduced stably on two independent laptops.  Git mailing list ran
        out of ideas and for me this looks like some very exotic kernel issue"
      
      and bisected the failure to the backport of commit 53a59fc6 ("mm:
      limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
      
      That commit itself is not actually buggy, but what it does is to make it
      much more likely to hit the partial TLB invalidation case, since it
      introduces a new case in tlb_next_batch() that previously only ever
      happened when running out of memory.
      
      The real bug is that the TLB gather virtual memory range setup is subtly
      buggered.  It was introduced in commit 597e1c35 ("mm/mmu_gather:
      enable tlb flush range in generic mmu_gather"), and the range handling
      was already fixed at least once in commit e6c495a9 ("mm: fix the TLB
      range flushed when __tlb_remove_page() runs out of slots"), but that fix
      was not complete.
      
      The problem with the TLB gather virtual address range is that it isn't
      set up by the initial tlb_gather_mmu() initialization (which didn't get
      the TLB range information), but it is set up ad-hoc later by the
      functions that actually flush the TLB.  And so any such case that forgot
      to update the TLB range entries would potentially miss TLB invalidates.
      
      Rather than try to figure out exactly which particular ad-hoc range
      setup was missing (I personally suspect it's the hugetlb case in
      zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
      did), this patch just gets rid of the problem at the source: make the
      TLB range information available to tlb_gather_mmu(), and initialize it
      when initializing all the other tlb gather fields.
      
      This makes the patch larger, but conceptually much simpler.  And the end
      result is much more understandable; even if you want to play games with
      partial ranges when invalidating the TLB contents in chunks, now the
      range information is always there, and anybody who doesn't want to
      bother with it won't introduce subtle bugs.
      
      Ben verified that this fixes his problem.
      Reported-bisected-and-tested-by: NBen Tebulin <tebulin@googlemail.com>
      Build-testing-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Build-testing-by: NRichard Weinberger <richard.weinberger@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b047252
  5. 14 8月, 2013 2 次提交
  6. 13 8月, 2013 1 次提交
  7. 10 7月, 2013 1 次提交
  8. 04 7月, 2013 3 次提交
  9. 06 6月, 2013 1 次提交
    • P
      arch, mm: Remove tlb_fast_mode() · 29eb7782
      Peter Zijlstra 提交于
      Since the introduction of preemptible mmu_gather TLB fast mode has been
      broken. TLB fast mode relies on there being absolutely no concurrency;
      it frees pages first and invalidates TLBs later.
      
      However now we can get concurrency and stuff goes *bang*.
      
      This patch removes all tlb_fast_mode() code; it was found the better
      option vs trying to patch the hole by entangling tlb invalidation with
      the scheduler.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Reported-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29eb7782
  10. 28 5月, 2013 2 次提交
    • M
      mm, sched: Allow uaccess in atomic with pagefault_disable() · 662bbcb2
      Michael S. Tsirkin 提交于
      This changes might_fault() so that it does not
      trigger a false positive diagnostic for e.g. the following
      sequence:
      
      	spin_lock_irqsave()
      	pagefault_disable()
      	copy_to_user()
      	pagefault_enable()
      	spin_unlock_irqrestore()
      
      In particular vhost wants to do this, to call
      socket ops from under a lock.
      
      There are 3 cases to consider:
      
       - CONFIG_PROVE_LOCKING - might_fault is non-inline
         so it's easy to move the in_atomic test to fix
         up the false positive warning.
      
       - CONFIG_DEBUG_ATOMIC_SLEEP - might_fault
         is currently inline, but we are calling a
         non-inline __might_sleep anyway,
         so let's use the non-line version of might_fault
         that does the right thing.
      
       - !CONFIG_DEBUG_ATOMIC_SLEEP && !CONFIG_PROVE_LOCKING
         __might_sleep is a nop so might_fault is a nop.
      
      Make this explicit.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1369577426-26721-11-git-send-email-mst@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      662bbcb2
    • M
      mm, sched: Drop voluntary schedule from might_fault() · 114276ac
      Michael S. Tsirkin 提交于
      might_fault() is called from functions like copy_to_user()
      which most callers expect to be very fast, like a couple of
      instructions.
      
      So functions like memcpy_toiovec() call them many times in a loop.
      
      But might_fault() calls might_sleep() and with CONFIG_PREEMPT_VOLUNTARY
      this results in a function call.
      
      Let's not do this - just call __might_sleep() that produces
      a diagnostic for sleep within atomic, but drop
      might_preempt().
      
      Here's a test sending traffic between the VM and the host,
      host is built with CONFIG_PREEMPT_VOLUNTARY:
      
       before:
      	incoming: 7122.77   Mb/s
      	outgoing: 8480.37   Mb/s
      
       after:
      	incoming: 8619.24   Mb/s
      	outgoing: 9455.42   Mb/s
      
      As a side effect, this fixes an issue pointed
      out by Ingo: might_fault might schedule differently
      depending on PROVE_LOCKING. Now there's no
      preemption point in both cases, so it's consistent.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1369577426-26721-10-git-send-email-mst@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      114276ac
  11. 30 4月, 2013 1 次提交
  12. 29 4月, 2013 1 次提交
  13. 17 4月, 2013 1 次提交
    • L
      vm: add vm_iomap_memory() helper function · b4cbb197
      Linus Torvalds 提交于
      Various drivers end up replicating the code to mmap() their memory
      buffers into user space, and our core memory remapping function may be
      very flexible but it is unnecessarily complicated for the common cases
      to use.
      
      Our internal VM uses pfn's ("page frame numbers") which simplifies
      things for the VM, and allows us to pass physical addresses around in a
      denser and more efficient format than passing a "phys_addr_t" around,
      and having to shift it up and down by the page size.  But it just means
      that drivers end up doing that shifting instead at the interface level.
      
      It also means that drivers end up mucking around with internal VM things
      like the vma details (vm_pgoff, vm_start/end) way more than they really
      need to.
      
      So this just exports a function to map a certain physical memory range
      into user space (using a phys_addr_t based interface that is much more
      natural for a driver) and hides all the complexity from the driver.
      Some drivers will still end up tweaking the vm_page_prot details for
      things like prefetching or cacheability etc, but that's actually
      relevant to the driver, rather than caring about what the page offset of
      the mapping is into the particular IO memory region.
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4cbb197
  14. 13 4月, 2013 1 次提交
    • D
      x86-32: Fix possible incomplete TLB invalidate with PAE pagetables · 1de14c3c
      Dave Hansen 提交于
      This patch attempts to fix:
      
      	https://bugzilla.kernel.org/show_bug.cgi?id=56461
      
      The symptom is a crash and messages like this:
      
      	chrome: Corrupted page table at address 34a03000
      	*pdpt = 0000000000000000 *pde = 0000000000000000
      	Bad pagetable: 000f [#1] PREEMPT SMP
      
      Ingo guesses this got introduced by commit 611ae8e3 ("x86/tlb:
      enable tlb flush range support for x86") since that code started to free
      unused pagetables.
      
      On x86-32 PAE kernels, that new code has the potential to free an entire
      PMD page and will clear one of the four page-directory-pointer-table
      (aka pgd_t entries).
      
      The hardware aggressively "caches" these top-level entries and invlpg
      does not actually affect the CPU's copy.  If we clear one we *HAVE* to
      do a full TLB flush, otherwise we might continue using a freed pmd page.
      (note, we do this properly on the population side in pud_populate()).
      
      This patch tracks whenever we clear one of these entries in the 'struct
      mmu_gather', and ensures that we follow up with a full tlb flush.
      
      BTW, I disassembled and checked that:
      
      	if (tlb->fullmm == 0)
      and
      	if (!tlb->fullmm && !tlb->need_flush_all)
      
      generate essentially the same code, so there should be zero impact there
      to the !PAE case.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Artem S Tashkinov <t.artem@mailcity.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1de14c3c
  15. 24 2月, 2013 8 次提交
    • H
      mm: cleanup "swapcache" in do_swap_page · 56f31801
      Hugh Dickins 提交于
      I dislike the way in which "swapcache" gets used in do_swap_page():
      there is always a page from swapcache there (even if maybe uncached by
      the time we lock it), but tests are made according to "swapcache".
      Rework that with "page != swapcache", as has been done in unuse_pte().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56f31801
    • H
      mm,ksm: FOLL_MIGRATION do migration_entry_wait · 5117b3b8
      Hugh Dickins 提交于
      In "ksm: remove old stable nodes more thoroughly" I said that I'd never
      seen its WARN_ON_ONCE(page_mapped(page)).  True at the time of writing,
      but it soon appeared once I tried fuller tests on the whole series.
      
      It turned out to be due to the KSM page migration itself: unmerge_and_
      remove_all_rmap_items() failed to locate and replace all the KSM pages,
      because of that hiatus in page migration when old pte has been replaced
      by migration entry, but not yet by new pte.  follow_page() finds no page
      at that instant, but a KSM page reappears shortly after, without a
      fault.
      
      Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
      for KSM's break_cow().  I'd have preferred to avoid another flag, and do
      it every time, in case someone else makes the same easy mistake; but did
      not find another transgressor (the common get_user_pages() is of course
      safe), and cannot be sure that every follow_page() caller is prepared to
      sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
      already sleep there, since anon_vma locking was changed to mutex, but
      maybe that's somehow excluded.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5117b3b8
    • M
      mm: accelerate mm_populate() treatment of THP pages · 240aadee
      Michel Lespinasse 提交于
      This change adds a follow_page_mask function which is equivalent to
      follow_page, but with an extra page_mask argument.
      
      follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
      a THP page, and to 0 in other cases.
      
      __get_user_pages() makes use of this in order to accelerate populating
      THP ranges - that is, when both the pages and vmas arrays are NULL, we
      don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
      we also avoid taking mm->page_table_lock that many times).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      240aadee
    • M
      mm: use long type for page counts in mm_populate() and get_user_pages() · 28a35716
      Michel Lespinasse 提交于
      Use long type for page counts in mm_populate() so as to avoid integer
      overflow when running the following test code:
      
      int main(void) {
        void *p = mmap(NULL, 0x100000000000, PROT_READ,
                       MAP_PRIVATE | MAP_ANON, -1, 0);
        printf("p: %p\n", p);
        mlockall(MCL_CURRENT);
        printf("done\n");
        return 0;
      }
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28a35716
    • H
      ksm: remove old stable nodes more thoroughly · cbf86cfe
      Hugh Dickins 提交于
      Switching merge_across_nodes after running KSM is liable to oops on stale
      nodes still left over from the previous stable tree.  It's not something
      that people will often want to do, but it would be lame to demand a reboot
      when they're trying to determine which merge_across_nodes setting is best.
      
      How can this happen?  We only permit switching merge_across_nodes when
      pages_shared is 0, and usually set run 2 to force that beforehand, which
      ought to unmerge everything: yet oopses still occur when you then run 1.
      
      Three causes:
      
      1. The old stable tree (built according to the inverse
         merge_across_nodes) has not been fully torn down.  A stable node
         lingers until get_ksm_page() notices that the page it references no
         longer references it: but the page is not necessarily freed as soon as
         expected, particularly when swapcache.
      
         Fix this with a pass through the old stable tree, applying
         get_ksm_page() to each of the remaining nodes (most found stale and
         removed immediately), with forced removal of any left over.  Unless the
         page is still mapped: I've not seen that case, it shouldn't occur, but
         better to WARN_ON_ONCE and EBUSY than BUG.
      
      2. __ksm_enter() has a nice little optimization, to insert the new mm
         just behind ksmd's cursor, so there's a full pass for it to stabilize
         (or be removed) before ksmd addresses it.  Nice when ksmd is running,
         but not so nice when we're trying to unmerge all mms: we were missing
         those mms forked and inserted behind the unmerge cursor.  Easily fixed
         by inserting at the end when KSM_RUN_UNMERGE.
      
      3.  It is possible for a KSM page to be faulted back from swapcache
         into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
         it.  Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
         private to ksm.c, so dissolve the distinction between
         ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
         the one call into ksm.c.
      
      A long outstanding, unrelated bugfix sneaks in with that third fix:
      ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
      error when read in from swap) to a page which it then marks Uptodate.  Fix
      this case by not copying, letting do_swap_page() discover the error.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbf86cfe
    • P
      mm: fold page->_last_nid into page->flags where possible · 75980e97
      Peter Zijlstra 提交于
      page->_last_nid fits into page->flags on 64-bit.  The unlikely 32-bit
      NUMA configuration with NUMA Balancing will still need an extra page
      field.  As Peter notes "Completely dropping 32bit support for
      CONFIG_NUMA_BALANCING would simplify things, but it would also remove
      the warning if we grow enough 64bit only page-flags to push the last-cpu
      out."
      
      [mgorman@suse.de: minor modifications]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75980e97
    • M
      mm: directly use __mlock_vma_pages_range() in find_extend_vma() · cea10a19
      Michel Lespinasse 提交于
      In find_extend_vma(), we don't need mlock_vma_pages_range() to verify
      the vma type - we know we're working with a stack.  So, we can call
      directly into __mlock_vma_pages_range(), and remove the last
      make_pages_present() call site.
      
      Note that we don't use mm_populate() here, so we can't release the
      mmap_sem while allocating new stack pages.  This is deemed acceptable,
      because the stack vmas grow by a bounded number of pages at a time, and
      these are anon pages so we don't have to read from disk to populate
      them.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cea10a19
    • J
      mm: reduce rmap overhead for ex-KSM page copies created on swap faults · af34770e
      Johannes Weiner 提交于
      When ex-KSM pages are faulted from swap cache, the fault handler is not
      capable of re-establishing anon_vma-spanning KSM pages.  In this case, a
      copy of the page is created instead, just like during a COW break.
      
      These freshly made copies are known to be exclusive to the faulting VMA
      and there is no reason to go look for this page in parent and sibling
      processes during rmap operations.
      
      Use page_add_new_anon_rmap() for these copies.  This also puts them on
      the proper LRU lists and marks them SwapBacked, so we can get rid of
      doing this ad-hoc in the KSM copy code.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af34770e
  16. 21 1月, 2013 1 次提交
  17. 10 1月, 2013 1 次提交
  18. 05 1月, 2013 1 次提交
    • M
      mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT · 53a59fc6
      Michal Hocko 提交于
      Since commit e303297e ("mm: extended batches for generic
      mmu_gather") we are batching pages to be freed until either
      tlb_next_batch cannot allocate a new batch or we are done.
      
      This works just fine most of the time but we can get in troubles with
      non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY)
      on large machines where too aggressive batching might lead to soft
      lockups during process exit path (exit_mmap) because there are no
      scheduling points down the free_pages_and_swap_cache path and so the
      freeing can take long enough to trigger the soft lockup.
      
      The lockup is harmless except when the system is setup to panic on
      softlockup which is not that unusual.
      
      The simplest way to work around this issue is to limit the maximum
      number of batches in a single mmu_gather.  10k of collected pages should
      be safe to prevent from soft lockups (we would have 2ms for one) even if
      they are all freed without an explicit scheduling point.
      
      This patch doesn't add any new explicit scheduling points because it
      relies on zap_pmd_range during page tables zapping which calls
      cond_resched per PMD.
      
      The following lockup has been reported for 3.0 kernel with a huge
      process (in order of hundreds gigs but I do know any more details).
      
        BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053]
        Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod
        Supported: Yes
        CPU 56
        Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7
        RIP: 0010:  _raw_spin_unlock_irqrestore+0x8/0x10
        RSP: 0018:ffff883ec1037af0  EFLAGS: 00000206
        RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80
        RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206
        RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400
        R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e
        R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e
        FS:  00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100)
        Call Trace:
          release_pages+0xc5/0x260
          free_pages_and_swap_cache+0x9d/0xc0
          tlb_flush_mmu+0x5c/0x80
          tlb_finish_mmu+0xe/0x50
          exit_mmap+0xbd/0x120
          mmput+0x49/0x120
          exit_mm+0x122/0x160
          do_exit+0x17a/0x430
          do_group_exit+0x3d/0xb0
          get_signal_to_deliver+0x247/0x480
          do_signal+0x71/0x1b0
          do_notify_resume+0x98/0xb0
          int_signal+0x12/0x17
        DWARF2 unwinder stuck at int_signal+0x12/0x17
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[3.0+]
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53a59fc6
  19. 18 12月, 2012 2 次提交
  20. 13 12月, 2012 4 次提交
  21. 12 12月, 2012 1 次提交
  22. 11 12月, 2012 1 次提交
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd