1. 07 9月, 2017 2 次提交
    • R
      dax: remove DAX code from page_cache_tree_insert() · d01ad197
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees we
      can remove the special casing for DAX in page_cache_tree_insert().
      
      This also allows us to make dax_wake_mapping_entry_waiter() local to
      fs/dax.c, removing it from dax.h.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d01ad197
    • R
      mm: add vm_insert_mixed_mkwrite() · b2770da6
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.  This has three major
      drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory.
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it.
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      This series solves these issues by following the lead of the DAX PMD
      code and using a common 4k zero page instead.  This reduces memory usage
      and decreases latencies for some workloads, and it simplifies the DAX
      code, removing over 100 lines in total.
      
      This patch (of 5):
      
      To be able to use the common 4k zero page in DAX we need to have our PTE
      fault path look more like our PMD fault path where a PTE entry can be
      marked as dirty and writeable as it is first inserted rather than
      waiting for a follow-up dax_pfn_mkwrite() => finish_mkwrite_fault()
      call.
      
      Right now we can rely on having a dax_pfn_mkwrite() call because we can
      distinguish between these two cases in do_wp_page():
      
      	case 1: 4k zero page => writable DAX storage
      	case 2: read-only DAX storage => writeable DAX storage
      
      This distinction is made by via vm_normal_page().  vm_normal_page()
      returns false for the common 4k zero page, though, just as it does for
      DAX ptes.  Instead of special casing the DAX + 4k zero page case we will
      simplify our DAX PTE page fault sequence so that it matches our DAX PMD
      sequence, and get rid of the dax_pfn_mkwrite() helper.  We will instead
      use dax_iomap_fault() to handle write-protection faults.
      
      This means that insert_pfn() needs to follow the lead of
      insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag.  If 'mkwrite'
      is set insert_pfn() will do the work that was previously done by
      wp_page_reuse() as part of the dax_pfn_mkwrite() call path.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2770da6
  2. 01 9月, 2017 5 次提交
    • M
      mm, madvise: ensure poisoned pages are removed from per-cpu lists · c461ad6a
      Mel Gorman 提交于
      Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
      and bisected it to the commit 479f854a ("mm, page_alloc: defer
      debugging checks of pages allocated from the PCP").
      
      The problem is that a page that was poisoned with madvise() is reused.
      The commit removed a check that would trigger if DEBUG_VM was enabled
      but re-enabling the check only fixes the problem as a side-effect by
      printing a bad_page warning and recovering.
      
      The root of the problem is that an madvise() can leave a poisoned page
      on the per-cpu list.  This patch drains all per-cpu lists after pages
      are poisoned so that they will not be reused.  Wendy reports that the
      test case in question passes with this patch applied.  While this could
      be done in a targeted fashion, it is over-complicated for such a rare
      operation.
      
      Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
      Fixes: 479f854a ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NWang, Wendy <wendy.wang@intel.com>
      Tested-by: NWang, Wendy <wendy.wang@intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Hansen, Dave" <dave.hansen@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c461ad6a
    • T
      mm,page_alloc: don't call __node_reclaim() with oom_lock held. · e746bf73
      Tetsuo Handa 提交于
      We are doing a last second memory allocation attempt before calling
      out_of_memory().  But since slab shrinker functions might indirectly
      wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory
      allocations via sleeping locks, calling slab shrinker functions from
      node_reclaim() from get_page_from_freelist() with oom_lock held has
      possibility of deadlock.  Therefore, make sure that last second memory
      allocation attempt does not call slab shrinker functions.
      
      Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e746bf73
    • J
      mm/mmu_notifier: kill invalidate_page · 5f32b265
      Jérôme Glisse 提交于
      The invalidate_page callback suffered from two pitfalls.  First it used
      to happen after the page table lock was release and thus a new page
      might have setup before the call to invalidate_page() happened.
      
      This is in a weird way fixed by commit c7ab0d2f ("mm: convert
      try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
      callback under the page table lock but this also broke several existing
      users of the mmu_notifier API that assumed they could sleep inside this
      callback.
      
      The second pitfall was invalidate_page() being the only callback not
      taking a range of address in respect to invalidation but was giving an
      address and a page.  Lots of the callback implementers assumed this
      could never be THP and thus failed to invalidate the appropriate range
      for THP.
      
      By killing this callback we unify the mmu_notifier callback API to
      always take a virtual address range as input.
      
      Finally this also simplifies the end user life as there is now two clear
      choices:
        - invalidate_range_start()/end() callback (which allow you to sleep)
        - invalidate_range() where you can not sleep but happen right after
          page table update under page table lock
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f32b265
    • J
      mm/rmap: update to new mmu_notifier semantic v2 · 369ea824
      Jérôme Glisse 提交于
      Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
      and make sure it is bracketed by calls to *_invalidate_range_start()/end().
      
      Note that because we can not presume the pmd value or pte value we have
      to assume the worst and unconditionaly report an invalidation as
      happening.
      
      Changed since v2:
        - try_to_unmap_one() only one call to mmu_notifier_invalidate_range()
        - compute end with PAGE_SIZE << compound_order(page)
        - fix PageHuge() case in try_to_unmap_one()
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      369ea824
    • J
      dax: update to new mmu_notifier semantic · a4d1a885
      Jérôme Glisse 提交于
      Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
      and make sure it is bracketed by calls to *_invalidate_range_start()/end().
      
      Note that because we can not presume the pmd value or pte value we have
      to assume the worst and unconditionaly report an invalidation as
      happening.
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4d1a885
  3. 30 8月, 2017 1 次提交
    • L
      Revert "rmap: do not call mmu_notifier_invalidate_page() under ptl" · 785373b4
      Linus Torvalds 提交于
      This reverts commit aac2fea9.
      
      It turns out that that patch was complete and utter garbage, and broke
      KVM, resulting in odd oopses.
      
      Quoting Andrea Arcangeli:
       "The aforementioned commit has 3 bugs.
      
        1) mmu_notifier_invalidate_range cannot be used in replacement of
           mmu_notifier_invalidate_range_start/end.
      
           For KVM mmu_notifier_invalidate_range is a noop and rightfully so.
      
           A MMU notifier implementation has to implement either
           ->invalidate_range method or the invalidate_range_start/end
           methods, not both. And if you implement invalidate_range_start/end
           like KVM is forced to do, calling mmu_notifier_invalidate_range in
           common code is a noop for KVM.
      
           For those MMU notifiers that can get away only implementing
           ->invalidate_range, the ->invalidate_range is implicitly called by
           mmu_notifier_invalidate_range_end(). And only those secondary MMUs
           that share the same pagetable with the primary MMU (like AMD
           iommuv2) can get away only implementing ->invalidate_range.
      
           So all cases (THP on/off) are broken right now.
      
           To fix this is enough to replace mmu_notifier_invalidate_range with
           mmu_notifier_invalidate_range_start;mmu_notifier_invalidate_range_end.
           Either that or call multiple mmu_notifier_invalidate_page like
           before.
      
        2) address + (1UL << compound_order(page) is buggy, it should be
           PAGE_SIZE << compound_order(page), it's bytes not pages, 2M not
           512.
      
        3) The whole invalidate_range thing was an attempt to call a single
           invalidate while walking multiple 4k ptes that maps the same THP
           (after a pmd virtual split without physical compound page THP
           split).
      
           It's unclear if the rmap_walk will always provide an address that
           is 2M aligned as parameter to try_to_unmap_one, in presence of THP.
           I think it needs also an address &= (PAGE_SIZE <<
           compound_order(page)) - 1 to be safe"
      
      In general, we should stop making excuses for horrible MMU notifier
      users.  It's much more important that the core VM is sane and safe, than
      letting MMU notifiers sleep.
      
      So if some MMU notifier is sleeping under a spinlock, we need to fix the
      notifier, not try to make excuses for that garbage in the core VM.
      Reported-and-tested-by: NBernhard Held <berny156@gmx.de>
      Reported-and-tested-by: NAdam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      785373b4
  4. 29 8月, 2017 1 次提交
    • L
      page waitqueue: always add new entries at the end · 9c3a815f
      Linus Torvalds 提交于
      Commit 3510ca20 ("Minor page waitqueue cleanups") made the page
      queue code always add new waiters to the back of the queue, which helps
      upcoming patches to batch the wakeups for some horrid loads where the
      wait queues grow to thousands of entries.
      
      However, I forgot about the nasrt add_page_wait_queue() special case
      code that is only used by the cachefiles code.  That one still continued
      to add the new wait queue entries at the beginning of the list.
      
      Fix it, because any sane batched wakeup will require that we don't
      suddenly start getting new entries at the beginning of the list that we
      already handled in a previous batch.
      
      [ The current code always does the whole list while holding the lock, so
        wait queue ordering doesn't matter for correctness, but even then it's
        better to add later entries at the end from a fairness standpoint ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c3a815f
  5. 28 8月, 2017 2 次提交
    • L
      Avoid page waitqueue race leaving possible page locker waiting · a8b169af
      Linus Torvalds 提交于
      The "lock_page_killable()" function waits for exclusive access to the
      page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
      set.
      
      That means that if it gets woken up, other waiters may have been
      skipped.
      
      That, in turn, means that if it sees the page being unlocked, it *must*
      take that lock and return success, even if a lethal signal is also
      pending.
      
      So instead of checking for lethal signals first, we need to check for
      them after we've checked the actual bit that we were waiting for.  Even
      if that might then delay the killing of the process.
      
      This matches the order of the old "wait_on_bit_lock()" infrastructure
      that the page locking used to use (and is still used in a few other
      areas).
      
      Note that if we still return an error after having unsuccessfully tried
      to acquire the page lock, that is ok: that means that some other thread
      was able to get ahead of us and lock the page, and when that other
      thread then unlocks the page, the wakeup event will be repeated.  So any
      other pending waiters will now get properly woken up.
      
      Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8b169af
    • L
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds 提交于
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
  6. 26 8月, 2017 4 次提交
    • P
      mm/memblock.c: reversed logic in memblock_discard() · 91b540f9
      Pavel Tatashin 提交于
      In recently introduced memblock_discard() there is a reversed logic bug.
      Memory is freed of static array instead of dynamically allocated one.
      
      Link: http://lkml.kernel.org/r/1503511441-95478-2-git-send-email-pasha.tatashin@oracle.com
      Fixes: 3010f876 ("mm: discard memblock data later")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reported-by: NWoody Suwalski <terraluna977@gmail.com>
      Tested-by: NWoody Suwalski <terraluna977@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b540f9
    • E
      mm/madvise.c: fix freeing of locked page with MADV_FREE · 263630e8
      Eric Biggers 提交于
      If madvise(..., MADV_FREE) split a transparent hugepage, it called
      put_page() before unlock_page().
      
      This was wrong because put_page() can free the page, e.g. if a
      concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
      mapping. put_page() then rightfully complained about freeing a locked
      page.
      
      Fix this by moving the unlock_page() before put_page().
      
      This bug was found by syzkaller, which encountered the following splat:
      
          BUG: Bad page state in process syzkaller412798  pfn:1bd800
          page:ffffea0006f60000 count:0 mapcount:0 mapping:          (null) index:0x20a00
          flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
          raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
          raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
          page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
          bad because of flags: 0x1(locked)
          Modules linked in:
          CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
          Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
          Call Trace:
           __dump_stack lib/dump_stack.c:16 [inline]
           dump_stack+0x194/0x257 lib/dump_stack.c:52
           bad_page+0x230/0x2b0 mm/page_alloc.c:565
           free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
           free_pages_check mm/page_alloc.c:952 [inline]
           free_pages_prepare mm/page_alloc.c:1043 [inline]
           free_pcp_prepare mm/page_alloc.c:1068 [inline]
           free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
           __put_single_page mm/swap.c:79 [inline]
           __put_page+0xfb/0x160 mm/swap.c:113
           put_page include/linux/mm.h:814 [inline]
           madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
           walk_pmd_range mm/pagewalk.c:50 [inline]
           walk_pud_range mm/pagewalk.c:108 [inline]
           walk_p4d_range mm/pagewalk.c:134 [inline]
           walk_pgd_range mm/pagewalk.c:160 [inline]
           __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
           walk_page_range+0x200/0x470 mm/pagewalk.c:326
           madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
           madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
           madvise_dontneed_free mm/madvise.c:555 [inline]
           madvise_vma mm/madvise.c:664 [inline]
           SYSC_madvise mm/madvise.c:832 [inline]
           SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
           entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Here is a C reproducer:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <sys/mman.h>
          #include <unistd.h>
      
          #define MADV_FREE	8
          #define PAGE_SIZE	4096
      
          static void *mapping;
          static const size_t mapping_size = 0x1000000;
      
          static void *madvise_thrproc(void *arg)
          {
              madvise(mapping, mapping_size, (long)arg);
          }
      
          int main(void)
          {
              pthread_t t[2];
      
              for (;;) {
                  mapping = mmap(NULL, mapping_size, PROT_WRITE,
                                 MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      
                  munmap(mapping + mapping_size / 2, PAGE_SIZE);
      
                  pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
                  pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
                  pthread_join(t[0], NULL);
                  pthread_join(t[1], NULL);
                  munmap(mapping, mapping_size);
              }
          }
      
      Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
      CONFIG_DEBUG_VM=y are needed.
      
      Google Bug Id: 64696096
      
      Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
      Fixes: 854e9ed0 ("mm: support madvise(MADV_FREE)")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[v4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      263630e8
    • K
      mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled · 435c0b87
      Kirill A. Shutemov 提交于
      /sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
      to allocate huge pages when allocate pages for private in-kernel shmem
      mount.
      
      Unfortunately, as Dan noticed, I've screwed it up and the only way to
      make kernel allocate huge page for the mount is to use "force" there.
      All other values will be effectively ignored.
      
      Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
      Fixes: 5a6e75f8 ("shmem: prepare huge= mount option and sysfs knob")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: stable <stable@vger.kernel.org> [4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      435c0b87
    • C
      PM/hibernate: touch NMI watchdog when creating snapshot · 556b969a
      Chen Yu 提交于
      There is a problem that when counting the pages for creating the
      hibernation snapshot will take significant amount of time, especially on
      system with large memory.  Since the counting job is performed with irq
      disabled, this might lead to NMI lockup.  The following warning were
      found on a system with 1.5TB DRAM:
      
        Freezing user space processes ... (elapsed 0.002 seconds) done.
        OOM killer disabled.
        PM: Preallocating image memory...
        NMI watchdog: Watchdog detected hard LOCKUP on cpu 27
        CPU: 27 PID: 3128 Comm: systemd-sleep Not tainted 4.13.0-0.rc2.git0.1.fc27.x86_64 #1
        task: ffff9f01971ac000 task.stack: ffffb1a3f325c000
        RIP: 0010:memory_bm_find_bit+0xf4/0x100
        Call Trace:
         swsusp_set_page_free+0x2b/0x30
         mark_free_pages+0x147/0x1c0
         count_data_pages+0x41/0xa0
         hibernate_preallocate_memory+0x80/0x450
         hibernation_snapshot+0x58/0x410
         hibernate+0x17c/0x310
         state_store+0xdf/0xf0
         kobj_attr_store+0xf/0x20
         sysfs_kf_write+0x37/0x40
         kernfs_fop_write+0x11c/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xb1/0x1a0
         SyS_write+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        ...
        done (allocated 6590003 pages)
        PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)
      
      It has taken nearly 20 seconds(2.10GHz CPU) thus the NMI lockup was
      triggered.  In case the timeout of the NMI watch dog has been set to 1
      second, a safe interval should be 6590003/20 = 320k pages in theory.
      However there might also be some platforms running at a lower frequency,
      so feed the watchdog every 100k pages.
      
      [yu.c.chen@intel.com: simplification]
        Link: http://lkml.kernel.org/r/1503460079-29721-1-git-send-email-yu.c.chen@intel.com
      [yu.c.chen@intel.com: use interval of 128k instead of 100k to avoid modulus]
      Link: http://lkml.kernel.org/r/1503328098-5120-1-git-send-email-yu.c.chen@intel.comSigned-off-by: NChen Yu <yu.c.chen@intel.com>
      Reported-by: NJan Filipcewicz <jan.filipcewicz@intel.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      556b969a
  7. 21 8月, 2017 1 次提交
    • L
      Sanitize 'move_pages()' permission checks · 197e7e52
      Linus Torvalds 提交于
      The 'move_paghes()' system call was introduced long long ago with the
      same permission checks as for sending a signal (except using
      CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
      
      That turns out to not be a great choice - while the system call really
      only moves physical page allocations around (and you need other
      capabilities to do a lot of it), you can check the return value to map
      out some the virtual address choices and defeat ASLR of a binary that
      still shares your uid.
      
      So change the access checks to the more common 'ptrace_may_access()'
      model instead.
      
      This tightens the access checks for the uid, and also effectively
      changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
      anybody really _uses_ this legacy system call any more (we hav ebetter
      NUMA placement models these days), so I expect nobody to notice.
      
      Famous last words.
      Reported-by: NOtto Ebeling <otto.ebeling@iki.fi>
      Acked-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      197e7e52
  8. 19 8月, 2017 8 次提交
    • L
      mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEM · 704b862f
      Laura Abbott 提交于
      Commit 19809c2d ("mm, vmalloc: use __GFP_HIGHMEM implicitly") added
      use of __GFP_HIGHMEM for allocations.  vmalloc_32 may use
      GFP_DMA/GFP_DMA32 which does not play nice with __GFP_HIGHMEM and will
      trigger a BUG in gfp_zone.
      
      Only add __GFP_HIGHMEM if we aren't using GFP_DMA/GFP_DMA32.
      
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1482249
      Link: http://lkml.kernel.org/r/20170816220705.31374-1-labbott@redhat.com
      Fixes: 19809c2d ("mm, vmalloc: use __GFP_HIGHMEM implicitly")
      Signed-off-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      704b862f
    • Z
      mm/mempolicy: fix use after free when calling get_mempolicy · 73223e4e
      zhong jiang 提交于
      I hit a use after free issue when executing trinity and repoduced it
      with KASAN enabled.  The related call trace is as follows.
      
        BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
        Read of size 2 by task syz-executor1/798
      
        INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
           __slab_alloc+0x768/0x970
           kmem_cache_alloc+0x2e7/0x450
           mpol_new.part.2+0x74/0x160
           mpol_new+0x66/0x80
           SyS_mbind+0x267/0x9f0
           system_call_fastpath+0x16/0x1b
        INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
           __slab_free+0x495/0x8e0
           kmem_cache_free+0x2f3/0x4c0
           __mpol_put+0x2b/0x40
           SyS_mbind+0x383/0x9f0
           system_call_fastpath+0x16/0x1b
        INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
        INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600
      
        Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
        Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
        Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5                          kkkkkkk.
        Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb                          ........
        Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
        Memory state around the buggy address:
        ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc
      
      !shared memory policy is not protected against parallel removal by other
      thread which is normally protected by the mmap_sem.  do_get_mempolicy,
      however, drops the lock midway while we can still access it later.
      
      Early premature up_read is a historical artifact from times when
      put_user was called in this path see https://lwn.net/Articles/124754/
      but that is gone since 8bccd85f ("[PATCH] Implement sys_* do_*
      layering in the memory policy layer.").  but when we have the the
      current mempolicy ref count model.  The issue was introduced
      accordingly.
      
      Fix the issue by removing the premature release.
      
      Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[2.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73223e4e
    • P
      mm/cma_debug.c: fix stack corruption due to sprintf usage · da094e42
      Prakash Gupta 提交于
      name[] in cma_debugfs_add_one() can only accommodate 16 chars including
      NULL to store sprintf output.  It's common for cma device name to be
      larger than 15 chars.  This can cause stack corrpution.  If the gcc
      stack protector is turned on, this can cause a panic due to stack
      corruption.
      
      Below is one example trace:
      
        Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:
        ffffff8e69a75730
        Call trace:
           dump_backtrace+0x0/0x2c4
           show_stack+0x20/0x28
           dump_stack+0xb8/0xf4
           panic+0x154/0x2b0
           print_tainted+0x0/0xc0
           cma_debugfs_init+0x274/0x290
           do_one_initcall+0x5c/0x168
           kernel_init_freeable+0x1c8/0x280
      
      Fix the short sprintf buffer in cma_debugfs_add_one() by using
      scnprintf() instead of sprintf().
      
      Link: http://lkml.kernel.org/r/1502446217-21840-1-git-send-email-guptap@codeaurora.org
      Fixes: f318dd08 ("cma: Store a name in the cma structure")
      Signed-off-by: NPrakash Gupta <guptap@codeaurora.org>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da094e42
    • M
      mm, oom: fix potential data corruption when oom_reaper races with writer · 6b31d595
      Michal Hocko 提交于
      Wenwei Tao has noticed that our current assumption that the oom victim
      is dying and never doing any visible changes after it dies, and so the
      oom_reaper can tear it down, is not entirely true.
      
      __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
      but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
      So there is a race window when some threads won't have
      fatal_signal_pending while the oom_reaper could start unmapping the
      address space.  Moreover some paths might not check for fatal signals
      before each PF/g-u-p/copy_from_user.
      
      We already have a protection for oom_reaper vs.  PF races by checking
      MMF_UNSTABLE.  This has been, however, checked only for kernel threads
      (use_mm users) which can outlive the oom victim.  A simple fix would be
      to extend the current check in handle_mm_fault for all tasks but that
      wouldn't be sufficient because the current check assumes that a kernel
      thread would bail out after EFAULT from get_user*/copy_from_user and
      never re-read the same address which would succeed because the PF path
      has established page tables already.  This seems to be the case for the
      only existing use_mm user currently (virtio driver) but it is rather
      fragile in general.
      
      This is even more fragile in general for more complex paths such as
      generic_perform_write which can re-read the same address more times
      (e.g.  iov_iter_copy_from_user_atomic to fail and then
      iov_iter_fault_in_readable on retry).
      
      Therefore we have to implement MMF_UNSTABLE protection in a robust way
      and never make a potentially corrupted content visible.  That requires
      to hook deeper into the PF path and check for the flag _every time_
      before a pte for anonymous memory is established (that means all
      !VM_SHARED mappings).
      
      The corruption can be triggered artificially
      (http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp)
      but there doesn't seem to be any real life bug report.  The race window
      should be quite tight to trigger most of the time.
      
      Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NWenwei Tao <wenwei.tww@alibaba-inc.com>
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b31d595
    • M
      mm: fix double mmap_sem unlock on MMF_UNSTABLE enforced SIGBUS · 5b53a6ea
      Michal Hocko 提交于
      Tetsuo Handa has noticed that MMF_UNSTABLE SIGBUS path in
      handle_mm_fault causes a lockdep splat
      
        Out of memory: Kill process 1056 (a.out) score 603 or sacrifice child
        Killed process 1056 (a.out) total-vm:4268108kB, anon-rss:2246048kB, file-rss:0kB, shmem-rss:0kB
        a.out (1169) used greatest stack depth: 11664 bytes left
        DEBUG_LOCKS_WARN_ON(depth <= 0)
        ------------[ cut here ]------------
        WARNING: CPU: 6 PID: 1339 at kernel/locking/lockdep.c:3617 lock_release+0x172/0x1e0
        CPU: 6 PID: 1339 Comm: a.out Not tainted 4.13.0-rc3-next-20170803+ #142
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
        RIP: 0010:lock_release+0x172/0x1e0
        Call Trace:
           up_read+0x1a/0x40
           __do_page_fault+0x28e/0x4c0
           do_page_fault+0x30/0x80
           page_fault+0x28/0x30
      
      The reason is that the page fault path might have dropped the mmap_sem
      and returned with VM_FAULT_RETRY.  MMF_UNSTABLE check however rewrites
      the error path to VM_FAULT_SIGBUS and we always expect mmap_sem taken in
      that path.  Fix this by taking mmap_sem when VM_FAULT_RETRY is held in
      the MMF_UNSTABLE path.
      
      We cannot simply add VM_FAULT_SIGBUS to the existing error code because
      all arch specific page fault handlers and g-u-p would have to learn a
      new error code combination.
      
      Link: http://lkml.kernel.org/r/20170807113839.16695-2-mhocko@kernel.org
      Fixes: 3f70dc38 ("mm: make sure that kthreads will not refault oom reaped memory")
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Wenwei Tao <wenwei.tww@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>	[4.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b53a6ea
    • V
      slub: fix per memcg cache leak on css offline · f6ba4880
      Vladimir Davydov 提交于
      To avoid a possible deadlock, sysfs_slab_remove() schedules an
      asynchronous work to delete sysfs entries corresponding to the kmem
      cache.  To ensure the cache isn't freed before the work function is
      called, it takes a reference to the cache kobject.  The reference is
      supposed to be released by the work function.
      
      However, the work function (sysfs_slab_remove_workfn()) does nothing in
      case the cache sysfs entry has already been deleted, leaking the kobject
      and the corresponding cache.
      
      This may happen on a per memcg cache destruction, because sysfs entries
      of a per memcg cache are deleted on memcg offline if the cache is empty
      (see __kmemcg_cache_deactivate()).
      
      The kmemleak report looks like this:
      
        unreferenced object 0xffff9f798a79f540 (size 32):
          comm "kworker/1:4", pid 15416, jiffies 4307432429 (age 28687.554s)
          hex dump (first 32 bytes):
            6b 6d 61 6c 6c 6f 63 2d 31 36 28 31 35 39 39 3a  kmalloc-16(1599:
            6e 65 77 72 6f 6f 74 29 00 23 6b c0 ff ff ff ff  newroot).#k.....
          backtrace:
             kmemleak_alloc+0x4a/0xa0
             __kmalloc_track_caller+0x148/0x2c0
             kvasprintf+0x66/0xd0
             kasprintf+0x49/0x70
             memcg_create_kmem_cache+0xe6/0x160
             memcg_kmem_cache_create_func+0x20/0x110
             process_one_work+0x205/0x5d0
             worker_thread+0x4e/0x3a0
             kthread+0x109/0x140
             ret_from_fork+0x2a/0x40
        unreferenced object 0xffff9f79b6136840 (size 416):
          comm "kworker/1:4", pid 15416, jiffies 4307432429 (age 28687.573s)
          hex dump (first 32 bytes):
            40 fb 80 c2 3e 33 00 00 00 00 00 40 00 00 00 00  @...>3.....@....
            00 00 00 00 00 00 00 00 10 00 00 00 10 00 00 00  ................
          backtrace:
             kmemleak_alloc+0x4a/0xa0
             kmem_cache_alloc+0x128/0x280
             create_cache+0x3b/0x1e0
             memcg_create_kmem_cache+0x118/0x160
             memcg_kmem_cache_create_func+0x20/0x110
             process_one_work+0x205/0x5d0
             worker_thread+0x4e/0x3a0
             kthread+0x109/0x140
             ret_from_fork+0x2a/0x40
      
      Fix the leak by adding the missing call to kobject_put() to
      sysfs_slab_remove_workfn().
      
      Link: http://lkml.kernel.org/r/20170812181134.25027-1-vdavydov.dev@gmail.com
      Fixes: 3b7b3140 ("slub: make sysfs file removal asynchronous")
      Signed-off-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reported-by: NAndrei Vagin <avagin@gmail.com>
      Tested-by: NAndrei Vagin <avagin@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[4.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6ba4880
    • P
      mm: discard memblock data later · 3010f876
      Pavel Tatashin 提交于
      There is existing use after free bug when deferred struct pages are
      enabled:
      
      The memblock_add() allocates memory for the memory array if more than
      128 entries are needed.  See comment in e820__memblock_setup():
      
        * The bootstrap memblock region count maximum is 128 entries
        * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
        * than that - so allow memblock resizing.
      
      This memblock memory is freed here:
              free_low_memory_core_early()
      
      We access the freed memblock.memory later in boot when deferred pages
      are initialized in this path:
      
              deferred_init_memmap()
                      for_each_mem_pfn_range()
                        __next_mem_pfn_range()
                          type = &memblock.memory;
      
      One possible explanation for why this use-after-free hasn't been hit
      before is that the limit of INIT_MEMBLOCK_REGIONS has never been
      exceeded at least on systems where deferred struct pages were enabled.
      
      Tested by reducing INIT_MEMBLOCK_REGIONS down to 4 from the current 128,
      and verifying in qemu that this code is getting excuted and that the
      freed pages are sane.
      
      Link: http://lkml.kernel.org/r/1502485554-318703-2-git-send-email-pasha.tatashin@oracle.com
      Fixes: 7e18adb4 ("mm: meminit: initialise remaining struct pages in parallel with kswapd")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3010f876
    • J
      mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() · 739f79fc
      Johannes Weiner 提交于
      Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
      to update the memcg stats:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
          IP: test_clear_page_writeback+0x12e/0x2c0
          [...]
          RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
          Call Trace:
           <IRQ>
           end_page_writeback+0x47/0x70
           f2fs_write_end_io+0x76/0x180 [f2fs]
           bio_endio+0x9f/0x120
           blk_update_request+0xa8/0x2f0
           scsi_end_request+0x39/0x1d0
           scsi_io_completion+0x211/0x690
           scsi_finish_command+0xd9/0x120
           scsi_softirq_done+0x127/0x150
           __blk_mq_complete_request_remote+0x13/0x20
           flush_smp_call_function_queue+0x56/0x110
           generic_smp_call_function_single_interrupt+0x13/0x30
           smp_call_function_single_interrupt+0x27/0x40
           call_function_single_interrupt+0x89/0x90
          RIP: 0010:native_safe_halt+0x6/0x10
      
          (gdb) l *(test_clear_page_writeback+0x12e)
          0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
          614		mod_node_page_state(page_pgdat(page), idx, val);
          615		if (mem_cgroup_disabled() || !page->mem_cgroup)
          616			return;
          617		mod_memcg_state(page->mem_cgroup, idx, val);
          618		pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
          619		this_cpu_add(pn->lruvec_stat->count[idx], val);
          620	}
          621
          622	unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
          623							gfp_t gfp_mask,
      
      The issue is that writeback doesn't hold a page reference and the page
      might get freed after PG_writeback is cleared (and the mapping is
      unlocked) in test_clear_page_writeback().  The stat functions looking up
      the page's node or zone are safe, as those attributes are static across
      allocation and free cycles.  But page->mem_cgroup is not, and it will
      get cleared if we race with truncation or migration.
      
      It appears this race window has been around for a while, but less likely
      to trigger when the memcg stats were updated first thing after
      PG_writeback is cleared.  Recent changes reshuffled this code to update
      the global node stats before the memcg ones, though, stretching the race
      window out to an extent where people can reproduce the problem.
      
      Update test_clear_page_writeback() to look up and pin page->mem_cgroup
      before clearing PG_writeback, then not use that pointer afterward.  It
      is a partial revert of 62cccb8c ("mm: simplify lock_page_memcg()")
      but leaves the pageref-holding callsites that aren't affected alone.
      
      Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
      Fixes: 62cccb8c ("mm: simplify lock_page_memcg()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Tested-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Reported-by: NBradley Bolen <bradleybolen@gmail.com>
      Tested-by: NBrad Bolen <bradleybolen@gmail.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      739f79fc
  9. 17 8月, 2017 1 次提交
    • T
      x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages · ce0fa3e5
      Tony Luck 提交于
      Speculative processor accesses may reference any memory that has a
      valid page table entry.  While a speculative access won't generate
      a machine check, it will log the error in a machine check bank. That
      could cause escalation of a subsequent error since the overflow bit
      will be then set in the machine check bank status register.
      
      Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual
      address of the page we want to map out otherwise we may trigger the
      very problem we are trying to avoid.  We use a non-canonical address
      that passes through the usual Linux table walking code to get to the
      same "pte".
      
      Thanks to Dave Hansen for reviewing several iterations of this.
      
      Also see:
      
        http://marc.info/?l=linux-mm&m=149860136413338&w=2Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ce0fa3e5
  10. 11 8月, 2017 13 次提交
    • P
      mm, locking: Fix up flush_tlb_pending() related merge in do_huge_pmd_numa_page() · ccde85ba
      Peter Zijlstra 提交于
      Merge commit:
      
        040cca3a ("Merge branch 'linus' into locking/core, to resolve conflicts")
      
      overlooked the fact that do_huge_pmd_numa_page() now does two TLB
      flushes. Commit:
      
        8b1b436d ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
      
      and commit:
      
        a9b80250 ("Revert "mm: numa: defer TLB flush for THP migration as long as possible"")
      
      Both moved the TLB flush around but slightly different, the end result
      being that what was one became two.
      
      Clean this up.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ccde85ba
    • K
      rmap: do not call mmu_notifier_invalidate_page() under ptl · aac2fea9
      Kirill A. Shutemov 提交于
      MMU notifiers can sleep, but in page_mkclean_one() we call
      mmu_notifier_invalidate_page() under page table lock.
      
      Let's instead use mmu_notifier_invalidate_range() outside
      page_vma_mapped_walk() loop.
      
      [jglisse@redhat.com: try_to_unmap_one() do not call mmu_notifier under ptl]
        Link: http://lkml.kernel.org/r/20170809204333.27485-1-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170804134928.l4klfcnqatni7vsc@black.fi.intel.com
      Fixes: c7ab0d2f ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reported-by: Naxie <axie@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "Writer, Tim" <Tim.Writer@amd.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aac2fea9
    • C
      mm: fix list corruptions on shmem shrinklist · d041353d
      Cong Wang 提交于
      We saw many list corruption warnings on shmem shrinklist:
      
        WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
        list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
        Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
        CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
        Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
        Call Trace:
          dump_stack+0x4d/0x66
          __warn+0xcb/0xf0
          warn_slowpath_fmt+0x4f/0x60
          __list_del_entry+0x9e/0xc0
          shmem_unused_huge_shrink+0xfa/0x2e0
          shmem_unused_huge_scan+0x20/0x30
          super_cache_scan+0x193/0x1a0
          shrink_slab.part.41+0x1e3/0x3f0
          shrink_slab+0x29/0x30
          shrink_node+0xf9/0x2f0
          kswapd+0x2d8/0x6c0
          kthread+0xd7/0xf0
          ret_from_fork+0x22/0x30
      
        WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
        list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
        Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
        CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G        W       4.9.34-t3.el7.twitter.x86_64 #1
        Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
        Call Trace:
          dump_stack+0x4d/0x66
          __warn+0xcb/0xf0
          warn_slowpath_fmt+0x4f/0x60
          __list_add+0x89/0xb0
          shmem_setattr+0x204/0x230
          notify_change+0x2ef/0x440
          do_truncate+0x5d/0x90
          path_openat+0x331/0x1190
          do_filp_open+0x7e/0xe0
          do_sys_open+0x123/0x200
          SyS_open+0x1e/0x20
          do_syscall_64+0x61/0x170
          entry_SYSCALL64_slow_path+0x25/0x25
      
      The problem is that shmem_unused_huge_shrink() moves entries from the
      global sbinfo->shrinklist to its local lists and then releases the
      spinlock.  However, a parallel shmem_setattr() could access one of these
      entries directly and add it back to the global shrinklist if it is
      removed, with the spinlock held.
      
      The logic itself looks solid since an entry could be either in a local
      list or the global list, otherwise it is removed from one of them by
      list_del_init().  So probably the race condition is that, one CPU is in
      the middle of INIT_LIST_HEAD() but the other CPU calls list_empty()
      which returns true too early then the following list_add_tail() sees a
      corrupted entry.
      
      list_empty_careful() is designed to fix this situation.
      
      [akpm@linux-foundation.org: add comments]
      Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com
      Fixes: 779750d2 ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d041353d
    • W
      mm/balloon_compaction.c: don't zero ballooned pages · af54aed9
      Wei Wang 提交于
      Revert commit bb01b64c ("mm/balloon_compaction.c: enqueue zero page
      to balloon device")'
      
      Zeroing ballon pages is rather time consuming, especially when a lot of
      pages are in flight. E.g. 7GB worth of ballooned memory takes 2.8s with
      __GFP_ZERO while it takes ~491ms without it.
      
      The original commit argued that zeroing will help ksmd to merge these
      pages on the host but this argument is assuming that the host actually
      marks balloon pages for ksm which is not universally true.  So we pay
      performance penalty for something that even might not be used in the end
      which is wrong.  The host can zero out pages on its own when there is a
      need.
      
      [mhocko@kernel.org: new changelog text]
      Link: http://lkml.kernel.org/r/1501761557-9758-1-git-send-email-wei.w.wang@intel.com
      Fixes: bb01b64c ("mm/balloon_compaction.c: enqueue zero page to balloon device")
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: zhenwei.pi <zhenwei.pi@youruncloud.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af54aed9
    • M
      mm: fix KSM data corruption · b3a81d08
      Minchan Kim 提交于
      Nadav reported KSM can corrupt the user data by the TLB batching
      race[1].  That means data user written can be lost.
      
      Quote from Nadav Amit:
       "For this race we need 4 CPUs:
      
        CPU0: Caches a writable and dirty PTE entry, and uses the stale value
        for write later.
      
        CPU1: Runs madvise_free on the range that includes the PTE. It would
        clear the dirty-bit. It batches TLB flushes.
      
        CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
        We care about the fact that it clears the PTE write-bit, and of
        course, batches TLB flushes.
      
        CPU3: Runs KSM. Our purpose is to pass the following test in
        write_protect_page():
      
      	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
      	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
      
        Since it will avoid TLB flush. And we want to do it while the PTE is
        stale. Later, and before replacing the page, we would be able to
        change the page.
      
        Note that all the operations the CPU1-3 perform canhappen in parallel
        since they only acquire mmap_sem for read.
      
        We start with two identical pages. Everything below regards the same
        page/PTE.
      
        CPU0        CPU1        CPU2        CPU3
        ----        ----        ----        ----
        Write the same
        value on page
      
        [cache PTE as
         dirty in TLB]
      
                    MADV_FREE
                    pte_mkclean()
      
                                4 > clear_refs
                                pte_wrprotect()
      
                                            write_protect_page()
                                            [ success, no flush ]
      
                                            pages_indentical()
                                            [ ok ]
      
        Write to page
        different value
      
        [Ok, using stale
         PTE]
      
                                            replace_page()
      
        Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
        CPU0 already wrote on the page, but KSM ignored this write, and it got
        lost"
      
      In above scenario, MADV_FREE is fixed by changing TLB batching API
      including [set|clear]_tlb_flush_pending.  Remained thing is soft-dirty
      part.
      
      This patch changes soft-dirty uses TLB batching API instead of
      flush_tlb_mm and KSM checks pending TLB flush by using
      mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
      there are other parallel threads pending TLB flush.
      
      [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Reported-by: NNadav Amit <namit@vmware.com>
      Tested-by: NNadav Amit <namit@vmware.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3a81d08
    • M
      mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem · 99baac21
      Minchan Kim 提交于
      Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
      problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
      
      Quote from Mel Gorman:
       "The race in question is CPU 0 running madv_free and updating some PTEs
        while CPU 1 is also running madv_free and looking at the same PTEs.
        CPU 1 may have writable TLB entries for a page but fail the pte_dirty
        check (because CPU 0 has updated it already) and potentially fail to
        flush.
      
        Hence, when madv_free on CPU 1 returns, there are still potentially
        writable TLB entries and the underlying PTE is still present so that a
        subsequent write does not necessarily propagate the dirty bit to the
        underlying PTE any more. Reclaim at some unknown time at the future
        may then see that the PTE is still clean and discard the page even
        though a write has happened in the meantime. I think this is possible
        but I could have missed some protection in madv_free that prevents it
        happening."
      
      This patch aims for solving both problems all at once and is ready for
      other problem with KSM, MADV_FREE and soft-dirty story[3].
      
      TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
      and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
      catch there are parallel threads going on.  In that case, forcefully,
      flush TLB to prevent for user to access memory via stale TLB entry
      although it fail to gather page table entry.
      
      I confirmed this patch works with [4] test program Nadav gave so this
      patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
      v2" in current mmotm.
      
      NOTE:
      
      This patch modifies arch-specific TLB gathering interface(x86, ia64,
      s390, sh, um).  It seems most of architecture are straightforward but
      s390 need to be careful because tlb_flush_mmu works only if
      mm->context.flush_mm is set to non-zero which happens only a pte entry
      really is cleared by ptep_get_and_clear and friends.  However, this
      problem never changes the pte entries but need to flush to prevent
      memory access from stale tlb.
      
      [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
      [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
      [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      [4] https://patchwork.kernel.org/patch/9861621/
      
      [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
        Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
      Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Reported-by: NNadav Amit <namit@vmware.com>
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99baac21
    • M
      mm: make tlb_flush_pending global · 0a2dd266
      Minchan Kim 提交于
      Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
      COMPACTION] but upcoming patches to solve subtle TLB flush batching
      problem will use it regardless of compaction/NUMA so this patch doesn't
      remove the dependency.
      
      [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
      Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2dd266
    • M
      mm: refactor TLB gathering API · 56236a59
      Minchan Kim 提交于
      This patch is a preparatory patch for solving race problems caused by
      TLB batch.  For that, we will increase/decrease TLB flush pending count
      of mm_struct whenever tlb_[gather|finish]_mmu is called.
      
      Before making it simple, this patch separates architecture specific part
      and rename it to arch_tlb_[gather|finish]_mmu and generic part just
      calls it.
      
      It shouldn't change any behavior.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56236a59
    • N
      Revert "mm: numa: defer TLB flush for THP migration as long as possible" · a9b80250
      Nadav Amit 提交于
      While deferring TLB flushes is a good practice, the reverted patch
      caused pending TLB flushes to be checked while the page-table lock is
      not taken.  As a result, in architectures with weak memory model (PPC),
      Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
      and cause (in theory) a memory corruption.
      
      Since the alternative of using smp_mb__after_unlock_lock() was
      considered a bit open-coded, and the performance impact is expected to
      be small, the previous patch is reverted.
      
      This reverts b0943d61 ("mm: numa: defer TLB flush for THP migration
      as long as possible").
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-4-namit@vmware.comSigned-off-by: NNadav Amit <namit@vmware.com>
      Suggested-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9b80250
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
    • A
      userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case · 5af10dfd
      Andrea Arcangeli 提交于
      huge_add_to_page_cache->add_to_page_cache implicitly unlocks the page
      before returning in case of errors.
      
      The error returned was -EEXIST by running UFFDIO_COPY on a non-hole
      offset of a VM_SHARED hugetlbfs mapping.  It was an userland bug that
      triggered it and the kernel must cope with it returning -EEXIST from
      ioctl(UFFDIO_COPY) as expected.
      
        page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
        kernel BUG at mm/filemap.c:964!
        invalid opcode: 0000 [#1] SMP
        CPU: 1 PID: 22582 Comm: qemu-system-x86 Not tainted 4.11.11-300.fc26.x86_64 #1
        RIP: unlock_page+0x4a/0x50
        Call Trace:
          hugetlb_mcopy_atomic_pte+0xc0/0x320
          mcopy_atomic+0x96f/0xbe0
          userfaultfd_ioctl+0x218/0xe90
          do_vfs_ioctl+0xa5/0x600
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x1a/0xa9
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5af10dfd
    • J
      mm: ratelimit PFNs busy info message · 75dddef3
      Jonathan Toppins 提交于
      The RDMA subsystem can generate several thousand of these messages per
      second eventually leading to a kernel crash.  Ratelimit these messages
      to prevent this crash.
      
      Doug said:
       "I've been carrying a version of this for several kernel versions. I
        don't remember when they started, but we have one (and only one) class
        of machines: Dell PE R730xd, that generate these errors. When it
        happens, without a rate limit, we get rcu timeouts and kernel oopses.
        With the rate limit, we just get a lot of annoying kernel messages but
        the machine continues on, recovers, and eventually the memory
        operations all succeed"
      
      And:
       "> Well... why are all these EBUSY's occurring? It sounds inefficient
        > (at least) but if it is expected, normal and unavoidable then
        > perhaps we should just remove that message altogether?
      
        I don't have an answer to that question. To be honest, I haven't
        looked real hard. We never had this at all, then it started out of the
        blue, but only on our Dell 730xd machines (and it hits all of them),
        but no other classes or brands of machines. And we have our 730xd
        machines loaded up with different brands and models of cards (for
        instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
        ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
        meant it wasn't tied to any particular brand/model of RDMA hardware.
        To me, it always smelled of a hardware oddity specific to maybe the
        CPUs or mainboard chipsets in these machines, so given that I'm not an
        mm expert anyway, I never chased it down.
      
        A few other relevant details: it showed up somewhere around 4.8/4.9 or
        thereabouts. It never happened before, but the prinkt has been there
        since the 3.18 days, so possibly the test to trigger this message was
        changed, or something else in the allocator changed such that the
        situation started happening on these machines?
      
        And, like I said, it is specific to our 730xd machines (but they are
        all identical, so that could mean it's something like their specific
        ram configuration is causing the allocator to hit this on these
        machine but not on other machines in the cluster, I don't want to say
        it's necessarily the model of chipset or CPU, there are other bits of
        identicalness between these machines)"
      
      Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.comSigned-off-by: NJonathan Toppins <jtoppins@redhat.com>
      Reviewed-by: NDoug Ledford <dledford@redhat.com>
      Tested-by: NDoug Ledford <dledford@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75dddef3
    • J
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner 提交于
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  11. 10 8月, 2017 2 次提交
    • P
      locking/lockdep: Rework FS_RECLAIM annotation · d92a8cfc
      Peter Zijlstra 提交于
      A while ago someone, and I cannot find the email just now, asked if we
      could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
      like we use for other things like workqueues etc. I think this should
      be possible which allows reducing the 'irq' states and will reduce the
      amount of __bfs() lookups we do.
      
      Removing the 1 IRQ state results in 4 less __bfs() walks per
      dependency, improving lockdep performance. And by moving this
      annotation out of the lockdep code it becomes easier for the mm people
      to extend.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: iamjoonsoo.kim@lge.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d92a8cfc
    • P
      mm, locking: Rework {set,clear,mm}_tlb_flush_pending() · 8b1b436d
      Peter Zijlstra 提交于
      Commit:
      
        af2c1401 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
      
      added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
      can solve the same problem without this barrier.
      
      If instead we mandate that mm_tlb_flush_pending() is used while
      holding the PTL we're guaranteed to observe prior
      set_tlb_flush_pending() instances.
      
      For this to work we need to rework migrate_misplaced_transhuge_page()
      a little and move the test up into do_huge_pmd_numa_page().
      
      NOTE: this relies on flush_tlb_range() to guarantee:
      
         (1) it ensures that prior page table updates are visible to the
             page table walker and
         (2) it ensures that subsequent memory accesses are only made
             visible after the invalidation has completed
      
      This is required for architectures that implement TRANSPARENT_HUGEPAGE
      (arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
      mm_tlb_flush_pending() in their page-table operations (arm, arm64,
      x86).
      
      This appears true for:
      
       - arm (DSB ISB before and after),
       - arm64 (DSB ISHST before, and DSB ISH after),
       - powerpc (PTESYNC before and after),
       - s390 and x86 TLB invalidate are serializing instructions
      
      But I failed to understand the situation for:
      
       - arc, mips, sparc
      
      Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
      and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
      inside the PTL. It still needs to guarantee the PTL unlock happens
      _after_ the invalidate completes.
      
      Vineet, Ralf and Dave could you guys please have a look?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8b1b436d