1. 23 2月, 2017 1 次提交
  2. 04 2月, 2017 1 次提交
    • K
      shmem: fix sleeping from atomic context · 253fd0f0
      Kirill A. Shutemov 提交于
      Syzkaller fuzzer managed to trigger this:
      
          BUG: sleeping function called from invalid context at mm/shmem.c:852
          in_atomic(): 1, irqs_disabled(): 0, pid: 529, name: khugepaged
          3 locks held by khugepaged/529:
           #0:  (shrinker_rwsem){++++..}, at: [<ffffffff818d7ef1>] shrink_slab.part.59+0x121/0xd30 mm/vmscan.c:451
           #1:  (&type->s_umount_key#29){++++..}, at: [<ffffffff81a63630>] trylock_super+0x20/0x100 fs/super.c:392
           #2:  (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] spin_lock include/linux/spinlock.h:302 [inline]
           #2:  (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] shmem_unused_huge_shrink+0x28e/0x1490 mm/shmem.c:427
          CPU: 2 PID: 529 Comm: khugepaged Not tainted 4.10.0-rc5+ #201
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Call Trace:
             shmem_undo_range+0xb20/0x2710 mm/shmem.c:852
             shmem_truncate_range+0x27/0xa0 mm/shmem.c:939
             shmem_evict_inode+0x35f/0xca0 mm/shmem.c:1030
             evict+0x46e/0x980 fs/inode.c:553
             iput_final fs/inode.c:1515 [inline]
             iput+0x589/0xb20 fs/inode.c:1542
             shmem_unused_huge_shrink+0xbad/0x1490 mm/shmem.c:446
             shmem_unused_huge_scan+0x10c/0x170 mm/shmem.c:512
             super_cache_scan+0x376/0x450 fs/super.c:106
             do_shrink_slab mm/vmscan.c:378 [inline]
             shrink_slab.part.59+0x543/0xd30 mm/vmscan.c:481
             shrink_slab mm/vmscan.c:2592 [inline]
             shrink_node+0x2c7/0x870 mm/vmscan.c:2592
             shrink_zones mm/vmscan.c:2734 [inline]
             do_try_to_free_pages+0x369/0xc80 mm/vmscan.c:2776
             try_to_free_pages+0x3c6/0x900 mm/vmscan.c:2982
             __perform_reclaim mm/page_alloc.c:3301 [inline]
             __alloc_pages_direct_reclaim mm/page_alloc.c:3322 [inline]
             __alloc_pages_slowpath+0xa24/0x1c30 mm/page_alloc.c:3683
             __alloc_pages_nodemask+0x544/0xae0 mm/page_alloc.c:3848
             __alloc_pages include/linux/gfp.h:426 [inline]
             __alloc_pages_node include/linux/gfp.h:439 [inline]
             khugepaged_alloc_page+0xc2/0x1b0 mm/khugepaged.c:750
             collapse_huge_page+0x182/0x1fe0 mm/khugepaged.c:955
             khugepaged_scan_pmd+0xfdf/0x12a0 mm/khugepaged.c:1208
             khugepaged_scan_mm_slot mm/khugepaged.c:1727 [inline]
             khugepaged_do_scan mm/khugepaged.c:1808 [inline]
             khugepaged+0xe9b/0x1590 mm/khugepaged.c:1853
             kthread+0x326/0x3f0 kernel/kthread.c:227
             ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
      
      The iput() from atomic context was a bad idea: if after igrab() somebody
      else calls iput() and we left with the last inode reference, our iput()
      would lead to inode eviction and therefore sleeping.
      
      This patch should fix the situation.
      
      Link: http://lkml.kernel.org/r/20170131093141.GA15899@node.shutemov.nameSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      253fd0f0
  3. 25 12月, 2016 1 次提交
  4. 15 12月, 2016 2 次提交
  5. 13 12月, 2016 4 次提交
  6. 09 12月, 2016 1 次提交
  7. 07 12月, 2016 1 次提交
    • L
      shmem: fix shm fallocate() list corruption · 10d20bd2
      Linus Torvalds 提交于
      The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
      want to race with generating new pages by faulting them in.
      
      However, the wait-queue used to delay the page faulting has a serious
      problem: the wait queue head (in shmem_fallocate()) is allocated on the
      stack, and the code expects that "wake_up_all()" will make sure that all
      the queue entries are gone before the stack frame is de-allocated.
      
      And that is not at all necessarily the case.
      
      Yes, a normal wake-up sequence will remove the wait-queue entry that
      caused the wakeup (see "autoremove_wake_function()"), but the key
      wording there is "that caused the wakeup".  When there are multiple
      possible wakeup sources, the wait queue entry may well stay around.
      
      And _particularly_ in a page fault path, we may be faulting in new pages
      from user space while we also have other things going on, and there may
      well be other pending wakeups.
      
      So despite the "wake_up_all()", it's not at all guaranteed that all list
      entries are removed from the wait queue head on the stack.
      
      Fix this by introducing a new wakeup function that removes the list
      entry unconditionally, even if the target process had already woken up
      for other reasons.  Use that "synchronous" function to set up the
      waiters in shmem_fault().
      
      This problem has never been seen in the wild afaik, but Dave Jones has
      reported it on and off while running trinity.  We thought we fixed the
      stack corruption with the blk-mq rq_list locking fix (commit
      7fe31130: "blk-mq: update hardware and software queues for sleeping
      alloc"), but it turns out there was _another_ stack corruptor hiding
      in the trinity runs.
      
      Vegard Nossum (also running trinity) was able to trigger this one fairly
      consistently, and made us look once again at the shmem code due to the
      faults often being in that area.
      
      Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d20bd2
  8. 12 11月, 2016 1 次提交
  9. 08 10月, 2016 2 次提交
  10. 06 10月, 2016 1 次提交
  11. 28 9月, 2016 1 次提交
  12. 27 9月, 2016 1 次提交
  13. 25 9月, 2016 2 次提交
  14. 22 9月, 2016 1 次提交
  15. 11 8月, 2016 1 次提交
  16. 04 8月, 2016 1 次提交
  17. 29 7月, 2016 1 次提交
  18. 27 7月, 2016 8 次提交
  19. 11 7月, 2016 1 次提交
    • H
      tmpfs: fix regression hang in fallocate undo · 7f556567
      Hugh Dickins 提交于
      The well-spotted fallocate undo fix is good in most cases, but not when
      fallocate failed on the very first page.  index 0 then passes lend -1
      to shmem_undo_range(), and that has two bad effects: (a) that it will
      undo every fallocation throughout the file, unrestricted by the current
      range; but more importantly (b) it can cause the undo to hang, because
      lend -1 is treated as truncation, which makes it keep on retrying until
      every page has gone, but those already fully instantiated will never go
      away.  Big thank you to xfstests generic/269 which demonstrates this.
      
      Fixes: b9b4bb26 ("tmpfs: don't undo fallocate past its last page")
      Cc: stable@vger.kernel.org
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f556567
  20. 25 6月, 2016 1 次提交
  21. 28 5月, 2016 1 次提交
  22. 20 5月, 2016 3 次提交
    • A
      tmpfs: mem_cgroup charge fault to vm_mm not current mm · 9e18eb29
      Andres Lagar-Cavilla 提交于
      Although shmem_fault() has been careful to count a major fault to vm_mm,
      shmem_getpage_gfp() has been careless in charging a remote access fault
      to current->mm owner's memcg instead of to vma->vm_mm owner's memcg:
      that is inconsistent with all the mem_cgroup charging on remote access
      faults in mm/memory.c.
      
      Fix it by passing fault_mm along with fault_type to
      shmem_get_page_gfp(); but in that case, now knowing the right mm, it's
      better for it to handle the PGMAJFAULT updates itself.
      
      And let's keep this clutter out of most callers' way: change the common
      shmem_getpage() wrapper to hide fault_mm and fault_type as well as gfp.
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e18eb29
    • H
      tmpfs: preliminary minor tidyups · 75edd345
      Hugh Dickins 提交于
      Make a few cleanups in mm/shmem.c, before going on to complicate it.
      
      shmem_alloc_page() will become more complicated: we can't afford to to
      have that complication duplicated between a CONFIG_NUMA version and a
      !CONFIG_NUMA version, so rearrange the #ifdef'ery there to yield a
      single shmem_swapin() and a single shmem_alloc_page().
      
      Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA
      configurations, but eliminating it is a larger cleanup: I have an
      alloc_pages_mpol() patchset not yet ready - mpol handling is subtle and
      bug-prone, and changed yet again since my last version.
      
      Move __SetPageLocked, __SetPageSwapBacked from shmem_getpage_gfp() to
      shmem_alloc_page(): that SwapBacked flag will be useful in future, to
      help to distinguish different cases appropriately.
      
      And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of
      little use (IIRC it dates back to when shmem_getpage() returned the page
      unlocked): kill it and do the necessary in shmem_file_read_iter().
      
      But an arm64 build then complained that info may be uninitialized (where
      shmem_getpage_gfp() deletes a freshly alloced page beyond eof), and
      advancing to an "sgp <= SGP_CACHE" test jogged it back to reality.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75edd345
    • H
      mm: use __SetPageSwapBacked and dont ClearPageSwapBacked · fa9949da
      Hugh Dickins 提交于
      v3.16 commit 07a42788 ("mm: shmem: avoid atomic operation during
      shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
      by __SetPageSwapBacked, pointing out that the newly allocated page is
      not yet visible to other users (except speculative get_page_unless_zero-
      ers, who may not update page flags before their further checks).
      
      That was part of a series in which Mel was focused on tmpfs profiles:
      but almost all SetPageSwapBacked uses can be so optimized, with the same
      justification.
      
      Remove ClearPageSwapBacked from __read_swap_cache_async() error path:
      it's not an error to free a page with PG_swapbacked set.
      
      Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
      doing it differently in different places; but that's for tidiness - if
      the ordering actually mattered, we should not be using the __variants.
      
      There's probably scope for further __SetPageFlags in other places, but
      SwapBacked is the one I'm interested in at the moment.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Reviewed-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa9949da
  23. 03 5月, 2016 1 次提交
    • A
      parallel lookups machinery, part 2 · 84e710da
      Al Viro 提交于
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      84e710da
  24. 11 4月, 2016 1 次提交
  25. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf