1. 15 12月, 2016 1 次提交
    • M
      radix-tree: improve multiorder iterators · 148deab2
      Matthew Wilcox 提交于
      This fixes several interlinked problems with the iterators in the
      presence of multiorder entries.
      
      1. radix_tree_iter_next() would only advance by one slot, which would
         result in the iterators returning the same entry more than once if
         there were sibling entries.
      
      2. radix_tree_next_slot() could return an internal pointer instead of
         a user pointer if a tagged multiorder entry was immediately followed by
         an entry of lower order.
      
      3. radix_tree_next_slot() expanded to a lot more code than it used to
         when multiorder support was compiled in.  And I wasn't comfortable with
         entry_to_node() being in a header file.
      
      Fixing radix_tree_iter_next() for the presence of sibling entries
      necessarily involves examining the contents of the radix tree, so we now
      need to pass 'slot' to radix_tree_iter_next(), and we need to change the
      calling convention so it is called *before* dropping the lock which
      protects the tree.  Also rename it to radix_tree_iter_resume(), as some
      people thought it was necessary to call radix_tree_iter_next() each time
      around the loop.
      
      radix_tree_next_slot() becomes closer to how it looked before multiorder
      support was introduced.  It only checks to see if the next entry in the
      chunk is a sibling entry or a pointer to a node; this should be rare
      enough that handling this case out of line is not a performance impact
      (and such impact is amortised by the fact that the entry we just
      processed was a multiorder entry).  Also, radix_tree_next_slot() used to
      force a new chunk lookup for untagged entries, which is more expensive
      than the out of line sibling entry skipping.
      
      Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.comSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Tested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      148deab2
  2. 13 12月, 2016 4 次提交
  3. 07 12月, 2016 1 次提交
    • L
      shmem: fix shm fallocate() list corruption · 10d20bd2
      Linus Torvalds 提交于
      The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
      want to race with generating new pages by faulting them in.
      
      However, the wait-queue used to delay the page faulting has a serious
      problem: the wait queue head (in shmem_fallocate()) is allocated on the
      stack, and the code expects that "wake_up_all()" will make sure that all
      the queue entries are gone before the stack frame is de-allocated.
      
      And that is not at all necessarily the case.
      
      Yes, a normal wake-up sequence will remove the wait-queue entry that
      caused the wakeup (see "autoremove_wake_function()"), but the key
      wording there is "that caused the wakeup".  When there are multiple
      possible wakeup sources, the wait queue entry may well stay around.
      
      And _particularly_ in a page fault path, we may be faulting in new pages
      from user space while we also have other things going on, and there may
      well be other pending wakeups.
      
      So despite the "wake_up_all()", it's not at all guaranteed that all list
      entries are removed from the wait queue head on the stack.
      
      Fix this by introducing a new wakeup function that removes the list
      entry unconditionally, even if the target process had already woken up
      for other reasons.  Use that "synchronous" function to set up the
      waiters in shmem_fault().
      
      This problem has never been seen in the wild afaik, but Dave Jones has
      reported it on and off while running trinity.  We thought we fixed the
      stack corruption with the blk-mq rq_list locking fix (commit
      7fe31130: "blk-mq: update hardware and software queues for sleeping
      alloc"), but it turns out there was _another_ stack corruptor hiding
      in the trinity runs.
      
      Vegard Nossum (also running trinity) was able to trigger this one fairly
      consistently, and made us look once again at the shmem code due to the
      faults often being in that area.
      
      Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d20bd2
  4. 12 11月, 2016 1 次提交
  5. 08 10月, 2016 2 次提交
  6. 06 10月, 2016 1 次提交
  7. 28 9月, 2016 1 次提交
  8. 27 9月, 2016 1 次提交
  9. 25 9月, 2016 2 次提交
  10. 22 9月, 2016 1 次提交
  11. 11 8月, 2016 1 次提交
  12. 04 8月, 2016 1 次提交
  13. 29 7月, 2016 1 次提交
  14. 27 7月, 2016 8 次提交
  15. 11 7月, 2016 1 次提交
    • H
      tmpfs: fix regression hang in fallocate undo · 7f556567
      Hugh Dickins 提交于
      The well-spotted fallocate undo fix is good in most cases, but not when
      fallocate failed on the very first page.  index 0 then passes lend -1
      to shmem_undo_range(), and that has two bad effects: (a) that it will
      undo every fallocation throughout the file, unrestricted by the current
      range; but more importantly (b) it can cause the undo to hang, because
      lend -1 is treated as truncation, which makes it keep on retrying until
      every page has gone, but those already fully instantiated will never go
      away.  Big thank you to xfstests generic/269 which demonstrates this.
      
      Fixes: b9b4bb26 ("tmpfs: don't undo fallocate past its last page")
      Cc: stable@vger.kernel.org
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f556567
  16. 25 6月, 2016 1 次提交
  17. 28 5月, 2016 1 次提交
  18. 20 5月, 2016 3 次提交
    • A
      tmpfs: mem_cgroup charge fault to vm_mm not current mm · 9e18eb29
      Andres Lagar-Cavilla 提交于
      Although shmem_fault() has been careful to count a major fault to vm_mm,
      shmem_getpage_gfp() has been careless in charging a remote access fault
      to current->mm owner's memcg instead of to vma->vm_mm owner's memcg:
      that is inconsistent with all the mem_cgroup charging on remote access
      faults in mm/memory.c.
      
      Fix it by passing fault_mm along with fault_type to
      shmem_get_page_gfp(); but in that case, now knowing the right mm, it's
      better for it to handle the PGMAJFAULT updates itself.
      
      And let's keep this clutter out of most callers' way: change the common
      shmem_getpage() wrapper to hide fault_mm and fault_type as well as gfp.
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e18eb29
    • H
      tmpfs: preliminary minor tidyups · 75edd345
      Hugh Dickins 提交于
      Make a few cleanups in mm/shmem.c, before going on to complicate it.
      
      shmem_alloc_page() will become more complicated: we can't afford to to
      have that complication duplicated between a CONFIG_NUMA version and a
      !CONFIG_NUMA version, so rearrange the #ifdef'ery there to yield a
      single shmem_swapin() and a single shmem_alloc_page().
      
      Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA
      configurations, but eliminating it is a larger cleanup: I have an
      alloc_pages_mpol() patchset not yet ready - mpol handling is subtle and
      bug-prone, and changed yet again since my last version.
      
      Move __SetPageLocked, __SetPageSwapBacked from shmem_getpage_gfp() to
      shmem_alloc_page(): that SwapBacked flag will be useful in future, to
      help to distinguish different cases appropriately.
      
      And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of
      little use (IIRC it dates back to when shmem_getpage() returned the page
      unlocked): kill it and do the necessary in shmem_file_read_iter().
      
      But an arm64 build then complained that info may be uninitialized (where
      shmem_getpage_gfp() deletes a freshly alloced page beyond eof), and
      advancing to an "sgp <= SGP_CACHE" test jogged it back to reality.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75edd345
    • H
      mm: use __SetPageSwapBacked and dont ClearPageSwapBacked · fa9949da
      Hugh Dickins 提交于
      v3.16 commit 07a42788 ("mm: shmem: avoid atomic operation during
      shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
      by __SetPageSwapBacked, pointing out that the newly allocated page is
      not yet visible to other users (except speculative get_page_unless_zero-
      ers, who may not update page flags before their further checks).
      
      That was part of a series in which Mel was focused on tmpfs profiles:
      but almost all SetPageSwapBacked uses can be so optimized, with the same
      justification.
      
      Remove ClearPageSwapBacked from __read_swap_cache_async() error path:
      it's not an error to free a page with PG_swapbacked set.
      
      Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
      doing it differently in different places; but that's for tidiness - if
      the ordering actually mattered, we should not be using the __variants.
      
      There's probably scope for further __SetPageFlags in other places, but
      SwapBacked is the one I'm interested in at the moment.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Reviewed-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa9949da
  19. 03 5月, 2016 1 次提交
    • A
      parallel lookups machinery, part 2 · 84e710da
      Al Viro 提交于
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      84e710da
  20. 11 4月, 2016 1 次提交
  21. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  22. 18 3月, 2016 3 次提交
  23. 16 3月, 2016 1 次提交
    • J
      mm: migrate: do not touch page->mem_cgroup of live pages · 6a93ca8f
      Johannes Weiner 提交于
      Changing a page's memcg association complicates dealing with the page,
      so we want to limit this as much as possible.  Page migration e.g.  does
      not have to do that.  Just like page cache replacement, it can forcibly
      charge a replacement page, and then uncharge the old page when it gets
      freed.  Temporarily overcharging the cgroup by a single page is not an
      issue in practice, and charging is so cheap nowadays that this is much
      preferrable to the headache of messing with live pages.
      
      The only place that still changes the page->mem_cgroup binding of live
      pages is when pages move along with a task to another cgroup.  But that
      path isolates the page from the LRU, takes the page lock, and the move
      lock (lock_page_memcg()).  That means page->mem_cgroup is always stable
      in callers that have the page isolated from the LRU or locked.  Lighter
      unlocked paths, like writeback accounting, can use lock_page_memcg().
      
      [akpm@linux-foundation.org: fix build]
      [vdavydov@virtuozzo.com: fix lockdep splat]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a93ca8f
  24. 23 1月, 2016 1 次提交