1. 20 6月, 2017 2 次提交
  2. 09 5月, 2017 2 次提交
  3. 04 5月, 2017 2 次提交
    • A
      fs: fix data invalidation in the cleancache during direct IO · 55635ba7
      Andrey Ryabinin 提交于
      Patch series "Properly invalidate data in the cleancache", v2.
      
      We've noticed that after direct IO write, buffered read sometimes gets
      stale data which is coming from the cleancache.  The reason for this is
      that some direct write hooks call call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero, so we may not invalidate
      data in the cleancache.
      
      Another odd thing is that we check only for ->nrpages and don't check
      for ->nrexceptional, but invalidate_inode_pages2[_range] also
      invalidates exceptional entries as well.  So we invalidate exceptional
      entries only if ->nrpages != 0? This doesn't feel right.
      
       - Patch 1 fixes direct IO writes by removing ->nrpages check.
       - Patch 2 fixes similar case in invalidate_bdev().
           Note: I only fixed conditional cleancache_invalidate_inode() here.
             Do we also need to add ->nrexceptional check in into invalidate_bdev()?
      
       - Patches 3-4: some optimizations.
      
      This patch (of 4):
      
      Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero.  This can't be right,
      because invalidate_inode_pages2[_range]() also invalidate data in the
      cleancache via cleancache_invalidate_inode() call.  So if page cache is
      empty but there is some data in the cleancache, buffered read after
      direct IO write would get stale data from the cleancache.
      
      Also it doesn't feel right to check only for ->nrpages because
      invalidate_inode_pages2[_range] invalidates exceptional entries as well.
      
      Fix this by calling invalidate_inode_pages2[_range]() regardless of
      nrpages state.
      
      Note: nfs,cifs,9p doesn't need similar fix because the never call
      cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
      they are not affected by this bug.
      
      Fixes: c515e1fd ("mm/fs: add hooks to support cleancache")
      Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55635ba7
    • M
      mm: tighten up the fault path a little · 9ab2594f
      Matthew Wilcox 提交于
      The round_up() macro generates a couple of unnecessary instructions
      in this usage:
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 83 e8 01             sub    $0x1,%rax
          48d5:       48 0d ff 0f 00 00       or     $0xfff,%rax
          48db:       48 83 c0 01             add    $0x1,%rax
          48df:       48 c1 f8 0c             sar    $0xc,%rax
          48e3:       48 39 c3                cmp    %rax,%rbx
          48e6:       72 2e                   jb     4916 <filemap_fault+0x96>
      
      If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
      then GCC can see through it and remove the mask (because that would be
      dead code given the subsequent shift):
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 05 ff 0f 00 00       add    $0xfff,%rax
          48d7:       48 c1 e8 0c             shr    $0xc,%rax
          48db:       48 39 c3                cmp    %rax,%rbx
          48de:       72 2e                   jb     490e <filemap_fault+0x8e>
      
      But that's problematic because we'd evaluate 'y' twice.  Converting
      round_up into an inline function prevents it from being used in other
      definitions.  The easiest thing to do is just change these three usages
      of round_up to use DIV_ROUND_UP.  Also add an unlikely() because GCC's
      heuristic is wrong in this case.
      
      Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab2594f
  4. 22 4月, 2017 2 次提交
  5. 03 4月, 2017 1 次提交
    • M
      kernel-api.rst: fix a series of errors when parsing C files · 0e056eb5
      mchehab@s-opensource.com 提交于
      ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
      ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/filemap.c:1283: ERROR: Unexpected indentation.
      ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
      ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
      ./ipc/util.c:676: ERROR: Unexpected indentation.
      ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
      ./security/security.c:109: ERROR: Unexpected indentation.
      ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
      ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
      ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./ipc/util.c:477: ERROR: Unknown target name: "s".
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      0e056eb5
  6. 02 3月, 2017 1 次提交
  7. 25 2月, 2017 2 次提交
  8. 23 2月, 2017 2 次提交
  9. 04 2月, 2017 1 次提交
  10. 11 1月, 2017 1 次提交
    • R
      dax: fix deadlock with DAX 4k holes · 965d004a
      Ross Zwisler 提交于
      Currently in DAX if we have three read faults on the same hole address we
      can end up with the following:
      
      Thread 0		Thread 1		Thread 2
      --------		--------		--------
      dax_iomap_fault
       grab_mapping_entry
        lock_slot
         <locks empty DAX entry>
      
        			dax_iomap_fault
      			 grab_mapping_entry
      			  get_unlocked_mapping_entry
      			   <sleeps on empty DAX entry>
      
      						dax_iomap_fault
      						 grab_mapping_entry
      						  get_unlocked_mapping_entry
      						   <sleeps on empty DAX entry>
        dax_load_hole
         find_or_create_page
         ...
          page_cache_tree_insert
           dax_wake_mapping_entry_waiter
            <wakes one sleeper>
           __radix_tree_replace
            <swaps empty DAX entry with 4k zero page>
      
      			<wakes>
      			get_page
      			lock_page
      			...
      			put_locked_mapping_entry
      			unlock_page
      			put_page
      
      						<sleeps forever on the DAX
      						 wait queue>
      
      The crux of the problem is that once we insert a 4k zero page, all
      locking from then on is done in terms of that 4k zero page and any
      additional threads sleeping on the empty DAX entry will never be woken.
      
      Fix this by waking all sleepers when we replace the DAX radix tree entry
      with a 4k zero page.  This will allow all sleeping threads to
      successfully transition from locking based on the DAX empty entry to
      locking on the 4k zero page.
      
      With the test case reported by Xiong this happens very regularly in my
      test setup, with some runs resulting in 9+ threads in this deadlocked
      state.  With this fix I've been able to run that same test dozens of
      times in a loop without issue.
      
      Fixes: ac401cc7 ("dax: New fault locking")
      Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NXiong Zhou <xzhou@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      965d004a
  11. 30 12月, 2016 2 次提交
    • O
      mm/filemap: fix parameters to test_bit() · 98473f9f
      Olof Johansson 提交于
       mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
        mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
          return test_bit(PG_waiters);
               ^~~~~~~~
      
      Fixes: b91e1302 ('mm: optimize PageWaiters bit use for unlock_page()')
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Brown-paper-bag-by: NLinus Torvalds <dummy@duh.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98473f9f
    • L
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds 提交于
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: NNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  12. 26 12月, 2016 1 次提交
    • N
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin 提交于
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62906027
  13. 15 12月, 2016 2 次提交
  14. 13 12月, 2016 4 次提交
  15. 12 11月, 2016 1 次提交
    • E
      mm/filemap: don't allow partially uptodate page for pipes · 60da81ea
      Eryu Guan 提交于
      Starting from 4.9-rc1 kernel, I started noticing some test failures of
      sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
      testing on sub-page block size filesystems (tested both XFS and ext4),
      these syscalls start to return EIO in the tests.  e.g.
      
        sendfile02    1  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
        sendfile02    2  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
        sendfile02    3  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
        sendfile02    4  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1
      
      This is because that in sub-page block size cases, we don't need the
      whole page to be uptodate, only the part we care about is uptodate is OK
      (if fs has ->is_partially_uptodate defined).
      
      But page_cache_pipe_buf_confirm() doesn't have the ability to check the
      partially-uptodate case, it needs the whole page to be uptodate.  So it
      returns EIO in this case.
      
      This is a regression introduced by commit 82c156f8 ("switch
      generic_file_splice_read() to use of ->read_iter()").  Prior to the
      change, generic_file_splice_read() doesn't allow partially-uptodate page
      either, so it worked fine.
      
      Fix it by skipping the partially-uptodate check if we're working on a
      pipe in do_generic_file_read(), so we read the whole page from disk as
      long as the page is not uptodate.
      
      I think the other way to fix it is to add the ability to check & allow
      partially-uptodate page to page_cache_pipe_buf_confirm(), but that is
      much harder to do and seems gain little.
      
      Link: http://lkml.kernel.org/r/1477986187-12717-1-git-send-email-guaneryu@gmail.comSigned-off-by: NEryu Guan <guaneryu@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60da81ea
  16. 08 11月, 2016 2 次提交
    • R
      dax: add struct iomap based DAX PMD support · 642261ac
      Ross Zwisler 提交于
      DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
      locking.  This patch allows DAX PMDs to participate in the DAX radix tree
      based locking scheme so that they can be re-enabled using the new struct
      iomap based fault handlers.
      
      There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
      mappings that have an associated block allocation, and 4k DAX empty
      entries.  The empty entries exist to provide locking for the duration of a
      given page fault.
      
      This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
      entries, PMD DAX entries that have associated block allocations, and 2 MiB
      DAX empty entries.
      
      Unlike the 4k case where we insert a struct page* into the radix tree for
      4k zero pages, for HZP we insert a DAX exceptional entry with the new
      RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
      every 2MiB hole mapping, and it doesn't make sense to have that same struct
      page* with multiple entries in multiple trees.  This would cause contention
      on the single page lock for the one Huge Zero Page, and it would break the
      page->index and page->mapping associations that are assumed to be valid in
      many other places in the kernel.
      
      One difficult use case is when one thread is trying to use 4k entries in
      radix tree for a given offset, and another thread is using 2 MiB entries
      for that same offset.  The current code handles this by making the 2 MiB
      user fall back to 4k entries for most cases.  This was done because it is
      the simplest solution, and because the use of 2MiB pages is already
      opportunistic.
      
      If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
      we run into the problem of how we lock out 4k page faults for the entire
      2MiB range while we clean out the radix tree so we can insert the 2MiB
      entry.  We can solve this problem if we need to, but I think that the cases
      where both 2MiB entries and 4K entries are being used for the same range
      will be rare enough and the gain small enough that it probably won't be
      worth the complexity.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      642261ac
    • R
      dax: coordinate locking for offsets in PMD range · 63e95b5c
      Ross Zwisler 提交于
      DAX radix tree locking currently locks entries based on the unique
      combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
      This works for PTEs, but as we move to PMDs we will need to have all the
      offsets within the range covered by the PMD to map to the same bit lock.
      To accomplish this, for ranges covered by a PMD entry we will instead lock
      based on the page offset of the beginning of the PMD entry.  The 'mapping'
      pointer is still used in the same way.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      63e95b5c
  17. 07 11月, 2016 1 次提交
    • E
      mm/filemap: don't allow partially uptodate page for pipes · 6d6d36bc
      Eryu Guan 提交于
      Starting from 4.9-rc1 kernel, I started noticing some test failures
      of sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
      testing on sub-page block size filesystems (tested both XFS and
      ext4), these syscalls start to return EIO in the tests. e.g.
      
      sendfile02    1  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
      sendfile02    2  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
      sendfile02    3  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
      sendfile02    4  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1
      
      This is because that in sub-page block size cases, we don't need the
      whole page to be uptodate, only the part we care about is uptodate
      is OK (if fs has ->is_partially_uptodate defined). But
      page_cache_pipe_buf_confirm() doesn't have the ability to check the
      partially-uptodate case, it needs the whole page to be uptodate. So
      it returns EIO in this case.
      
      This is a regression introduced by commit 82c156f8 ("switch
      generic_file_splice_read() to use of ->read_iter()"). Prior to the
      change, generic_file_splice_read() doesn't allow partially-uptodate
      page either, so it worked fine.
      
      Fix it by skipping the partially-uptodate check if we're working on
      a pipe in do_generic_file_read(), so we read the whole page from
      disk as long as the page is not uptodate.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6d6d36bc
  18. 28 10月, 2016 1 次提交
    • L
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds 提交于
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  19. 11 10月, 2016 1 次提交
    • A
      fix ITER_PIPE interaction with direct_IO · c3a69024
      Al Viro 提交于
      by making sure we call iov_iter_advance() on original
      iov_iter even if direct_IO (done on its copy) has returned 0.
      It's a no-op for old iov_iter flavours and does the right thing
      (== truncation of the stuff we'd allocated, but not filled) in
      ITER_PIPE case.  Failures (e.g. -EIO) get caught and dealt with
      by cleanup in generic_file_read_iter().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c3a69024
  20. 08 10月, 2016 2 次提交
    • W
      vfs,mm: fix a dead loop in truncate_inode_pages_range() · c2a9737f
      Wei Fang 提交于
      We triggered a deadloop in truncate_inode_pages_range() on 32 bits
      architecture with the test case bellow:
      
      	...
      	fd = open();
      	write(fd, buf, 4096);
      	preadv64(fd, &iovec, 1, 0xffffffff000);
      	ftruncate(fd, 0);
      	...
      
      Then ftruncate() will not return forever.
      
      The filesystem used in this case is ubifs, but it can be triggered on
      many other filesystems.
      
      When preadv64() is called with offset=0xffffffff000, a page with
      index=0xffffffff will be added to the radix tree of ->mapping.  Then
      this page can be found in ->mapping with pagevec_lookup().  After that,
      truncate_inode_pages_range(), which is called in ftruncate(), will fall
      into an infinite loop:
      
       - find a page with index=0xffffffff, since index>=end, this page won't
         be truncated
      
       - index++, and index become 0
      
       - the page with index=0xffffffff will be found again
      
      The data type of index is unsigned long, so index won't overflow to 0 on
      64 bits architecture in this case, and the dead loop won't happen.
      
      Since truncate_inode_pages_range() is executed with holding lock of
      inode->i_rwsem, any operation related with this lock will be blocked,
      and a hung task will happen, e.g.:
      
        INFO: task truncate_test:3364 blocked for more than 120 seconds.
        ...
           call_rwsem_down_write_failed+0x17/0x30
           generic_file_write_iter+0x32/0x1c0
           ubifs_write_iter+0xcc/0x170
           __vfs_write+0xc4/0x120
           vfs_write+0xb2/0x1b0
           SyS_write+0x46/0xa0
      
      The page with index=0xffffffff added to ->mapping is useless.  Fix this
      by checking the read position before allocating pages.
      
      Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.comSigned-off-by: NWei Fang <fangwei1@huawei.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2a9737f
    • B
      do_generic_file_read(): fail immediately if killed · c4b209a4
      Bart Van Assche 提交于
      If a fatal signal has been received, fail immediately instead of trying
      to read more data.
      
      If wait_on_page_locked_killable() was interrupted then this page is most
      likely is not PageUptodate() and in this case do_generic_file_read()
      will fail after lock_page_killable().
      
      See also commit ebded027 ("mm: filemap: avoid unnecessary calls to
      lock_page when waiting for IO to complete during a read")
      
      [oleg@redhat.com: changelog addition]
      Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.comSigned-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4b209a4
  21. 06 10月, 2016 2 次提交
    • J
      mm: filemap: fix mapping->nrpages double accounting in fuse · 3ddf40e8
      Johannes Weiner 提交于
      Commit 22f2ac51 ("mm: workingset: fix crash in shadow node shrinker
      caused by replace_page_cache_page()") switched replace_page_cache() from
      raw radix tree operations to page_cache_tree_insert() but didn't take
      into account that the latter function, unlike the raw radix tree op,
      handles mapping->nrpages.  As a result, that counter is bumped for each
      page replacement rather than balanced out even.
      
      The mapping->nrpages counter is used to skip needless radix tree walks
      when invalidating, truncating, syncing inodes without pages, as well as
      statistics for userspace.  Since the error is positive, we'll do more
      page cache tree walks than necessary; we won't miss a necessary one.
      And we'll report more buffer pages to userspace than there are.  The
      error is limited to fuse inodes.
      
      Fixes: 22f2ac51 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ddf40e8
    • J
      mm: filemap: don't plant shadow entries without radix tree node · d3798ae8
      Johannes Weiner 提交于
      When the underflow checks were added to workingset_node_shadow_dec(),
      they triggered immediately:
      
        kernel BUG at ./include/linux/swap.h:276!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
         soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
        CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60b #1
        Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
        task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
        RIP: page_cache_tree_insert+0xf1/0x100
        Call Trace:
          __add_to_page_cache_locked+0x12e/0x270
          add_to_page_cache_lru+0x4e/0xe0
          mpage_readpages+0x112/0x1d0
          blkdev_readpages+0x1d/0x20
          __do_page_cache_readahead+0x1ad/0x290
          force_page_cache_readahead+0xaa/0x100
          page_cache_sync_readahead+0x3f/0x50
          generic_file_read_iter+0x5af/0x740
          blkdev_read_iter+0x35/0x40
          __vfs_read+0xe1/0x130
          vfs_read+0x96/0x130
          SyS_read+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x13/0x8f
        Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
        RIP  page_cache_tree_insert+0xf1/0x100
      
      This is a long-standing bug in the way shadow entries are accounted in
      the radix tree nodes. The shrinker needs to know when radix tree nodes
      contain only shadow entries, no pages, so node->count is split in half
      to count shadows in the upper bits and pages in the lower bits.
      
      Unfortunately, the radix tree implementation doesn't know of this and
      assumes all entries are in node->count. When there is a shadow entry
      directly in root->rnode and the tree is later extended, the radix tree
      implementation will copy that entry into the new node and and bump its
      node->count, i.e. increases the page count bits. Once the shadow gets
      removed and we subtract from the upper counter, node->count underflows
      and triggers the warning. Afterwards, without node->count reaching 0
      again, the radix tree node is leaked.
      
      Limit shadow entries to when we have actual radix tree nodes and can
      count them properly. That means we lose the ability to detect refaults
      from files that had only the first page faulted in at eviction time.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-and-tested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3798ae8
  22. 03 10月, 2016 1 次提交
  23. 01 10月, 2016 1 次提交
    • J
      mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() · 22f2ac51
      Johannes Weiner 提交于
      Antonio reports the following crash when using fuse under memory pressure:
      
        kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: all of them
        CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
        Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
        task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
        RIP: shadow_lru_isolate+0x181/0x190
        Call Trace:
          __list_lru_walk_one.isra.3+0x8f/0x130
          list_lru_walk_one+0x23/0x30
          scan_shadow_nodes+0x34/0x50
          shrink_slab.part.40+0x1ed/0x3d0
          shrink_zone+0x2ca/0x2e0
          kswapd+0x51e/0x990
          kthread+0xd8/0xf0
          ret_from_fork+0x3f/0x70
      
      which corresponds to the following sanity check in the shadow node
      tracking:
      
        BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
      
      The workingset code tracks radix tree nodes that exclusively contain
      shadow entries of evicted pages in them, and this (somewhat obscure)
      line checks whether there are real pages left that would interfere with
      reclaim of the radix tree node under memory pressure.
      
      While discussing ways how fuse might sneak pages into the radix tree
      past the workingset code, Miklos pointed to replace_page_cache_page(),
      and indeed there is a problem there: it properly accounts for the old
      page being removed - __delete_from_page_cache() does that - but then
      does a raw raw radix_tree_insert(), not accounting for the replacement
      page.  Eventually the page count bits in node->count underflow while
      leaving the node incorrectly linked to the shadow node LRU.
      
      To address this, make sure replace_page_cache_page() uses the tracked
      page insertion code, page_cache_tree_insert().  This fixes the page
      accounting and makes sure page-containing nodes are properly unlinked
      from the shadow node LRU again.
      
      Also, make the sanity checks a bit less obscure by using the helpers for
      checking the number of pages and shadows in a radix tree node.
      
      Fixes: 449dd698 ("mm: keep page cache radix tree nodes in check")
      Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAntonio SJ Musumeci <trapexit@spawn.link>
      Debugged-by: NMiklos Szeredi <miklos@szeredi.hu>
      Cc: <stable@vger.kernel.org>	[3.15+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22f2ac51
  24. 08 8月, 2016 1 次提交
  25. 05 8月, 2016 1 次提交
  26. 29 7月, 2016 1 次提交