1. 07 9月, 2017 1 次提交
  2. 29 8月, 2017 1 次提交
    • L
      page waitqueue: always add new entries at the end · 9c3a815f
      Linus Torvalds 提交于
      Commit 3510ca20 ("Minor page waitqueue cleanups") made the page
      queue code always add new waiters to the back of the queue, which helps
      upcoming patches to batch the wakeups for some horrid loads where the
      wait queues grow to thousands of entries.
      
      However, I forgot about the nasrt add_page_wait_queue() special case
      code that is only used by the cachefiles code.  That one still continued
      to add the new wait queue entries at the beginning of the list.
      
      Fix it, because any sane batched wakeup will require that we don't
      suddenly start getting new entries at the beginning of the list that we
      already handled in a previous batch.
      
      [ The current code always does the whole list while holding the lock, so
        wait queue ordering doesn't matter for correctness, but even then it's
        better to add later entries at the end from a fairness standpoint ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c3a815f
  3. 28 8月, 2017 2 次提交
    • L
      Avoid page waitqueue race leaving possible page locker waiting · a8b169af
      Linus Torvalds 提交于
      The "lock_page_killable()" function waits for exclusive access to the
      page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
      set.
      
      That means that if it gets woken up, other waiters may have been
      skipped.
      
      That, in turn, means that if it sees the page being unlocked, it *must*
      take that lock and return success, even if a lethal signal is also
      pending.
      
      So instead of checking for lethal signals first, we need to check for
      them after we've checked the actual bit that we were waiting for.  Even
      if that might then delay the killing of the process.
      
      This matches the order of the old "wait_on_bit_lock()" infrastructure
      that the page locking used to use (and is still used in a few other
      areas).
      
      Note that if we still return an error after having unsuccessfully tried
      to acquire the page lock, that is ok: that means that some other thread
      was able to get ahead of us and lock the page, and when that other
      thread then unlocks the page, the wakeup event will be repeated.  So any
      other pending waiters will now get properly woken up.
      
      Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8b169af
    • L
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds 提交于
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
  4. 11 7月, 2017 1 次提交
  5. 07 7月, 2017 1 次提交
  6. 06 7月, 2017 4 次提交
    • J
      fs: new infrastructure for writeback error handling and reporting · 5660e13d
      Jeff Layton 提交于
      Most filesystems currently use mapping_set_error and
      filemap_check_errors for setting and reporting/clearing writeback errors
      at the mapping level. filemap_check_errors is indirectly called from
      most of the filemap_fdatawait_* functions and from
      filemap_write_and_wait*. These functions are called from all sorts of
      contexts to wait on writeback to finish -- e.g. mostly in fsync, but
      also in truncate calls, getattr, etc.
      
      The non-fsync callers are problematic. We should be reporting writeback
      errors during fsync, but many places spread over the tree clear out
      errors before they can be properly reported, or report errors at
      nonsensical times.
      
      If I get -EIO on a stat() call, there is no reason for me to assume that
      it is because some previous writeback failed. The fact that it also
      clears out the error such that a subsequent fsync returns 0 is a bug,
      and a nasty one since that's potentially silent data corruption.
      
      This patch adds a small bit of new infrastructure for setting and
      reporting errors during address_space writeback. While the above was my
      original impetus for adding this, I think it's also the case that
      current fsync semantics are just problematic for userland. Most
      applications that call fsync do so to ensure that the data they wrote
      has hit the backing store.
      
      In the case where there are multiple writers to the file at the same
      time, this is really hard to determine. The first one to call fsync will
      see any stored error, and the rest get back 0. The processes with open
      fds may not be associated with one another in any way. They could even
      be in different containers, so ensuring coordination between all fsync
      callers is not really an option.
      
      One way to remedy this would be to track what file descriptor was used
      to dirty the file, but that's rather cumbersome and would likely be
      slow. However, there is a simpler way to improve the semantics here
      without incurring too much overhead.
      
      This set adds an errseq_t to struct address_space, and a corresponding
      one is added to struct file. Writeback errors are recorded in the
      mapping's errseq_t, and the one in struct file is used as the "since"
      value.
      
      This changes the semantics of the Linux fsync implementation such that
      applications can now use it to determine whether there were any
      writeback errors since fsync(fd) was last called (or since the file was
      opened in the case of fsync having never been called).
      
      Note that those writeback errors may have occurred when writing data
      that was dirtied via an entirely different fd, but that's the case now
      with the current mapping_set_error/filemap_check_error infrastructure.
      This will at least prevent you from getting a false report of success.
      
      The new behavior is still consistent with the POSIX spec, and is more
      reliable for application developers. This patch just adds some basic
      infrastructure for doing this, and ensures that the f_wb_err "cursor"
      is properly set when a file is opened. Later patches will change the
      existing code to use this new infrastructure for reporting errors at
      fsync time.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      5660e13d
    • J
      mm: don't TestClearPageError in __filemap_fdatawait_range · 5e8fcc1a
      Jeff Layton 提交于
      The -EIO returned here can end up overriding whatever error is marked in
      the address space, and be returned at fsync time, even when there is a
      more appropriate error stored in the mapping.
      
      Read errors are also sometimes tracked on a per-page level using
      PG_error. Suppose we have a read error on a page, and then that page is
      subsequently dirtied by overwriting the whole page. Writeback doesn't
      clear PG_error, so we can then end up successfully writing back that
      page and still return -EIO on fsync.
      
      Worse yet, PG_error is cleared during a sync() syscall, but the -EIO
      return from that is silently discarded. Any subsystem that is relying on
      PG_error to report errors during fsync can easily lose writeback errors
      due to this. All you need is a stray sync() call to wait for writeback
      to complete and you've lost the error.
      
      Since the handling of the PG_error flag is somewhat inconsistent across
      subsystems, let's just rely on marking the address space when there are
      writeback errors. Change the TestClearPageError call to ClearPageError,
      and make __filemap_fdatawait_range a void return function.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      5e8fcc1a
    • J
      mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails · cbeaf951
      Jeff Layton 提交于
      filemap_write_and_wait{_range} will return an error if writeback
      initiation fails, but won't clear errors in the address_space. This is
      particularly problematic on DAX, as filemap_fdatawrite* is
      effectively synchronous there. Ensure that we clear the AS_EIO/AS_ENOSPC
      flags when filemap_fdatawrite* returns an error.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      cbeaf951
    • J
      jbd2: don't clear and reset errors after waiting on writeback · 76341cab
      Jeff Layton 提交于
      Resetting this flag is almost certainly racy, and will be problematic
      with some coming changes.
      
      Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
      Have jbd2 call it instead of filemap_fdatawait and don't attempt to
      re-set the error flag if it fails.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      76341cab
  7. 20 6月, 2017 4 次提交
    • G
      fs: return if direct I/O will trigger writeback · 6be96d3a
      Goldwyn Rodrigues 提交于
      Find out if the I/O will trigger a wait due to writeback. If yes,
      return -EAGAIN.
      
      Return -EINVAL for buffered AIO: there are multiple causes of
      delay such as page locks, dirty throttling logic, page loading
      from disk etc. which cannot be taken care of.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6be96d3a
    • G
      fs: Introduce filemap_range_has_page() · 7fc9e472
      Goldwyn Rodrigues 提交于
      filemap_range_has_page() return true if the file's mapping has
      a page within the range mentioned. This function will be used
      to check if a write() call will cause a writeback of previous
      writes.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7fc9e472
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  8. 09 5月, 2017 2 次提交
  9. 04 5月, 2017 2 次提交
    • A
      fs: fix data invalidation in the cleancache during direct IO · 55635ba7
      Andrey Ryabinin 提交于
      Patch series "Properly invalidate data in the cleancache", v2.
      
      We've noticed that after direct IO write, buffered read sometimes gets
      stale data which is coming from the cleancache.  The reason for this is
      that some direct write hooks call call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero, so we may not invalidate
      data in the cleancache.
      
      Another odd thing is that we check only for ->nrpages and don't check
      for ->nrexceptional, but invalidate_inode_pages2[_range] also
      invalidates exceptional entries as well.  So we invalidate exceptional
      entries only if ->nrpages != 0? This doesn't feel right.
      
       - Patch 1 fixes direct IO writes by removing ->nrpages check.
       - Patch 2 fixes similar case in invalidate_bdev().
           Note: I only fixed conditional cleancache_invalidate_inode() here.
             Do we also need to add ->nrexceptional check in into invalidate_bdev()?
      
       - Patches 3-4: some optimizations.
      
      This patch (of 4):
      
      Some direct IO write fs hooks call invalidate_inode_pages2[_range]()
      conditionally iff mapping->nrpages is not zero.  This can't be right,
      because invalidate_inode_pages2[_range]() also invalidate data in the
      cleancache via cleancache_invalidate_inode() call.  So if page cache is
      empty but there is some data in the cleancache, buffered read after
      direct IO write would get stale data from the cleancache.
      
      Also it doesn't feel right to check only for ->nrpages because
      invalidate_inode_pages2[_range] invalidates exceptional entries as well.
      
      Fix this by calling invalidate_inode_pages2[_range]() regardless of
      nrpages state.
      
      Note: nfs,cifs,9p doesn't need similar fix because the never call
      cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so
      they are not affected by this bug.
      
      Fixes: c515e1fd ("mm/fs: add hooks to support cleancache")
      Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Alexey Kuznetsov <kuznet@virtuozzo.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55635ba7
    • M
      mm: tighten up the fault path a little · 9ab2594f
      Matthew Wilcox 提交于
      The round_up() macro generates a couple of unnecessary instructions
      in this usage:
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 83 e8 01             sub    $0x1,%rax
          48d5:       48 0d ff 0f 00 00       or     $0xfff,%rax
          48db:       48 83 c0 01             add    $0x1,%rax
          48df:       48 c1 f8 0c             sar    $0xc,%rax
          48e3:       48 39 c3                cmp    %rax,%rbx
          48e6:       72 2e                   jb     4916 <filemap_fault+0x96>
      
      If we change round_up() to ((x) + __round_mask(x, y)) & ~__round_mask(x, y)
      then GCC can see through it and remove the mask (because that would be
      dead code given the subsequent shift):
      
          48cd:       49 8b 47 50             mov    0x50(%r15),%rax
          48d1:       48 05 ff 0f 00 00       add    $0xfff,%rax
          48d7:       48 c1 e8 0c             shr    $0xc,%rax
          48db:       48 39 c3                cmp    %rax,%rbx
          48de:       72 2e                   jb     490e <filemap_fault+0x8e>
      
      But that's problematic because we'd evaluate 'y' twice.  Converting
      round_up into an inline function prevents it from being used in other
      definitions.  The easiest thing to do is just change these three usages
      of round_up to use DIV_ROUND_UP.  Also add an unlikely() because GCC's
      heuristic is wrong in this case.
      
      Link: http://lkml.kernel.org/r/20170207192812.5281-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab2594f
  10. 22 4月, 2017 2 次提交
  11. 03 4月, 2017 1 次提交
    • M
      kernel-api.rst: fix a series of errors when parsing C files · 0e056eb5
      mchehab@s-opensource.com 提交于
      ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
      ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/filemap.c:1283: ERROR: Unexpected indentation.
      ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
      ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
      ./ipc/util.c:676: ERROR: Unexpected indentation.
      ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
      ./security/security.c:109: ERROR: Unexpected indentation.
      ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
      ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
      ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./ipc/util.c:477: ERROR: Unknown target name: "s".
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      0e056eb5
  12. 02 3月, 2017 1 次提交
  13. 25 2月, 2017 2 次提交
  14. 23 2月, 2017 2 次提交
  15. 04 2月, 2017 1 次提交
  16. 11 1月, 2017 1 次提交
    • R
      dax: fix deadlock with DAX 4k holes · 965d004a
      Ross Zwisler 提交于
      Currently in DAX if we have three read faults on the same hole address we
      can end up with the following:
      
      Thread 0		Thread 1		Thread 2
      --------		--------		--------
      dax_iomap_fault
       grab_mapping_entry
        lock_slot
         <locks empty DAX entry>
      
        			dax_iomap_fault
      			 grab_mapping_entry
      			  get_unlocked_mapping_entry
      			   <sleeps on empty DAX entry>
      
      						dax_iomap_fault
      						 grab_mapping_entry
      						  get_unlocked_mapping_entry
      						   <sleeps on empty DAX entry>
        dax_load_hole
         find_or_create_page
         ...
          page_cache_tree_insert
           dax_wake_mapping_entry_waiter
            <wakes one sleeper>
           __radix_tree_replace
            <swaps empty DAX entry with 4k zero page>
      
      			<wakes>
      			get_page
      			lock_page
      			...
      			put_locked_mapping_entry
      			unlock_page
      			put_page
      
      						<sleeps forever on the DAX
      						 wait queue>
      
      The crux of the problem is that once we insert a 4k zero page, all
      locking from then on is done in terms of that 4k zero page and any
      additional threads sleeping on the empty DAX entry will never be woken.
      
      Fix this by waking all sleepers when we replace the DAX radix tree entry
      with a 4k zero page.  This will allow all sleeping threads to
      successfully transition from locking based on the DAX empty entry to
      locking on the 4k zero page.
      
      With the test case reported by Xiong this happens very regularly in my
      test setup, with some runs resulting in 9+ threads in this deadlocked
      state.  With this fix I've been able to run that same test dozens of
      times in a loop without issue.
      
      Fixes: ac401cc7 ("dax: New fault locking")
      Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NXiong Zhou <xzhou@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      965d004a
  17. 30 12月, 2016 2 次提交
    • O
      mm/filemap: fix parameters to test_bit() · 98473f9f
      Olof Johansson 提交于
       mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
        mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
          return test_bit(PG_waiters);
               ^~~~~~~~
      
      Fixes: b91e1302 ('mm: optimize PageWaiters bit use for unlock_page()')
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Brown-paper-bag-by: NLinus Torvalds <dummy@duh.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98473f9f
    • L
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds 提交于
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: NNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  18. 26 12月, 2016 1 次提交
    • N
      mm: add PageWaiters indicating tasks are waiting for a page bit · 62906027
      Nicholas Piggin 提交于
      Add a new page flag, PageWaiters, to indicate the page waitqueue has
      tasks waiting. This can be tested rather than testing waitqueue_active
      which requires another cacheline load.
      
      This bit is always set when the page has tasks on page_waitqueue(page),
      and is set and cleared under the waitqueue lock. It may be set when
      there are no tasks on the waitqueue, which will cause a harmless extra
      wakeup check that will clears the bit.
      
      The generic bit-waitqueue infrastructure is no longer used for pages.
      Instead, waitqueues are used directly with a custom key type. The
      generic code was not flexible enough to have PageWaiters manipulation
      under the waitqueue lock (which simplifies concurrency).
      
      This improves the performance of page lock intensive microbenchmarks by
      2-3%.
      
      Putting two bits in the same word opens the opportunity to remove the
      memory barrier between clearing the lock bit and testing the waiters
      bit, after some work on the arch primitives (e.g., ensuring memory
      operand widths match and cover both bits).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62906027
  19. 15 12月, 2016 2 次提交
  20. 13 12月, 2016 4 次提交
  21. 12 11月, 2016 1 次提交
    • E
      mm/filemap: don't allow partially uptodate page for pipes · 60da81ea
      Eryu Guan 提交于
      Starting from 4.9-rc1 kernel, I started noticing some test failures of
      sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
      testing on sub-page block size filesystems (tested both XFS and ext4),
      these syscalls start to return EIO in the tests.  e.g.
      
        sendfile02    1  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
        sendfile02    2  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
        sendfile02    3  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
        sendfile02    4  TFAIL  :  sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1
      
      This is because that in sub-page block size cases, we don't need the
      whole page to be uptodate, only the part we care about is uptodate is OK
      (if fs has ->is_partially_uptodate defined).
      
      But page_cache_pipe_buf_confirm() doesn't have the ability to check the
      partially-uptodate case, it needs the whole page to be uptodate.  So it
      returns EIO in this case.
      
      This is a regression introduced by commit 82c156f8 ("switch
      generic_file_splice_read() to use of ->read_iter()").  Prior to the
      change, generic_file_splice_read() doesn't allow partially-uptodate page
      either, so it worked fine.
      
      Fix it by skipping the partially-uptodate check if we're working on a
      pipe in do_generic_file_read(), so we read the whole page from disk as
      long as the page is not uptodate.
      
      I think the other way to fix it is to add the ability to check & allow
      partially-uptodate page to page_cache_pipe_buf_confirm(), but that is
      much harder to do and seems gain little.
      
      Link: http://lkml.kernel.org/r/1477986187-12717-1-git-send-email-guaneryu@gmail.comSigned-off-by: NEryu Guan <guaneryu@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60da81ea
  22. 08 11月, 2016 2 次提交
    • R
      dax: add struct iomap based DAX PMD support · 642261ac
      Ross Zwisler 提交于
      DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
      locking.  This patch allows DAX PMDs to participate in the DAX radix tree
      based locking scheme so that they can be re-enabled using the new struct
      iomap based fault handlers.
      
      There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
      mappings that have an associated block allocation, and 4k DAX empty
      entries.  The empty entries exist to provide locking for the duration of a
      given page fault.
      
      This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
      entries, PMD DAX entries that have associated block allocations, and 2 MiB
      DAX empty entries.
      
      Unlike the 4k case where we insert a struct page* into the radix tree for
      4k zero pages, for HZP we insert a DAX exceptional entry with the new
      RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
      every 2MiB hole mapping, and it doesn't make sense to have that same struct
      page* with multiple entries in multiple trees.  This would cause contention
      on the single page lock for the one Huge Zero Page, and it would break the
      page->index and page->mapping associations that are assumed to be valid in
      many other places in the kernel.
      
      One difficult use case is when one thread is trying to use 4k entries in
      radix tree for a given offset, and another thread is using 2 MiB entries
      for that same offset.  The current code handles this by making the 2 MiB
      user fall back to 4k entries for most cases.  This was done because it is
      the simplest solution, and because the use of 2MiB pages is already
      opportunistic.
      
      If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
      we run into the problem of how we lock out 4k page faults for the entire
      2MiB range while we clean out the radix tree so we can insert the 2MiB
      entry.  We can solve this problem if we need to, but I think that the cases
      where both 2MiB entries and 4K entries are being used for the same range
      will be rare enough and the gain small enough that it probably won't be
      worth the complexity.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      642261ac
    • R
      dax: coordinate locking for offsets in PMD range · 63e95b5c
      Ross Zwisler 提交于
      DAX radix tree locking currently locks entries based on the unique
      combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
      This works for PTEs, but as we move to PMDs we will need to have all the
      offsets within the range covered by the PMD to map to the same bit lock.
      To accomplish this, for ranges covered by a PMD entry we will instead lock
      based on the page offset of the beginning of the PMD entry.  The 'mapping'
      pointer is still used in the same way.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      63e95b5c