1. 01 2月, 2018 1 次提交
  2. 16 11月, 2017 11 次提交
  3. 13 11月, 2017 1 次提交
  4. 04 10月, 2017 1 次提交
  5. 25 9月, 2017 1 次提交
    • L
      fs: Fix page cache inconsistency when mixing buffered and AIO DIO · 332391a9
      Lukas Czerner 提交于
      Currently when mixing buffered reads and asynchronous direct writes it
      is possible to end up with the situation where we have stale data in the
      page cache while the new data is already written to disk. This is
      permanent until the affected pages are flushed away. Despite the fact
      that mixing buffered and direct IO is ill-advised it does pose a thread
      for a data integrity, is unexpected and should be fixed.
      
      Fix this by deferring completion of asynchronous direct writes to a
      process context in the case that there are mapped pages to be found in
      the inode. Later before the completion in dio_complete() invalidate
      the pages in question. This ensures that after the completion the pages
      in the written area are either unmapped, or populated with up-to-date
      data. Also do the same for the iomap case which uses
      iomap_dio_complete() instead.
      
      This has a side effect of deferring the completion to a process context
      for every AIO DIO that happens on inode that has pages mapped. However
      since the consensus is that this is ill-advised practice the performance
      implication should not be a problem.
      
      This was based on proposal from Jeff Moyer, thanks!
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      332391a9
  6. 15 9月, 2017 1 次提交
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b
  7. 07 9月, 2017 4 次提交
  8. 05 9月, 2017 2 次提交
  9. 29 8月, 2017 1 次提交
    • L
      page waitqueue: always add new entries at the end · 9c3a815f
      Linus Torvalds 提交于
      Commit 3510ca20 ("Minor page waitqueue cleanups") made the page
      queue code always add new waiters to the back of the queue, which helps
      upcoming patches to batch the wakeups for some horrid loads where the
      wait queues grow to thousands of entries.
      
      However, I forgot about the nasrt add_page_wait_queue() special case
      code that is only used by the cachefiles code.  That one still continued
      to add the new wait queue entries at the beginning of the list.
      
      Fix it, because any sane batched wakeup will require that we don't
      suddenly start getting new entries at the beginning of the list that we
      already handled in a previous batch.
      
      [ The current code always does the whole list while holding the lock, so
        wait queue ordering doesn't matter for correctness, but even then it's
        better to add later entries at the end from a fairness standpoint ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c3a815f
  10. 28 8月, 2017 2 次提交
    • L
      Avoid page waitqueue race leaving possible page locker waiting · a8b169af
      Linus Torvalds 提交于
      The "lock_page_killable()" function waits for exclusive access to the
      page lock bit using the WQ_FLAG_EXCLUSIVE bit in the waitqueue entry
      set.
      
      That means that if it gets woken up, other waiters may have been
      skipped.
      
      That, in turn, means that if it sees the page being unlocked, it *must*
      take that lock and return success, even if a lethal signal is also
      pending.
      
      So instead of checking for lethal signals first, we need to check for
      them after we've checked the actual bit that we were waiting for.  Even
      if that might then delay the killing of the process.
      
      This matches the order of the old "wait_on_bit_lock()" infrastructure
      that the page locking used to use (and is still used in a few other
      areas).
      
      Note that if we still return an error after having unsuccessfully tried
      to acquire the page lock, that is ok: that means that some other thread
      was able to get ahead of us and lock the page, and when that other
      thread then unlocks the page, the wakeup event will be repeated.  So any
      other pending waiters will now get properly woken up.
      
      Fixes: 62906027 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8b169af
    • L
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds 提交于
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
  11. 01 8月, 2017 2 次提交
    • J
      mm: remove optimizations based on i_size in mapping writeback waits · ffb959bb
      Jeff Layton 提交于
      Marcelo added this i_size based optimization with a patch in 2004
      (commitid is from the linux-history tree):
      
          commit 765dad09b4ac101a32d87af2bb793c3060497d3c
          Author: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
          Date:   Tue Sep 7 17:51:17 2004 -0700
      
      	small wait_on_page_writeback_range() optimization
      
      	filemap_fdatawait() calls wait_on_page_writeback_range() with -1
      	as "end" parameter.  This is not needed since we know the EOF
      	from the inode.  Use that instead.
      
      There may be races here, particularly with clustered or network
      filesystems. It also seems like a bit of a layering violation since
      we're operating on an address_space here, not an inode.
      
      Finally, it's also questionable whether this optimization really helps
      on workloads that we care about. Should we be optimizing for writeback
      vs. truncate races in a codepath where we expect to wait anyway? It
      doesn't seem worth the risk.
      
      Remove this optimization from the filemap_fdatawait codepaths. This
      means that filemap_fdatawait becomes a trivial wrapper around
      filemap_fdatawait_range.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      ffb959bb
    • J
      mm: add file_fdatawait_range and file_write_and_wait · a823e458
      Jeff Layton 提交于
      Necessary now for gfs2_fsync and sync_file_range, but there will
      eventually be other callers.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      a823e458
  12. 29 7月, 2017 1 次提交
  13. 27 7月, 2017 1 次提交
  14. 11 7月, 2017 1 次提交
  15. 07 7月, 2017 1 次提交
  16. 06 7月, 2017 4 次提交
    • J
      fs: new infrastructure for writeback error handling and reporting · 5660e13d
      Jeff Layton 提交于
      Most filesystems currently use mapping_set_error and
      filemap_check_errors for setting and reporting/clearing writeback errors
      at the mapping level. filemap_check_errors is indirectly called from
      most of the filemap_fdatawait_* functions and from
      filemap_write_and_wait*. These functions are called from all sorts of
      contexts to wait on writeback to finish -- e.g. mostly in fsync, but
      also in truncate calls, getattr, etc.
      
      The non-fsync callers are problematic. We should be reporting writeback
      errors during fsync, but many places spread over the tree clear out
      errors before they can be properly reported, or report errors at
      nonsensical times.
      
      If I get -EIO on a stat() call, there is no reason for me to assume that
      it is because some previous writeback failed. The fact that it also
      clears out the error such that a subsequent fsync returns 0 is a bug,
      and a nasty one since that's potentially silent data corruption.
      
      This patch adds a small bit of new infrastructure for setting and
      reporting errors during address_space writeback. While the above was my
      original impetus for adding this, I think it's also the case that
      current fsync semantics are just problematic for userland. Most
      applications that call fsync do so to ensure that the data they wrote
      has hit the backing store.
      
      In the case where there are multiple writers to the file at the same
      time, this is really hard to determine. The first one to call fsync will
      see any stored error, and the rest get back 0. The processes with open
      fds may not be associated with one another in any way. They could even
      be in different containers, so ensuring coordination between all fsync
      callers is not really an option.
      
      One way to remedy this would be to track what file descriptor was used
      to dirty the file, but that's rather cumbersome and would likely be
      slow. However, there is a simpler way to improve the semantics here
      without incurring too much overhead.
      
      This set adds an errseq_t to struct address_space, and a corresponding
      one is added to struct file. Writeback errors are recorded in the
      mapping's errseq_t, and the one in struct file is used as the "since"
      value.
      
      This changes the semantics of the Linux fsync implementation such that
      applications can now use it to determine whether there were any
      writeback errors since fsync(fd) was last called (or since the file was
      opened in the case of fsync having never been called).
      
      Note that those writeback errors may have occurred when writing data
      that was dirtied via an entirely different fd, but that's the case now
      with the current mapping_set_error/filemap_check_error infrastructure.
      This will at least prevent you from getting a false report of success.
      
      The new behavior is still consistent with the POSIX spec, and is more
      reliable for application developers. This patch just adds some basic
      infrastructure for doing this, and ensures that the f_wb_err "cursor"
      is properly set when a file is opened. Later patches will change the
      existing code to use this new infrastructure for reporting errors at
      fsync time.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      5660e13d
    • J
      mm: don't TestClearPageError in __filemap_fdatawait_range · 5e8fcc1a
      Jeff Layton 提交于
      The -EIO returned here can end up overriding whatever error is marked in
      the address space, and be returned at fsync time, even when there is a
      more appropriate error stored in the mapping.
      
      Read errors are also sometimes tracked on a per-page level using
      PG_error. Suppose we have a read error on a page, and then that page is
      subsequently dirtied by overwriting the whole page. Writeback doesn't
      clear PG_error, so we can then end up successfully writing back that
      page and still return -EIO on fsync.
      
      Worse yet, PG_error is cleared during a sync() syscall, but the -EIO
      return from that is silently discarded. Any subsystem that is relying on
      PG_error to report errors during fsync can easily lose writeback errors
      due to this. All you need is a stray sync() call to wait for writeback
      to complete and you've lost the error.
      
      Since the handling of the PG_error flag is somewhat inconsistent across
      subsystems, let's just rely on marking the address space when there are
      writeback errors. Change the TestClearPageError call to ClearPageError,
      and make __filemap_fdatawait_range a void return function.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      5e8fcc1a
    • J
      mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails · cbeaf951
      Jeff Layton 提交于
      filemap_write_and_wait{_range} will return an error if writeback
      initiation fails, but won't clear errors in the address_space. This is
      particularly problematic on DAX, as filemap_fdatawrite* is
      effectively synchronous there. Ensure that we clear the AS_EIO/AS_ENOSPC
      flags when filemap_fdatawrite* returns an error.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      cbeaf951
    • J
      jbd2: don't clear and reset errors after waiting on writeback · 76341cab
      Jeff Layton 提交于
      Resetting this flag is almost certainly racy, and will be problematic
      with some coming changes.
      
      Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
      Have jbd2 call it instead of filemap_fdatawait and don't attempt to
      re-set the error flag if it fails.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      76341cab
  17. 20 6月, 2017 4 次提交
    • G
      fs: return if direct I/O will trigger writeback · 6be96d3a
      Goldwyn Rodrigues 提交于
      Find out if the I/O will trigger a wait due to writeback. If yes,
      return -EAGAIN.
      
      Return -EINVAL for buffered AIO: there are multiple causes of
      delay such as page locks, dirty throttling logic, page loading
      from disk etc. which cannot be taken care of.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6be96d3a
    • G
      fs: Introduce filemap_range_has_page() · 7fc9e472
      Goldwyn Rodrigues 提交于
      filemap_range_has_page() return true if the file's mapping has
      a page within the range mentioned. This function will be used
      to check if a write() call will cause a writeback of previous
      writes.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7fc9e472
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  18. 09 5月, 2017 1 次提交