1. 20 7月, 2016 2 次提交
  2. 18 5月, 2016 1 次提交
    • B
      xfs: buffer ->bi_end_io function requires irq-safe lock · 9bdd9bd6
      Brian Foster 提交于
      Reports have surfaced of a lockdep splat complaining about an
      irq-safe -> irq-unsafe locking order in the xfs_buf_bio_end_io() bio
      completion handler. This only occurs when I/O errors are present
      because bp->b_lock is only acquired in this context to protect
      setting an error on the buffer. The problem is that this lock can be
      acquired with the (request_queue) q->queue_lock held. See
      scsi_end_request() or ata_qc_schedule_eh(), for example.
      
      Replace the locked test/set of b_io_error with a cmpxchg() call.
      This eliminates the need for the lock and thus the lock ordering
      problem goes away.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      9bdd9bd6
  3. 10 2月, 2016 1 次提交
  4. 19 1月, 2016 1 次提交
    • D
      xfs: log mount failures don't wait for buffers to be released · 85bec546
      Dave Chinner 提交于
      Recently I've been seeing xfs/051 fail on 1k block size filesystems.
      Trying to trace the events during the test lead to the problem going
      away, indicating that it was a race condition that lead to this
      ASSERT failure:
      
      XFS: Assertion failed: atomic_read(&pag->pag_ref) == 0, file: fs/xfs/xfs_mount.c, line: 156
      .....
      [<ffffffff814e1257>] xfs_free_perag+0x87/0xb0
      [<ffffffff814e21b9>] xfs_mountfs+0x4d9/0x900
      [<ffffffff814e5dff>] xfs_fs_fill_super+0x3bf/0x4d0
      [<ffffffff811d8800>] mount_bdev+0x180/0x1b0
      [<ffffffff814e3ff5>] xfs_fs_mount+0x15/0x20
      [<ffffffff811d90a8>] mount_fs+0x38/0x170
      [<ffffffff811f4347>] vfs_kern_mount+0x67/0x120
      [<ffffffff811f7018>] do_mount+0x218/0xd60
      [<ffffffff811f7e5b>] SyS_mount+0x8b/0xd0
      
      When I finally caught it with tracing enabled, I saw that AG 2 had
      an elevated reference count and a buffer was responsible for it. I
      tracked down the specific buffer, and found that it was missing the
      final reference count release that would put it back on the LRU and
      hence be found by xfs_wait_buftarg() calls in the log mount failure
      handling.
      
      The last four traces for the buffer before the assert were (trimmed
      for relevance)
      
      kworker/0:1-5259   xfs_buf_iodone:        hold 2  lock 0 flags ASYNC
      kworker/0:1-5259   xfs_buf_ioerror:       hold 2  lock 0 error -5
      mount-7163	   xfs_buf_lock_done:     hold 2  lock 0 flags ASYNC
      mount-7163	   xfs_buf_unlock:        hold 2  lock 1 flags ASYNC
      
      This is an async write that is completing, so there's nobody waiting
      for it directly.  Hence we call xfs_buf_relse() once all the
      processing is complete. That does:
      
      static inline void xfs_buf_relse(xfs_buf_t *bp)
      {
      	xfs_buf_unlock(bp);
      	xfs_buf_rele(bp);
      }
      
      Now, it's clear that mount is waiting on the buffer lock, and that
      it has been released by xfs_buf_relse() and gained by mount. This is
      expected, because at this point the mount process is in
      xfs_buf_delwri_submit() waiting for all the IO it submitted to
      complete.
      
      The mount process, however, is waiting on the lock for the buffer
      because it is in xfs_buf_delwri_submit(). This waits for IO
      completion, but it doesn't wait for the buffer reference owned by
      the IO to go away. The mount process collects all the completions,
      fails the log recovery, and the higher level code then calls
      xfs_wait_buftarg() to free all the remaining buffers in the
      filesystem.
      
      The issue is that on unlocking the buffer, the scheduler has decided
      that the mount process has higher priority than the the kworker
      thread that is running the IO completion, and so immediately
      switched contexts to the mount process from the semaphore unlock
      code, hence preventing the kworker thread from finishing the IO
      completion and releasing the IO reference to the buffer.
      
      Hence by the time that xfs_wait_buftarg() is run, the buffer still
      has an active reference and so isn't on the LRU list that the
      function walks to free the remaining buffers. Hence we miss that
      buffer and continue onwards to tear down the mount structures,
      at which time we get find a stray reference count on the perag
      structure. On a non-debug kernel, this will be ignored and the
      structure torn down and freed. Hence when the kworker thread is then
      rescheduled and the buffer released and freed, it will access a
      freed perag structure.
      
      The problem here is that when the log mount fails, we still need to
      quiesce the log to ensure that the IO workqueues have returned to
      idle before we run xfs_wait_buftarg(). By synchronising the
      workqueues, we ensure that all IO completions are fully processed,
      not just to the point where buffers have been unlocked. This ensures
      we don't end up in the situation above.
      
      cc: <stable@vger.kernel.org> # 3.18
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      85bec546
  5. 12 1月, 2016 1 次提交
    • D
      xfs: inode recovery readahead can race with inode buffer creation · b79f4a1c
      Dave Chinner 提交于
      When we do inode readahead in log recovery, we do can do the
      readahead before we've replayed the icreate transaction that stamps
      the buffer with inode cores. The inode readahead verifier catches
      this and marks the buffer as !done to indicate that it doesn't yet
      contain valid inodes.
      
      In adding buffer error notification  (i.e. setting b_error = -EIO at
      the same time as as we clear the done flag) to such a readahead
      verifier failure, we can then get subsequent inode recovery failing
      with this error:
      
      XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32
      
      This occurs when readahead completion races with icreate item replay
      such as:
      
      	inode readahead
      		find buffer
      		lock buffer
      		submit RA io
      	....
      	icreate recovery
      	    xfs_trans_get_buffer
      		find buffer
      		lock buffer
      		<blocks on RA completion>
      	.....
      	<ra completion>
      		fails verifier
      		clear XBF_DONE
      		set bp->b_error = -EIO
      		release and unlock buffer
      	<icreate gains lock>
      	icreate initialises buffer
      	marks buffer as done
      	adds buffer to delayed write queue
      	releases buffer
      
      At this point, we have an initialised inode buffer that is up to
      date but has an -EIO state registered against it. When we finally
      get to recovering an inode in that buffer:
      
      	inode item recovery
      	    xfs_trans_read_buffer
      		find buffer
      		lock buffer
      		sees XBF_DONE is set, returns buffer
      	    sees bp->b_error is set
      		fail log recovery!
      
      Essentially, we need xfs_trans_get_buf_map() to clear the error status of
      the buffer when doing a lookup. This function returns uninitialised
      buffers, so the buffer returned can not be in an error state and
      none of the code that uses this function expects b_error to be set
      on return. Indeed, there is an ASSERT(!bp->b_error); in the
      transaction case in xfs_trans_get_buf_map() that would have caught
      this if log recovery used transactions....
      
      This patch firstly changes the inode readahead failure to set -EIO
      on the buffer, and secondly changes xfs_buf_get_map() to never
      return a buffer with an error state set so this first change doesn't
      cause unexpected log recovery failures.
      
      cc: <stable@vger.kernel.org> # 3.12 - current
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b79f4a1c
  6. 07 1月, 2016 1 次提交
  7. 04 1月, 2016 1 次提交
  8. 12 10月, 2015 2 次提交
  9. 25 8月, 2015 1 次提交
  10. 29 7月, 2015 2 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
    • J
      xfs: Use consistent logging message prefixes · f41febd2
      Joe Perches 提交于
      The second and subsequent lines of multi-line logging messages
      are not prefixed with the same information as the first line.
      
      Separate messages with newlines into multiple calls to ensure
      consistent prefixing and allow easier grep use.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      f41febd2
  11. 22 6月, 2015 1 次提交
  12. 13 2月, 2015 2 次提交
    • V
      list_lru: add helpers to isolate items · 3f97b163
      Vladimir Davydov 提交于
      Currently, the isolate callback passed to the list_lru_walk family of
      functions is supposed to just delete an item from the list upon returning
      LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
      __list_lru_walk_one after the callback returns.  Since the callback is
      allowed to drop the lock after removing an item (it has to return
      LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
      of elements on the list even if we check them under the lock.  This makes
      it difficult to move items from one list_lru_one to another, which is
      required for per-memcg list_lru reparenting - we can't just splice the
      lists, we have to move entries one by one.
      
      This patch therefore introduces helpers that must be used by callback
      functions to isolate items instead of raw list_del/list_move.  These are
      list_lru_isolate and list_lru_isolate_move.  They not only remove the
      entry from the list, but also fix the nr_items counter, making sure
      nr_items always reflects the actual number of elements on the list if
      checked under the appropriate lock.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f97b163
    • V
      list_lru: introduce list_lru_shrink_{count,walk} · 503c358c
      Vladimir Davydov 提交于
      Kmem accounting of memcg is unusable now, because it lacks slab shrinker
      support.  That means when we hit the limit we will get ENOMEM w/o any
      chance to recover.  What we should do then is to call shrink_slab, which
      would reclaim old inode/dentry caches from this cgroup.  This is what
      this patch set is intended to do.
      
      Basically, it does two things.  First, it introduces the notion of
      per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
      cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
      passed the memory cgroup to scan from in shrink_control->memcg.  For
      such shrinkers shrink_slab iterates over the whole cgroup subtree under
      the target cgroup and calls the shrinker for each kmem-active memory
      cgroup.
      
      Secondly, this patch set makes the list_lru structure per-memcg.  It's
      done transparently to list_lru users - everything they have to do is to
      tell list_lru_init that they want memcg-aware list_lru.  Then the
      list_lru will automatically distribute objects among per-memcg lists
      basing on which cgroup the object is accounted to.  This way to make FS
      shrinkers (icache, dcache) memcg-aware we only need to make them use
      memcg-aware list_lru, and this is what this patch set does.
      
      As before, this patch set only enables per-memcg kmem reclaim when the
      pressure goes from memory.limit, not from memory.kmem.limit.  Handling
      memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
      it is still unclear whether we will have this knob in the unified
      hierarchy.
      
      This patch (of 9):
      
      NUMA aware slab shrinkers use the list_lru structure to distribute
      objects coming from different NUMA nodes to different lists.  Whenever
      such a shrinker needs to count or scan objects from a particular node,
      it issues commands like this:
      
              count = list_lru_count_node(lru, sc->nid);
              freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                         isolate_arg, &sc->nr_to_scan);
      
      where sc is an instance of the shrink_control structure passed to it
      from vmscan.
      
      To simplify this, let's add special list_lru functions to be used by
      shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
      consolidate the nid and nr_to_scan arguments in the shrink_control
      structure.
      
      This will also allow us to avoid patching shrinkers that use list_lru
      when we make shrink_slab() per-memcg - all we will have to do is extend
      the shrink_control structure to include the target memcg and make
      list_lru_shrink_{count,walk} handle this appropriately.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      503c358c
  13. 04 12月, 2014 1 次提交
    • B
      xfs: split metadata and log buffer completion to separate workqueues · b29c70f5
      Brian Foster 提交于
      XFS traditionally sends all buffer I/O completion work to a single
      workqueue. This includes metadata buffer completion and log buffer
      completion. The log buffer completion requires a high priority queue to
      prevent stalls due to log forces getting stuck behind other queued work.
      
      Rather than continue to prioritize all buffer I/O completion due to the
      needs of log completion, split log buffer completion off to
      m_log_workqueue and move the high priority flag from m_buf_workqueue to
      m_log_workqueue.
      
      Add a b_ioend_wq wq pointer to xfs_buf to allow completion workqueue
      customization on a per-buffer basis. Initialize b_ioend_wq to
      m_buf_workqueue by default in the generic buffer I/O submission path.
      Finally, override the default wq with the high priority m_log_workqueue
      in the log buffer I/O submission path.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b29c70f5
  14. 28 11月, 2014 3 次提交
    • C
      xfs: merge xfs_ag.h into xfs_format.h · 4fb6e8ad
      Christoph Hellwig 提交于
      More on-disk format consolidation.  A few declarations that weren't on-disk
      format related move into better suitable spots.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4fb6e8ad
    • E
      xfs: catch invalid negative blknos in _xfs_buf_find() · db52d09e
      Eric Sandeen 提交于
      Here blkno is a daddr_t, which is a __s64; it's possible to hold
      a value which is negative, and thus pass the (blkno >= eofs)
      test.  Then we try to do a xfs_perag_get() for a ridiculous
      agno via xfs_daddr_to_agno(), and bad things happen when that
      fails, and returns a null pag which is dereferenced shortly
      thereafter.
      
      Found via a user-supplied fuzzed image...
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      db52d09e
    • B
      xfs: replace global xfslogd wq with per-mount wq · 78c931b8
      Brian Foster 提交于
      The xfslogd workqueue is a global, single-job workqueue for buffer ioend
      processing. This means we allow for a single work item at a time for all
      possible XFS mounts on a system. fsstress testing in loopback XFS over
      XFS configurations has reproduced xfslogd deadlocks due to the single
      threaded nature of the queue and dependencies introduced between the
      separate XFS instances by online discard (-o discard).
      
      Discard over a loopback device converts the discard request to a hole
      punch (fallocate) on the underlying file. Online discard requests are
      issued synchronously and from xfslogd context in XFS, hence the xfslogd
      workqueue is blocked in the upper fs waiting on a hole punch request to
      be servied in the lower fs. If the lower fs issues I/O that depends on
      xfslogd to complete, both filesystems end up hung indefinitely. This is
      reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
      the '-o discard' mount option.
      
      Further, docker implementations appear to use this kind of configuration
      for container instance filesystems by default (container fs->dm->
      loop->base fs) and therefore are subject to this deadlock when running
      on XFS.
      
      Replace the global xfslogd workqueue with a per-mount variant. This
      guarantees each mount access to a single worker and prevents deadlocks
      due to inter-fs dependencies introduced by discard. Since the queue is
      only responsible for buffer iodone processing at this point in time,
      rename xfslogd to xfs-buf.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      78c931b8
  15. 02 10月, 2014 9 次提交
    • D
      xfs: check xfs_buf_read_uncached returns correctly · ba372674
      Dave Chinner 提交于
      xfs_buf_read_uncached() has two failure modes. If can either return
      NULL or bp->b_error != 0 depending on the type of failure, and not
      all callers check for both. Fix it so that xfs_buf_read_uncached()
      always returns the error status, and the buffer is returned as a
      function parameter. The buffer will only be returned on success.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ba372674
    • D
      xfs: introduce xfs_buf_submit[_wait] · 595bff75
      Dave Chinner 提交于
      There is a lot of cookie-cutter code that looks like:
      
      	if (shutdown)
      		handle buffer error
      	xfs_buf_iorequest(bp)
      	error = xfs_buf_iowait(bp)
      	if (error)
      		handle buffer error
      
      spread through XFS. There's significant complexity now in
      xfs_buf_iorequest() to specifically handle this sort of synchronous
      IO pattern, but there's all sorts of nasty surprises in different
      error handling code dependent on who owns the buffer references and
      the locks.
      
      Pull this pattern into a single helper, where we can hide all the
      synchronous IO warts and hence make the error handling for all the
      callers much saner. This removes the need for a special extra
      reference to protect IO completion processing, as we can now hold a
      single reference across dispatch and waiting, simplifying the sync
      IO smeantics and error handling.
      
      In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
      make it explicitly handle on asynchronous IO. This forces all users
      to be switched specifically to one interface or the other and
      removes any ambiguity between how the interfaces are to be used. It
      also means that xfs_buf_iowait() goes away.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      595bff75
    • D
      xfs: kill xfs_bioerror_relse · 8b131973
      Dave Chinner 提交于
      There is only one caller now - xfs_trans_read_buf_map() - and it has
      very well defined call semantics - read, synchronous, and b_iodone
      is NULL. Hence it's pretty clear what error handling is necessary
      for this case. The bigger problem of untangling
      xfs_trans_read_buf_map error handling is left to a future patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8b131973
    • D
      xfs: xfs_bioerror can die. · 27187754
      Dave Chinner 提交于
      Internal buffer write error handling is a mess due to the unnatural
      split between xfs_bioerror and xfs_bioerror_relse().
      
      xfs_bwrite() only does sync IO and determines the handler to
      call based on b_iodone, so for this caller the only difference
      between xfs_bioerror() and xfs_bioerror_release() is the XBF_DONE
      flag. We don't care what the XBF_DONE flag state is because we stale
      the buffer in both paths - the next buffer lookup will clear
      XBF_DONE because XBF_STALE is set. Hence we can use common
      error handling for xfs_bwrite().
      
      __xfs_buf_delwri_submit() is a similar - it's only ever called
      on writes - all sync or async - and again there's no reason to
      handle them any differently at all.
      
      Clean up the nasty error handling and remove xfs_bioerror().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      27187754
    • D
      xfs: kill xfs_bdstrat_cb · 8dac3921
      Dave Chinner 提交于
      Only has two callers, and is just a shutdown check and error handler
      around xfs_buf_iorequest. However, the error handling is a mess of
      read and write semantics, and both internal callers only call it for
      writes. Hence kill the wrapper, and follow up with a patch to
      sanitise the error handling.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      8dac3921
    • D
      xfs: rework xfs_buf_bio_endio error handling · 61be9c52
      Dave Chinner 提交于
      Currently the report of a bio error from completion
      immediately marks the buffer with an error. The issue is that this
      is racy w.r.t. synchronous IO - the submitter can see b_error being
      set before the IO is complete, and hence we cannot differentiate
      between submission failures and completion failures.
      
      Add an internal b_io_error field protected by the b_lock to catch IO
      completion errors, and only propagate that to the buffer during
      final IO completion handling. Hence we can tell in xfs_buf_iorequest
      if we've had a submission failure bey checking bp->b_error before
      dropping our b_io_remaining reference - that reference will prevent
      b_io_error values from being propagated to b_error in the event that
      completion races with submission.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      61be9c52
    • D
      xfs: xfs_buf_ioend and xfs_buf_iodone_work duplicate functionality · e8aaba9a
      Dave Chinner 提交于
      We do some work in xfs_buf_ioend, and some work in
      xfs_buf_iodone_work, but much of that functionality is the same.
      This work can all be done in a single function, leaving
      xfs_buf_iodone just a wrapper to determine if we should execute it
      by workqueue or directly. hence rename xfs_buf_iodone_work to
      xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
      need async processing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e8aaba9a
    • D
      xfs: synchronous buffer IO needs a reference · e11bb805
      Dave Chinner 提交于
      When synchronous IO runs IO completion work, it does so without an
      IO reference or a hold reference on the buffer. The IO "hold
      reference" is owned by the submitter, and released when the
      submission is complete. The IO reference is released when both the
      submitter and the bio end_io processing is run, and so if the io
      completion work is run from IO completion context, it is run without
      an IO reference.
      
      Hence we can get the situation where the submitter can submit the
      IO, see an error on the buffer and unlock and free the buffer while
      there is still IO in progress. This leads to use-after-free and
      memory corruption.
      
      Fix this by taking a "sync IO hold" reference that is owned by the
      IO and not released until after the buffer completion calls are run
      to wake up synchronous waiters. This means that the buffer will not
      be freed in any circumstance until all IO processing is completed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      e11bb805
    • D
      xfs: Don't use xfs_buf_iowait in the delwri buffer code · cf53e99d
      Dave Chinner 提交于
      For the special case of delwri buffer submission and waiting, we
      don't need to issue IO synchronously at all. The second pass to call
      xfs_buf_iowait() can be replaced with  blocking on xfs_buf_lock() -
      the buffer will be unlocked when the async IO is complete.
      
      This formalises a sane the method of waiting for async IO - take an
      extra reference, submit the IO, call xfs_buf_lock() when you want to
      wait for IO completion. i.e.:
      
      	bp = xfs_buf_find();
      	xfs_buf_hold(bp);
      	bp->b_flags |= XBF_ASYNC;
      	xfs_buf_iosubmit(bp);
      	xfs_buf_lock(bp)
      	error = bp->b_error;
      	....
      	xfs_buf_relse(bp);
      
      While this is somewhat racy for gathering IO errors, none of the
      code that calls xfs_buf_delwri_submit() will race against other
      users of the buffers being submitted. Even if they do, we don't
      really care if the error is detected by the delwri code or the user
      we raced against. Either way, the error will be detected and
      handled.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cf53e99d
  16. 09 9月, 2014 2 次提交
    • B
      xfs: mark all internal workqueues as freezable · 8018ec08
      Brian Foster 提交于
      Workqueues must be explicitly set as freezable to ensure they are frozen
      in the assocated part of the hibernation/suspend sequence. Freezing of
      workqueues and kernel threads is important to ensure that modifications
      are not made on-disk after the hibernation image has been created.
      Otherwise, the in-memory state can become inconsistent with what is on
      disk and eventually lead to filesystem corruption. We have reports of
      free space btree corruptions that occur immediately after restore from
      hibernate that suggest the xfs-eofblocks workqueue could be causing
      such problems if it races with hibernation.
      
      Mark all of the internal XFS workqueues as freezable to ensure nothing
      changes on-disk once the freezer infrastructure freezes kernel threads
      and creates the hibernation image.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reported-by: NCarlos E. R. <carlos.e.r@opensuse.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8018ec08
    • T
      block, bdi: an active gendisk always has a request_queue associated with it · ff9ea323
      Tejun Heo 提交于
      bdev_get_queue() returns the request_queue associated with the
      specified block_device.  blk_get_backing_dev_info() makes use of
      bdev_get_queue() to determine the associated bdi given a block_device.
      
      All the callers of bdev_get_queue() including
      blk_get_backing_dev_info() assume that bdev_get_queue() may return
      NULL and implement NULL handling; however, bdev_get_queue() requires
      the passed in block_device is opened and attached to its gendisk.
      Because an active gendisk always has a valid request_queue associated
      with it, bdev_get_queue() can never return NULL and neither can
      blk_get_backing_dev_info().
      
      Make it clear that neither of the two functions can return NULL and
      remove NULL handling from all the callers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ff9ea323
  17. 04 8月, 2014 1 次提交
    • D
      xfs: catch buffers written without verifiers attached · 400b9d88
      Dave Chinner 提交于
      We recently had a bug where buffers were slipping through log
      recovery without any verifier attached to them. This was resulting
      in on-disk CRC mismatches for valid data. Add some warning code to
      catch this occurrence so that we catch such bugs during development
      rather than not being aware they exist.
      
      Note that we cannot do this verification unconditionally as non-CRC
      filesystems don't always attach verifiers to the buffers being
      written. e.g. during log recovery we cannot identify all the
      different types of buffers correctly on non-CRC filesystems, so we
      can't attach the correct verifiers in all cases and so we don't
      attach any. Hence we don't want on non-CRC filesystems to avoid
      spamming the logs with false indications.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      400b9d88
  18. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  19. 17 4月, 2014 1 次提交
    • E
      xfs: fix buffer use after free on IO error · 8d6c1210
      Eric Sandeen 提交于
      When testing exhaustion of dm snapshots, the following appeared
      with CONFIG_DEBUG_OBJECTS_FREE enabled:
      
      ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs]
      
      indicating that we'd freed a buffer which still had a pending reference,
      down this path:
      
      [  190.867975]  [<ffffffff8133e6fb>] debug_check_no_obj_freed+0x22b/0x270
      [  190.880820]  [<ffffffff811da1d0>] kmem_cache_free+0xd0/0x370
      [  190.892615]  [<ffffffffa02c5924>] xfs_buf_free+0xe4/0x210 [xfs]
      [  190.905629]  [<ffffffffa02c6167>] xfs_buf_rele+0xe7/0x270 [xfs]
      [  190.911770]  [<ffffffffa034c826>] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs]
      
      At issue is the fact that if IO fails in xfs_buf_iorequest,
      we'll queue completion unconditionally, and then call
      xfs_buf_rele; but if IO failed, there are no IOs remaining,
      and xfs_buf_rele will free the bp while work is still queued.
      
      Fix this by not scheduling completion if the buffer has
      an error on it; run it immediately.  The rest is only comment
      changes.
      
      Thanks to dchinner for spotting the root cause.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8d6c1210
  20. 14 4月, 2014 3 次提交
  21. 07 3月, 2014 1 次提交
    • D
      xfs: use NOIO contexts for vm_map_ram · ae687e58
      Dave Chinner 提交于
      When we map pages in the buffer cache, we can do so in GFP_NOFS
      contexts. However, the vmap interfaces do not provide any method of
      communicating this information to memory reclaim, and hence we get
      lockdep complaining about it regularly and occassionally see hangs
      that may be vmap related reclaim deadlocks. We can also see these
      same problems from anywhere where we use vmalloc for a large buffer
      (e.g. attribute code) inside a transaction context.
      
      A typical lockdep report shows up as a reclaim state warning like so:
      
      [14046.101458] =================================
      [14046.102850] [ INFO: inconsistent lock state ]
      [14046.102850] 3.14.0-rc4+ #2 Not tainted
      [14046.102850] ---------------------------------
      [14046.102850] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      [14046.102850] kswapd0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
      [14046.102850]  (&xfs_dir_ilock_class){++++?+}, at: [<791a04bb>] xfs_ilock+0xff/0x16a
      [14046.102850] {RECLAIM_FS-ON-W} state was registered at:
      [14046.102850]   [<7904cdb1>] mark_held_locks+0x81/0xe7
      [14046.102850]   [<7904d390>] lockdep_trace_alloc+0x5c/0xb4
      [14046.102850]   [<790c2c28>] kmem_cache_alloc_trace+0x2b/0x11e
      [14046.102850]   [<790ba7f4>] vm_map_ram+0x119/0x3e6
      [14046.102850]   [<7914e124>] _xfs_buf_map_pages+0x5b/0xcf
      [14046.102850]   [<7914ed74>] xfs_buf_get_map+0x67/0x13f
      [14046.102850]   [<7917506f>] xfs_attr_rmtval_set+0x396/0x4d5
      [14046.102850]   [<7916e8bb>] xfs_attr_leaf_addname+0x18f/0x37d
      [14046.102850]   [<7916ed9e>] xfs_attr_set_int+0x2f5/0x3e8
      [14046.102850]   [<7916eefc>] xfs_attr_set+0x6b/0x74
      [14046.102850]   [<79168355>] xfs_xattr_set+0x61/0x81
      [14046.102850]   [<790e5b10>] generic_setxattr+0x59/0x68
      [14046.102850]   [<790e4c06>] __vfs_setxattr_noperm+0x58/0xce
      [14046.102850]   [<790e4d0a>] vfs_setxattr+0x8e/0x92
      [14046.102850]   [<790e4ddd>] setxattr+0xcf/0x159
      [14046.102850]   [<790e5423>] SyS_lsetxattr+0x88/0xbb
      [14046.102850]   [<79268438>] sysenter_do_call+0x12/0x36
      
      Now, we can't completely remove these traces - mainly because
      vm_map_ram() will do GFP_KERNEL allocation and that generates the
      above warning before we get into the reclaim code, but we can turn
      them all into false positive warnings.
      
      To do that, use the method that DM and other IO context code uses to
      avoid this problem: there is a process flag to tell memory reclaim
      not to do IO that we can set appropriately. That prevents GFP_KERNEL
      context reclaim being done from deep inside the vmalloc code in
      places we can't directly pass a GFP_NOFS context to. That interface
      has a pair of wrapper functions: memalloc_noio_save() and
      memalloc_noio_restore().
      
      Adding them around vm_map_ram and the vzalloc call in
      kmem_alloc_large() will prevent deadlocks and most lockdep reports
      for this issue. Also, convert the vzalloc() call in
      kmem_alloc_large() to use __vmalloc() so that we can pass the
      correct gfp context to the data page allocation routine inside
      __vmalloc() so that it is clear that GFP_NOFS context is important
      to this vmalloc call.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ae687e58
  22. 25 1月, 2014 2 次提交
    • E
      xfs: allow logical-sector sized O_DIRECT · 7c71ee78
      Eric Sandeen 提交于
      Some time ago, mkfs.xfs started picking the storage physical
      sector size as the default filesystem "sector size" in order
      to avoid RMW costs incurred by doing IOs at logical sector
      size alignments.
      
      However, this means that for a filesystem made with i.e.
      a 4k sector size on an "advanced format" 4k/512 disk,
      512-byte direct IOs are no longer allowed.  This means
      that XFS has essentially turned this AF drive into a hard
      4K device, from the filesystem on up.
      
      XFS's mkfs-specified "sector size" is really just controlling
      the minimum size & alignment of filesystem metadata.
      
      There is no real need to tightly couple XFS's minimal
      metadata size to the minimum allowed direct IO size;
      XFS can continue doing metadata in optimal sizes, but
      still allow smaller DIOs for apps which issue them,
      for whatever reason.
      
      This patch adds a new field to the xfs_buftarg, so that
      we now track 2 sizes:
      
       1) The metadata sector size, which is the minimum unit and
          alignment of IO which will be performed by metadata operations.
       2) The device logical sector size
      
      The first is used internally by the file system for metadata
      alignment and IOs.
      The second is used for the minimum allowed direct IO alignment.
      
      This has passed xfstests on filesystems made with 4k sectors,
      including when run under the patch I sent to ignore
      XFS_IOC_DIOINFO, and issue 512 DIOs anyway.  I also directly
      tested end of block behavior on preallocated, sparse, and
      existing files when we do a 512 IO into a 4k file on a 
      4k-sector filesystem, to be sure there were no unexpected
      behaviors.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7c71ee78
    • E
      xfs: rename xfs_buftarg structure members · 6da54179
      Eric Sandeen 提交于
      In preparation for adding new members to the structure,
      give these old ones more descriptive names:
      
      	bt_ssize -> bt_meta_sectorsize
      	bt_smask -> bt_meta_sectormask
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      6da54179