1. 02 10月, 2014 10 次提交
    • D
      xfs: check xfs_buf_read_uncached returns correctly · ba372674
      Dave Chinner 提交于
      xfs_buf_read_uncached() has two failure modes. If can either return
      NULL or bp->b_error != 0 depending on the type of failure, and not
      all callers check for both. Fix it so that xfs_buf_read_uncached()
      always returns the error status, and the buffer is returned as a
      function parameter. The buffer will only be returned on success.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ba372674
    • D
      xfs: introduce xfs_buf_submit[_wait] · 595bff75
      Dave Chinner 提交于
      There is a lot of cookie-cutter code that looks like:
      
      	if (shutdown)
      		handle buffer error
      	xfs_buf_iorequest(bp)
      	error = xfs_buf_iowait(bp)
      	if (error)
      		handle buffer error
      
      spread through XFS. There's significant complexity now in
      xfs_buf_iorequest() to specifically handle this sort of synchronous
      IO pattern, but there's all sorts of nasty surprises in different
      error handling code dependent on who owns the buffer references and
      the locks.
      
      Pull this pattern into a single helper, where we can hide all the
      synchronous IO warts and hence make the error handling for all the
      callers much saner. This removes the need for a special extra
      reference to protect IO completion processing, as we can now hold a
      single reference across dispatch and waiting, simplifying the sync
      IO smeantics and error handling.
      
      In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
      make it explicitly handle on asynchronous IO. This forces all users
      to be switched specifically to one interface or the other and
      removes any ambiguity between how the interfaces are to be used. It
      also means that xfs_buf_iowait() goes away.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      595bff75
    • D
      xfs: kill xfs_bioerror_relse · 8b131973
      Dave Chinner 提交于
      There is only one caller now - xfs_trans_read_buf_map() - and it has
      very well defined call semantics - read, synchronous, and b_iodone
      is NULL. Hence it's pretty clear what error handling is necessary
      for this case. The bigger problem of untangling
      xfs_trans_read_buf_map error handling is left to a future patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8b131973
    • D
      xfs: xfs_bioerror can die. · 27187754
      Dave Chinner 提交于
      Internal buffer write error handling is a mess due to the unnatural
      split between xfs_bioerror and xfs_bioerror_relse().
      
      xfs_bwrite() only does sync IO and determines the handler to
      call based on b_iodone, so for this caller the only difference
      between xfs_bioerror() and xfs_bioerror_release() is the XBF_DONE
      flag. We don't care what the XBF_DONE flag state is because we stale
      the buffer in both paths - the next buffer lookup will clear
      XBF_DONE because XBF_STALE is set. Hence we can use common
      error handling for xfs_bwrite().
      
      __xfs_buf_delwri_submit() is a similar - it's only ever called
      on writes - all sync or async - and again there's no reason to
      handle them any differently at all.
      
      Clean up the nasty error handling and remove xfs_bioerror().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      27187754
    • D
      xfs: kill xfs_bdstrat_cb · 8dac3921
      Dave Chinner 提交于
      Only has two callers, and is just a shutdown check and error handler
      around xfs_buf_iorequest. However, the error handling is a mess of
      read and write semantics, and both internal callers only call it for
      writes. Hence kill the wrapper, and follow up with a patch to
      sanitise the error handling.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      8dac3921
    • D
      xfs: rework xfs_buf_bio_endio error handling · 61be9c52
      Dave Chinner 提交于
      Currently the report of a bio error from completion
      immediately marks the buffer with an error. The issue is that this
      is racy w.r.t. synchronous IO - the submitter can see b_error being
      set before the IO is complete, and hence we cannot differentiate
      between submission failures and completion failures.
      
      Add an internal b_io_error field protected by the b_lock to catch IO
      completion errors, and only propagate that to the buffer during
      final IO completion handling. Hence we can tell in xfs_buf_iorequest
      if we've had a submission failure bey checking bp->b_error before
      dropping our b_io_remaining reference - that reference will prevent
      b_io_error values from being propagated to b_error in the event that
      completion races with submission.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      61be9c52
    • D
      xfs: xfs_buf_ioend and xfs_buf_iodone_work duplicate functionality · e8aaba9a
      Dave Chinner 提交于
      We do some work in xfs_buf_ioend, and some work in
      xfs_buf_iodone_work, but much of that functionality is the same.
      This work can all be done in a single function, leaving
      xfs_buf_iodone just a wrapper to determine if we should execute it
      by workqueue or directly. hence rename xfs_buf_iodone_work to
      xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
      need async processing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e8aaba9a
    • D
      xfs: synchronous buffer IO needs a reference · e11bb805
      Dave Chinner 提交于
      When synchronous IO runs IO completion work, it does so without an
      IO reference or a hold reference on the buffer. The IO "hold
      reference" is owned by the submitter, and released when the
      submission is complete. The IO reference is released when both the
      submitter and the bio end_io processing is run, and so if the io
      completion work is run from IO completion context, it is run without
      an IO reference.
      
      Hence we can get the situation where the submitter can submit the
      IO, see an error on the buffer and unlock and free the buffer while
      there is still IO in progress. This leads to use-after-free and
      memory corruption.
      
      Fix this by taking a "sync IO hold" reference that is owned by the
      IO and not released until after the buffer completion calls are run
      to wake up synchronous waiters. This means that the buffer will not
      be freed in any circumstance until all IO processing is completed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      e11bb805
    • D
      xfs: Don't use xfs_buf_iowait in the delwri buffer code · cf53e99d
      Dave Chinner 提交于
      For the special case of delwri buffer submission and waiting, we
      don't need to issue IO synchronously at all. The second pass to call
      xfs_buf_iowait() can be replaced with  blocking on xfs_buf_lock() -
      the buffer will be unlocked when the async IO is complete.
      
      This formalises a sane the method of waiting for async IO - take an
      extra reference, submit the IO, call xfs_buf_lock() when you want to
      wait for IO completion. i.e.:
      
      	bp = xfs_buf_find();
      	xfs_buf_hold(bp);
      	bp->b_flags |= XBF_ASYNC;
      	xfs_buf_iosubmit(bp);
      	xfs_buf_lock(bp)
      	error = bp->b_error;
      	....
      	xfs_buf_relse(bp);
      
      While this is somewhat racy for gathering IO errors, none of the
      code that calls xfs_buf_delwri_submit() will race against other
      users of the buffers being submitted. Even if they do, we don't
      really care if the error is detected by the delwri code or the user
      we raced against. Either way, the error will be detected and
      handled.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cf53e99d
    • D
      xfs: force the log before shutting down · a870fe6d
      Dave Chinner 提交于
      When we have marked the filesystem for shutdown, we want to prevent
      any further buffer IO from being submitted. However, we currently
      force the log after marking the filesystem as shut down, hence
      allowing IO to the log *after* we have marked both the filesystem
      and the log as in an error state.
      
      Clean this up by forcing the log before we mark the filesytem with
      an error. This replaces the pure CIL flush that we currently have
      which works around this same issue (i.e the CIL can't be flushed
      once the shutdown flags are set) and hence enables us to clean up
      the logic substantially.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      a870fe6d
  2. 25 8月, 2014 1 次提交
    • B
      aio: fix reqs_available handling · d856f32a
      Benjamin LaHaise 提交于
      As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
      leak when events are reaped by userspace") introduces a regression when
      user code attempts to perform io_submit() with more events than are
      available in the ring buffer.  Reverting that commit would reintroduce a
      regression when user space event reaping is used.
      
      Fixing this bug is a bit more involved than the previous attempts to fix
      this regression.  Since we do not have a single point at which we can
      count events as being reaped by user space and io_getevents(), we have
      to track event completion by looking at the number of events left in the
      event ring.  So long as there are as many events in the ring buffer as
      there have been completion events generate, we cannot call
      put_reqs_available().  The code to check for this is now placed in
      refill_reqs_available().
      
      A test program from Dan and modified by me for verifying this bug is available
      at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
      Reported-by: NDan Aloni <dan@kernelim.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Acked-by: NDan Aloni <dan@kernelim.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38 was backported to
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d856f32a
  3. 23 8月, 2014 8 次提交
  4. 20 8月, 2014 3 次提交
  5. 18 8月, 2014 3 次提交
  6. 17 8月, 2014 3 次提交
  7. 16 8月, 2014 3 次提交
  8. 15 8月, 2014 9 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
    • F
      Btrfs: fix csum tree corruption, duplicate and outdated checksums · 27b9a812
      Filipe Manana 提交于
      Under rare circumstances we can end up leaving 2 versions of a checksum
      for the same file extent range.
      
      The reason for this is that after calling btrfs_next_leaf we process
      slot 0 of the leaf it returns, instead of processing the slot set in
      path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
      btrfs_next_leaf() releases the path and before it searches for the next
      leaf, another task might cause a split of the next leaf, which migrates
      some of its keys to the leaf we were processing before calling
      btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
      same leaf but with path->slots[0] having a slot number corresponding
      to the first new key it got, that is, a slot number that didn't exist
      before calling btrfs_next_leaf(), as the leaf now has more keys than
      it had before. So we must really process the returned leaf starting at
      path->slots[0] always, as it isn't always 0, and the key at slot 0 can
      have an offset much lower than our search offset/bytenr.
      
      For example, consider the following scenario, where we have:
      
      sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
      four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472
      
        Leaf N:
      
          slot = 0                           slot = btrfs_header_nritems() - 1
        |-------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
        |-------------------------------------------------------------------|
      
        Leaf N + 1:
      
            slot = 0                          slot = btrfs_header_nritems() - 1
        |--------------------------------------------------------------------|
        | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
        |--------------------------------------------------------------------|
      
      Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
      find the next highest key, which releases the current path and then searches
      for that next key. However after releasing the path and before finding that
      next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
      to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
      btrfs_next_leaf() will returns us a path again with leaf N but with the slot
      pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
      is then:
      
          slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
        |----------------------------------------------------------------------------------------------------|
        | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
        |----------------------------------------------------------------------------------------------------|
      
      And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
      into the "insert:" label, which will set tmp to:
      
          tmp = min((sums->len - total_bytes) >> blocksize_bits,
              (next_offset - file_key.offset) >> blocksize_bits) =
          min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
          min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4
      
      and
      
         ins_size = csum_size * tmp = 4 * 4 = 16 bytes.
      
      In other words, we insert a new csum item in the tree with key
      (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
      for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
      because the item with key (CSUM CSUM 40161280) (the one that was moved from
      leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
      bytes of our data and won't get those old checksums removed.
      
      So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
      and breaks the logical rule:
      
         Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover
      
      An obvious bad effect of this is that a subsequent csum tree lookup to get
      the checksum of any of the blocks with logical offset of 40161280, 40165376
      or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      27b9a812
    • T
      Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch · 4eb1f66d
      Takashi Iwai 提交于
      We've got bug reports that btrfs crashes when quota is enabled on
      32bit kernel, typically with the Oops like below:
       BUG: unable to handle kernel NULL pointer dereference at 00000004
       IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
       *pde = 00000000
       Oops: 0000 [#1] SMP
       CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
       Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
       task: f1478130 ti: f147c000 task.ti: f147c000
       EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
       EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
       EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
       ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
       CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
       Stack:
        00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
        00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
        00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
       Call Trace:
        [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
        [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
        [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
        [<c025e38b>] process_one_work+0x11b/0x390
        [<c025eea1>] worker_thread+0x101/0x340
        [<c026432b>] kthread+0x9b/0xb0
        [<c0712a71>] ret_from_kernel_thread+0x21/0x30
        [<c0264290>] kthread_create_on_node+0x110/0x110
      
      This indicates a NULL corruption in prefs_delayed list.  The further
      investigation and bisection pointed that the call of ulist_add_merge()
      results in the corruption.
      
      ulist_add_merge() takes u64 as aux and writes a 64bit value into
      old_aux.  The callers of this function in backref.c, however, pass a
      pointer of a pointer to old_aux.  That is, the function overwrites
      64bit value on 32bit pointer.  This caused a NULL in the adjacent
      variable, in this case, prefs_delayed.
      
      Here is a quick attempt to band-aid over this: a new function,
      ulist_add_merge_ptr() is introduced to pass/store properly a pointer
      value instead of u64.  There are still ugly void ** cast remaining
      in the callers because void ** cannot be taken implicitly.  But, it's
      safer than explicit cast to u64, anyway.
      
      Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
      Cc: <stable@vger.kernel.org> [v3.11+]
      Signed-off-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      4eb1f66d
    • L
      Btrfs: fix compressed write corruption on enospc · ce62003f
      Liu Bo 提交于
      When failing to allocate space for the whole compressed extent, we'll
      fallback to uncompressed IO, but we've forgotten to redirty the pages
      which belong to this compressed extent, and these 'clean' pages will
      simply skip 'submit' part and go to endio directly, at last we got data
      corruption as we write nothing.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce62003f
    • M
      btrfs: correctly handle return from ulist_add · f90e579c
      Mark Fasheh 提交于
      ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
      doesn't take into account. As a result, that value can be bubbled up to
      callers, causing an error to be printed. Fix this by only returning the
      value of ulist_add() when it indicates an error.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      f90e579c
    • M
      btrfs: qgroup: account shared subtrees during snapshot delete · 1152651a
      Mark Fasheh 提交于
      During its tree walk, btrfs_drop_snapshot() will skip any shared
      subtrees it encounters. This is incorrect when we have qgroups
      turned on as those subtrees need to have their contents
      accounted. In particular, the case we're concerned with is when
      removing our snapshot root leaves the subtree with only one root
      reference.
      
      In those cases we need to find the last remaining root and add
      each extent in the subtree to the corresponding qgroup exclusive
      counts.
      
      This patch implements the shared subtree walk and a new qgroup
      operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
      this type is encountered during qgroup accounting, we search for
      any root references to that extent and in the case that we find
      only one reference left, we go ahead and do the math on it's
      exclusive counts.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1152651a
    • F
      Btrfs: read lock extent buffer while walking backrefs · 6f7ff6d7
      Filipe Manana 提交于
      Before processing the extent buffer, acquire a read lock on it, so
      that we're safe against concurrent updates on the extent buffer.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6f7ff6d7
    • J
      Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0
      Josef Bacik 提交于
      Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
      understand how snapshot delete was using it and assumed that we needed the
      quota operations there.  With Mark's work this has turned out to be not the
      case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
      argument and make __btrfs_mod_ref call it's process function with no_quota set
      always.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e339a6b0
    • D
      btrfs: adjust statfs calculations according to raid profiles · ba7b6e62
      David Sterba 提交于
      This has been discussed in thread:
      http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528
      
      and this patch implements this proposal:
      http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536
      
      Works fine for "clean" raid profiles where the raid factor correction
      does the right job. Otherwise it's pessimistic and may show low space
      although there's still some left.
      
      The df nubmers are lightly wrong in case of mixed block groups, but this
      is not a major usecase and can be addressed later.
      
      The RAID56 numbers are wrong almost the same way as before and will be
      addressed separately.
      
      CC: Hugo Mills <hugo@carfax.org.uk>
      CC: cwillu <cwillu@cwillu.com>
      CC: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      ba7b6e62