1. 03 12月, 2014 9 次提交
    • M
      Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 · 4245215d
      Miao Xie 提交于
      The commit c404e0dc (Btrfs: fix use-after-free in the finishing
      procedure of the device replace) fixed a use-after-free problem
      which happened when removing the source device at the end of device
      replace, but at that time, btrfs didn't support device replace
      on raid56, so we didn't fix the problem on the raid56 profile.
      Currently, we implemented device replace for raid56, so we need
      kick that problem out before we enable that function for raid56.
      
      The fix method is very simple, we just increase the bio per-cpu
      counter before we submit a raid56 io, and decrease the counter
      when the raid56 io ends.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      4245215d
    • M
      Btrfs, replace: write raid56 parity into the replace target device · 76035976
      Miao Xie 提交于
      This function reused the code of parity scrub, and we just write
      the right parity or corrected parity into the target device before
      the parity scrub end.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      76035976
    • M
      Btrfs, replace: write dirty pages into the replace target device · 2c8cdd6e
      Miao Xie 提交于
      The implementation is simple:
      - In order to avoid changing the code logic of btrfs_map_bio and
        RAID56, we add the stripes of the replace target devices at the
        end of the stripe array in btrfs bio, and we sort those target
        device stripes in the array. And we keep the number of the target
        device stripes in the btrfs bio.
      - Except write operation on RAID56, all the other operation don't
        take the target device stripes into account.
      - When we do write operation, we read the data from the common devices
        and calculate the parity. Then write the dirty data and new parity
        out, at this time, we will find the relative replace target stripes
        and wirte the relative data into it.
      
      Note: The function that copying old data on the source device to
      the target device was implemented in the past, it is similar to
      the other RAID type.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      2c8cdd6e
    • M
      Btrfs, raid56: support parity scrub on raid56 · 5a6ac9ea
      Miao Xie 提交于
      The implementation is:
      - Read and check all the data with checksum in the same stripe.
        All the data which has checksum is COW data, and we are sure
        that it is not changed though we don't lock the stripe. because
        the space of that data just can be reclaimed after the current
        transction is committed, and then the fs can use it to store the
        other data, but when doing scrub, we hold the current transaction,
        that is that data can not be recovered, it is safe that read and check
        it out of the stripe lock.
      - Lock the stripe
      - Read out all the data without checksum and parity
        The data without checksum and the parity may be changed if we don't
        lock the stripe, so we need read it in the stripe lock context.
      - Check the parity
      - Re-calculate the new parity and write back it if the old parity
        is not right
      - Unlock the stripe
      
      If we can not read out the data or the data we read is corrupted,
      we will try to repair it. If the repair fails. we will mark the
      horizontal sub-stripe(pages on the same horizontal) as corrupted
      sub-stripe, and we will skip the parity check and repair of that
      horizontal sub-stripe.
      
      And in order to skip the horizontal sub-stripe that has no data, we
      introduce a bitmap. If there is some data on the horizontal sub-stripe,
      we will the relative bit to 1, and when we check and repair the
      parity, we will skip those horizontal sub-stripes that the relative
      bits is 0.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      5a6ac9ea
    • M
      Btrfs, raid56: use a variant to record the operation type · 1b94b556
      Miao Xie 提交于
      We will introduce new operation type later, if we still use integer
      variant as bool variant to record the operation type, we would add new
      variant and increase the size of raid bio structure. It is not good,
      by this patch, we define different number for different operation,
      and we can just use a variant to record the operation type.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      1b94b556
    • M
      Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted · af8e2d1d
      Miao Xie 提交于
      This patch implement the RAID5/6 common data repair function, the
      implementation is similar to the scrub on the other RAID such as
      RAID1, the differentia is that we don't read the data from the
      mirror, we use the data repair function of RAID5/6.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      af8e2d1d
    • M
      Btrfs, raid56: don't change bbio and raid_map · b89e1b01
      Miao Xie 提交于
      Because we will reuse bbio and raid_map during the scrub later, it is
      better that we don't change any variant of bbio and don't free it at
      the end of IO request. So we introduced similar variants into the raid
      bio, and don't access those bbio's variants any more.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      b89e1b01
    • Z
      Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block · 6de65650
      Zhao Lei 提交于
      stripe_index's value was set again in latter line:
      stripe_index = 0;
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      6de65650
    • Z
      Btrfs: remove noused bbio_ret in __btrfs_map_block in condition · f90523d1
      Zhao Lei 提交于
      bbio_ret in this condition is always !NULL because previous code
      already have a check-and-skip:
      4908 if (!bbio_ret)
      4909     goto out;
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      f90523d1
  2. 20 11月, 2014 1 次提交
    • C
      btrfs: fix lockups from btrfs_clear_path_blocking · f82c458a
      Chris Mason 提交于
      The fair reader/writer locks mean that btrfs_clear_path_blocking needs
      to strictly follow lock ordering rules even when we already have
      blocking locks on a given path.
      
      Before we can clear a blocking lock on the path, we need to make sure
      all of the locks have been converted to blocking.  This will remove lock
      inversions against anyone spinning in write_lock() against the buffers
      we're trying to get read locks on.  These inversions didn't exist before
      the fair read/writer locks, but now we need to be more careful.
      
      We papered over this deadlock in the past by changing
      btrfs_try_read_lock() to be a true trylock against both the spinlock and
      the blocking lock.  This was slower, and not sufficient to fix all the
      deadlocks.  This patch adds a btrfs_tree_read_lock_atomic(), which
      basically means get the spinlock but trylock on the blocking lock.
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reported-by: NPatrick Schmid <schmid@phys.ethz.ch>
      cc: stable@vger.kernel.org #v3.15+
      f82c458a
  3. 04 11月, 2014 1 次提交
  4. 29 10月, 2014 1 次提交
    • F
      Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items · d05a2b4c
      Filipe Manana 提交于
      We have a race that can lead us to miss skinny extent items in the function
      btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
      So basically the sequence of steps is:
      
      1) We search in the extent tree for the skinny extent, which returns > 0
         (not found);
      
      2) We check the previous item in the returned leaf for a non-skinny extent,
         and we don't find it;
      
      3) Because we didn't find the non-skinny extent in step 2), we release our
         path to search the extent tree again, but this time for a non-skinny
         extent key;
      
      4) Right after we released our path in step 3), a skinny extent was inserted
         in the extent tree (delayed refs were run) - our second extent tree search
         will miss it, because it's not looking for a skinny extent;
      
      5) After the second search returned (with ret > 0), we look for any delayed
         ref for our extent's bytenr (and we do it while holding a read lock on the
         leaf), but we won't find any, as such delayed ref had just run and completed
         after we released out path in step 3) before doing the second search.
      
      Fix this by removing completely the path release and re-search logic. This is
      safe, because if we seach for a metadata item and we don't find it, we have the
      guarantee that the returned leaf is the one where the item would be inserted,
      and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
      non-skinny extent item is if it exists. The only case where path->slots[0] is
      zero is when there are no smaller keys in the tree (i.e. no left siblings for
      our leaf), in which case the re-search logic isn't needed as well.
      
      This race has been present since the introduction of skinny metadata (change
      3173a18f).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d05a2b4c
  5. 28 10月, 2014 3 次提交
  6. 24 10月, 2014 1 次提交
  7. 17 10月, 2014 1 次提交
    • C
      Revert "Btrfs: race free update of commit root for ro snapshots" · d3797308
      Chris Mason 提交于
      This reverts commit 9c3b306e.
      
      Switching only one commit root during a transaction is wrong because it
      leads the fs into an inconsistent state. All commit roots should be
      switched at once, at transaction commit time, otherwise backref walking
      can often miss important references that were only accessible through
      the old commit root.  Plus, the root item for the snapshot's root wasn't
      getting updated and preventing the next transaction commit to do it.
      
      This made several users get into random corruption issues after creation
      of readonly snapshots.
      
      A regression test for xfstests will follow soon.
      
      Cc: stable@vger.kernel.org # 3.17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d3797308
  8. 14 10月, 2014 1 次提交
  9. 09 10月, 2014 1 次提交
  10. 08 10月, 2014 2 次提交
  11. 06 10月, 2014 1 次提交
    • Q
      btrfs: Make btrfs handle security mount options internally to avoid losing security label. · f667aef6
      Qu Wenruo 提交于
      [BUG]
      Originally when mount btrfs with "-o subvol=" mount option, btrfs will
      lose all security lable.
      And if the btrfs fs is mounted somewhere else, due to the lost of
      security lable, SELinux will refuse to mount since the same super block
      is being mounted using different security lable.
      
      [REPRODUCER]
      With SELinux enabled:
       #mkfs -t btrfs /dev/sda5
       #mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs
       #btrfs subvolume create /mnt/btrfs/subvol
       #mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5
        /mnt/test
      
      kernel message:
      SELinux: mount invalid.  Same superblock, different security settings
      for (dev sda5, type btrfs)
      
      [REASON]
      This happens because btrfs will call vfs_kern_mount() and then
      mount_subtree() to handle subvolume name lookup.
      First mount will cut off all the security lables and when it comes to
      the second vfs_kern_mount(), it has no security label now.
      
      [FIX]
      This patch will makes btrfs behavior much more like nfs,
      which has the type flag FS_BINARY_MOUNTDATA,
      making btrfs handles the security label internally.
      So security label will be set in the real mount time and won't lose
      label when use with "subvol=" mount option.
      Reported-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f667aef6
  12. 04 10月, 2014 12 次提交
    • F
      Btrfs: send, don't delay dir move if there's a new parent inode · bf8e8ca6
      Filipe Manana 提交于
      If between two snapshots we rename an existing directory named X to Y and
      make it a child (direct or not) of a new inode named X, we were delaying
      the move/rename of the former directory unnecessarily, which would result
      in attempting to rename the new directory from its orphan name to name X
      prematurely.
      
      Minimal reproducer:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ mkdir -p /mnt/merlin/RC/OSD/Source
      
          $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1
      
          $ mkdir /mnt/OSD
          $ mv /mnt/merlin/RC/OSD /mnt/OSD/OSD-Plane_788
          $ mv /mnt/OSD /mnt/merlin/RC
      
          $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2
      
          $ btrfs send /mnt/mysnap1 -f /tmp/1.snap
          $ btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap
      
          $ mkfs.btrfs -f /dev/vdc
          $ mount /dev/vdc /mnt2
      
          $ btrfs receive /mnt2 -f /tmp/1.snap
          $ btrfs receive /mnt2 -f /tmp/2.snap
      
      The second receive (from an incremental send) failed with the following
      error message: "rename o261-7-0 -> merlin/RC/OSD failed".
      This is a regression introduced in the 3.16 kernel.
      
      A test case for xfstests follows.
      Reported-by: NMarc Merlin <marc@merlins.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bf8e8ca6
    • D
      btrfs: add more superblock checks · c926093e
      David Sterba 提交于
      Populate btrfs_check_super_valid() with checks that try to verify
      consistency of superblock by additional conditions that may arise from
      corrupted devices or bitflips. Some of tests are only hints and issue
      warnings instead of failing the mount, basically when the checks are
      derived from the data found in the superblock.
      
      Tested on a broken image provided by Qu.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c926093e
    • S
      Btrfs: fix race in WAIT_SYNC ioctl · 42383020
      Sage Weil 提交于
      We check whether transid is already committed via last_trans_committed and
      then search through trans_list for pending transactions.  If
      last_trans_committed is updated by btrfs_commit_transaction after we check
      it (there is no locking), we will fail to find the committed transaction
      and return EINVAL to the caller.  This has been observed occasionally by
      ceph-osd (which uses this ioctl heavily).
      
      Fix by rechecking whether the provided transid <= last_trans_committed
      after the search fails, and if so return 0.
      Signed-off-by: NSage Weil <sage@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      42383020
    • F
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana 提交于
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      656f30db
    • F
      Btrfs: remove redundant btrfs_verify_qgroup_counts declaration. · 15b636e1
      Fabian Frederick 提交于
      Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS
      and keep prototype in qgroup.h only
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NChris Mason <clm@fb.com>
      15b636e1
    • F
      btrfs: fix shadow warning on cmp · b99d9a6a
      Fabian Frederick 提交于
      cmp was declared twice in btrfs_compare_trees resulting in a shadow
      warning. This patch renames second internal variable.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NChris Mason <clm@fb.com>
      b99d9a6a
    • F
      Btrfs: fix compilation errors under DEBUG · 1b6e4469
      Fabian Frederick 提交于
      bi_sector and bi_size moved to bi_iter since commit 4f024f37
      ("block: Abstract out bvec iterator")
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NChris Mason <clm@fb.com>
      1b6e4469
    • L
      Btrfs: fix crash of btrfs_release_extent_buffer_page · 81465028
      Liu Bo 提交于
      This is actually inspired by Filipe's patch.  When write_one_eb() fails on
      submit_extent_page(), it'll give up writing this eb and mark it with
      EXTENT_BUFFER_IOERR.  So if it's not the last page that encounter the failure,
      there are some left pages which remain DIRTY, and if a later COW on this eb
      happens, ie. eb is COWed and freed, it'd run into BUG_ON in
      btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));
      
      This adds the missing clear_page_dirty_for_io() for the rest pages of eb.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      81465028
    • F
      Btrfs: add missing end_page_writeback on submit_extent_page failure · 55e3bd2e
      Filipe Manana 提交于
      If submit_extent_page() fails in write_one_eb(), we end up with the current
      page not marked dirty anymore, unlocked and marked for writeback. But we never
      end up calling end_page_writeback() against the page, which will make calls to
      filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
      for the writeback bit to be cleared from the page.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      55e3bd2e
    • Q
      btrfs: Fix the wrong condition judgment about subset extent map · 32be3a1a
      Qu Wenruo 提交于
      Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
      best fitted extent map
      is using wrong condition to judgement whether the range is a subset of a
      existing extent map.
      
      This may cause bug in btrfs no-holes mode.
      
      This patch will correct the judgment and fix the bug.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      32be3a1a
    • J
      Btrfs: fix build_backref_tree issue with multiple shared blocks · bbe90514
      Josef Bacik 提交于
      Marc Merlin sent me a broken fs image months ago where it would blow up in the
      upper->checked BUG_ON() in build_backref_tree.  This is because we had a
      scenario like this
      
      block a -- level 4 (not shared)
         |
      block b -- level 3 (reloc block, shared)
         |
      block c -- level 2 (not shared)
         |
      block d -- level 1 (shared)
         |
      block e -- level 0 (shared)
      
      We go to build a backref tree for block e, we notice block d is shared and add
      it to the list of blocks to lookup it's backrefs for.  Now when we loop around
      we will check edges for the block, so we will see we looked up block c last
      time.  So we lookup block d and then see that the block that points to it is
      block c and we can just skip that edge since we've already been up this path.
      The problem is because we clear need_check when we see block d (as it is shared)
      we never add block b as needing to be checked.  And because block c is in our
      path already we bail out before we walk up to block b and add it to the backref
      check list.
      
      To fix this we need to reset need_check if we trip over a block that doesn't
      need to be checked.  This will make sure that any subsequent blocks in the path
      as we're walking up afterwards are added to the list to be processed.  With this
      patch I can now mount Marc's fs image and it'll complete the balance without
      panicing.  Thanks,
      Reported-by: NMarc MERLIN <marc@merlins.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bbe90514
    • J
      Btrfs: cleanup error handling in build_backref_tree · 75bfb9af
      Josef Bacik 提交于
      When balance panics it tends to panic in the
      
      BUG_ON(!upper->checked);
      
      test, because it means it couldn't build the backref tree properly.  This is
      annoying to users and frankly a recoverable error, nothing in this function is
      actually fatal since it is just an in-memory building of the backrefs for a
      given bytenr.  So go through and change all the BUG_ON()'s to ASSERT()'s, and
      fix the BUG_ON(!upper->checked) thing to just return an error.
      
      This patch also fixes the error handling so it tears down the work we've done
      properly.  This code was horribly broken since we always just panic'ed instead
      of actually erroring out, so it needed to be completely re-worked.  With this
      patch my broken image no longer panics when I mount it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      75bfb9af
  13. 02 10月, 2014 6 次提交