1. 10 6月, 2014 40 次提交
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456
    • L
      Btrfs: mark mapping with error flag to report errors to userspace · 5dca6eea
      Liu Bo 提交于
      According to commit 865ffef3
      (fs: fix fsync() error reporting),
      it's not stable to just check error pages because pages can be
      truncated or invalidated, we should also mark mapping with error
      flag so that a later fsync can catch the error.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5dca6eea
    • L
      Btrfs: fix NULL pointer crash of deleting a seed device · 29cc83f6
      Liu Bo 提交于
      Same as normal devices, seed devices should be initialized with
      fs_info->dev_root as well, otherwise we'll get a NULL pointer crash.
      
      Cc: Chris Murphy <lists@colorremedies.com>
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      29cc83f6
    • W
      Btrfs: fix joining same transaction handle more than twice · f017f15f
      Wang Shilong 提交于
      We hit something like the following function call flows:
      
      |->run_delalloc_range()
       |->btrfs_join_transaction()
         |->cow_file_range()
           |->btrfs_join_transaction()
             |->find_free_extent()
               |->btrfs_join_transaction()
      
      Trace infomation can be seen as:
      
      [ 7411.127040] ------------[ cut here ]------------
      [ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 start_transaction+0x561/0x580 [btrfs]()
      [ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G           O 3.13.0+ #4
      [ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013
      [ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5)
      [ 7411.127092] Call Trace:
      [ 7411.127097]  [<ffffffff815b87b0>] dump_stack+0x45/0x56
      [ 7411.127101]  [<ffffffff81051ffd>] warn_slowpath_common+0x7d/0xa0
      [ 7411.127102]  [<ffffffff810520da>] warn_slowpath_null+0x1a/0x20
      [ 7411.127109]  [<ffffffffa0444fb1>] start_transaction+0x561/0x580 [btrfs]
      [ 7411.127115]  [<ffffffffa0445027>] btrfs_join_transaction+0x17/0x20 [btrfs]
      [ 7411.127120]  [<ffffffffa0431c91>] find_free_extent+0xa21/0xb50 [btrfs]
      [ 7411.127126]  [<ffffffffa0431f68>] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
      [ 7411.127131]  [<ffffffffa04322ce>] btrfs_alloc_free_block+0xee/0x440 [btrfs]
      [ 7411.127137]  [<ffffffffa043bd6e>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
      [ 7411.127142]  [<ffffffffa041da51>] __btrfs_cow_block+0x121/0x530 [btrfs]
      [ 7411.127146]  [<ffffffffa041dfff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
      [ 7411.127151]  [<ffffffffa0421b74>] btrfs_search_slot+0x1d4/0x9c0 [btrfs]
      [ 7411.127157]  [<ffffffffa0438567>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
      [ 7411.127163]  [<ffffffffa0456bfc>] __btrfs_drop_extents+0x16c/0xd90 [btrfs]
      [ 7411.127169]  [<ffffffffa0444ae3>] ? start_transaction+0x93/0x580 [btrfs]
      [ 7411.127171]  [<ffffffff811663e2>] ? kmem_cache_alloc+0x132/0x140
      [ 7411.127176]  [<ffffffffa041cd9a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
      [ 7411.127182]  [<ffffffffa044aa61>] cow_file_range_inline+0x181/0x2e0 [btrfs]
      [ 7411.127187]  [<ffffffffa044aead>] cow_file_range+0x2ed/0x440 [btrfs]
      [ 7411.127194]  [<ffffffffa0464d7f>] ? free_extent_buffer+0x4f/0xb0 [btrfs]
      [ 7411.127200]  [<ffffffffa044b38f>] run_delalloc_nocow+0x38f/0xa60 [btrfs]
      [ 7411.127207]  [<ffffffffa0461600>] ? test_range_bit+0x30/0x180 [btrfs]
      [ 7411.127212]  [<ffffffffa044bd48>] run_delalloc_range+0x2e8/0x350 [btrfs]
      [ 7411.127219]  [<ffffffffa04618f9>] ? find_lock_delalloc_range+0x1a9/0x1e0 [btrfs]
      [ 7411.127222]  [<ffffffff812a1e71>] ? blk_queue_bio+0x2c1/0x330
      [ 7411.127228]  [<ffffffffa0462ad4>] __extent_writepage+0x2f4/0x760 [btrfs]
      
      Here we fix it by avoiding joining transaction again if we have held
      a transaction handle when allocating chunk in find_free_extent().
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f017f15f
    • M
    • F
      Btrfs: check if items are ordered when a leaf is marked dirty · 1f21ef0a
      Filipe Manana 提交于
      To ease finding bugs during development related to modifying btree leaves
      in such a way that it makes its items not sorted by key anymore. Since this
      is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY
      is set, which isn't meant to be enabled for regular users.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      1f21ef0a
    • F
      Btrfs: don't access non-existent key when csum tree is empty · 35045bf2
      Filipe Manana 提交于
      When the csum tree is empty, our leaf (path->nodes[0]) has a number
      of items equal to 0 and since btrfs_header_nritems() returns an
      unsigned integer (and so is our local nritems variable) the following
      comparison always evaluates to false:
      
           if (path->slots[0] >= nritems - 1) {
      
      As the casting rules lead to:
      
           if ((u32)0 >= (u32)4294967295) {
      
      This makes us access key at slot paths->slots[0] + 1 (1) of the empty leaf
      some lines below:
      
          btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
          if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
              found_key.type != BTRFS_EXTENT_CSUM_KEY) {
      		found_next = 1;
      		goto insert;
          }
      
      So just don't access such non-existent slot and don't set found_next to 1
      when the tree is empty. It's very unlikely we'll get a random key with the
      objectid and type values above, which is where we could go into trouble.
      
      If nritems is 0, just set found_next to 1 anyway as it will make us insert
      a csum item covering our whole extent (or the whole leaf) when the tree is
      empty.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      35045bf2
    • W
      Btrfs: make sure there are not any read requests before stopping workers · de348ee0
      Wang Shilong 提交于
      In close_ctree(), after we have stopped all workers,there maybe still
      some read requests(for example readahead) to submit and this *maybe* trigger
      an oops that user reported before:
      
      kernel BUG at fs/btrfs/async-thread.c:619!
      
      By hacking codes, i can reproduce this problem with one cpu available.
      We fix this potential problem by invalidating all btree inode pages before
      stopping all workers.
      
      Thanks to Miao for pointing out this problem.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      de348ee0
    • T
      Btrfs: fix possible memory leak in btrfs_create_tree() · 59885b39
      Tsutomu Itoh 提交于
      In btrfs_create_tree(), if btrfs_insert_root() fails, we should
      free root->commit_root.
      Reported-by: NAlex Lyakas <alex@zadarastorage.com>
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      59885b39
    • Z
      btrfs: remove useless ACL check · 776e4aae
      ZhangZhen 提交于
      posix_acl_xattr_set() already does the check, and it's the only
      way to feed in an ACL from userspace.
      So the check here is useless, remove it.
      Signed-off-by: Nzhang zhen <zhenzhang.zhang@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      776e4aae
    • A
      btrfs: btrfs_rm_device() should zero mirror SB as well · 4d90d28b
      Anand Jain 提交于
      This fix will ensure all SB copies on the disk is zeroed
      when the disk is intentionally removed. This helps to
      better manage disks in the user land.
      
      This version of patch also merges the Zach patch as below.
      
       btrfs: don't double brelse on device rm
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4d90d28b
    • M
    • F
      Btrfs: send, fix more issues related to directory renames · f959492f
      Filipe Manana 提交于
      This is a continuation of the previous changes titled:
      
         Btrfs: fix incremental send's decision to delay a dir move/rename
         Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
      
      There's a few more cases where a directory rename/move must be delayed which was
      previously overlooked. If our immediate ancestor has a lower inode number than
      ours and it doesn't have a delayed rename/move operation associated to it, it
      doesn't mean there isn't any non-direct ancestor of our current inode that needs
      to be renamed/moved before our current inode (i.e. with a higher inode number
      than ours).
      
      So we can't stop the search if our immediate ancestor has a lower inode number than
      ours, we need to navigate the directory hierarchy upwards until we hit the root or:
      
      1) find an ancestor with an higher inode number that was renamed/moved in the send
         root too (or already has a pending rename/move registered);
      2) find an ancestor that is a new directory (higher inode number than ours and
         exists only in the send root).
      
      Reproducer for case 1)
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount /dev/sdd /mnt
      
          $ mkdir -p /mnt/a/b
          $ mkdir -p /mnt/a/c/d
          $ mkdir /mnt/a/b/e
          $ mkdir /mnt/a/c/d/f
          $ mv /mnt/a/b /mnt/a/c/d/2b
          $ mkdir /mnt/a/x
          $ mkdir /mnt/a/y
      
          $ btrfs subvolume snapshot -r /mnt /mnt/snap1
          $ btrfs send /mnt/snap1 -f /tmp/base.send
      
          $ mv /mnt/a/x /mnt/a/y
          $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
          $ mv /mnt/a/c/d /mnt/a/h/2d
          $ mv /mnt/a/c /mnt/a/h/2d/2b/2c
      
          $ btrfs subvolume snapshot -r /mnt /mnt/snap2
          $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      Simple reproducer for case 2)
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount /dev/sdd /mnt
      
          $ mkdir -p /mnt/a/b
          $ mkdir /mnt/a/c
          $ mv /mnt/a/b /mnt/a/c/b2
          $ mkdir /mnt/a/e
      
          $ btrfs subvolume snapshot -r /mnt /mnt/snap1
          $ btrfs send /mnt/snap1 -f /tmp/base.send
      
          $ mv /mnt/a/c/b2 /mnt/a/e/b3
          $ mkdir /mnt/a/e/b3/f
          $ mkdir /mnt/a/h
          $ mv /mnt/a/c /mnt/a/e/b3/f/c2
          $ mv /mnt/a/e /mnt/a/h/e2
      
          $ btrfs subvolume snapshot -r /mnt /mnt/snap2
          $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      Another simple reproducer for case 2)
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount /dev/sdd /mnt
      
          $ mkdir -p /mnt/a/b
          $ mkdir /mnt/a/c
          $ mkdir /mnt/a/b/d
          $ mkdir /mnt/a/c/e
      
          $ btrfs subvolume snapshot -r /mnt /mnt/snap1
          $ btrfs send /mnt/snap1 -f /tmp/base.send
      
          $ mkdir /mnt/a/b/d/f
          $ mkdir /mnt/a/b/g
          $ mv /mnt/a/c/e /mnt/a/b/g/e2
          $ mv /mnt/a/c /mnt/a/b/d/f/c2
          $ mv /mnt/a/b/d/f /mnt/a/b/g/e2/f2
      
          $ btrfs subvolume snapshot -r /mnt /mnt/snap2
          $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      More complex reproducer for case 2)
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount /dev/sdd /mnt
      
          $ mkdir -p /mnt/a/b
          $ mkdir -p /mnt/a/c/d
          $ mkdir /mnt/a/b/e
          $ mkdir /mnt/a/c/d/f
          $ mv /mnt/a/b /mnt/a/c/d/2b
          $ mkdir /mnt/a/x
          $ mkdir /mnt/a/y
      
          $ btrfs subvolume snapshot -r /mnt /mnt/snap1
          $ btrfs send /mnt/snap1 -f /tmp/base.send
      
          $ mv /mnt/a/x /mnt/a/y
          $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
          $ mv /mnt/a/c/d /mnt/a/h/2d
          $ mv /mnt/a/c /mnt/a/h/2d/2b/2c
      
          $ btrfs subvolume snapshot -r /mnt /mnt/snap2
          $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      For both cases the incremental send would enter an infinite loop when building
      path strings.
      
      While solving these cases, this change also re-implements the code to detect
      when directory moves/renames should be delayed. Instead of dealing with several
      specific cases separately, it's now more generic handling all cases with a simple
      detection algorithm and if when applying a delayed move/rename there's a path loop
      detected, it further delays the move/rename registering a new ancestor inode as
      the dependency inode (so our rename happens after that ancestor is renamed).
      
      Tests for these cases is being added to xfstests too.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f959492f
    • F
    • F
      Btrfs: send, account for orphan directories when building path strings · c992ec94
      Filipe Manana 提交于
      If we have directories with a pending move/rename operation, we must take into
      account any orphan directories that got created before executing the pending
      move/rename. Those orphan directories are directories with an inode number higher
      then the current send progress and that don't exist in the parent snapshot, they
      are created before current progress reaches their inode number, with a generated
      name of the form oN-M-I and at the root of the filesystem tree, and later when
      progress matches their inode number, moved/renamed to their final location.
      
      Reproducer:
      
                $ mkfs.btrfs -f /dev/sdd
                $ mount /dev/sdd /mnt
      
                $ mkdir -p /mnt/a/b/c/d
                $ mkdir /mnt/a/b/e
                $ mv /mnt/a/b/c /mnt/a/b/e/CC
                $ mkdir /mnt/a/b/e/CC/d/f
      	  $ mkdir /mnt/a/g
      
                $ btrfs subvolume snapshot -r /mnt /mnt/snap1
                $ btrfs send /mnt/snap1 -f /tmp/base.send
      
                $ mkdir /mnt/a/g/h
      	  $ mv /mnt/a/b/e /mnt/a/g/h/EE
                $ mv /mnt/a/g/h/EE/CC/d /mnt/a/g/h/EE/DD
      
                $ btrfs subvolume snapshot -r /mnt /mnt/snap2
                $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
      
      The second receive command failed with the following error:
      
          ERROR: rename a/b/e/CC/d -> o264-7-0/EE/DD failed. No such file or directory
      
      A test case for xfstests follows soon.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c992ec94
    • F
      Btrfs: send, avoid unnecessary inode item lookup in the btree · b46ab97b
      Filipe Manana 提交于
      Regardless of whether the caller is interested or not in knowing the inode's
      generation (dir_gen != NULL), get_first_ref always does a btree lookup to get
      the inode item. Avoid this useless lookup if dir_gen parameter is NULL (which
      is in some cases).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b46ab97b
    • G
      btrfs: add dev maxs limit for __btrfs_alloc_chunk in kernel space · 23f8f9b7
      Gui Hecheng 提交于
      For RAID0,5,6,10,
      For system chunk, there shouldn't be too many stripes to
      make a btrfs_chunk that exceeds BTRFS_SYSTEM_CHUNK_ARRAY_SIZE
      For data/meta chunk, there shouldn't be too many stripes to
      make a btrfs_chunk that exceeds a leaf.
      Signed-off-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      23f8f9b7
    • G
      btrfs: fix wrong max system array size check in kernel space · 5f43f86e
      Gui Hecheng 提交于
      For system chunk array,
      We copy a "disk_key" and an chunk item each time,
      so there should be enough space to hold both of them,
      not only the chunk item.
      Signed-off-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5f43f86e
    • Q
      btrfs: Add check to avoid cleanup roots already in fs_info->dead_roots. · 65d33fd7
      Qu Wenruo 提交于
      Current btrfs_orphan_cleanup will also cleanup roots which is already in
      fs_info->dead_roots without protection.
      This will have conditional race with fs_info->cleaner_kthread.
      
      This patch will use refs in root->root_item to detect roots in
      dead_roots and avoid conflicts.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      65d33fd7
    • M
      Btrfs: reclaim the reserved metadata space at background · 21c7e756
      Miao Xie 提交于
      Before applying this patch, the task had to reclaim the metadata space
      by itself if the metadata space was not enough. And When the task started
      the space reclamation, all the other tasks which wanted to reserve the
      metadata space were blocked. At some cases, they would be blocked for
      a long time, it made the performance fluctuate wildly.
      
      So we introduce the background metadata space reclamation, when the space
      is about to be exhausted, we insert a reclaim work into the workqueue, the
      worker of the workqueue helps us to reclaim the reserved space at the
      background. By this way, the tasks needn't reclaim the space by themselves at
      most cases, and even if the tasks have to reclaim the space or are blocked
      for the space reclamation, they will get enough space more quickly.
      
      Here is my test result(Tested by compilebench):
       Memory:	2GB
       CPU:		2Cores * 1CPU
       Partition:	40GB(SSD)
      
      Test command:
       # compilebench -D <mnt> -m
      
      Without this patch:
       intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
       compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
       read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
       delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
      
      With this patch:
       intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
       compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
       read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
       delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      21c7e756
    • M
      Btrfs: output warning instead of error when loading free space cache failed · 32d6b47f
      Miao Xie 提交于
      If we fail to load a free space cache, we can rebuild it from the extent tree,
      so it is not a serious error, we should not output a error message that
      would make the users uncomfortable. This patch uses warning message instead
      of it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      32d6b47f
    • Q
      btrfs: Add ctime/mtime update for btrfs device add/remove. · 5a1972bd
      Qu Wenruo 提交于
      Btrfs will send uevent to udev inform the device change,
      but ctime/mtime for the block device inode is not udpated, which cause
      libblkid used by btrfs-progs unable to detect device change and use old
      cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an
      error message.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Cc: Karel Zak <kzak@redhat.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5a1972bd
    • D
      btrfs: assert that send is not in progres before root deletion · 61155aa0
      David Sterba 提交于
      CC: Miao Xie <miaox@cn.fujitsu.com>
      CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      61155aa0
    • D
      btrfs: protect snapshots from deleting during send · 521e0546
      David Sterba 提交于
      The patch "Btrfs: fix protection between send and root deletion"
      (18f687d5) does not actually prevent to delete the snapshot
      and just takes care during background cleaning, but this seems rather
      user unfriendly, this patch implements the idea presented in
      
      http://www.spinics.net/lists/linux-btrfs/msg30813.html
      
      - add an internal root_item flag to denote a dead root
      - check if the send_in_progress is set and refuse to delete, otherwise
        set the flag and proceed
      - check the flag in send similar to the btrfs_root_readonly checks, for
        all involved roots
      
      The root lookup in send via btrfs_read_fs_root_no_name will check if the
      root is really dead or not. If it is, ENOENT, aborted send. If it's
      alive, it's protected by send_in_progress, send can continue.
      
      CC: Miao Xie <miaox@cn.fujitsu.com>
      CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      521e0546
    • D
      btrfs: remove redundant null check in btrfs_dentry_release() · 944a4515
      Daeseok Youn 提交于
      It doesn't need to check NULL for kfree()
      Signed-off-by: NDaeseok Youn <daeseok.youn@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      944a4515
    • F
      Btrfs: implement inode_operations callback tmpfile · ef3b9af5
      Filipe Manana 提交于
      This implements the tmpfile callback of struct inode_operations, introduced
      in the linux kernel 3.11, and implemented already by some filesystems. This
      callback is invoked by the VFS when the flag O_TMPFILE is passed to the open
      system call.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      ef3b9af5
    • D
      btrfs: make FS_INFO ioctl available to anyone · e4ef90ff
      David Sterba 提交于
      This ioctl provides basic info about the filesystem that can be obtained
      in other ways (eg. sysfs), there's no reason to restrict it to
      CAP_SYSADMIN.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      e4ef90ff
    • D
      btrfs: make DEV_INFO ioctl available to anyone · 7d6213c5
      David Sterba 提交于
      This ioctl provides basic info about the devices that can be obtained in
      other ways (eg. sysfs), there's no reason to restrict it to
      CAP_SYSADMIN.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      7d6213c5
    • D
      btrfs: export more from FS_INFO to sysfs · df93589a
      David Sterba 提交于
      Similar to the FS_INFO updates, export the basic filesystem info through
      sysfs: node size, sector size and clone alignment.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      df93589a
    • D
      btrfs: retrieve more info from FS_INFO ioctl · 80a773fb
      David Sterba 提交于
      Provide the basic information about filesystem through the ioctl:
      * b-tree node size (same as leaf size)
      * sector size
      * expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments
      
      Backward compatibility: if the values are 0, kernel does not provide
      this information, the applications should ignore them.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      80a773fb
    • D
      btrfs: balance filter: add limit of processed chunks · 7d824b6f
      David Sterba 提交于
      This started as debugging helper, to watch the effects of converting
      between raid levels on multiple devices, but could be useful standalone.
      
      In my case the usage filter was not finegrained enough and led to
      converting too many chunks at once. Another example use is in connection
      with drange+devid or vrange filters that allow to work with a specific
      chunk or even with a chunk on a given device.
      
      The limit filter applies last, the value of 0 means no limiting.
      
      CC: Ilya Dryomov <idryomov@gmail.com>
      CC: Hugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      7d824b6f
    • F
      Btrfs: fix leaf corruption caused by ENOSPC while hole punching · fc19c5e7
      Filipe Manana 提交于
      While running a stress test with multiple threads writing to the same btrfs
      file system, I ended up with a situation where a leaf was corrupted in that
      it had 2 file extent item keys that had the same exact key. I was able to
      detect this quickly thanks to the following patch which triggers an assertion
      as soon as a leaf is marked dirty if there are duplicated keys or out of order
      keys:
      
          Btrfs: check if items are ordered when a leaf is marked dirty
          (https://patchwork.kernel.org/patch/3955431/)
      
      Basically while running the test, I got the following in dmesg:
      
          [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]()
          (...)
          [28877.415917] Call Trace:
          [28877.415922]  [<ffffffff816f1189>] dump_stack+0x4e/0x68
          [28877.415926]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
          [28877.415929]  [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20
          [28877.415944]  [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs]
          [28877.415949]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
          [28877.415962]  [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs]
          [28877.415972]  [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs]
          [28877.415984]  [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs]
          (...)
          [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24
          [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778
          (...)
          [29854.132637] 	item 23 key (3486 108 667648) itemoff 2694 itemsize 53
          [29854.132638] 		extent data disk bytenr 14574411776 nr 286720
          [29854.132639] 		extent data offset 0 nr 286720 ram 286720
          [29854.132640] 	item 24 key (3486 108 954368) itemoff 2641 itemsize 53
          [29854.132641] 		extent data disk bytenr 0 nr 0
          [29854.132643] 		extent data offset 0 nr 0 ram 0
          [29854.132644] 	item 25 key (3486 108 954368) itemoff 2588 itemsize 53
          [29854.132645] 		extent data disk bytenr 8699670528 nr 77824
          [29854.132646] 		extent data offset 0 nr 77824 ram 77824
          [29854.132647] 	item 26 key (3486 108 1146880) itemoff 2535 itemsize 53
          [29854.132648] 		extent data disk bytenr 8699670528 nr 77824
          [29854.132649] 		extent data offset 0 nr 77824 ram 77824
          (...)
          [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901!
          (...)
          [29854.132771] Call Trace:
          [29854.132779]  [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs]
          [29854.132791]  [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs]
          [29854.132794]  [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0
          [29854.132797]  [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10
          [29854.132800]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
          [29854.132810]  [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs]
          [29854.132820]  [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs]
          [29854.132830]  [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs]
          (...)
      
      So this is caused by getting an -ENOSPC error while punching a file hole, more
      specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop
      of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one
      or more file extent items due to lack of enough free space. When this happens,
      in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction
      block reservation object to root->fs_info->trans_block_rsv, end our transaction and
      start a new transaction basically - and, we keep increasing our current offset
      (cur_offset) as long as it's smaller than the end of the target range (lockend) -
      this makes use leave the loop with cur_offset == drop_end which in turn makes us
      call fill_holes() for inserting a file extent item that represents a 0 bytes range
      hole (and this insertion succeeds, as in the meanwhile more space became available).
      
      This 0 bytes file hole extent item is a problem because any subsequent caller of
      __btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a
      start file offset that is equal to the offset of the hole, will not remove this
      extent item due to the following conditional in the while loop of
      __btrfs_drop_extents:
      
          if (extent_end <= search_start) {
                  path->slots[0]++;
                  goto next_slot;
          }
      
      This later makes the call to setup_items_for_insert() (at the very end of
      __btrfs_drop_extents), insert a new file extent item with the same offset as
      the 0 bytes file hole extent item that follows it. Needless is to say that this
      causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook),
      where we perform leaf sanity checks or in subsequent operations that manipulate
      file extent items, as in the fallocate call as shown by the dmesg trace above.
      
      Without my other patch to perform the leaf sanity checks once a leaf is marked
      as dirty (if the integrity checker is enabled), it would have been much harder
      to debug this issue.
      
      This change might fix a few similar issues reported by users in the mailing
      list regarding assertion failures in btrfs_set_item_key_safe calls performed
      by __btrfs_drop_extents, such as the following report:
      
          http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938
      
      Asking fill_holes() to create a 0 bytes wide file hole item also produced the
      first warning in the trace above, as we passed a range to btrfs_drop_extent_cache
      that has an end smaller (by -1) than its start.
      
      On 3.14 kernels this issue manifests itself through leaf corruption, as we get
      duplicated file extent item keys in a leaf when calling setup_items_for_insert(),
      but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(),
      instead we have callers of __btrfs_drop_extents(), namely the functions
      inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling
      btrfs_insert_empty_item() to insert the new file extent item, which would fail with
      error -EEXIST, instead of inserting a duplicated key - which is still a serious
      issue as it would make all similar file extent item replace operations keep
      failing if they target the same file range.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fc19c5e7
    • L
      Btrfs: do not increment on bio_index one by one · d2cbf2a2
      Liu Bo 提交于
      'bio_index' is just a index, it's really not necessary to do increment
      one by one.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      d2cbf2a2
    • F
      Btrfs: read inode size after acquiring the mutex when punching a hole · a1a50f60
      Filipe Manana 提交于
      In a previous change, commit 12870f1c,
      I accidentally moved the roundup of inode->i_size to outside of the
      critical section delimited by the inode mutex, which is not atomic and
      not correct since the size can be changed by other task before we acquire
      the mutex. Therefore fix it.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a1a50f60
    • T
      btrfs: Remove unnecessary check for NULL · 7fb18a06
      Tobias Klauser 提交于
      iput() already checks for the inode being NULL, thus it's unnecessary to
      check before calling.
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NChris Mason <clm@fb.com>
      7fb18a06
    • Z
      btrfs: fix inline compressed read err corruption · 166ae5a4
      Zach Brown 提交于
      uncompress_inline() is dropping the error from btrfs_decompress() after
      testing it and zeroing the page that was supposed to hold decompressed
      data.  This can silently turn compressed inline data in to zeros if
      decompression fails due to corrupt compressed data or memory allocation
      failure.
      
      I verified this by manually forcing the error from btrfs_decompress()
      for a silly named copy of od:
      
      	if (!strcmp(current->comm, "failod"))
      		ret = -ENOMEM;
      
        # od -x /mnt/btrfs/dir/80 | head -1
        0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031
        # echo 3 > /proc/sys/vm/drop_caches
        # cp $(which od) /tmp/failod
        # /tmp/failod -x /mnt/btrfs/dir/80 | head -1
        0000000 0000 0000 0000 0000 0000 0000 0000 0000
      
      The fix is to pass the error to its caller.  Which still has a BUG_ON().
      So we fix that too.
      
      There seems to be no reason for the zeroing of the page on the error
      from btrfs_decompress() but not from the allocation error a few lines
      above.  So the page zeroing is removed.
      Signed-off-by: NZach Brown <zab@redhat.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      166ae5a4
    • Z
      btrfs: return ptr error from compression workspace · 774bcb35
      Zach Brown 提交于
      The btrfs compression wrappers translated errors from workspace
      allocation to either -ENOMEM or -1.  The compression type workspace
      allocators are already returning a ERR_PTR(-ENOMEM).  Just return that
      and get rid of the magical -1.
      
      This helps a future patch return errors from the compression wrappers.
      Signed-off-by: NZach Brown <zab@redhat.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      774bcb35
    • Z
      btrfs: return errno instead of -1 from compression · 60e1975a
      Zach Brown 提交于
      The compression layer seems to have been built to return -1 and have
      callers make up errors that make sense.  This isn't great because there
      are different errors that originate down in the compression layer.
      
      Let's return real negative errnos from the compression layer so that
      callers can pass on the error without having to guess what happened.
      ENOMEM for allocation failure, E2BIG when compression exceeds the
      uncompressed input, and EIO for everything else.
      
      This helps a future path return errors from btrfs_decompress().
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      60e1975a
    • S
      btrfs: check_int: propagate out-of-memory error upwards · 98806b44
      Stefan Behrens 提交于
      This issue was not causing any harm but IMO (and in the opinion of the
      static code checker) it is better to propagate this error status upwards.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      98806b44
    • F
      Btrfs: fix hang on error (such as ENOSPC) when writing extent pages · 61391d56
      Filipe Manana 提交于
      When running low on available disk space and having several processes
      doing buffered file IO, I got the following trace in dmesg:
      
      [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
      [ 4202.720401]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
      [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 4202.720874] kworker/u8:1    D 0000000000000001     0  5450      2 0x00000000
      [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
      [ 4202.720908]  ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
      [ 4202.720913]  ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
      [ 4202.720918]  00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
      [ 4202.720922] Call Trace:
      [ 4202.720931]  [<ffffffff816a3cb9>] schedule+0x29/0x70
      [ 4202.720950]  [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
      [ 4202.720956]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
      [ 4202.720972]  [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
      [ 4202.720988]  [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
      [ 4202.720994]  [<ffffffff810680e5>] process_one_work+0x1f5/0x530
      (...)
      [ 4202.721027] 2 locks held by kworker/u8:1/5450:
      [ 4202.721028]  #0:  (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
      [ 4202.721037]  #1:  ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
      [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
      [ 4202.721258]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
      [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 4202.721699] btrfs           D 0000000000000001     0  7891   7890 0x00000001
      [ 4202.721704]  ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
      [ 4202.721710]  ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
      [ 4202.721714]  ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
      [ 4202.721718] Call Trace:
      [ 4202.721723]  [<ffffffff816a3cb9>] schedule+0x29/0x70
      [ 4202.721727]  [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
      [ 4202.721732]  [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
      [ 4202.721736]  [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
      [ 4202.721740]  [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
      [ 4202.721744]  [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
      [ 4202.721749]  [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
      [ 4202.721765]  [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
      [ 4202.721781]  [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
      [ 4202.721786]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
      [ 4202.721799]  [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
      [ 4202.721813]  [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
      (...)
      
      It turns out that extent_io.c:__extent_writepage(), which ends up being called
      through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
      -ENOSPC when calling the fill_delalloc callback. In this situation, it returned
      without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
      ever being called for the respective page, which prevents the ordered extent's
      bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
      is never queued into the endio_write_workers queue. This makes the task that
      called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
      extent.
      
      This is fairly easy to reproduce using a small filesystem and fsstress on
      a quad core vm:
      
          mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
          mount /dev/sdd /mnt
      
          fsstress -p 6 -d /mnt -n 100000 -x \
              "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
      	    -f allocsp=0 \
      	    -f bulkstat=0 \
      	    -f bulkstat1=0 \
      	    -f chown=0 \
      	    -f creat=1 \
      	    -f dread=0 \
      	    -f dwrite=0 \
      	    -f fallocate=1 \
      	    -f fdatasync=0 \
      	    -f fiemap=0 \
      	    -f freesp=0 \
      	    -f fsync=0 \
      	    -f getattr=0 \
      	    -f getdents=0 \
      	    -f link=0 \
      	    -f mkdir=0 \
      	    -f mknod=0 \
      	    -f punch=1 \
      	    -f read=0 \
      	    -f readlink=0 \
      	    -f rename=0 \
      	    -f resvsp=0 \
      	    -f rmdir=0 \
      	    -f setxattr=0 \
      	    -f stat=0 \
      	    -f symlink=0 \
      	    -f sync=0 \
      	    -f truncate=1 \
      	    -f unlink=0 \
      	    -f unresvsp=0 \
      	    -f write=4
      
      So just ensure that if an error happens while writing the extent page
      we call the writepage_end_io_hook callback. Also make it return the
      error code and ensure the caller (extent_write_cache_pages) processes
      all pages in the page vector even if an error happens only for some
      of them, so that ordered extents end up released.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      61391d56