1. 11 3月, 2014 33 次提交
    • M
      Btrfs: remove unnecessary memory barrier in btrfs_sync_log() · 7483e1a4
      Miao Xie 提交于
      Mutex unlock implies certain memory barriers to make sure all the memory
      operation completes before the unlock, and the next mutex lock implies memory
      barriers to make sure the all the memory happens after the lock. So it is
      a full memory barrier(smp_mb), we needn't add memory barriers. Remove them.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7483e1a4
    • M
      Btrfs: don't start the log transaction if the log tree init fails · e87ac136
      Miao Xie 提交于
      The old code would start the log transaction even the log tree init
      failed, it was unnecessary. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e87ac136
    • M
      Btrfs: fix the skipped transaction commit during the file sync · 48cab2e0
      Miao Xie 提交于
      We may abort the wait earlier if ->last_trans_log_full_commit was set to
      the current transaction id, at this case, we need commit the current
      transaction instead of the log sub-transaction. But the current code
      didn't tell the caller to do it (return 0, not -EAGAIN). Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      48cab2e0
    • M
      Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ->last_trans_log_full_commit · 5c902ba6
      Miao Xie 提交于
      ->last_trans_log_full_commit may be changed by the other tasks without lock,
      so we need prevent the compiler from the optimize access just like
      	tmp = fs_info->last_trans_log_full_commit
      	if (tmp == ...)
      		...
      
      	<do something>
      
      	if (tmp == ...)
      		...
      
      In fact, we need get the new value of ->last_trans_log_full_commit during
      the second access. Fix it by ACCESS_ONCE().
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      5c902ba6
    • L
      Btrfs: avoid warning bomb of btrfs_invalidate_inodes · 7813b3db
      Liu Bo 提交于
      So after transaction is aborted, we need to cleanup inode resources by
      calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
      roots' refs to be zero in old times and sets a WARN_ON(), however, this
      is not always true within cleaning up transaction, so we get to detect
      transaction abortion and not warn at all.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7813b3db
    • L
      Btrfs: fix possible deadlock in btrfs_cleanup_transaction · 2a85d9ca
      Liu Bo 提交于
      [13654.480669] ======================================================
      [13654.480905] [ INFO: possible circular locking dependency detected ]
      [13654.481003] 3.12.0+ #4 Tainted: G        W  O
      [13654.481060] -------------------------------------------------------
      [13654.481060] btrfs-transacti/9347 is trying to acquire lock:
      [13654.481060]  (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
      [13654.481060] but task is already holding lock:
      [13654.481060]  (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
      [13654.481060] which lock already depends on the new lock.
      
      [13654.481060] the existing dependency chain (in reverse order) is:
      [13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}:
      [13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
      [13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
      [13654.481060]        [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs]
      [13654.481060]        [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs]
      [13654.481060]        [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs]
      [13654.481060]        [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs]
      [13654.481060]        [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs]
      [13654.481060]        [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs]
      [13654.481060]        [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs]
      [13654.481060]        [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs]
      [13654.481060]        [<ffffffff8114be91>] do_writepages+0x21/0x50
      [13654.481060]        [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60
      [13654.481060]        [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20
      [13654.481060]        [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs]
      [13654.481060]        [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs]
      [13654.481060]        [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
      [13654.481060]        [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs]
      [13654.481060]        [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs]
      [13654.481060]        [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs]
      [13654.481060]        [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs]
      [13654.481060]        [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs]
      [13654.481060]        [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs]
      [13654.481060]        [<ffffffff811b2409>] mount_fs+0x39/0x1b0
      [13654.481060]        [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0
      [13654.481060]        [<ffffffff811cfcce>] do_mount+0x23e/0xa90
      [13654.481060]        [<ffffffff811d05a3>] SyS_mount+0x83/0xc0
      [13654.481060]        [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b
      [13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}:
      [13654.481060]        [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70
      [13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
      [13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
      [13654.481060]        [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
      [13654.481060]        [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs]
      [13654.481060]        [<ffffffff81079efa>] kthread+0xea/0xf0
      [13654.481060]        [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
      [13654.481060] other info that might help us debug this:
      
      [13654.481060]  Possible unsafe locking scenario:
      
      [13654.481060]        CPU0                    CPU1
      [13654.481060]        ----                    ----
      [13654.481060]   lock(&(&fs_info->ordered_root_lock)->rlock);
      [13654.481060]				 lock(&(&root->ordered_extent_lock)->rlock);
      [13654.481060]				 lock(&(&fs_info->ordered_root_lock)->rlock);
      [13654.481060]   lock(&(&root->ordered_extent_lock)->rlock);
      [13654.481060]
       *** DEADLOCK ***
      [...]
      
      ======================================================
      
      btrfs_destroy_all_ordered_extents()
      gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock,
      while btrfs_[add,remove]_ordered_extent()
      acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock.
      
      This patch fixes the above problem.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      2a85d9ca
    • F
      Btrfs: faster/more efficient insertion of file extent items · d5f37527
      Filipe David Borba Manana 提交于
      This is an extension to my previous commit titled:
      
        "Btrfs: faster file extent item replace operations"
        (hash 1acae57b)
      
      Instead of inserting the new file extent item if we deleted existing
      file extent items covering our target file range, also allow to insert
      the new file extent item if we didn't find any existing items to delete
      and replace_extent != 0, since in this case our caller would do another
      tree search to insert the new file extent item anyway, therefore just
      combine the two tree searches into a single one, saving cpu time, reducing
      lock contention and reducing btree node/leaf COW operations.
      
      This covers the case where applications keep doing tail append writes to
      files, which for example is the case of Apache CouchDB (its database and
      view index files are always open with O_APPEND).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d5f37527
    • S
      btrfs: always choose work from prio_head first · 51b98eff
      Stanislaw Gruszka 提交于
      In case we do not refill, we can overwrite cur pointer from prio_head
      by one from not prioritized head, what looks as something that was
      not intended.
      
      This change make we always take works from prio_head first until it's
      not empty.
      Signed-off-by: NStanislaw Gruszka <stf_xl@wp.pl>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      51b98eff
    • W
      Revert "Btrfs: remove transaction from btrfs send" · dcfd5ad2
      Wang Shilong 提交于
      This reverts commit 41ce9970.
      Previously i was thinking we can use readonly root's commit root
      safely while it is not true, readonly root may be cowed with the
      following cases.
      
      1.snapshot send root will cow source root.
      2.balance,device operations will also cow readonly send root
      to relocate.
      
      So i have two ideas to make us safe to use commit root.
      
      -->approach 1:
      make it protected by transaction and end transaction properly and we research
      next item from root node(see btrfs_search_slot_for_read()).
      
      -->approach 2:
      add another counter to local root structure to sync snapshot with send.
      and add a global counter to sync send with exclusive device operations.
      
      So with approach 2, send can use commit root safely, because we make sure
      send root can not be cowed during send. Unfortunately, it make codes *ugly*
      and more complex to maintain.
      
      To make snapshot and send exclusively, device operations and send operation
      exclusively with each other is a little confusing for common users.
      
      So why not drop into previous way.
      
      Cc: Josef Bacik <jbacik@fb.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dcfd5ad2
    • W
      Btrfs: skip readonly root for snapshot-aware defragment · bcbba5e6
      Wang Shilong 提交于
      Btrfs send is assuming readonly root won't change, let's skip readonly root.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      bcbba5e6
    • W
      Btrfs: switch to btrfs_previous_extent_item() · 850a8cdf
      Wang Shilong 提交于
      Since we have introduced btrfs_previous_extent_item() to search previous
      extent item, just switch into it.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: NFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      850a8cdf
    • H
      Btrfs: skip submitting barrier for missing device · f88ba6a2
      Hidetoshi Seto 提交于
      I got an error on v3.13:
       BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.)
      
      how to reproduce:
        > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2
        > wipefs -a /dev/sdf2
        > mount -o degraded /dev/sdf1 /mnt
        > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt
      
      The reason of the error is that barrier_all_devices() failed to submit
      barrier to the missing device.  However it is clear that we cannot do
      anything on missing device, and also it is not necessary to care chunks
      on the missing device.
      
      This patch stops sending/waiting barrier if device is missing.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      f88ba6a2
    • J
      Btrfs: unlock extent and pages on error in cow_file_range · 29bce2f3
      Josef Bacik 提交于
      When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I
      made it so we just return an error instead of unlocking all of our various
      stuff.  This is a mistake and causes us to hang when we run into this.  This
      patch fixes this problem.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      29bce2f3
    • J
      Btrfs: balance delayed inode updates · c581afc8
      Josef Bacik 提交于
      While trying to reproduce a delayed ref problem I noticed the box kept falling
      over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
      Turns out this is because we only throttle delayed inode updates in
      btrfs_dirty_inode, which doesn't actually get called that often, especially when
      all you are doing is creating a bunch of files.  So balance delayed inode
      updates everytime we create a new inode.  With this patch we no longer use up
      all of our ram with delayed inode updates.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c581afc8
    • D
      btrfs: add simple debugfs interface · 1bae3098
      David Sterba 提交于
      Help during debugging to export various interesting infromation and
      tunables without the need of extra mount options or ioctls.
      
      Usage:
      * declare your variable in sysfs.h, and include where you need it
      * define the variable in sysfs.c and make it visible via
        debugfs_create_TYPE
      
      Depends on CONFIG_DEBUG_FS.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      1bae3098
    • D
      btrfs: send: lower memory requirements in common case · ace01050
      David Sterba 提交于
      The fs_path structure uses an inline buffer and falls back to a chain of
      allocations, but vmalloc is not necessary because PATH_MAX fits into
      PAGE_SIZE.
      
      The size of fs_path has been reduced to 256 bytes from PAGE_SIZE,
      usually 4k. Experimental measurements show that most paths on a single
      filesystem do not exceed 200 bytes, and these get stored into the inline
      buffer directly, which is now 230 bytes. Longer paths are kmalloced when
      needed.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      ace01050
    • F
      Btrfs: make some tree searches in send.c more efficient · dff6d0ad
      Filipe David Borba Manana 提交于
      We have this pattern where we do search for a contiguous group of
      items in a tree and everytime we find an item, we process it, then
      we release our path, increment the offset of the search key, do
      another full tree search and repeat these steps until a tree search
      can't find more items we're interested in.
      
      Instead of doing these full tree searches after processing each item,
      just process the next item/slot in our leaf and don't release the path.
      Since all these trees are read only and we always use the commit root
      for a search and skip node/leaf locks, we're not affecting concurrency
      on the trees.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dff6d0ad
    • F
      Btrfs: use right extent item position in send when finding extent clones · a0859c09
      Filipe David Borba Manana 提交于
      This was a leftover from the commit:
      
         74dd17fb
         (Btrfs: fix btrfs send for inline items and compression)
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      a0859c09
    • D
      btrfs: send: remove BUG_ON from name_cache_delete · 57fb8910
      David Sterba 提交于
      If cleaning the name cache fails, we could try to proceed at the cost of
      some memory leak. This is not expected to happen often.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      57fb8910
    • D
      btrfs: send: remove BUG from process_all_refs · 4d1a63b2
      David Sterba 提交于
      There are only 2 static callers, the BUG would normally be never
      reached, but let's be nice.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      4d1a63b2
    • D
      btrfs: send: squeeze bitfilelds in fs_path · 1f5a7ff9
      David Sterba 提交于
      We know that buf_len is at most PATH_MAX, 4k, and can merge it with the
      reversed member. This saves 3 bytes in favor of inline_buf.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      1f5a7ff9
    • D
      btrfs: send: remove virtual_mem member from fs_path · e25a8122
      David Sterba 提交于
      We don't need to keep track of that, it's available via is_vmalloc_addr.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e25a8122
    • D
      btrfs: send: remove prepared member from fs_path · b23ab57d
      David Sterba 提交于
      The member is used only to return value back from
      fs_path_prepare_for_add, we can do it locally and save 8 bytes for the
      inline_buf path.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      b23ab57d
    • D
      btrfs: send: replace check with an assert in gen_unique_name · 64792f25
      David Sterba 提交于
      The buffer passed to snprintf can hold the fully expanded format string,
      64 = 3x largest ULL + 3x char + trailing null.  I don't think that removing the
      check entirely is a good idea, hence the ASSERT.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      64792f25
    • F
      Btrfs: more send support for parent/child dir relationship inversion · 5ed7f9ff
      Filipe David Borba Manana 提交于
      The commit titled "Btrfs: fix infinite path build loops in incremental send"
      didn't cover a particular case where the parent-child relationship inversion
      of directories doesn't imply a rename of the new parent directory. This was
      due to a simple logic mistake, a logical and instead of a logical or.
      
      Steps to reproduce:
      
        $ mkfs.btrfs -f /dev/sdb3
        $ mount /dev/sdb3 /mnt/btrfs
        $ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4
        $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
        $ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44
        $ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44
        $ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3
        $ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11
        $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
        $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
      
      A patch to update the test btrfs/030 from xfstests, so that it covers
      this case, will be submitted soon.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      5ed7f9ff
    • F
      Btrfs: fix send dealing with file renames and directory moves · 03cb4fb9
      Filipe David Borba Manana 提交于
      This fixes a case that the commit titled:
      
         Btrfs: fix infinite path build loops in incremental send
      
      didn't cover. If the parent-child relationship between 2 directories
      is inverted, both get renamed, and the former parent has a file that
      got renamed too (but remains a child of that directory), the incremental
      send operation would use the file's old path after sending an unlink
      operation for that old path, causing receive to fail on future operations
      like changing owner, permissions or utimes of the corresponding inode.
      
      This is not a regression from the commit mentioned before, as without
      that commit we would fall into the issues that commit fixed, so it's
      just one case that wasn't covered before.
      
      Simple steps to reproduce this issue are:
      
            $ mkfs.btrfs -f /dev/sdb3
            $ mount /dev/sdb3 /mnt/btrfs
            $ mkdir -p /mnt/btrfs/a/b/c/d
            $ touch /mnt/btrfs/a/b/c/d/file
            $ mkdir -p /mnt/btrfs/a/b/x
            $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1
            $ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2
            $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2
            $ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2
            $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2
            $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send
      
      A patch to update the test btrfs/030 from xfstests, so that it covers
      this case, will be submitted soon.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      03cb4fb9
    • W
      Btrfs: only add roots if necessary in find_parent_nodes() · 98cfee21
      Wang Shilong 提交于
      find_all_leafs() dosen't need add all roots actually, add roots only
      if we need, this can avoid unnecessary ulist dance.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      98cfee21
    • H
      btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl · abccd00f
      Hugo Mills 提交于
      The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
      and 64-bit systems. This means that it is impossible to use btrfs
      receive on a system with a 64-bit kernel and 32-bit userspace, because
      the structure size (and hence the ioctl number) is different.
      
      This patch adds a compatibility structure and ioctl to deal with the
      above case.
      Signed-off-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      abccd00f
    • F
      Btrfs: add missing error check in incremental send · d86477b3
      Filipe David Borba Manana 提交于
      Function wait_for_parent_move() returns negative value if an error
      happened, 0 if we don't need to wait for the parent's move, and
      1 if the wait is needed.
      Before this change an error return value was being treated like the
      return value 1, which was not correct.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d86477b3
    • M
      Btrfs: fix use-after-free in the finishing procedure of the device replace · c404e0dc
      Miao Xie 提交于
      During device replace test, we hit a null pointer deference (It was very easy
      to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
      scsi driver). There were two bugs that caused this problem:
      - We might allocate new chunks on the replaced device after we updated
        the mapping tree. And we forgot to replace the source device in those
        mapping of the new chunks.
      - We might get the mapping information which including the source device
        before the mapping information update. And then submit the bio which was
        based on that mapping information after we freed the source device.
      
      For the first bug, we can fix it by doing mapping tree update and source
      device remove in the same context of the chunk mutex. The chunk mutex is
      used to protect the allocable device list, the above method can avoid
      the new chunk allocation, and after we remove the source device, all
      the new chunks will be allocated on the new device. So it can fix
      the first bug.
      
      For the second bug, we need make sure all flighting bios are finished and
      no new bios are produced during we are removing the source device. To fix
      this problem, we introduced a global @bio_counter, we not only inc/dec
      @bio_counter outsize of map_blocks, but also inc it before submitting bio
      and dec @bio_counter when ending bios.
      
      Since Raid56 is a little different and device replace dosen't support raid56
      yet, it is not addressed in the patch and I add comments to make sure we will
      fix it in the future.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c404e0dc
    • M
      Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace · 391cd9df
      Miao Xie 提交于
      the alloc list of the filesystem is protected by ->chunk_mutex, we need
      get that mutex when we insert the new device into the list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      391cd9df
    • K
      btrfs: Return EXDEV for cross file system snapshot · 23ad5b17
      Kusanagi Kouichi 提交于
      EXDEV seems an appropriate error if an operation fails bacause it
      crosses file system boundaries.
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NKusanagi Kouichi <slash@ac.auone-net.jp>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      23ad5b17
    • M
      Btrfs: don't mix the ordered extents of all files together during logging the inodes · 827463c4
      Miao Xie 提交于
      There was a problem in the old code:
      If we failed to log the csum, we would free all the ordered extents in the log list
      including those ordered extents that were logged successfully, it would make the
      log committer not to wait for the completion of the ordered extents.
      
      This patch doesn't insert the ordered extents that is about to be logged into
      a global list, instead, we insert them into a local list. If we log the ordered
      extents successfully, we splice them with the global list, or we will throw them
      away, then do full sync. It can also reduce the lock contention and the traverse
      time of list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      827463c4
  2. 16 2月, 2014 2 次提交
    • F
      Btrfs: use right clone root offset for compressed extents · 93de4ba8
      Filipe David Borba Manana 提交于
      For non compressed extents, iterate_extent_inodes() gives us offsets
      that take into account the data offset from the file extent items, while
      for compressed extents it doesn't. Therefore we have to adjust them before
      placing them in a send clone instruction. Not doing this adjustment leads to
      the receiving end requesting for a wrong a file range to the clone ioctl,
      which results in different file content from the one in the original send
      root.
      
      Issue reproducible with the following excerpt from the test I made for
      xfstests:
      
        _scratch_mkfs
        _scratch_mount "-o compress-force=lzo"
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1
      
        $XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo
        $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT
        $XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo
      
        # will be used for incremental send to be able to issue clone operations
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap
      
        $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2
      
        $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1
        $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \
            -x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2
        $FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \
            -x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2
      
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap
        $BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap
        $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \
            -c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap
      
        _scratch_unmount
        _scratch_mkfs
        _scratch_mount
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap
        $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap
        $FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full
      
        $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap
        $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      93de4ba8
    • A
      btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 · f085381e
      Anand Jain 提交于
      bdev is null when disk has disappeared and mounted with
      the degrade option
      
      stack trace
      ---------
      btrfs_sysfs_add_one+0x105/0x1c0 [btrfs]
      open_ctree+0x15f3/0x1fe0 [btrfs]
      btrfs_mount+0x5db/0x790 [btrfs]
      ? alloc_pages_current+0xa4/0x160
      mount_fs+0x34/0x1b0
      vfs_kern_mount+0x62/0xf0
      do_mount+0x22e/0xa80
      ? __get_free_pages+0x9/0x40
      ? copy_mount_options+0x31/0x170
      SyS_mount+0x7e/0xc0
      system_call_fastpath+0x16/0x1b
      ---------
      
      reproducer:
      -------
      mkfs.btrfs -draid1 -mraid1 /dev/sdc /dev/sdd
      (detach a disk)
      devmgt detach /dev/sdc [1]
      mount -o degrade /dev/sdd /btrfs
      -------
      
      [1] github.com/anajain/devmgt.git
      Signed-off-by: NAnand Jain <Anand.Jain@oracle.com>
      Tested-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f085381e
  3. 15 2月, 2014 4 次提交
    • J
      Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol · 3a0dfa6a
      Josef Bacik 提交于
      A user was running into errors from an NFS export of a subvolume that had a
      default subvol set.  When we mount a default subvol we will use d_obtain_alias()
      to find an existing dentry for the subvolume in the case that the root subvol
      has already been mounted, or a dummy one is allocated in the case that the root
      subvol has not already been mounted.  This allows us to connect the dentry later
      on if we wander into the path.  However if we don't ever wander into the path we
      will keep DCACHE_DISCONNECTED set for a long time, which angers NFS.  It doesn't
      appear to cause any problems but it is annoying nonetheless, so simply unset
      DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
      use d_materialise_unique() instead which will make everything play nicely
      together and reconnect stuff if we wander into the defaul subvol path from a
      different way.  With this patch I'm no longer getting the NFS errors when
      exporting a volume that has been mounted with a default subvol set.  Thanks,
      
      cc: bfields@fieldses.org
      cc: ebiederm@xmission.com
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a0dfa6a
    • M
      Btrfs: fix max_inline mount option · feb5f965
      Mitch Harder 提交于
      Currently, the only mount option for max_inline that has any effect is
      max_inline=0.  Any other value that is supplied to max_inline will be
      adjusted to a minimum of 4k.  Since max_inline has an effective maximum
      of ~3900 bytes due to page size limitations, the current behaviour
      only has meaning for max_inline=0.
      
      This patch will allow the the max_inline mount option to accept non-zero
      values as indicated in the documentation.
      Signed-off-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NChris Mason <clm@fb.com>
      feb5f965
    • L
      Btrfs: fix a lockdep warning when cleaning up aborted transaction · a9d2d4ad
      Liu Bo 提交于
      Given now we have 2 spinlock for management of delayed refs,
      CONFIG_DEBUG_SPINLOCK=y helped me find this,
      
      [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258
      [ 4723.414882]  lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2
      [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G        W  O 3.12.0+ #4
      [ 4723.421321] Call Trace:
      [ 4723.421872]  [<ffffffff81680fe7>] dump_stack+0x54/0x74
      [ 4723.422753]  [<ffffffff81681093>] spin_dump+0x8c/0x91
      [ 4723.424979]  [<ffffffff816810b9>] spin_bug+0x21/0x26
      [ 4723.425846]  [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90
      [ 4723.434424]  [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40
      [ 4723.438747]  [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs]
      [ 4723.443321]  [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs]
      [ 4723.444692]  [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
      [ 4723.450336]  [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10
      [ 4723.451332]  [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs]
      [ 4723.452543]  [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs]
      [ 4723.457833]  [<ffffffff81079efa>] kthread+0xea/0xf0
      [ 4723.458990]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
      [ 4723.460133]  [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
      [ 4723.460865]  [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140
      [ 4723.496521] ------------[ cut here ]------------
      
      ----------------------------------------------------------------------
      
      The reason is that we get to call cond_resched_lock(&head_ref->lock) while
      still holding @delayed_refs->lock.
      
      So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire
      dance before and after actually processing delayed refs.
      
      Here we don't drop the lock, others are not able to add new delayed refs to
      head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a9d2d4ad
    • C
      Revert "btrfs: add ioctl to export size of global metadata reservation" · 11bcac89
      Chris Mason 提交于
      This reverts commit 01e219e8.
      
      David Sterba found a different way to provide these features without adding a new
      ioctl.  We haven't released any progs with this ioctl yet, so I'm taking this out
      for now until we finalize things.
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      CC: Jeff Mahoney <jeffm@suse.com>
      11bcac89
  4. 09 2月, 2014 1 次提交
    • F
      Btrfs: fix data corruption when reading/updating compressed extents · a2aa75e1
      Filipe David Borba Manana 提交于
      When using a mix of compressed file extents and prealloc extents, it
      is possible to fill a page of a file with random, garbage data from
      some unrelated previous use of the page, instead of a sequence of zeroes.
      
      A simple sequence of steps to get into such case, taken from the test
      case I made for xfstests, is:
      
         _scratch_mkfs
         _scratch_mount "-o compress-force=lzo"
         $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar
         $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
      This results in the following file items in the fs tree:
      
         item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
             inode generation 6 transid 6 size 542872 block group 0 mode 100600
         item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
             inode ref index 2 namelen 6 name: foobar
         item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
             extent data disk byte 0 nr 0 gen 6
             extent data offset 0 nr 24576 ram 266240
             extent compression 0
         item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
             prealloc data disk byte 12849152 nr 241664 gen 6
             prealloc data offset 0 nr 241664
         item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
             extent data disk byte 12845056 nr 4096 gen 6
             extent data offset 0 nr 20480 ram 20480
             extent compression 2
         item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
             prealloc data disk byte 13090816 nr 405504 gen 6
             prealloc data offset 0 nr 258048
      
      The on disk extent at offset 266240 (which corresponds to 1 single disk block),
      contains 5 compressed chunks of file data. Each of the first 4 compress 4096
      bytes of file data, while the last one only compresses 3024 bytes of file data.
      Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
      1072 bytes) should always return zeroes (our next extent is a prealloc one).
      
      The solution here is the compression code path to zero the remaining (untouched)
      bytes of the last page it uncompressed data into, as the information about how
      much space the file data consumes in the last page is not known in the upper layer
      fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
      the remainder of the page but only if it corresponds to the last page of the inode
      and if the inode's size is not a multiple of the page size.
      
      This would cause not only returning random data on reads, but also permanently
      storing random data when updating parts of the region that should be zeroed.
      For the example above, it means updating a single byte in the region [285648 ; 286720[
      would store that byte correctly but also store random data on disk.
      
      A test case for xfstests follows soon.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a2aa75e1