1. 23 2月, 2016 2 次提交
  2. 20 1月, 2016 2 次提交
    • Z
      btrfs: merge functions for wait snapshot creation · 0bc19f90
      Zhao Lei 提交于
      wait_for_snapshot_creation() is in same group with oher two:
       btrfs_start_write_no_snapshoting()
       btrfs_end_write_no_snapshoting()
      
      Rename wait_for_snapshot_creation() and move it into same place
      with other two.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0bc19f90
    • F
      Btrfs: fix deadlock running delayed iputs at transaction commit time · c2d6cb16
      Filipe Manana 提交于
      While running a stress test I ran into a deadlock when running the delayed
      iputs at transaction time, which produced the following report and trace:
      
      [  886.399989] =============================================
      [  886.400871] [ INFO: possible recursive locking detected ]
      [  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
      [  886.402384] ---------------------------------------------
      [  886.403182] fio/8277 is trying to acquire lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] but task is already holding lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] other info that might help us debug this:
      [  886.403568]  Possible unsafe locking scenario:
      [  886.403568]
      [  886.403568]        CPU0
      [  886.403568]        ----
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]
      [  886.403568]  *** DEADLOCK ***
      [  886.403568]
      [  886.403568]  May be due to missing lock nesting notation
      [  886.403568]
      [  886.403568] 3 locks held by fio/8277:
      [  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
      [  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] stack backtrace:
      [  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
      [  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
      [  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
      [  886.403568] Call Trace:
      [  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
      [  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
      [  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
      [  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
      [  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
      [  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
      [  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
      [  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
      [  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
      [  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
      [  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
      [  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
      [  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
      [  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
      [  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
      [ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
      [ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
      [ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
      [ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
      [ 1081.880782] Call Trace:
      [ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
      [ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.985293] INFO: lockdep is turned off.
      [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
      [ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
      [ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
      [ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
      [ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
      [ 1081.996485] Call Trace:
      [ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
      [ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
      [ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
      [ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
      [ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
      [ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
      [ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
      [ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
      [ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This happens because we can recursively acquire the semaphore
      fs_info->delayed_iput_sem when attempting to allocate space to satisfy
      a file write request as shown in the first trace above - when committing
      a transaction we acquire (down_read) the semaphore before running the
      delayed iputs, and when running a delayed iput() we can end up calling
      an inode's eviction handler, which in turn commits another transaction
      and attempts to acquire (down_read) again the semaphore to run more
      delayed iput operations.
      This results in a deadlock because if a task acquires multiple times a
      semaphore it should invoke down_read_nested() with a different lockdep
      class for each level of recursion.
      
      Fix this by simplifying the implementation and use a mutex instead that
      is acquired by the cleaner kthread before it runs the delayed iputs
      instead of always acquiring a semaphore before delayed references are
      run from anywhere.
      
      Fixes: d7c15171 (btrfs: Fix NO_SPACE bug caused by delayed-iput)
      Cc: stable@vger.kernel.org   # 4.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c2d6cb16
  3. 07 1月, 2016 7 次提交
  4. 01 1月, 2016 1 次提交
  5. 30 12月, 2015 1 次提交
  6. 18 12月, 2015 5 次提交
    • O
      Btrfs: add free space tree mount option · 70f6d82e
      Omar Sandoval 提交于
      Now we can finally hook up everything so we can actually use free space
      tree. The free space tree is enabled by passing the space_cache=v2 mount
      option. On the first mount with the this option set, the free space tree
      will be created and the FREE_SPACE_TREE read-only compat bit will be
      set. Any time the filesystem is mounted from then on, we must use the
      free space tree. The clear_cache option will also clear the free space
      tree.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      70f6d82e
    • O
      Btrfs: implement the free space B-tree · a5ed9182
      Omar Sandoval 提交于
      The free space cache has turned out to be a scalability bottleneck on
      large, busy filesystems. When the cache for a lot of block groups needs
      to be written out, we can get extremely long commit times; if this
      happens in the critical section, things are especially bad because we
      block new transactions from happening.
      
      The main problem with the free space cache is that it has to be written
      out in its entirety and is managed in an ad hoc fashion. Using a B-tree
      to store free space fixes this: updates can be done as needed and we get
      all of the benefits of using a B-tree: checksumming, RAID handling,
      well-understood behavior.
      
      With the free space tree, we get commit times that are about the same as
      the no cache case with load times slower than the free space cache case
      but still much faster than the no cache case. Free space is represented
      with extents until it becomes more space-efficient to use bitmaps,
      giving us similar space overhead to the free space cache.
      
      The operations on the free space tree are: adding and removing free
      space, handling the creation and deletion of block groups, and loading
      the free space for a block group. We can also create the free space tree
      by walking the extent tree and clear the free space tree.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a5ed9182
    • O
      Btrfs: introduce the free space B-tree on-disk format · 208acb8c
      Omar Sandoval 提交于
      The on-disk format for the free space tree is straightforward. Each
      block group is represented in the free space tree by a free space info
      item that stores accounting information: whether the free space for this
      block group is stored as bitmaps or extents and how many extents of free
      space exist for this block group (regardless of which format is being
      used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
      with no corresponding item, and bitmaps instead have the
      FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
      array of bytes.
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      208acb8c
    • O
      Btrfs: refactor caching_thread() · 73fa48b6
      Omar Sandoval 提交于
      We're also going to load the free space tree from caching_thread(), so
      we should refactor some of the common code.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      73fa48b6
    • O
      Btrfs: add helpers for read-only compat bits · 1abfbcdf
      Omar Sandoval 提交于
      We're finally going to add one of these for the free space tree, so
      let's add the same nice helpers that we have for the incompat bits.
      While we're add it, also add helpers to clear the bits.
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1abfbcdf
  7. 08 12月, 2015 1 次提交
    • C
      vfs: pull btrfs clone API to vfs layer · 04b38d60
      Christoph Hellwig 提交于
      The btrfs clone ioctls are now adopted by other file systems, with NFS
      and CIFS already having support for them, and XFS being under active
      development.  To avoid growth of various slightly incompatible
      implementations, add one to the VFS.  Note that clones are different from
      file copies in several ways:
      
       - they are atomic vs other writers
       - they support whole file clones
       - they support 64-bit legth clones
       - they do not allow partial success (aka short writes)
       - clones are expected to be a fast metadata operation
      
      Because of that it would be rather cumbersome to try to piggyback them on
      top of the recent clone_file_range infrastructure.  The converse isn't
      true and the clone_file_range system call could try clone file range as
      a first attempt to copy, something that further patches will enable.
      
      Based on earlier work from Peng Tao.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      04b38d60
  8. 03 12月, 2015 2 次提交
  9. 02 12月, 2015 1 次提交
  10. 25 11月, 2015 3 次提交
    • F
      Btrfs: fix scrub preventing unused block groups from being deleted · 758f2dfc
      Filipe Manana 提交于
      Currently scrub can race with the cleaner kthread when the later attempts
      to delete an unused block group, and the result is preventing the cleaner
      kthread from ever deleting later the block group - unless the block group
      becomes used and unused again. The following diagram illustrates that
      race:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs and
           removes it from that list
      
                                                   scrub_enumerate_chunks()
      
                                                     searches device tree using
                                                     its commit root
      
                                                     finds device extent for
                                                     block group X
      
                                                     gets block group X from the tree
                                                     fs_info->block_group_cache_tree
                                                     (via btrfs_lookup_block_group())
      
                                                     sets bg X to RO
      
           sees the block group is
           already RO and therefore
           doesn't delete it nor adds
           it back to unused list
      
      So fix this by making scrub add the block group again to the list of
      unused block groups if the block group is still unused when it finished
      scrubbing it and it hasn't been removed already.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758f2dfc
    • F
      Btrfs: fix the number of transaction units needed to remove a block group · 7fd01182
      Filipe Manana 提交于
      We were using only 1 transaction unit when attempting to delete an unused
      block group but in reality we need 3 + N units, where N corresponds to the
      number of stripes. We were accounting only for the addition of the orphan
      item (for the block group's free space cache inode) but we were not
      accounting that we need to delete one block group item from the extent
      tree, one free space item from the tree of tree roots and N device extent
      items from the device tree.
      
      While one unit is not enough, it worked most of the time because for each
      single unit we are too pessimistic and assume an entire tree path, with
      the highest possible heigth (8), needs to be COWed with eventual node
      splits at every possible level in the tree, so there was usually enough
      reserved space for removing all the items and adding the orphan item.
      
      However after adding the orphan item, writepages() can by called by the VM
      subsystem against the btree inode when we are under memory pressure, which
      causes writeback to start for the nodes we COWed before, this forces the
      operation to remove the free space item to COW again some (or all of) the
      same nodes (in the tree of tree roots). Even without writepages() being
      called, we could fail with ENOSPC because these items are located in
      multiple trees and one of them might have a higher heigth and require
      node/leaf splits at many levels, exhausting all the reserved space before
      removing all the items and adding the orphan.
      
      In the kernel 4.0 release, commit 3d84be79 ("Btrfs: fix BUG_ON in
      btrfs_orphan_add() when delete unused block group"), we attempted to fix
      a BUG_ON due to ENOSPC when trying to add the orphan item by making the
      cleaner kthread reserve one transaction unit before attempting to remove
      the block group, but this was not enough. We had a couple user reports
      still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
      a 4.2-rc6 kernel for example:
      
          http://www.spinics.net/lists/linux-btrfs/msg46070.html
      
      So fix this by reserving all the necessary units of metadata.
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Fixes: 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7fd01182
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  11. 07 11月, 2015 1 次提交
  12. 27 10月, 2015 4 次提交
  13. 26 10月, 2015 1 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
  14. 22 10月, 2015 8 次提交
  15. 08 10月, 2015 1 次提交