1. 30 12月, 2015 1 次提交
  2. 24 12月, 2015 1 次提交
  3. 22 12月, 2015 1 次提交
    • F
      Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups · e44081ef
      Filipe Manana 提交于
      We call btrfs_write_dirty_block_groups() in the critical section of a
      transaction's commit, when no other tasks can join the transaction and
      add more block groups to the transaction's list of dirty block groups,
      so we not taking the dirty block groups spinlock when checking for the
      list's emptyness, grabbing its first element or deleting elements from
      it.
      
      However there's a special and rare case where we can have a concurrent
      task adding elements to this list. We trigger writeback for space
      caches before at btrfs_start_dirty_block_groups() and in past iterations
      of the loop at btrfs_write_dirty_block_groups(), this means that when
      the writeback finishes (which happens asynchronously) it creates a
      task for the endio free space work queue that executes
      btrfs_finish_ordered_io() - this function is able to join the transaction,
      through btrfs_join_transaction_nolock(), and update the free space cache's
      inode item in the root tree, which can result in COWing nodes of this tree
      and therefore allocation of a new block group can happen, which gets added
      to the transaction's list of dirty block groups while the transaction
      commit task is operating on it concurrently.
      
      So fix this by taking the dirty block groups spinlock before doing
      operations on the dirty block groups list at
      btrfs_write_dirty_block_groups().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      e44081ef
  4. 18 12月, 2015 10 次提交
    • F
      Btrfs: fix locking bugs when defragging leaves · 0376374a
      Filipe Manana 提交于
      When running fstests btrfs/070, with a higher number of fsstress
      operations, I ran frequently into two different locking bugs when
      defragging directories.
      
      The first bug produced the following traces:
      
      [133860.229792] ------------[ cut here ]------------
      [133860.251062] WARNING: CPU: 2 PID: 26057 at fs/btrfs/locking.c:46 btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]()
      [133860.253576] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport
      [133860.282566] CPU: 2 PID: 26057 Comm: btrfs Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [133860.284393] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [133860.286827]  0000000000000000 ffff880207697b78 ffffffff812566f4 0000000000000000
      [133860.288341]  ffff880207697bb0 ffffffff8104d0a6 ffffffffa052d4c1 ffff880178f60e00
      [133860.294219]  ffff880178f60e00 0000000000000000 00000000000000f6 ffff880207697bc0
      [133860.295831] Call Trace:
      [133860.306518]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
      [133860.307473]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
      [133860.308619]  [<ffffffffa052d4c1>] ? btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]
      [133860.310068]  [<ffffffff8104d172>] warn_slowpath_null+0x1a/0x1c
      [133860.312552]  [<ffffffffa052d4c1>] btrfs_set_lock_blocking_rw+0x57/0xbd [btrfs]
      [133860.314630]  [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs]
      [133860.323596]  [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs]
      [133860.325233]  [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs]
      [133860.332427]  [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs]
      [133860.337259]  [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs]
      [133860.340147]  [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs]
      [133860.344833]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
      [133860.346343]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [133860.353248]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [133860.354242]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
      [133860.355232]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
      [133860.356237]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
      [133860.358587]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [133860.360195]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [133860.361380]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [133860.363578]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [133860.366217] ---[ end trace 2cadb2f653437e49 ]---
      [133860.367399] ------------[ cut here ]------------
      [133860.368162] kernel BUG at fs/btrfs/locking.c:307!
      [133860.369430] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [133860.370205] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 psmouse parport
      [133860.370205] CPU: 2 PID: 26057 Comm: btrfs Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [133860.370205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [133860.370205] task: ffff8800aec6db40 ti: ffff880207694000 task.ti: ffff880207694000
      [133860.370205] RIP: 0010:[<ffffffffa052d466>]  [<ffffffffa052d466>] btrfs_assert_tree_locked+0x10/0x14 [btrfs]
      [133860.370205] RSP: 0018:ffff880207697bc0  EFLAGS: 00010246
      [133860.370205] RAX: 0000000000000000 RBX: ffff880178f60e00 RCX: 0000000000000000
      [133860.370205] RDX: ffff88023ec4fb50 RSI: 00000000ffffffff RDI: ffff880178f60e00
      [133860.370205] RBP: ffff880207697bc0 R08: 0000000000000001 R09: 0000000000000000
      [133860.370205] R10: 0000160000000000 R11: ffffffff81651000 R12: ffff880178f60e00
      [133860.370205] R13: 0000000000000000 R14: 00000000000000f6 R15: ffff8801ff409000
      [133860.370205] FS:  00007f763efd48c0(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
      [133860.370205] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [133860.370205] CR2: 0000000002158048 CR3: 000000003fd6c000 CR4: 00000000000006e0
      [133860.370205] Stack:
      [133860.370205]  ffff880207697bd8 ffffffffa052d4d0 0000000000000000 ffff880207697be8
      [133860.370205]  ffffffffa04d5787 ffff880207697c80 ffffffffa04d99cb ffff8801ff409590
      [133860.370205]  ffff880207697ca8 000000f507697c80 ffff880183c11bb8 0000000000000000
      [133860.370205] Call Trace:
      [133860.370205]  [<ffffffffa052d4d0>] btrfs_set_lock_blocking_rw+0x66/0xbd [btrfs]
      [133860.370205]  [<ffffffffa04d5787>] btrfs_set_lock_blocking+0xe/0x10 [btrfs]
      [133860.370205]  [<ffffffffa04d99cb>] btrfs_realloc_node+0xb3/0x341 [btrfs]
      [133860.370205]  [<ffffffffa050e396>] btrfs_defrag_leaves+0x239/0x2fa [btrfs]
      [133860.370205]  [<ffffffffa04fc2ce>] btrfs_defrag_root+0x63/0xca [btrfs]
      [133860.370205]  [<ffffffffa052a34e>] btrfs_ioctl_defrag+0x78/0x14e [btrfs]
      [133860.370205]  [<ffffffffa052b00b>] btrfs_ioctl+0x746/0x24c6 [btrfs]
      [133860.370205]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
      [133860.370205]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [133860.370205]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [133860.370205]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
      [133860.370205]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
      [133860.370205]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
      [133860.370205]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [133860.370205]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [133860.370205]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [133860.370205]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This bug happened because we assumed that by setting keep_locks to 1 in
      our search path, our path after a call to btrfs_search_slot() would have
      all nodes locked, which is not always true because unlock_up() (called by
      btrfs_search_slot()) will unlock a node in a path if the slot of the node
      below it doesn't point to the last item or beyond the last item. For
      example, when the tree has a heigth of 2 and path->slots[0] has a value
      smaller than btrfs_header_nritems(path->nodes[0]) - 1, the node at level 2
      will be unlocked (also because lowest_unlock is set to 1 due to the fact
      that the value passed as ins_len to btrfs_search_slot is 0).
      This resulted in btrfs_find_next_key(), called before btrfs_realloc_node(),
      to release out path and call again btrfs_search_slot(), but this time with
      the cow parameter set to 0, meaning the resulting path got only read locks.
      Therefore when we called btrfs_realloc_node(), with path->nodes[1] having
      a read lock, it resulted in the warning and BUG_ON when calling
      btrfs_set_lock_blocking() against the node, as that function expects the
      node to have a write lock.
      
      The second bug happened often when the first bug didn't happen, and made
      us hang and hitting the following warning at fs/btrfs/locking.c:
      
         251  void btrfs_tree_lock(struct extent_buffer *eb)
         252  {
         253          WARN_ON(eb->lock_owner == current->pid);
      
      This happened because the tree search we made at btrfs_defrag_leaves()
      before calling btrfs_find_next_key() locked a leaf and all the other
      nodes in the path, so btrfs_find_next_key() had no need to release the
      path and make a new search (with path->lowest_level set to 1). This
      made btrfs_realloc_node() attempt to write lock the same leaf again,
      resulting in a hang/deadlock.
      
      So fix these issues by calling btrfs_find_next_key() after calling
      btrfs_realloc_node() and setting the search path's lowest_level to 1
      to avoid the hang/deadlock when attempting to write lock the leaves
      at btrfs_realloc_node().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      0376374a
    • O
      Btrfs: add free space tree mount option · 70f6d82e
      Omar Sandoval 提交于
      Now we can finally hook up everything so we can actually use free space
      tree. The free space tree is enabled by passing the space_cache=v2 mount
      option. On the first mount with the this option set, the free space tree
      will be created and the FREE_SPACE_TREE read-only compat bit will be
      set. Any time the filesystem is mounted from then on, we must use the
      free space tree. The clear_cache option will also clear the free space
      tree.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      70f6d82e
    • O
      Btrfs: wire up the free space tree to the extent tree · 1e144fb8
      Omar Sandoval 提交于
      The free space tree is updated in tandem with the extent tree. There are
      only a handful of places where we need to hook in:
      
      1. Block group creation
      2. Block group deletion
      3. Delayed refs (extent creation and deletion)
      4. Block group caching
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1e144fb8
    • O
      Btrfs: add free space tree sanity tests · 7c55ee0c
      Omar Sandoval 提交于
      This tests the operations on the free space tree trying to excercise all
      of the main cases for both formats. Between this and xfstests, the free
      space tree should have pretty good coverage.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7c55ee0c
    • O
      Btrfs: implement the free space B-tree · a5ed9182
      Omar Sandoval 提交于
      The free space cache has turned out to be a scalability bottleneck on
      large, busy filesystems. When the cache for a lot of block groups needs
      to be written out, we can get extremely long commit times; if this
      happens in the critical section, things are especially bad because we
      block new transactions from happening.
      
      The main problem with the free space cache is that it has to be written
      out in its entirety and is managed in an ad hoc fashion. Using a B-tree
      to store free space fixes this: updates can be done as needed and we get
      all of the benefits of using a B-tree: checksumming, RAID handling,
      well-understood behavior.
      
      With the free space tree, we get commit times that are about the same as
      the no cache case with load times slower than the free space cache case
      but still much faster than the no cache case. Free space is represented
      with extents until it becomes more space-efficient to use bitmaps,
      giving us similar space overhead to the free space cache.
      
      The operations on the free space tree are: adding and removing free
      space, handling the creation and deletion of block groups, and loading
      the free space for a block group. We can also create the free space tree
      by walking the extent tree and clear the free space tree.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a5ed9182
    • O
      Btrfs: introduce the free space B-tree on-disk format · 208acb8c
      Omar Sandoval 提交于
      The on-disk format for the free space tree is straightforward. Each
      block group is represented in the free space tree by a free space info
      item that stores accounting information: whether the free space for this
      block group is stored as bitmaps or extents and how many extents of free
      space exist for this block group (regardless of which format is being
      used in the tree). Extents are (start, FREE_SPACE_EXTENT, length) keys
      with no corresponding item, and bitmaps instead have the
      FREE_SPACE_BITMAP type and have a bitmap item attached, which is just an
      array of bytes.
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      208acb8c
    • O
      Btrfs: refactor caching_thread() · 73fa48b6
      Omar Sandoval 提交于
      We're also going to load the free space tree from caching_thread(), so
      we should refactor some of the common code.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      73fa48b6
    • O
      Btrfs: add helpers for read-only compat bits · 1abfbcdf
      Omar Sandoval 提交于
      We're finally going to add one of these for the free space tree, so
      let's add the same nice helpers that we have for the incompat bits.
      While we're add it, also add helpers to clear the bits.
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1abfbcdf
    • O
      Btrfs: add extent buffer bitmap sanity tests · 0f331229
      Omar Sandoval 提交于
      Sanity test the extent buffer bitmap operations (test, set, and clear)
      against the equivalent standard kernel operations.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f331229
    • O
      Btrfs: add extent buffer bitmap operations · 3e1e8bb7
      Omar Sandoval 提交于
      These are going to be used for the free space tree bitmap items.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3e1e8bb7
  5. 17 12月, 2015 5 次提交
    • F
      Btrfs: fix leaking of ordered extents after direct IO write error · f28a4928
      Filipe Manana 提交于
      When doing a direct IO write, __blockdev_direct_IO() can call the
      btrfs_get_blocks_direct() callback one or more times before it calls the
      btrfs_submit_direct() callback. However it can fail after calling the
      first callback and before calling the second callback, which is a problem
      because the first one creates ordered extents and the second one is the
      one that submits bios that cover the ordered extents created by the first
      one. That means the ordered extents will never complete nor have any of
      the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
      subsequent operations (such as other direct IO writes, buffered writes or
      hole punching) that lock the same IO range and lookup for ordered extents
      in the range to hang forever waiting for those ordered extents because
      they can not complete ever, since no bio was submitted.
      
      Fix this by tracking a range of created ordered extents that don't have
      yet corresponding bios submitted and completing the ordered extents in
      the range if __blockdev_direct_IO() fails with an error.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      f28a4928
    • F
      Btrfs: fix deadlock between direct IO write and defrag/readpages · b850ae14
      Filipe Manana 提交于
      If readpages() (triggered by defrag or buffered reads) is called while a
      direct IO write is in progress, we have a small time window where we can
      deadlock, resulting in traces like the following being generated:
      
      [84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
      [84723.214310]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84723.217313] fio        D ffff88023ec75218     0  2849   2835 0x00000000
      [84723.218778]  ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
      [84723.220458]  ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
      [84723.230597]  0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
      [84723.232085] Call Trace:
      [84723.232625]  [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
      [84723.233529]  [<ffffffff8147856a>] schedule+0x7d/0x95
      [84723.234398]  [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
      [84723.235384]  [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
      [84723.236426]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
      [84723.237502]  [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
      [84723.238807]  [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [84723.242012]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
      [84723.243064]  [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
      [84723.244116]  [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
      [84723.245029]  [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
      [84723.245942]  [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
      [84723.246596]  [<ffffffff81478953>] bit_wait_io+0x39/0x45
      [84723.247503]  [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
      [84723.248540]  [<ffffffff8111684f>] __lock_page+0x66/0x68
      [84723.249558]  [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
      [84723.250844]  [<ffffffff81124a04>] lock_page+0x2c/0x2f
      [84723.251871]  [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
      [84723.253274]  [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
      [84723.254757]  [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
      [84723.256378]  [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
      [84723.258556]  [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
      [84723.260064]  [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
      [84723.261479]  [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
      [84723.262961]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.264449]  [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
      [84723.265614]  [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
      [84723.266769]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.268264]  [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
      [84723.270954]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.272465]  [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
      [84723.273734]  [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
      [84723.275101]  [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
      [84723.276200]  [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
      [84723.277298]  [<ffffffff8116d79d>] SyS_write+0x50/0x7e
      [84723.278327]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [84723.279595] INFO: lockdep is turned off.
      [84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
      [84723.380323]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84723.383003] btrfs           D ffff88023ed75218     0  2923   2859 0x00000000
      [84723.384277]  ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
      [84723.385748]  ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
      [84723.387152]  ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
      [84723.388620] Call Trace:
      [84723.389105]  [<ffffffff8147856a>] schedule+0x7d/0x95
      [84723.391882]  [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
      [84723.393718]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
      [84723.395659]  [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
      [84723.397383]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
      [84723.398852]  [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
      [84723.400561]  [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
      [84723.401787]  [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
      [84723.403121]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
      [84723.404583]  [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
      [84723.406007]  [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
      [84723.407502]  [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
      [84723.408937]  [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
      [84723.410487]  [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
      [84723.411710]  [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
      [84723.413007]  [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
      [84723.414085]  [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
      [84723.415307]  [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
      [84723.416532]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
      [84723.417731]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [84723.418699]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [84723.421532]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
      [84723.422629]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
      [84723.423712]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
      [84723.424801]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [84723.425968]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [84723.427063]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [84723.428138]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      Consider the following logical and physical file layout:
      
      logical:    ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
                      4K                    8K                    16K
      
      physical:   ... 12853248              12857344              1103101952   ...
                                            (= 12853248 + 4K)
      
      Extents A and B are physically adjacent. The following diagram shows a
      sequence of events that lead to the deadlock when we attempt to do a
      direct IO write against the file range [4K, 16K[ and a defrag is triggered
      simultaneously.
      
                 CPU 1                                               CPU 2
      
       btrfs_direct_IO()
      
         btrfs_get_blocks_direct()
           creates ordered extent A, covering
           the 4k prealloc extent A (range [4K, 8K[)
      
                                                          btrfs_defrag_file()
                                                            page_cache_sync_readahead([0K, 1M[)
                                                              btrfs_readpages()
                                                                extent_readpages()
      
                                                                  locks all pages in the file
                                                                  range [0K, 128K[ through calls
                                                                  to add_to_page_cache_lru()
      
                                                                  __do_contiguous_readpages()
      
                                                                     finds ordered extent A
      
                                                                     waits for it to complete
      
         btrfs_get_blocks_direct() called again
      
           lock_extent_direct(range [8K, 16K[)
      
             finds a page in range [8K, 16K[ through
             btrfs_page_exists_in_range()
      
             invalidate_inode_pages2_range([8K, 16K[)
      
               --> tries to lock pages that are already
                   locked by the task at CPU 2
      
               --> our task, running __blockdev_direct_IO(),
                   hangs waiting to lock the pages and the
                   submit bio callback, btrfs_submit_direct(),
                   ends up never being called, resulting in the
                   ordered extent A never completing (because a
                   corresponding bio is never submitted) and
                   CPU 2 will wait for it forever while holding
                   the pages locked
                    ---> deadlock!
      
      Fix this by removing the page invalidation approach when attempting to
      lock the range for IO from the callback btrfs_get_blocks_direct() and
      falling back buffered IO. This was a rare case anyway and well behaved
      applications do not mix concurrent direct IO writes with buffered reads
      anyway, being a concurrent defrag the only normal case that could lead
      to the deadlock.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      b850ae14
    • F
      Btrfs: fix error path when failing to submit bio for direct IO write · 14543774
      Filipe Manana 提交于
      Commit 61de718f ("Btrfs: fix memory corruption on failure to submit
      bio for direct IO") fixed problems with the error handling code after we
      fail to submit a bio for direct IO. However there were 2 problems that it
      did not address when the failure is due to memory allocation failures for
      direct IO writes:
      
      1) We considered that there could be only one ordered extent for the whole
         IO range, which is not always true, as we can have multiple;
      
      2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
         which can make other tasks running btrfs_wait_logged_extents() hang
         forever, since they wait for that bit to be set. The general assumption
         is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
         and it precedes setting the bit BTRFS_ORDERED_COMPLETE.
      
      Fix these issues by moving part of the btrfs_endio_direct_write() handler
      into a new helper function and having that new helper function called when
      we fail to allocate memory to submit the bio (and its private object) for
      a direct IO write.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      14543774
    • F
      Btrfs: fix memory leaks after transaction is aborted · 7785a663
      Filipe Manana 提交于
      When a transaction is aborted, or its commit fails before writing the new
      superblock and calling btrfs_finish_extent_commit(), we leak reference
      counts on the block groups attached to the transaction's delete_bgs list,
      because btrfs_finish_extent_commit() is never called for those two cases.
      Fix this by dropping their references at btrfs_put_transaction(), which
      is called when transactions are aborted (by making the transaction kthread
      commit the transaction) or if their commits fail.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      7785a663
    • F
      Btrfs: fix race when finishing dev replace leading to transaction abort · 50460e37
      Filipe Manana 提交于
      During the final phase of a device replace operation, I ran into a
      transaction abort that resulted in the following trace:
      
      [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
      [23919.664742] BTRFS: Transaction aborted (error -2)
      [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
      [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
      [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [23919.689151]  0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
      [23919.692604]  ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
      [23919.694230]  ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
      [23919.696716] Call Trace:
      [23919.698669]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
      [23919.700597]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
      [23919.701958]  [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
      [23919.703612]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
      [23919.705047]  [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
      [23919.706967]  [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
      [23919.708611]  [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [23919.710099]  [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
      [23919.711970]  [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
      [23919.713602]  [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
      [23919.714756]  [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
      [23919.716155]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
      [23919.718918]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
      [23919.724170]  [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
      [23919.725482]  [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
      [23919.726790]  [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
      [23919.728428]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [23919.729642]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [23919.730782]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [23919.731847]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [23919.733330] ---[ end trace 166ef301a335832a ]---
      
      This is due to a race between device replace and chunk allocation, which
      the following diagram illustrates:
      
               CPU 1                                    CPU 2
      
       btrfs_dev_replace_finishing()
      
         at this point
          dev_replace->tgtdev->devid ==
          BTRFS_DEV_REPLACE_DEVID (0ULL)
      
         ...
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
                                                     btrfs_fallocate()
                                                       btrfs_alloc_data_chunk_ondemand()
                                                         btrfs_join_transaction()
                                                           --> starts a new transaction
                                                         do_chunk_alloc()
                                                           lock fs_info->chunk_mutex
                                                             btrfs_alloc_chunk()
                                                               --> creates extent map for
                                                                   the new chunk with
                                                                   em->bdev->map->stripes[i]->dev->devid
                                                                   == X (X > 0)
                                                               --> extent map is added to
                                                                   fs_info->mapping_tree
                                                               --> initial phase of bg A
                                                                   allocation completes
                                                           unlock fs_info->chunk_mutex
      
         lock fs_info->chunk_mutex
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates fs_info->mapping_tree and
               replaces the device in every extent
               map's map->stripes[] with
               dev_replace->tgtdev, which still has
               an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
      
                                                         btrfs_end_transaction()
                                                           btrfs_create_pending_block_groups()
                                                             --> starts final phase of
                                                                 bg A creation (update device,
                                                                 extent, and chunk trees, etc)
                                                             btrfs_finish_chunk_alloc()
      
                                                               btrfs_update_device()
                                                                 --> attempts to update a device
                                                                     item with ID == 0ULL
                                                                     (BTRFS_DEV_REPLACE_DEVID)
                                                                     which is the current ID of
                                                                     bg A's
                                                                     em->bdev->map->stripes[i]->dev->devid
                                                                 --> doesn't find such item
                                                                     returns -ENOENT
                                                                 --> the device id should have been X
                                                                     and not 0ULL
      
                                                             got -ENOENT from
                                                             btrfs_finish_chunk_alloc()
                                                             and aborts current transaction
      
         finishes setting up the target device,
         namely it sets tgtdev->devid to the value
         of srcdev->devid, which is X (and X > 0)
      
         frees the srcdev
      
         unlock fs_info->chunk_mutex
      
      So fix this by taking the device list mutex when processing the chunk's
      extent map stripes to update the device items. This avoids getting the
      wrong device id and use-after-free problems if the task finishing a
      chunk allocation grabs the replaced device, which is freed while the
      dev replace task is holding the device list mutex.
      
      This happened while running fstest btrfs/071.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      50460e37
  6. 16 12月, 2015 2 次提交
    • C
      Btrfs: check prepare_uptodate_page() error code earlier · bb1591b4
      Chris Mason 提交于
      prepare_pages() may end up calling prepare_uptodate_page() twice if our
      write only spans a single page.  But if the first call returns an error,
      our page will be unlocked and its not safe to call it again.
      
      This bug goes all the way back to 2011, and it's not something commonly
      hit.
      
      While we're here, add a more explicit check for the page being truncated
      away.  The bare lock_page() alone is protected only by good thoughts and
      i_mutex, which we're sure to regret eventually.
      Reported-by: NDave Jones <dsj@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bb1591b4
    • C
      Btrfs: check for empty bitmap list in setup_cluster_bitmaps · 1b9b922a
      Chris Mason 提交于
      Dave Jones found a warning from kasan in setup_cluster_bitmaps()
      
      ==================================================================
      BUG: KASAN: stack-out-of-bounds in setup_cluster_bitmap+0xc4/0x5a0 at
      addr ffff88039bef6828
      Read of size 8 by task nfsd/1009
      page:ffffea000e6fbd80 count:0 mapcount:0 mapping:          (null)
      index:0x0
      flags: 0x8000000000000000()
      page dumped because: kasan: bad access detected
      CPU: 1 PID: 1009 Comm: nfsd Tainted: G        W
      4.4.0-rc3-backup-debug+ #1
       ffff880065647b50 000000006bb712c2 ffff88039bef6640 ffffffffa680a43e
       0000004559c00000 ffff88039bef66c8 ffffffffa62638d1 ffffffffa61121c0
       ffff8803a5769de8 0000000000000296 ffff8803a5769df0 0000000000046280
      Call Trace:
       [<ffffffffa680a43e>] dump_stack+0x4b/0x6d
       [<ffffffffa62638d1>] kasan_report_error+0x501/0x520
       [<ffffffffa61121c0>] ? debug_show_all_locks+0x1e0/0x1e0
       [<ffffffffa6263948>] kasan_report+0x58/0x60
       [<ffffffffa6814b00>] ? rb_last+0x10/0x40
       [<ffffffffa66f8af4>] ? setup_cluster_bitmap+0xc4/0x5a0
       [<ffffffffa6262ead>] __asan_load8+0x5d/0x70
       [<ffffffffa66f8af4>] setup_cluster_bitmap+0xc4/0x5a0
       [<ffffffffa66f675a>] ? setup_cluster_no_bitmap+0x6a/0x400
       [<ffffffffa66fcd16>] btrfs_find_space_cluster+0x4b6/0x640
       [<ffffffffa66fc860>] ? btrfs_alloc_from_cluster+0x4e0/0x4e0
       [<ffffffffa66fc36e>] ? btrfs_return_cluster_to_free_space+0x9e/0xb0
       [<ffffffffa702dc37>] ? _raw_spin_unlock+0x27/0x40
       [<ffffffffa666a1a1>] find_free_extent+0xba1/0x1520
      
      Andrey noticed this was because we were doing list_first_entry on a list
      that might be empty.  Rework the tests a bit so we don't do that.
      Signed-off-by: NChris Mason <clm@fb.com>
      Reprorted-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Reported-by: NDave Jones <dsj@fb.com>
      1b9b922a
  7. 10 12月, 2015 3 次提交
    • H
      btrfs: fix misleading warning when space cache failed to load · 94356889
      Holger Hoffstätte 提交于
      When an inconsistent space cache is detected during loading we log a
      warning that users frequently mistake as instruction to invalidate the
      cache manually, even though this is not required. Fix the message to
      indicate that the cache will be rebuilt automatically.
      Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Acked-by: NFilipe Manana <fdmanana@suse.com>
      94356889
    • F
      Btrfs: fix transaction handle leak in balance · 8a7d656f
      Filipe Manana 提交于
      If we fail to allocate a new data chunk, we were jumping to the error path
      without release the transaction handle we got before. Fix this by always
      releasing it before doing the jump.
      
      Fixes: 2c9fe835 ("btrfs: Fix lost-data-profile caused by balance bg")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      8a7d656f
    • F
      Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013
      Filipe Manana 提交于
      As of my previous change titled "Btrfs: fix scrub preventing unused block
      groups from being deleted", the following warning at
      extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
      filesysten with "-o discard":
      
       10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
       10264  {
       (...)
       10405                  if (trimming) {
       10406                          WARN_ON(!list_empty(&block_group->bg_list));
       10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
       10408                          list_move(&block_group->bg_list,
       10409                                    &trans->transaction->deleted_bgs);
       10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
       10411                          btrfs_get_block_group(block_group);
       10412                  }
       (...)
      
      This happens because scrub can now add back the block group to the list of
      unused block groups (fs_info->unused_bgs). This is dangerous because we
      are moving the block group from the unused block groups list to the list
      of deleted block groups without holding the lock that protects the source
      list (fs_info->unused_bgs_lock).
      
      The following diagram illustrates how this happens:
      
                  CPU 1                                     CPU 2
      
       cleaner_kthread()
         btrfs_delete_unused_bgs()
      
           sees bg X in list
            fs_info->unused_bgs
      
           deletes bg X from list
            fs_info->unused_bgs
      
                                                  scrub_enumerate_chunks()
      
                                                    searches device tree using
                                                    its commit root
      
                                                    finds device extent for
                                                    block group X
      
                                                    gets block group X from the tree
                                                    fs_info->block_group_cache_tree
                                                    (via btrfs_lookup_block_group())
      
                                                    sets bg X to RO (again)
      
                                                    scrub_chunk(bg X)
      
                                                    sets bg X back to RW mode
      
                                                    adds bg X to the list
                                                    fs_info->unused_bgs again,
                                                    since it's still unused and
                                                    currently not in that list
      
           sets bg X to RO mode
      
           btrfs_remove_chunk(bg X)
      
           --> discard is enabled and bg X
               is in the fs_info->unused_bgs
               list again so the warning is
               triggered
           --> we move it from that list into
               the transaction's delete_bgs
               list, but we can have another
               task currently manipulating
               the first list (fs_info->unused_bgs)
      
      Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
      the list of unused block groups and the list of deleted block groups. This
      makes it safe and there's not much worry for more lock contention, as this
      lock is seldom used and only the cleaner kthread adds elements to the list
      of deleted block groups. The warning goes away too, as this was previously
      an impossible case (and would have been better a BUG_ON/ASSERT) but it's
      not impossible anymore.
      Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      348a0013
  8. 07 12月, 2015 8 次提交
  9. 03 12月, 2015 9 次提交