1. 07 1月, 2016 3 次提交
    • D
      btrfs: cleanup, use enum values for btrfs_path reada · e4058b54
      David Sterba 提交于
      Replace the integers by enums for better readability. The value 2 does
      not have any meaning since a7175319
      "Btrfs: do less aggressive btree readahead" (2009-01-22).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4058b54
    • B
      Btrfs: use linux/sizes.h to represent constants · ee22184b
      Byongho Lee 提交于
      We use many constants to represent size and offset value.  And to make
      code readable we use '256 * 1024 * 1024' instead of '268435456' to
      represent '256MB'.  However we can make far more readable with 'SZ_256MB'
      which is defined in the 'linux/sizes.h'.
      
      So this patch replaces 'xxx * 1024 * 1024' kind of expression with
      single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
      not a power of 2. And I haven't touched to '4096' & '8192' because it's
      more intuitive than 'SZ_4KB' & 'SZ_8KB'.
      Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee22184b
    • D
      btrfs: better packing of btrfs_delayed_extent_op · 35b3ad50
      David Sterba 提交于
      btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
      and has 8 unused bytes. Reducing the level type to u8 makes it possible
      to squeeze it to the padding byte after key. The bitfields were switched
      to bool as there's space to store the full byte without increasing the
      whole structure, besides that the generated assembly is smaller.
      
      struct btrfs_delayed_extent_op {
      	struct btrfs_disk_key      key;                  /*     0    17 */
      	u8                         level;                /*    17     1 */
      	bool                       update_key;           /*    18     1 */
      	bool                       update_flags;         /*    19     1 */
      	bool                       is_data;              /*    20     1 */
      
      	/* XXX 3 bytes hole, try to pack */
      
      	u64                        flags_to_set;         /*    24     8 */
      
      	/* size: 32, cachelines: 1, members: 6 */
      	/* sum members: 29, holes: 1, sum holes: 3 */
      	/* last cacheline: 32 bytes */
      };
      
      The final size is 32 bytes which gives +26 object per slab page.
      
         text	   data	    bss	    dec	    hex	filename
       938811	  43670	  23144	1005625	  f5839	fs/btrfs/btrfs.ko.before
       938747	  43670	  23144	1005561	  f57f9	fs/btrfs/btrfs.ko.after
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35b3ad50
  2. 31 12月, 2015 1 次提交
    • F
      Btrfs: fix race between free space endio workers and space cache writeout · 2bc0bb5f
      Filipe Manana 提交于
      While running a stress test I ran into the following trace/transaction
      abort:
      
      [471626.672243] ------------[ cut here ]------------
      [471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
      [471626.675492] BTRFS: Transaction aborted (error -2)
      [471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix
      [471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [471626.691901]  0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38
      [471626.695009]  ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe
      [471626.697490]  ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90
      [471626.699201] Call Trace:
      [471626.699804]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
      [471626.701049]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
      [471626.702542]  [<ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
      [471626.704326]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
      [471626.705636]  [<ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
      [471626.707048]  [<ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
      [471626.708616]  [<ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs]
      [471626.709950]  [<ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs]
      [471626.711286]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
      [471626.712611]  [<ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
      [471626.715610]  [<ffffffff811962a2>] ? SyS_tee+0x226/0x226
      [471626.716718]  [<ffffffff811962c2>] sync_fs_one_sb+0x20/0x22
      [471626.717672]  [<ffffffff8116fc01>] iterate_supers+0x75/0xc2
      [471626.718800]  [<ffffffff8119669a>] sys_sync+0x52/0x80
      [471626.719990]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [471626.721835] ---[ end trace baf57f43d76693f4 ]---
      [471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry
      
      This is a very rare situation and it happened due to a race between a free
      space endio worker and writing the space caches for dirty block groups at
      a transaction's commit critical section. The steps leading to this are:
      
      1) A task calls btrfs_commit_transaction() and starts the writeout of the
         space caches for all currently dirty block groups (i.e. it calls
         btrfs_start_dirty_block_groups());
      
      2) The previous step starts writeback for space caches;
      
      3) When the writeback finishes it queues jobs for free space endio work
         queue (fs_info->endio_freespace_worker) that execute
         btrfs_finish_ordered_io();
      
      4) The task committing the transaction sets the transaction's state
         to TRANS_STATE_COMMIT_DOING and shortly after calls
         btrfs_write_dirty_block_groups();
      
      5) A free space endio job joins the transaction, through
         btrfs_join_transaction_nolock(), and updates a free space inode item
         in the root tree through btrfs_update_inode_fallback();
      
      6) Updating the free space inode item resulted in COWing one or more
         nodes/leaves of the root tree, and that resulted in creating a new
         metadata block group, which gets added to the transaction's list
         of dirty block groups (this is a very rare case);
      
      7) The free space endio job has not released yet its transaction handle
         at this point, so the new metadata block group was not yet fully
         created (didn't go through btrfs_create_pending_block_groups() yet);
      
      8) The transaction commit task sees the new metadata block group in
         the transaction's list of dirty block groups and processes it.
         When it attempts to update the block group's block group item in
         the extent tree, through write_one_cache_group(), it isn't able
         to find it and aborts the transaction with error -ENOENT - this
         is because the free space endio job hasn't yet released its
         transaction handle (which calls btrfs_create_pending_block_groups())
         and therefore the block group item was not yet added to the extent
         tree.
      
      Fix this waiting for free space endio jobs if we fail to find a block
      group item in the extent tree and then retry once updating the block
      group item.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      2bc0bb5f
  3. 30 12月, 2015 2 次提交
  4. 22 12月, 2015 1 次提交
    • F
      Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups · e44081ef
      Filipe Manana 提交于
      We call btrfs_write_dirty_block_groups() in the critical section of a
      transaction's commit, when no other tasks can join the transaction and
      add more block groups to the transaction's list of dirty block groups,
      so we not taking the dirty block groups spinlock when checking for the
      list's emptyness, grabbing its first element or deleting elements from
      it.
      
      However there's a special and rare case where we can have a concurrent
      task adding elements to this list. We trigger writeback for space
      caches before at btrfs_start_dirty_block_groups() and in past iterations
      of the loop at btrfs_write_dirty_block_groups(), this means that when
      the writeback finishes (which happens asynchronously) it creates a
      task for the endio free space work queue that executes
      btrfs_finish_ordered_io() - this function is able to join the transaction,
      through btrfs_join_transaction_nolock(), and update the free space cache's
      inode item in the root tree, which can result in COWing nodes of this tree
      and therefore allocation of a new block group can happen, which gets added
      to the transaction's list of dirty block groups while the transaction
      commit task is operating on it concurrently.
      
      So fix this by taking the dirty block groups spinlock before doing
      operations on the dirty block groups list at
      btrfs_write_dirty_block_groups().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      e44081ef
  5. 18 12月, 2015 3 次提交
    • O
      Btrfs: wire up the free space tree to the extent tree · 1e144fb8
      Omar Sandoval 提交于
      The free space tree is updated in tandem with the extent tree. There are
      only a handful of places where we need to hook in:
      
      1. Block group creation
      2. Block group deletion
      3. Delayed refs (extent creation and deletion)
      4. Block group caching
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1e144fb8
    • O
      Btrfs: implement the free space B-tree · a5ed9182
      Omar Sandoval 提交于
      The free space cache has turned out to be a scalability bottleneck on
      large, busy filesystems. When the cache for a lot of block groups needs
      to be written out, we can get extremely long commit times; if this
      happens in the critical section, things are especially bad because we
      block new transactions from happening.
      
      The main problem with the free space cache is that it has to be written
      out in its entirety and is managed in an ad hoc fashion. Using a B-tree
      to store free space fixes this: updates can be done as needed and we get
      all of the benefits of using a B-tree: checksumming, RAID handling,
      well-understood behavior.
      
      With the free space tree, we get commit times that are about the same as
      the no cache case with load times slower than the free space cache case
      but still much faster than the no cache case. Free space is represented
      with extents until it becomes more space-efficient to use bitmaps,
      giving us similar space overhead to the free space cache.
      
      The operations on the free space tree are: adding and removing free
      space, handling the creation and deletion of block groups, and loading
      the free space for a block group. We can also create the free space tree
      by walking the extent tree and clear the free space tree.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a5ed9182
    • O
      Btrfs: refactor caching_thread() · 73fa48b6
      Omar Sandoval 提交于
      We're also going to load the free space tree from caching_thread(), so
      we should refactor some of the common code.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      73fa48b6
  6. 10 12月, 2015 1 次提交
    • F
      Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013
      Filipe Manana 提交于
      As of my previous change titled "Btrfs: fix scrub preventing unused block
      groups from being deleted", the following warning at
      extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
      filesysten with "-o discard":
      
       10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
       10264  {
       (...)
       10405                  if (trimming) {
       10406                          WARN_ON(!list_empty(&block_group->bg_list));
       10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
       10408                          list_move(&block_group->bg_list,
       10409                                    &trans->transaction->deleted_bgs);
       10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
       10411                          btrfs_get_block_group(block_group);
       10412                  }
       (...)
      
      This happens because scrub can now add back the block group to the list of
      unused block groups (fs_info->unused_bgs). This is dangerous because we
      are moving the block group from the unused block groups list to the list
      of deleted block groups without holding the lock that protects the source
      list (fs_info->unused_bgs_lock).
      
      The following diagram illustrates how this happens:
      
                  CPU 1                                     CPU 2
      
       cleaner_kthread()
         btrfs_delete_unused_bgs()
      
           sees bg X in list
            fs_info->unused_bgs
      
           deletes bg X from list
            fs_info->unused_bgs
      
                                                  scrub_enumerate_chunks()
      
                                                    searches device tree using
                                                    its commit root
      
                                                    finds device extent for
                                                    block group X
      
                                                    gets block group X from the tree
                                                    fs_info->block_group_cache_tree
                                                    (via btrfs_lookup_block_group())
      
                                                    sets bg X to RO (again)
      
                                                    scrub_chunk(bg X)
      
                                                    sets bg X back to RW mode
      
                                                    adds bg X to the list
                                                    fs_info->unused_bgs again,
                                                    since it's still unused and
                                                    currently not in that list
      
           sets bg X to RO mode
      
           btrfs_remove_chunk(bg X)
      
           --> discard is enabled and bg X
               is in the fs_info->unused_bgs
               list again so the warning is
               triggered
           --> we move it from that list into
               the transaction's delete_bgs
               list, but we can have another
               task currently manipulating
               the first list (fs_info->unused_bgs)
      
      Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
      the list of unused block groups and the list of deleted block groups. This
      makes it safe and there's not much worry for more lock contention, as this
      lock is seldom used and only the cleaner kthread adds elements to the list
      of deleted block groups. The warning goes away too, as this was previously
      an impossible case (and would have been better a BUG_ON/ASSERT) but it's
      not impossible anymore.
      Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      348a0013
  7. 07 12月, 2015 1 次提交
  8. 25 11月, 2015 5 次提交
    • M
      btrfs: qgroup: account shared subtree during snapshot delete · 82bd101b
      Mark Fasheh 提交于
      Commit 0ed4792a ('btrfs: qgroup: Switch to new extent-oriented qgroup
      mechanism.') removed our qgroup accounting during
      btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
      going bad shortly after a snapshot is removed.
      
      Fix this by adding a dirty extent record when we encounter extents during
      our shared subtree walk. This effectively restores the functionality we had
      with the original shared subtree walking code in 1152651a (btrfs: qgroup:
      account shared subtrees during snapshot delete).
      
      The idea with the original patch (and this one) is that shared subtrees can
      get skipped during drop_snapshot. The shared subtree walk then allows us a
      chance to visit those extents and add them to the qgroup work for later
      processing. This ultimately makes the accounting for drop snapshot work.
      
      The new qgroup code nicely handles all the other extents during the tree
      walk via the ref dec/inc functions so we don't have to add actions beyond
      what we had originally.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      82bd101b
    • F
      Btrfs: fix race between cleaner kthread and space cache writeout · 036a9348
      Filipe Manana 提交于
      When a block group becomes unused and the cleaner kthread is currently
      running, we can end up getting the current transaction aborted with error
      -ENOENT when we try to commit the transaction, leading to the following
      trace:
      
        [59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
        [59779.272594] BTRFS: Transaction aborted (error -2)
        (...)
        [59779.291137] Call Trace:
        [59779.291621]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
        [59779.292543]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
        [59779.293435]  [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
        [59779.295000]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
        [59779.296138]  [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
        [59779.297663]  [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
        [59779.299141]  [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs]
        [59779.300359]  [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs]
        [59779.301805]  [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
        [59779.302893]  [<ffffffff81196634>] sync_filesystem+0x7f/0x93
        (...)
        [59779.318186] ---[ end trace 577e2daff90da33a ]---
      
      The following diagram illustrates a sequence of steps leading to this
      problem:
      
             CPU 1                                             CPU 2
      
                                 <at transaction N>
      
                                                              adds bg A to list
                                                              fs_info->unused_bgs
      
                                                              adds bg B to list
                                                              fs_info->unused_bgs
      
                                 <transaction kthread
                                  commits transaction N
                                  and wakes up the
                                  cleaner kthread>
      
        cleaner kthread
          delete_unused_bgs()
      
            sees bg A in list
            fs_info->unused_bgs
      
            btrfs_start_transaction()
      
                                 <transaction N + 1 starts>
      
            deletes bg A
      
                                                              update_block_group(bg C)
      
                                                                --> adds bg C to list
                                                                    fs_info->unused_bgs
      
            deletes bg B
      
            sees bg C in the list
            fs_info->unused_bgs
      
            btrfs_remove_chunk(bg C)
              btrfs_remove_block_group(bg C)
      
                --> checks if the block group
                    is in a dirty list, and
                    because it isn't now, it
                    does nothing
      
                --> the block group item
                    is deleted from the
                    extent tree
      
                                                                --> adds bg C to list
                                                                    transaction->dirty_bgs
      
                                                               some task calls
                                                               btrfs_commit_transaction(t N + 1)
                                                                 commit_cowonly_roots()
                                                                   btrfs_write_dirty_block_groups()
                                                                     --> sees bg C in cur_trans->dirty_bgs
                                                                     --> calls write_one_cache_group()
                                                                         which returns -ENOENT because
                                                                         it did not find the block group
                                                                         item in the extent tree
                                                                     --> transaction aborte with -ENOENT
                                                                         because write_one_cache_group()
                                                                         returned that error
      
      So fix this by adding a block group to the list of dirty block groups
      before adding it to the list of unused block groups.
      
      This happened on a stress test using fsstress plus concurrent calls to
      fallocate 20G and truncate (releasing part of the space allocated with
      fallocate).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      036a9348
    • F
      Btrfs: fix scrub preventing unused block groups from being deleted · 758f2dfc
      Filipe Manana 提交于
      Currently scrub can race with the cleaner kthread when the later attempts
      to delete an unused block group, and the result is preventing the cleaner
      kthread from ever deleting later the block group - unless the block group
      becomes used and unused again. The following diagram illustrates that
      race:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs and
           removes it from that list
      
                                                   scrub_enumerate_chunks()
      
                                                     searches device tree using
                                                     its commit root
      
                                                     finds device extent for
                                                     block group X
      
                                                     gets block group X from the tree
                                                     fs_info->block_group_cache_tree
                                                     (via btrfs_lookup_block_group())
      
                                                     sets bg X to RO
      
           sees the block group is
           already RO and therefore
           doesn't delete it nor adds
           it back to unused list
      
      So fix this by making scrub add the block group again to the list of
      unused block groups if the block group is still unused when it finished
      scrubbing it and it hasn't been removed already.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758f2dfc
    • F
      Btrfs: fix the number of transaction units needed to remove a block group · 7fd01182
      Filipe Manana 提交于
      We were using only 1 transaction unit when attempting to delete an unused
      block group but in reality we need 3 + N units, where N corresponds to the
      number of stripes. We were accounting only for the addition of the orphan
      item (for the block group's free space cache inode) but we were not
      accounting that we need to delete one block group item from the extent
      tree, one free space item from the tree of tree roots and N device extent
      items from the device tree.
      
      While one unit is not enough, it worked most of the time because for each
      single unit we are too pessimistic and assume an entire tree path, with
      the highest possible heigth (8), needs to be COWed with eventual node
      splits at every possible level in the tree, so there was usually enough
      reserved space for removing all the items and adding the orphan item.
      
      However after adding the orphan item, writepages() can by called by the VM
      subsystem against the btree inode when we are under memory pressure, which
      causes writeback to start for the nodes we COWed before, this forces the
      operation to remove the free space item to COW again some (or all of) the
      same nodes (in the tree of tree roots). Even without writepages() being
      called, we could fail with ENOSPC because these items are located in
      multiple trees and one of them might have a higher heigth and require
      node/leaf splits at many levels, exhausting all the reserved space before
      removing all the items and adding the orphan.
      
      In the kernel 4.0 release, commit 3d84be79 ("Btrfs: fix BUG_ON in
      btrfs_orphan_add() when delete unused block group"), we attempted to fix
      a BUG_ON due to ENOSPC when trying to add the orphan item by making the
      cleaner kthread reserve one transaction unit before attempting to remove
      the block group, but this was not enough. We had a couple user reports
      still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
      a 4.2-rc6 kernel for example:
      
          http://www.spinics.net/lists/linux-btrfs/msg46070.html
      
      So fix this by reserving all the necessary units of metadata.
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Fixes: 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7fd01182
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  9. 11 11月, 2015 2 次提交
    • Z
      btrfs: Use fs_info directly in btrfs_delete_unused_bgs · d5f2e33b
      Zhao Lei 提交于
      No need to use root->fs_info in btrfs_delete_unused_bgs(),
      use fs_info directly instead.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d5f2e33b
    • Z
      btrfs: Fix lost-data-profile caused by auto removing bg · aefbe9a6
      Zhao Lei 提交于
      Reproduce:
       (In integration-4.3 branch)
      
       TEST_DEV=(/dev/vdg /dev/vdh)
       TEST_DIR=/mnt/tmp
      
       umount "$TEST_DEV" >/dev/null
       mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
      
       mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
       umount "$TEST_DEV"
      
       mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
       btrfs filesystem usage $TEST_DIR
      
      We can see the data chunk changed from raid1 to single:
       # btrfs filesystem usage $TEST_DIR
       Data,single: Size:8.00MiB, Used:0.00B
          /dev/vdg        8.00MiB
       #
      
      Reason:
       When a empty filesystem mount with -o nospace_cache, the last
       data blockgroup will be auto-removed in umount.
      
       Then if we mount it again, there is no data chunk in the
       filesystem, so the only available data profile is 0x0, result
       is all new chunks are created as single type.
      
      Fix:
       Don't auto-delete last blockgroup for a raid type.
      
      Test:
       Test by above script, and confirmed the logic by debug output.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      aefbe9a6
  10. 03 11月, 2015 1 次提交
    • C
      Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state · 13a0db5a
      chandan 提交于
      When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
      and nodesize set to 64k), the following call trace is observed,
      
      WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
      Modules linked in:
      CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d9 #54
      task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
      NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
      REGS: c0000000f600a820 TRAP: 0700   Not tainted  (4.3.0-rc5-13676-ga5e681d9)
      MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 24444884  XER: 00000000
      CFAR: c0000000004a3b30 SOFTE: 1
      GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
      GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
      GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
      GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
      GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
      GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
      GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
      GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
      NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
      LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      Call Trace:
      [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
      [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
      [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
      [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
      [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
      [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
      [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
      [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
      [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
      [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
      [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
      [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
      [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
      [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
      [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
      [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
      [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
      [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
      [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
      [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
      [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
      [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
      [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
      [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
      Instruction dump:
      fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
      812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c
      
      The above call trace is seen even on x86_64; albeit very rarely and that too
      with nodesize set to 64k and with nospace_cache mount option being used.
      
      The reason for the above call trace is,
      btrfs_remove_chunk
        check_system_chunk
          Allocate chunk if required
        For each physical stripe on underlying device,
          btrfs_free_dev_extent
            ...
            Take lock on Device tree's root node
            btrfs_cow_block("dev tree's root node");
              btrfs_reserve_extent
                find_free_extent
      	    index = BTRFS_RAID_DUP;
      	    have_caching_bg = false;
      
                  When in LOOP_CACHING_NOWAIT state, Assume we find a block group
      	    which is being cached; Hence have_caching_bg is set to true
      
                  When repeating the search for the next RAID index, we set
      	    have_caching_bg to false.
      
      Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
      skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
      allocate a chunk and try to add entries corresponding to the chunk's physical
      stripe into the device tree. When doing so the task deadlocks itself waiting
      for the blocking lock on the root node of the device tree.
      
      This commit fixes the issue by introducing a new local variable to help
      indicate as to whether a block group of any RAID type is being cached.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      13a0db5a
  11. 27 10月, 2015 1 次提交
  12. 26 10月, 2015 2 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
    • F
      Btrfs: fix regression when running delayed references · 2c3cf7d5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a refactoring/rework of the delayed
      references implementation in order to fix certain problems with qgroups.
      However that rework introduced one more regression that leads to the
      following trace when running delayed references for metadata:
      
      [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
      [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
      [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
      [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
      [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
      [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
      [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
      [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
      [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
      [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
      [35908.065201] Stack:
      [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
      [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
      [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
      [35908.065201] Call Trace:
      [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
      [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
      [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
      [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
      [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
      [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
      [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
      [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
      [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
      [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201]  RSP <ffff88010c4cbb08>
      [35908.310885] ---[ end trace fe4299baf0666457 ]---
      
      This happens because the new delayed references code no longer merges
      delayed references that have different sequence values. The following
      steps are an example sequence leading to this issue:
      
      1) Transaction N starts, fs_info->tree_mod_seq has value 0;
      
      2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
         bytenr A is created, with a value of 1 and a seq value of 0;
      
      3) fs_info->tree_mod_seq is incremented to 1;
      
      4) Extent buffer A is deleted through btrfs_del_items(), which calls
         btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
         later returns the metadata extent associated to extent buffer A to
         the free space cache (the range is not pinned), because the extent
         buffer was created in the current transaction (N) and writeback never
         happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
         in the extent buffer).
         This creates the delayed reference Ref2 for bytenr A, with a value
         of -1 and a seq value of 1;
      
      5) Delayed reference Ref2 is not merged with Ref1 when we create it,
         because they have different sequence numbers (decided at
         add_delayed_ref_tail_merge());
      
      6) fs_info->tree_mod_seq is incremented to 2;
      
      7) Some task attempts to allocate a new extent buffer (done at
         extent-tree.c:find_free_extent()), but due to heavy fragmentation
         and running low on metadata space the clustered allocation fails
         and we fall back to unclustered allocation, which finds the
         extent at offset A, so a new extent buffer at offset A is allocated.
         This creates delayed reference Ref3 for bytenr A, with a value of 1
         and a seq value of 2;
      
      8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
         all have different seq values;
      
      9) We start running the delayed references (__btrfs_run_delayed_refs());
      
      10) The delayed Ref1 is the first one being applied, which ends up
          creating an inline extent backref in the extent tree;
      
      10) Next the delayed reference Ref3 is selected for execution, and not
          Ref2, because select_delayed_ref() always gives a preference for
          positive references (that have an action of BTRFS_ADD_DELAYED_REF);
      
      11) When running Ref3 we encounter alreay the inline extent backref
          in the extent tree at insert_inline_extent_backref(), which makes
          us hit the following BUG_ON:
      
              BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
      
          This is always true because owner corresponds to the level of the
          extent buffer/btree node in the btree.
      
      For the scenario described above we hit the BUG_ON because we never merge
      references that have different seq values.
      
      We used to do the merging before the 4.2 kernel, more specifically, before
      the commmits:
      
        c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
        c43d160f ("btrfs: delayed-ref: Cleanup the unneeded functions.")
      
      This issue became more exposed after the following change that was added
      to 4.2 as well:
      
        cffc3374 ("Btrfs: fix order by which delayed references are run")
      
      Which in turn fixed another regression by the two commits previously
      mentioned.
      
      So fix this by bringing back the delayed reference merge code, with the
      proper adaptations so that it operates against the new data structure
      (linked list vs old red black tree implementation).
      
      This issue was hit running fstest btrfs/063 in a loop. Several people have
      reported this issue in the mailing list when running on kernels 4.2+.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      2c3cf7d5
  13. 22 10月, 2015 17 次提交