1. 29 5月, 2018 4 次提交
  2. 28 5月, 2018 1 次提交
  3. 21 4月, 2018 1 次提交
    • N
      btrfs: Fix race condition between delayed refs and blockgroup removal · 5e388e95
      Nikolay Borisov 提交于
      When the delayed refs for a head are all run, eventually
      cleanup_ref_head is called which (in case of deletion) obtains a
      reference for the relevant btrfs_space_info struct by querying the bg
      for the range. This is problematic because when the last extent of a
      bg is deleted a race window emerges between removal of that bg and the
      subsequent invocation of cleanup_ref_head. This can result in cache being null
      and either a null pointer dereference or assertion failure.
      
      	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
      	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
      	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
      	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
      	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
      	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
      	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
      	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
      	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
      	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
      	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
      	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
      	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
      	 evict+0xc6/0x190
      	 do_unlinkat+0x19c/0x300
      	 do_syscall_64+0x74/0x140
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      	RIP: 0033:0x7fbf589c57a7
      
      To fix this, introduce a new flag "is_system" to head_ref structs,
      which is populated at insertion time. This allows to decouple the
      querying for the spaceinfo from querying the possibly deleted bg.
      
      Fixes: d7eae340 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
      CC: stable@vger.kernel.org # 4.14+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e388e95
  4. 12 4月, 2018 1 次提交
  5. 31 3月, 2018 1 次提交
  6. 26 3月, 2018 1 次提交
    • D
      btrfs: add more __cold annotations · e67c718b
      David Sterba 提交于
      The __cold functions are placed to a special section, as they're
      expected to be called rarely. This could help i-cache prefetches or help
      compiler to decide which branches are more/less likely to be taken
      without any other annotations needed.
      
      Though we can't add more __exit annotations, it's still possible to add
      __cold (that's also added with __exit). That way the following function
      categories are tagged:
      
      - printf wrappers, error messages
      - exit helpers
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e67c718b
  7. 02 2月, 2018 1 次提交
    • N
      btrfs: Ignore errors from btrfs_qgroup_trace_extent_post · 952bd3db
      Nikolay Borisov 提交于
      Running generic/019 with qgroups on the scratch device enabled is almost
      guaranteed to trigger the BUG_ON in btrfs_free_tree_block. It's supposed
      to trigger only on -ENOMEM, in reality, however, it's possible to get
      -EIO from btrfs_qgroup_trace_extent_post. This function just finds the
      roots of the extent being tracked and sets the qrecord->old_roots list.
      If this operation fails nothing critical happens except the quota
      accounting can be considered wrong. In such case just set the
      INCONSISTENT flag for the quota and print a warning, rather than killing
      off the system. Additionally, it's possible to trigger a BUG_ON in
      btrfs_truncate_inode_items as well.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      [ error message adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      952bd3db
  8. 22 1月, 2018 1 次提交
  9. 02 11月, 2017 3 次提交
    • J
      btrfs: track refs in a rb_tree instead of a list · 0e0adbcf
      Josef Bacik 提交于
      If we get a significant amount of delayed refs for a single block (think
      modifying multiple snapshots) we can end up spending an ungodly amount
      of time looping through all of the entries trying to see if they can be
      merged.  This is because we only add them to a list, so we have O(2n)
      for every ref head.  This doesn't make any sense as we likely have refs
      for different roots, and so they cannot be merged.  Tracking in a tree
      will allow us to break as soon as we hit an entry that doesn't match,
      making our worst case O(n).
      
      With this we can also merge entries more easily.  Before we had to hope
      that matching refs were on the ends of our list, but with the tree we
      can search down to exact matches and merge them at insert time.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e0adbcf
    • J
      btrfs: add a comp_refs() helper · 1d148e59
      Josef Bacik 提交于
      Instead of open-coding the delayed ref comparisons, add a helper to do
      the comparisons generically and use that everywhere.  We compare
      sequence numbers last for following patches.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d148e59
    • J
      btrfs: switch args for comp_*_refs · c7ad7c84
      Josef Bacik 提交于
      Make it more consistent, we want the inserted ref to be compared against
      what's already in there.  This will make the order go from lowest seq ->
      highest seq, which will make us more likely to make forward progress if
      there's a seqlock currently held.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7ad7c84
  10. 30 10月, 2017 2 次提交
  11. 30 6月, 2017 1 次提交
  12. 18 4月, 2017 1 次提交
  13. 17 2月, 2017 1 次提交
    • Q
      btrfs: qgroup: Move half of the qgroup accounting time out of commit trans · fb235dc0
      Qu Wenruo 提交于
      Just as Filipe pointed out, the most time consuming parts of qgroup are
      btrfs_qgroup_account_extents() and
      btrfs_qgroup_prepare_account_extents().
      Which both call btrfs_find_all_roots() to get old_roots and new_roots
      ulist.
      
      What makes things worse is, we're calling that expensive
      btrfs_find_all_roots() at transaction committing time with
      TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.
      
      Such behavior is necessary for @new_roots search as current
      btrfs_find_all_roots() can't do it correctly so we do call it just
      before switch commit roots.
      
      However for @old_roots search, it's not necessary as such search is
      based on commit_root, so it will always be correct and we can move it
      out of transaction committing.
      
      This patch moves the @old_roots search part out of
      commit_transaction(), so in theory we can half the time qgroup time
      consumption at commit_transaction().
      
      But please note that, this won't speedup qgroup overall, the total time
      consumption is still the same, just reduce the performance stall.
      
      Cc: Filipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb235dc0
  14. 14 2月, 2017 2 次提交
  15. 30 11月, 2016 2 次提交
    • W
      btrfs: improve delayed refs iterations · 1d57ee94
      Wang Xiaoguang 提交于
      This issue was found when I tried to delete a heavily reflinked file,
      when deleting such files, other transaction operation will not have a
      chance to make progress, for example, start_transaction() will blocked
      in wait_current_trans(root) for long time, sometimes it even triggers
      soft lockups, and the time taken to delete such heavily reflinked file
      is also very large, often hundreds of seconds. Using perf top, it reports
      that:
      
      PerfTop:    7416 irqs/sec  kernel:99.8%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
      ---------------------------------------------------------------------------------------
          84.37%  [btrfs]             [k] __btrfs_run_delayed_refs.constprop.80
          11.02%  [kernel]            [k] delay_tsc
           0.79%  [kernel]            [k] _raw_spin_unlock_irq
           0.78%  [kernel]            [k] _raw_spin_unlock_irqrestore
           0.45%  [kernel]            [k] do_raw_spin_lock
           0.18%  [kernel]            [k] __slab_alloc
      It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
      work, I found it's select_delayed_ref() causing this issue, for a delayed
      head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
      select_delayed_ref() will firstly try to iterate node list to find
      BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
      waste much time.
      
      To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
      then in select_delayed_ref(), if this list is not empty, we can directly use
      nodes in this list. With this patch, it just took about 10~15 seconds to
      delte the same file. Now using perf top, it reports that:
      
      PerfTop:    2734 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
      ----------------------------------------------------------------------------------------
      
          20.74%  [kernel]          [k] _raw_spin_unlock_irqrestore
          16.33%  [kernel]          [k] __slab_alloc
           5.41%  [kernel]          [k] lock_acquired
           4.42%  [kernel]          [k] lock_acquire
           4.05%  [kernel]          [k] lock_release
           3.37%  [kernel]          [k] _raw_spin_unlock_irq
      
      For normal files, this patch also gives help, at least we do not need to
      iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d57ee94
    • Q
      btrfs: qgroup: Rename functions to make it follow reserve,trace,account steps · 50b3e040
      Qu Wenruo 提交于
      Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
      btrfs_qgroup_trace_extent(_nolock), according to the new
      reserve/trace/account naming schema.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      50b3e040
  16. 27 9月, 2016 1 次提交
  17. 26 9月, 2016 1 次提交
    • J
      Btrfs: add a flags field to btrfs_fs_info · afcdd129
      Josef Bacik 提交于
      We have a lot of random ints in btrfs_fs_info that can be put into flags.  This
      is mostly equivalent with the exception of how we deal with quota going on or
      off, now instead we set a flag when we are turning it on or off and deal with
      that appropriately, rather than just having a pending state that the current
      quota_enabled gets set to.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      afcdd129
  18. 25 8月, 2016 1 次提交
    • Q
      btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() · cb93b52c
      Qu Wenruo 提交于
      Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
      1. btrfs_qgroup_insert_dirty_extent_nolock()
         Almost the same with original code.
         For delayed_ref usage, which has delayed refs locked.
      
         Change the return value type to int, since caller never needs the
         pointer, but only needs to know if they need to free the allocated
         memory.
      
      2. btrfs_qgroup_insert_dirty_extent()
         The more encapsulated version.
      
         Will do the delayed_refs lock, memory allocation, quota enabled check
         and other things.
      
      The original design is to keep exported functions to minimal, but since
      more btrfs hacks exposed, like replacing path in balance, we need to
      record dirty extents manually, so we have to add such functions.
      
      Also, add comment for both functions, to info developers how to keep
      qgroup correct when doing hacks.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cb93b52c
  19. 03 8月, 2016 1 次提交
  20. 26 7月, 2016 2 次提交
    • J
      btrfs: prefix fsid to all trace events · bc074524
      Jeff Mahoney 提交于
      When using trace events to debug a problem, it's impossible to determine
      which file system generated a particular event.  This patch adds a
      macro to prefix standard information to the head of a trace event.
      
      The extent_state alloc/free events are all that's left without an
      fs_info available.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc074524
    • N
      btrfs: Fix slab accounting flags · fba4b697
      Nikolay Borisov 提交于
      BTRFS is using a variety of slab caches to satisfy internal needs.
      Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
      meaning allocations from the caches are going to be accounted as
      SReclaimable. At the same time btrfs is not registering any shrinkers
      whatsoever, thus preventing memory from the slabs to be shrunk. This
      means those caches are not in fact reclaimable.
      
      To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
      inode cache, since this one is being freed by the generic VFS super_block
      shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
      to better document the lifetime of the objects (it just translates
      to SLAB_RECLAIM_ACCOUNT).
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fba4b697
  21. 18 2月, 2016 1 次提交
  22. 07 1月, 2016 1 次提交
    • D
      btrfs: better packing of btrfs_delayed_extent_op · 35b3ad50
      David Sterba 提交于
      btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
      and has 8 unused bytes. Reducing the level type to u8 makes it possible
      to squeeze it to the padding byte after key. The bitfields were switched
      to bool as there's space to store the full byte without increasing the
      whole structure, besides that the generated assembly is smaller.
      
      struct btrfs_delayed_extent_op {
      	struct btrfs_disk_key      key;                  /*     0    17 */
      	u8                         level;                /*    17     1 */
      	bool                       update_key;           /*    18     1 */
      	bool                       update_flags;         /*    19     1 */
      	bool                       is_data;              /*    20     1 */
      
      	/* XXX 3 bytes hole, try to pack */
      
      	u64                        flags_to_set;         /*    24     8 */
      
      	/* size: 32, cachelines: 1, members: 6 */
      	/* sum members: 29, holes: 1, sum holes: 3 */
      	/* last cacheline: 32 bytes */
      };
      
      The final size is 32 bytes which gives +26 object per slab page.
      
         text	   data	    bss	    dec	    hex	filename
       938811	  43670	  23144	1005625	  f5839	fs/btrfs/btrfs.ko.before
       938747	  43670	  23144	1005561	  f57f9	fs/btrfs/btrfs.ko.after
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35b3ad50
  23. 27 10月, 2015 1 次提交
  24. 26 10月, 2015 2 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
    • F
      Btrfs: fix regression when running delayed references · 2c3cf7d5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a refactoring/rework of the delayed
      references implementation in order to fix certain problems with qgroups.
      However that rework introduced one more regression that leads to the
      following trace when running delayed references for metadata:
      
      [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
      [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
      [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
      [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
      [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
      [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
      [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
      [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
      [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
      [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
      [35908.065201] Stack:
      [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
      [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
      [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
      [35908.065201] Call Trace:
      [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
      [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
      [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
      [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
      [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
      [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
      [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
      [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
      [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
      [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201]  RSP <ffff88010c4cbb08>
      [35908.310885] ---[ end trace fe4299baf0666457 ]---
      
      This happens because the new delayed references code no longer merges
      delayed references that have different sequence values. The following
      steps are an example sequence leading to this issue:
      
      1) Transaction N starts, fs_info->tree_mod_seq has value 0;
      
      2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
         bytenr A is created, with a value of 1 and a seq value of 0;
      
      3) fs_info->tree_mod_seq is incremented to 1;
      
      4) Extent buffer A is deleted through btrfs_del_items(), which calls
         btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
         later returns the metadata extent associated to extent buffer A to
         the free space cache (the range is not pinned), because the extent
         buffer was created in the current transaction (N) and writeback never
         happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
         in the extent buffer).
         This creates the delayed reference Ref2 for bytenr A, with a value
         of -1 and a seq value of 1;
      
      5) Delayed reference Ref2 is not merged with Ref1 when we create it,
         because they have different sequence numbers (decided at
         add_delayed_ref_tail_merge());
      
      6) fs_info->tree_mod_seq is incremented to 2;
      
      7) Some task attempts to allocate a new extent buffer (done at
         extent-tree.c:find_free_extent()), but due to heavy fragmentation
         and running low on metadata space the clustered allocation fails
         and we fall back to unclustered allocation, which finds the
         extent at offset A, so a new extent buffer at offset A is allocated.
         This creates delayed reference Ref3 for bytenr A, with a value of 1
         and a seq value of 2;
      
      8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
         all have different seq values;
      
      9) We start running the delayed references (__btrfs_run_delayed_refs());
      
      10) The delayed Ref1 is the first one being applied, which ends up
          creating an inline extent backref in the extent tree;
      
      10) Next the delayed reference Ref3 is selected for execution, and not
          Ref2, because select_delayed_ref() always gives a preference for
          positive references (that have an action of BTRFS_ADD_DELAYED_REF);
      
      11) When running Ref3 we encounter alreay the inline extent backref
          in the extent tree at insert_inline_extent_backref(), which makes
          us hit the following BUG_ON:
      
              BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
      
          This is always true because owner corresponds to the level of the
          extent buffer/btree node in the btree.
      
      For the scenario described above we hit the BUG_ON because we never merge
      references that have different seq values.
      
      We used to do the merging before the 4.2 kernel, more specifically, before
      the commmits:
      
        c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
        c43d160f ("btrfs: delayed-ref: Cleanup the unneeded functions.")
      
      This issue became more exposed after the following change that was added
      to 4.2 as well:
      
        cffc3374 ("Btrfs: fix order by which delayed references are run")
      
      Which in turn fixed another regression by the two commits previously
      mentioned.
      
      So fix this by bringing back the delayed reference merge code, with the
      proper adaptations so that it operates against the new data structure
      (linked list vs old red black tree implementation).
      
      This issue was hit running fstest btrfs/063 in a loop. Several people have
      reported this issue in the mailing list when running on kernels 4.2+.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      2c3cf7d5
  25. 22 10月, 2015 1 次提交
  26. 25 6月, 2015 1 次提交
  27. 11 6月, 2015 3 次提交
  28. 11 4月, 2015 1 次提交
    • J
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik 提交于
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: NChris Mason <clm@fb.com>
      1262133b