1. 09 9月, 2019 4 次提交
  2. 30 4月, 2019 2 次提交
  3. 15 10月, 2018 4 次提交
    • L
      Btrfs: kill btrfs_clear_path_blocking · 52398340
      Liu Bo 提交于
      Btrfs's btree locking has two modes, spinning mode and blocking mode,
      while searching btree, locking is always acquired in spinning mode and
      then converted to blocking mode if necessary, and in some hot paths we may
      switch the locking back to spinning mode by btrfs_clear_path_blocking().
      
      When acquiring locks, both of reader and writer need to wait for blocking
      readers and writers to complete before doing read_lock()/write_lock().
      
      The problem is that btrfs_clear_path_blocking() needs to switch nodes
      in the path to blocking mode at first (by btrfs_set_path_blocking) to
      make lockdep happy before doing its actual clearing blocking job.
      
      When switching to blocking mode from spinning mode, it consists of
      
      step 1) bumping up blocking readers counter and
      step 2) read_unlock()/write_unlock(),
      
      this has caused serious ping-pong effect if there're a great amount of
      concurrent readers/writers, as waiters will be woken up and go to
      sleep immediately.
      
      1) Killing this kind of ping-pong results in a big improvement in my 1600k
      files creation script,
      
      MNT=/mnt/btrfs
      mkfs.btrfs -f /dev/sdf
      mount /dev/def $MNT
      time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
              -d  $MNT/0  -d  $MNT/1 \
              -d  $MNT/2  -d  $MNT/3 \
              -d  $MNT/4  -d  $MNT/5 \
              -d  $MNT/6  -d  $MNT/7 \
              -d  $MNT/8  -d  $MNT/9 \
              -d  $MNT/10  -d  $MNT/11 \
              -d  $MNT/12  -d  $MNT/13 \
              -d  $MNT/14  -d  $MNT/15
      
      w/o patch:
      real    2m27.307s
      user    0m12.839s
      sys     13m42.831s
      
      w/ patch:
      real    1m2.273s
      user    0m15.802s
      sys     8m16.495s
      
      1.1) latency histogram from funclatency[1]
      
      Overall with the patch, there're ~50% less write lock acquisition and
      the 95% max latency that write lock takes also reduces to ~100ms from
      >500ms.
      
      --------------------------------------------
      w/o patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 2385222  |****************************************|
               2 -> 3          : 37147    |                                        |
               4 -> 7          : 20452    |                                        |
               8 -> 15         : 13131    |                                        |
              16 -> 31         : 3877     |                                        |
              32 -> 63         : 3900     |                                        |
              64 -> 127        : 2612     |                                        |
             128 -> 255        : 974      |                                        |
             256 -> 511        : 165      |                                        |
             512 -> 1023       : 13       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6743860  |****************************************|
               2 -> 3          : 2146     |                                        |
               4 -> 7          : 190      |                                        |
               8 -> 15         : 38       |                                        |
              16 -> 31         : 4        |                                        |
      
      --------------------------------------------
      w/ patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 1318454  |****************************************|
               2 -> 3          : 6800     |                                        |
               4 -> 7          : 3664     |                                        |
               8 -> 15         : 2145     |                                        |
              16 -> 31         : 809      |                                        |
              32 -> 63         : 219      |                                        |
              64 -> 127        : 10       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6854317  |****************************************|
               2 -> 3          : 2383     |                                        |
               4 -> 7          : 601      |                                        |
               8 -> 15         : 92       |                                        |
      
      2) dbench also proves the improvement,
      dbench -t 120 -D /mnt/btrfs 16
      
      w/o patch:
      Throughput 158.363 MB/sec
      
      w/ patch:
      Throughput 449.52 MB/sec
      
      3) xfstests didn't show any additional failures.
      
      One thing to note is that callers may set path->leave_spinning to have
      all nodes in the path stay in spinning mode, which means callers are
      ready to not sleep before releasing the path, but it won't cause
      problems if they don't want to sleep in blocking mode.
      
      [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.pySigned-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52398340
    • L
      Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root · 03a1d4c8
      Liu Bo 提交于
      rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
      rb_first() but in O(1).
      
      Functions manipulating delayed_item need to get the first entry, this converts
      it to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      03a1d4c8
    • M
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro 提交于
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fd786e6
    • L
      btrfs: switch update_size to bool in btrfs_block_rsv_migrate and btrfs_rsv_add_bytes · 3a584174
      Lu Fengqi 提交于
      Using true and false here is closer to the expected semantic than using
      0 and 1.  No functional change.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3a584174
  4. 06 8月, 2018 3 次提交
  5. 29 5月, 2018 1 次提交
  6. 18 4月, 2018 1 次提交
    • Q
      btrfs: delayed-inode: Remove wrong qgroup meta reservation calls · f218ea6c
      Qu Wenruo 提交于
      Commit 4f5427cc ("btrfs: delayed-inode: Use new qgroup meta rsv for
      delayed inode and item") merged into mainline was not latest version
      submitted to the mail list in Dec 2017.
      
      Which lacks the following fixes:
      
      1) Remove btrfs_qgroup_convert_reserved_meta() call in
         btrfs_delayed_item_release_metadata()
      2) Remove btrfs_qgroup_reserve_meta_prealloc() call in
         btrfs_delayed_inode_reserve_metadata()
      
      Those fixes will resolve unexpected EDQUOT problems.
      
      Fixes: 4f5427cc ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f218ea6c
  7. 12 4月, 2018 1 次提交
  8. 31 3月, 2018 1 次提交
  9. 26 3月, 2018 3 次提交
  10. 29 1月, 2018 1 次提交
  11. 25 1月, 2018 1 次提交
    • J
      Btrfs: fix stale entries in readdir · e4fd493c
      Josef Bacik 提交于
      In fixing the readdir+pagefault deadlock I accidentally introduced a
      stale entry regression in readdir.  If we get close to full for the
      temporary buffer, and then skip a few delayed deletions, and then try to
      add another entry that won't fit, we will emit the entries we found and
      retry.  Unfortunately we delete entries from our del_list as we find
      them, assuming we won't need them.  However our pos will be with
      whatever our last entry was, which could be before the delayed deletions
      we skipped, so the next search will add the deleted entries back into
      our readdir buffer.  So instead don't delete entries we find in our
      del_list so we can make sure we always find our delayed deletions.  This
      is a slight perf hit for readdir with lots of pending deletions, but
      hopefully this isn't a common occurrence.  If it is we can revist this
      and optimize it.
      
      cc: stable@vger.kernel.org
      Fixes: 23b5ec74 ("btrfs: fix readdir deadlock with pagefault")
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4fd493c
  12. 22 1月, 2018 2 次提交
  13. 03 1月, 2018 1 次提交
    • C
      btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes · ec35e48b
      Chris Mason 提交于
      refcounts have a generic implementation and an asm optimized one.  The
      generic version has extra debugging to make sure that once a refcount
      goes to zero, refcount_inc won't increase it.
      
      The btrfs delayed inode code wasn't expecting this, and we're tripping
      over the warnings when the generic refcounts are used.  We ended up with
      this race:
      
      Process A                                         Process B
                                                        btrfs_get_delayed_node()
      						  spin_lock(root->inode_lock)
      						  radix_tree_lookup()
      __btrfs_release_delayed_node()
      refcount_dec_and_test(&delayed_node->refs)
      our refcount is now zero
      						  refcount_add(2) <---
      						  warning here, refcount
                                                        unchanged
      
      spin_lock(root->inode_lock)
      radix_tree_delete()
      
      With the generic refcounts, we actually warn again when process B above
      tries to release his refcount because refcount_add() turned into a
      no-op.
      
      We saw this in production on older kernels without the asm optimized
      refcounts.
      
      The fix used here is to use refcount_inc_not_zero() to detect when the
      object is in the middle of being freed and return NULL.  This is almost
      always the right answer anyway, since we usually end up pitching the
      delayed_node if it didn't have fresh data in it.
      
      This also changes __btrfs_release_delayed_node() to remove the extra
      check for zero refcounts before radix tree deletion.
      btrfs_get_delayed_node() was the only path that was allowing refcounts
      to go from zero to one.
      
      Fixes: 6de5f18e ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
      CC: <stable@vger.kernel.org> # 4.12+
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec35e48b
  14. 02 11月, 2017 1 次提交
    • J
      btrfs: make the delalloc block rsv per inode · 69fe2d75
      Josef Bacik 提交于
      The way we handle delalloc metadata reservations has gotten
      progressively more complicated over the years.  There is so much cruft
      and weirdness around keeping the reserved count and outstanding counters
      consistent and handling the error cases that it's impossible to
      understand.
      
      Fix this by making the delalloc block rsv per-inode.  This way we can
      calculate the actual size of the outstanding metadata reservations every
      time we make a change, and then reserve the delta based on that amount.
      This greatly simplifies the code everywhere, and makes the error
      handling in btrfs_delalloc_reserve_metadata far less terrifying.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69fe2d75
  15. 18 8月, 2017 1 次提交
  16. 18 4月, 2017 2 次提交
  17. 28 2月, 2017 1 次提交
  18. 14 2月, 2017 10 次提交