1. 15 10月, 2018 11 次提交
    • L
      Btrfs: kill btrfs_clear_path_blocking · 52398340
      Liu Bo 提交于
      Btrfs's btree locking has two modes, spinning mode and blocking mode,
      while searching btree, locking is always acquired in spinning mode and
      then converted to blocking mode if necessary, and in some hot paths we may
      switch the locking back to spinning mode by btrfs_clear_path_blocking().
      
      When acquiring locks, both of reader and writer need to wait for blocking
      readers and writers to complete before doing read_lock()/write_lock().
      
      The problem is that btrfs_clear_path_blocking() needs to switch nodes
      in the path to blocking mode at first (by btrfs_set_path_blocking) to
      make lockdep happy before doing its actual clearing blocking job.
      
      When switching to blocking mode from spinning mode, it consists of
      
      step 1) bumping up blocking readers counter and
      step 2) read_unlock()/write_unlock(),
      
      this has caused serious ping-pong effect if there're a great amount of
      concurrent readers/writers, as waiters will be woken up and go to
      sleep immediately.
      
      1) Killing this kind of ping-pong results in a big improvement in my 1600k
      files creation script,
      
      MNT=/mnt/btrfs
      mkfs.btrfs -f /dev/sdf
      mount /dev/def $MNT
      time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
              -d  $MNT/0  -d  $MNT/1 \
              -d  $MNT/2  -d  $MNT/3 \
              -d  $MNT/4  -d  $MNT/5 \
              -d  $MNT/6  -d  $MNT/7 \
              -d  $MNT/8  -d  $MNT/9 \
              -d  $MNT/10  -d  $MNT/11 \
              -d  $MNT/12  -d  $MNT/13 \
              -d  $MNT/14  -d  $MNT/15
      
      w/o patch:
      real    2m27.307s
      user    0m12.839s
      sys     13m42.831s
      
      w/ patch:
      real    1m2.273s
      user    0m15.802s
      sys     8m16.495s
      
      1.1) latency histogram from funclatency[1]
      
      Overall with the patch, there're ~50% less write lock acquisition and
      the 95% max latency that write lock takes also reduces to ~100ms from
      >500ms.
      
      --------------------------------------------
      w/o patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 2385222  |****************************************|
               2 -> 3          : 37147    |                                        |
               4 -> 7          : 20452    |                                        |
               8 -> 15         : 13131    |                                        |
              16 -> 31         : 3877     |                                        |
              32 -> 63         : 3900     |                                        |
              64 -> 127        : 2612     |                                        |
             128 -> 255        : 974      |                                        |
             256 -> 511        : 165      |                                        |
             512 -> 1023       : 13       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6743860  |****************************************|
               2 -> 3          : 2146     |                                        |
               4 -> 7          : 190      |                                        |
               8 -> 15         : 38       |                                        |
              16 -> 31         : 4        |                                        |
      
      --------------------------------------------
      w/ patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 1318454  |****************************************|
               2 -> 3          : 6800     |                                        |
               4 -> 7          : 3664     |                                        |
               8 -> 15         : 2145     |                                        |
              16 -> 31         : 809      |                                        |
              32 -> 63         : 219      |                                        |
              64 -> 127        : 10       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6854317  |****************************************|
               2 -> 3          : 2383     |                                        |
               4 -> 7          : 601      |                                        |
               8 -> 15         : 92       |                                        |
      
      2) dbench also proves the improvement,
      dbench -t 120 -D /mnt/btrfs 16
      
      w/o patch:
      Throughput 158.363 MB/sec
      
      w/ patch:
      Throughput 449.52 MB/sec
      
      3) xfstests didn't show any additional failures.
      
      One thing to note is that callers may set path->leave_spinning to have
      all nodes in the path stay in spinning mode, which means callers are
      ready to not sleep before releasing the path, but it won't cause
      problems if they don't want to sleep in blocking mode.
      
      [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.pySigned-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52398340
    • D
      btrfs: dev-replace: move replace members out of fs_info · 7f8d236a
      David Sterba 提交于
      The replace_wait and bio_counter were mistakenly added to fs_info in
      commit c404e0dc ("Btrfs: fix use-after-free in the finishing
      procedure of the device replace"), but they logically belong to
      fs_info::dev_replace. Besides, bio_counter is a very generic name and is
      confusing in bare fs_info context.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f8d236a
    • D
      btrfs: remove btrfs_dev_replace::read_locks · 3280f874
      David Sterba 提交于
      This member seems to be copied from the extent_buffer locking scheme and
      is at least used to assert that the read lock/unlock is properly nested.
      In some way. While the _inc/_dec are called inside the read lock
      section, the asserts are both inside and outside, so the ordering is not
      guaranteed and we can see read/inc/dec ordered in any way
      (theoretically).
      
      A missing call of btrfs_dev_replace_clear_lock_blocking could cause
      unexpected read_locks count, so this at least looks like a valid
      assertion, but this will become unnecessary with later updates.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3280f874
    • D
      btrfs: tests: polish ifdefs around testing helper · b2fa1154
      David Sterba 提交于
      Avoid the inline ifdefs and use two sections for self-tests enabled and
      disabled.
      
      Though there could be no ifdef and unconditional test_bit of
      BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
      any code that would depend on conditions using btrfs_is_testing.
      
      As this is only for the testing code, drop unlikely().
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b2fa1154
    • D
      a654666a
    • D
      btrfs: tests: move testing members of struct btrfs_root to the end · 57ec5fb4
      David Sterba 提交于
      The data used only for tests are better placed at the end of the
      structure so that they don't change the structure layout. All new
      members of btrfs_root should be placed before.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      57ec5fb4
    • D
      btrfs: tests: add separate stub for find_lock_delalloc_range · 9c36396c
      David Sterba 提交于
      The helper find_lock_delalloc_range is now conditionally built static,
      dpending on whether the self-tests are enabled or not. There's a macro
      that is supposed to hide the export, used only once. To discourage
      further use, drop it an add a public wrapper for the helper needed by
      tests.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c36396c
    • L
      Btrfs: fix alignment in declaration and prototype of btrfs_get_extent · de2c6615
      Liu Bo 提交于
      This fixes btrfs_get_extent to be consistent with our existing
      declaration style.
      
      Note: For the record, indentation styles that are accepted are both,
      aligning under the opening ( and tab or double tab indentation on the
      next line.  Preferrably not spliting the type or long expressions in the
      argument lists.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de2c6615
    • M
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro 提交于
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fd786e6
    • L
      btrfs: Remove root parameter from btrfs_insert_dir_item · 684572df
      Lu Fengqi 提交于
      All callers pass the root tree of dir, we can push that down to the
      function itself.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      684572df
    • L
      btrfs: switch update_size to bool in btrfs_block_rsv_migrate and btrfs_rsv_add_bytes · 3a584174
      Lu Fengqi 提交于
      Using true and false here is closer to the expected semantic than using
      0 and 1.  No functional change.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3a584174
  2. 24 8月, 2018 1 次提交
  3. 18 8月, 2018 1 次提交
    • R
      Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space · 8ecebf4d
      Robbie Ko 提交于
      Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced
      nocow writes to fallback to COW, during writeback, when a snapshot is
      created. This resulted in writes made before creating the snapshot to
      unexpectedly fail with ENOSPC during writeback when success (0) was
      returned to user space through the write system call.
      
      The steps leading to this problem are:
      
      1. When it's not possible to allocate data space for a write, the
         buffered write path checks if a NOCOW write is possible.  If it is,
         it will not reserve space and success (0) is returned to user space.
      
      2. Then when a snapshot is created, the root's will_be_snapshotted
         atomic is incremented and writeback is triggered for all inode's that
         belong to the root being snapshotted. Incrementing that atomic forces
         all previous writes to fallback to COW during writeback (running
         delalloc).
      
      3. This results in the writeback for the inodes to fail and therefore
         setting the ENOSPC error in their mappings, so that a subsequent
         fsync on them will report the error to user space. So it's not a
         completely silent data loss (since fsync will report ENOSPC) but it's
         a very unexpected and undesirable behaviour, because if a clean
         shutdown/unmount of the filesystem happens without previous calls to
         fsync, it is expected to have the data present in the files after
         mounting the filesystem again.
      
      So fix this by adding a new atomic named snapshot_force_cow to the
      root structure which prevents this behaviour and works the following way:
      
      1. It is incremented when we start to create a snapshot after triggering
         writeback and before waiting for writeback to finish.
      
      2. This new atomic is now what is used by writeback (running delalloc)
         to decide whether we need to fallback to COW or not. Because we
         incremented this new atomic after triggering writeback in the
         snapshot creation ioctl, we ensure that all buffered writes that
         happened before snapshot creation will succeed and not fallback to
         COW (which would make them fail with ENOSPC).
      
      3. The existing atomic, will_be_snapshotted, is kept because it is used
         to force new buffered writes, that start after we started
         snapshotting, to reserve data space even when NOCOW is possible.
         This makes these writes fail early with ENOSPC when there's no
         available space to allocate, preventing the unexpected behaviour of
         writeback later failing with ENOSPC due to a fallback to COW mode.
      
      Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting")
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ecebf4d
  4. 06 8月, 2018 20 次提交
  5. 07 7月, 2018 2 次提交
  6. 07 6月, 2018 1 次提交
  7. 30 5月, 2018 3 次提交
  8. 29 5月, 2018 1 次提交