1. 30 5月, 2018 4 次提交
    • G
      btrfs: drop unused parameter qgroup_reserved · c4c129db
      Gu JinXiang 提交于
      Since commit 7775c818 ("btrfs: remove unused parameter from
      btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
      by caller of function btrfs_subvolume_reserve_metadata.  So remove it.
      Signed-off-by: NGu JinXiang <gujx@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c4c129db
    • E
      btrfs: balance dirty metadata pages in btrfs_finish_ordered_io · e73e81b6
      Ethan Lien 提交于
      [Problem description and how we fix it]
      We should balance dirty metadata pages at the end of
      btrfs_finish_ordered_io, since a small, unmergeable random write can
      potentially produce dirty metadata which is multiple times larger than
      the data itself. For example, a small, unmergeable 4KiB write may
      produce:
      
          16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
      
      Although we do call balance dirty pages in write side, but in the
      buffered write path, most metadata are dirtied only after we reach the
      dirty background limit (which by far only counts dirty data pages) and
      wakeup the flusher thread. If there are many small, unmergeable random
      writes spread in a large btree, we'll find a burst of dirty pages
      exceeds the dirty_bytes limit after we wakeup the flusher thread - which
      is not what we expect. In our machine, it caused out-of-memory problem
      since a page cannot be dropped if it is marked dirty.
      
      Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
      but since we do btrfs_finish_ordered_io in a separate worker, it will not
      stop the flusher consuming dirty pages. Also, we use different worker for
      metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
      the size of dirty metadata pages.
      
      [Reproduce steps]
      To reproduce the problem, we need to do 4KiB write randomly spread in a
      large btree. In our 2GiB RAM machine:
      
      1) Create 4 subvolumes.
      2) Run fio on each subvolume:
      
         [global]
         direct=0
         rw=randwrite
         ioengine=libaio
         bs=4k
         iodepth=16
         numjobs=1
         group_reporting
         size=128G
         runtime=1800
         norandommap
         time_based
         randrepeat=0
      
      3) Take snapshot on each subvolume and repeat fio on existing files.
      4) Repeat step (3) until we get large btrees.
         In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
         metadata in each subvolume tree and 12GiB of metadata in extent tree.
      5) Stop all fio, take snapshot again, and wait until all delayed work is
         completed.
      6) Start all fio. Few seconds later we hit OOM when the flusher starts
         to work.
      
      It can be reproduced even when using nocow write.
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e73e81b6
    • E
      btrfs: lift some btrfs_cross_ref_exist checks in nocow path · 78d4295b
      Ethan Lien 提交于
      In nocow path, we check if the extent is snapshotted in
      btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
      unnecessary search into extent tree.
      
      A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:
      
      [global]
      group_reporting
      time_based
      thread=1
      ioengine=libaio
      bs=4k
      iodepth=32
      size=64G
      runtime=180
      numjobs=8
      rw=randwrite
      
      [file1]
      filename=/mnt/nocow/testfile
      
      IOPS result:   unpatched     patched
      
      1 fio round:     46670        46958
      snapshot
      2 fio round:     51826        54498
      3 fio round:     59767        61289
      
      After snapshot, the first fio get about 5% performance gain. As we
      continually write to the same file, all writes will resume to nocow mode
      and eventually we have no performance gain.
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78d4295b
    • L
      btrfs: Remove fs_info argument from btrfs_uuid_tree_rem · d1957791
      Lu Fengqi 提交于
      This function always takes a transaction handle which contains a
      reference to the fs_info. Use that and remove the extra argument.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      [ rename the function ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1957791
  2. 29 5月, 2018 31 次提交
  3. 24 5月, 2018 1 次提交
    • O
      Btrfs: fix error handling in btrfs_truncate() · d5014738
      Omar Sandoval 提交于
      Jun Wu at Facebook reported that an internal service was seeing a return
      value of 1 from ftruncate() on Btrfs in some cases. This is coming from
      the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
      
      btrfs_truncate() uses two variables for error handling, ret and err.
      When btrfs_truncate_inode_items() returns non-zero, we set err to the
      return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
      only set err if ret is an error (i.e., negative).
      
      To reproduce the issue: mount a filesystem with -o compress-force=zstd
      and the following program will encounter return value of 1 from
      ftruncate:
      
      int main(void) {
              char buf[256] = { 0 };
              int ret;
              int fd;
      
              fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
              if (fd == -1) {
                      perror("open");
                      return EXIT_FAILURE;
              }
      
              if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                      perror("write");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              if (fsync(fd) == -1) {
                      perror("fsync");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              ret = ftruncate(fd, 128);
              if (ret) {
                      printf("ftruncate() returned %d\n", ret);
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              close(fd);
              return EXIT_SUCCESS;
      }
      
      Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
      CC: stable@vger.kernel.org # 4.15+
      Reported-by: NJun Wu <quark@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5014738
  4. 17 5月, 2018 1 次提交
  5. 12 5月, 2018 1 次提交
    • A
      do d_instantiate/unlock_new_inode combinations safely · 1e2e547a
      Al Viro 提交于
      For anything NFS-exported we do _not_ want to unlock new inode
      before it has grown an alias; original set of fixes got the
      ordering right, but missed the nasty complication in case of
      lockdep being enabled - unlock_new_inode() does
      	lockdep_annotate_inode_mutex_key(inode)
      which can only be done before anyone gets a chance to touch
      ->i_mutex.  Unfortunately, flipping the order and doing
      unlock_new_inode() before d_instantiate() opens a window when
      mkdir can race with open-by-fhandle on a guessed fhandle, leading
      to multiple aliases for a directory inode and all the breakage
      that follows from that.
      
      	Correct solution: a new primitive (d_instantiate_new())
      combining these two in the right order - lockdep annotate, then
      d_instantiate(), then the rest of unlock_new_inode().  All
      combinations of d_instantiate() with unlock_new_inode() should
      be converted to that.
      
      Cc: stable@kernel.org	# 2.6.29 and later
      Tested-by: NMike Marshall <hubcap@omnibond.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e2e547a
  6. 19 4月, 2018 1 次提交
  7. 12 4月, 2018 1 次提交