1. 31 5月, 2018 2 次提交
    • N
      btrfs: Factor out write portion of btrfs_get_blocks_direct · c5794e51
      Nikolay Borisov 提交于
      Now that the read side is extracted into its own function, do the same
      to the write side. This leaves btrfs_get_blocks_direct_write with the
      sole purpose of handling common locking required. Also flip the
      condition in btrfs_get_blocks_direct_write so that the write case
      comes first and we check for if (Create) rather than if (!create). This
      is purely subjective but I believe makes reading a bit more "linear".
      No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5794e51
    • N
      btrfs: Factor out read portion of btrfs_get_blocks_direct · 1c8d0175
      Nikolay Borisov 提交于
      Currently this function handles both the READ and WRITE dio cases. This
      is facilitated by a bunch of 'if' statements, a goto short-circuit
      statement and a very perverse aliasing of "!created"(READ) case
      by setting lockstart = lockend and checking for lockstart < lockend for
      detecting the write. Let's simplify this mess by extracting the
      READ-only code into a separate __btrfs_get_block_direct_read function.
      This is only the first step, the next one will be to factor out the
      write side as well. The end goal will be to have the common locking/
      unlocking code in btrfs_get_blocks_direct and then it will call either
      the read|write subvariants. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c8d0175
  2. 30 5月, 2018 5 次提交
    • S
      btrfs: return error value if create_io_em failed in cow_file_range · 090a127a
      Su Yue 提交于
      In cow_file_range(), create_io_em() may fail, but its return value is
      not recorded.  Then return value may be 0 even it failed which is a
      wrong behavior.
      
      Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.
      
      Fixes: 6f9994db ("Btrfs: create a helper to create em for IO")
      CC: stable@vger.kernel.org # 4.11+
      Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      090a127a
    • G
      btrfs: drop unused parameter qgroup_reserved · c4c129db
      Gu JinXiang 提交于
      Since commit 7775c818 ("btrfs: remove unused parameter from
      btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
      by caller of function btrfs_subvolume_reserve_metadata.  So remove it.
      Signed-off-by: NGu JinXiang <gujx@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c4c129db
    • E
      btrfs: balance dirty metadata pages in btrfs_finish_ordered_io · e73e81b6
      Ethan Lien 提交于
      [Problem description and how we fix it]
      We should balance dirty metadata pages at the end of
      btrfs_finish_ordered_io, since a small, unmergeable random write can
      potentially produce dirty metadata which is multiple times larger than
      the data itself. For example, a small, unmergeable 4KiB write may
      produce:
      
          16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
      
      Although we do call balance dirty pages in write side, but in the
      buffered write path, most metadata are dirtied only after we reach the
      dirty background limit (which by far only counts dirty data pages) and
      wakeup the flusher thread. If there are many small, unmergeable random
      writes spread in a large btree, we'll find a burst of dirty pages
      exceeds the dirty_bytes limit after we wakeup the flusher thread - which
      is not what we expect. In our machine, it caused out-of-memory problem
      since a page cannot be dropped if it is marked dirty.
      
      Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
      but since we do btrfs_finish_ordered_io in a separate worker, it will not
      stop the flusher consuming dirty pages. Also, we use different worker for
      metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
      the size of dirty metadata pages.
      
      [Reproduce steps]
      To reproduce the problem, we need to do 4KiB write randomly spread in a
      large btree. In our 2GiB RAM machine:
      
      1) Create 4 subvolumes.
      2) Run fio on each subvolume:
      
         [global]
         direct=0
         rw=randwrite
         ioengine=libaio
         bs=4k
         iodepth=16
         numjobs=1
         group_reporting
         size=128G
         runtime=1800
         norandommap
         time_based
         randrepeat=0
      
      3) Take snapshot on each subvolume and repeat fio on existing files.
      4) Repeat step (3) until we get large btrees.
         In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
         metadata in each subvolume tree and 12GiB of metadata in extent tree.
      5) Stop all fio, take snapshot again, and wait until all delayed work is
         completed.
      6) Start all fio. Few seconds later we hit OOM when the flusher starts
         to work.
      
      It can be reproduced even when using nocow write.
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e73e81b6
    • E
      btrfs: lift some btrfs_cross_ref_exist checks in nocow path · 78d4295b
      Ethan Lien 提交于
      In nocow path, we check if the extent is snapshotted in
      btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
      unnecessary search into extent tree.
      
      A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:
      
      [global]
      group_reporting
      time_based
      thread=1
      ioengine=libaio
      bs=4k
      iodepth=32
      size=64G
      runtime=180
      numjobs=8
      rw=randwrite
      
      [file1]
      filename=/mnt/nocow/testfile
      
      IOPS result:   unpatched     patched
      
      1 fio round:     46670        46958
      snapshot
      2 fio round:     51826        54498
      3 fio round:     59767        61289
      
      After snapshot, the first fio get about 5% performance gain. As we
      continually write to the same file, all writes will resume to nocow mode
      and eventually we have no performance gain.
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78d4295b
    • L
      btrfs: Remove fs_info argument from btrfs_uuid_tree_rem · d1957791
      Lu Fengqi 提交于
      This function always takes a transaction handle which contains a
      reference to the fs_info. Use that and remove the extra argument.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      [ rename the function ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1957791
  3. 29 5月, 2018 31 次提交
  4. 24 5月, 2018 1 次提交
    • O
      Btrfs: fix error handling in btrfs_truncate() · d5014738
      Omar Sandoval 提交于
      Jun Wu at Facebook reported that an internal service was seeing a return
      value of 1 from ftruncate() on Btrfs in some cases. This is coming from
      the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
      
      btrfs_truncate() uses two variables for error handling, ret and err.
      When btrfs_truncate_inode_items() returns non-zero, we set err to the
      return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
      only set err if ret is an error (i.e., negative).
      
      To reproduce the issue: mount a filesystem with -o compress-force=zstd
      and the following program will encounter return value of 1 from
      ftruncate:
      
      int main(void) {
              char buf[256] = { 0 };
              int ret;
              int fd;
      
              fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
              if (fd == -1) {
                      perror("open");
                      return EXIT_FAILURE;
              }
      
              if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                      perror("write");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              if (fsync(fd) == -1) {
                      perror("fsync");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              ret = ftruncate(fd, 128);
              if (ret) {
                      printf("ftruncate() returned %d\n", ret);
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              close(fd);
              return EXIT_SUCCESS;
      }
      
      Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
      CC: stable@vger.kernel.org # 4.15+
      Reported-by: NJun Wu <quark@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5014738
  5. 17 5月, 2018 1 次提交