1. 30 11月, 2016 2 次提交
    • D
      btrfs: remove unused headers, statfs.h · 926b9233
      David Sterba 提交于
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      926b9233
    • O
      Btrfs: deal with existing encompassing extent map in btrfs_get_extent() · 8e2bd3b7
      Omar Sandoval 提交于
      My QEMU VM was seeing inexplicable I/O errors that I tracked down to
      errors coming from the qcow2 virtual drive in the host system. The qcow2
      file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT.
      Every once in awhile, pread() or pwrite() would return EEXIST, which
      makes no sense. This turned out to be a bug in btrfs_get_extent().
      
      Commit 8dff9c85 ("Btrfs: deal with duplciates during extent_map
      insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where
      two threads race on adding the same extent map to an inode's extent map
      tree. However, if the added em is merged with an adjacent em in the
      extent tree, then we'll end up with an existing extent that is not
      identical to but instead encompasses the extent we tried to add. When we
      call merge_extent_mapping() to find the nonoverlapping part of the new
      em, the arithmetic overflows because there is no such thing. We then end
      up trying to add a bogus em to the em_tree, which results in a EEXIST
      that can bubble all the way up to userspace.
      
      Fix it by extending the identical extent map special case.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e2bd3b7
  2. 29 11月, 2016 1 次提交
  3. 25 10月, 2016 2 次提交
    • W
      btrfs: pass correct args to btrfs_async_run_delayed_refs() · dd4b857a
      Wang Xiaoguang 提交于
      In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we
      swap the arg2 and arg3 wrongly, fix this.
      
      This bug just impacts asynchronous delayed refs handle when we truncate inodes.
      In delayed_ref_async_start(), there is such codes:
      
          trans = btrfs_join_transaction(async->root);
          if (trans->transid > async->transid)
              goto end;
          ret = btrfs_run_delayed_refs(trans, async->root, async->count);
      
      From this codes, we can see that this just influence whether can we handle
      delayed refs or the number of delayed refs to handle, this may impact
      performance, but will not result in missing delayed refs, all delayed refs will
      be handled in btrfs_commit_transaction().
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dd4b857a
    • G
      btrfs: qgroup: Prevent qgroup->reserved from going subzero · 0b34c261
      Goldwyn Rodrigues 提交于
      While free'ing qgroup->reserved resources, we much check if
      the page has not been invalidated by a truncate operation
      by checking if the page is still dirty before reducing the
      qgroup resources. Resources in such a case are free'd when
      the entire extent is released by delayed_ref.
      
      This fixes a double accounting while releasing resources
      in case of truncating a file, reproduced by the following testcase.
      
      SCRATCH_DEV=/dev/vdb
      SCRATCH_MNT=/mnt
      mkfs.btrfs -f $SCRATCH_DEV
      mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
      cd $SCRATCH_MNT
      btrfs quota enable $SCRATCH_MNT
      btrfs subvolume create a
      btrfs qgroup limit 500m a $SCRATCH_MNT
      sync
      for c in {1..15}; do
      dd if=/dev/zero  bs=1M count=40 of=$SCRATCH_MNT/a/file;
      done
      
      sleep 10
      sync
      sleep 5
      
      touch $SCRATCH_MNT/a/newfile
      
      echo "Removing file"
      rm $SCRATCH_MNT/a/file
      
      Fixes: b9d0b389 ("btrfs: Add handler for invalidate page")
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0b34c261
  4. 11 10月, 2016 1 次提交
    • A
      [btrfs] fix check_direct_IO() for non-iovec iterators · cd27e455
      Al Viro 提交于
      looking for duplicate ->iov_base makes sense only for
      iovec-backed iterators; for kvec-backed ones it's pointless,
      for bvec-backed ones it's pointless and broken on 32bit (we
      walk through an array of struct bio_vec accessing them as if
      they were struct iovec; works by accident on 64bit, but on
      32bit it'll blow up) and for pipe-backed ones it's pointless
      and ends up oopsing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cd27e455
  5. 08 10月, 2016 1 次提交
  6. 28 9月, 2016 1 次提交
  7. 27 9月, 2016 3 次提交
  8. 26 9月, 2016 2 次提交
  9. 22 9月, 2016 1 次提交
  10. 16 9月, 2016 1 次提交
  11. 14 9月, 2016 1 次提交
  12. 25 8月, 2016 1 次提交
    • W
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang 提交于
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18513091
  13. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  14. 01 8月, 2016 2 次提交
    • F
      Btrfs: improve performance on fsync against new inode after rename/unlink · 44f714da
      Filipe Manana 提交于
      With commit 56f23fdb ("Btrfs: fix file/data loss caused by fsync after
      rename and new inode") we got simple fix for a functional issue when the
      following sequence of actions is done:
      
        at transaction N
        create file A at directory D
        at transaction N + M (where M >= 1)
        move/rename existing file A from directory D to directory E
        create a new file named A at directory D
        fsync the new file
        power fail
      
      The solution was to simply detect such scenario and fallback to a full
      transaction commit when we detect it. However this turned out to had a
      significant impact on throughput (and a bit on latency too) for benchmarks
      using the dbench tool, which simulates real workloads from smbd (Samba)
      servers. For example on a test vm (with a debug kernel):
      
      Unpatched:
      Throughput 19.1572 MB/sec  32 clients  32 procs  max_latency=1005.229 ms
      
      Patched:
      Throughput 23.7015 MB/sec  32 clients  32 procs  max_latency=809.206 ms
      
      The patched results (this patch is applied) are similar to the results of
      a kernel with the commit 56f23fdb ("Btrfs: fix file/data loss caused
      by fsync after rename and new inode") reverted.
      
      This change avoids the fallback to a transaction commit and instead makes
      sure all the names of the conflicting inode (the one that had a name in a
      past transaction that matches the name of the new file in the same parent
      directory) are logged so that at log replay time we don't lose neither the
      new file nor the old file, and the old file gets the name it was renamed
      to.
      
      This also ends up avoiding a full transaction commit for a similar case
      that involves an unlink instead of a rename of the old file:
      
        at transaction N
        create file A at directory D
        at transaction N + M (where M >= 1)
        remove file A
        create a new file named A at directory D
        fsync the new file
        power fail
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      44f714da
    • F
      Btrfs: be more precise on errors when getting an inode from disk · 67710892
      Filipe Manana 提交于
      When we attempt to read an inode from disk, we end up always returning an
      -ESTALE error to the caller regardless of the actual failure reason, which
      can be an out of memory problem (when allocating a path), some error found
      when reading from the fs/subvolume btree (like a genuine IO error) or the
      inode does not exists. So lets start returning the real error code to the
      callers so that they don't treat all -ESTALE errors as meaning that the
      inode does not exists (such as during orphan cleanup). This will also be
      needed for a subsequent patch in the same series dealing with a special
      fsync case.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      67710892
  15. 26 7月, 2016 8 次提交
  16. 08 7月, 2016 1 次提交
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
  17. 25 6月, 2016 1 次提交
    • O
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval 提交于
      Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      
      Tested with xfstests and by doing a parallel kernel build:
      
      	while make tinyconfig && make -j4 && git clean dqfx; do
      		:
      	done
      
      along with a bunch of parallel finds in another shell:
      
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      		done
      		wait
      	done
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      02dbfc99
  18. 23 6月, 2016 1 次提交
    • J
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik 提交于
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31b9655f
  19. 18 6月, 2016 1 次提交
  20. 08 6月, 2016 5 次提交
  21. 04 6月, 2016 1 次提交
    • C
      Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent · 8dff9c85
      Chris Mason 提交于
      When dealing with inline extents, btrfs_get_extent will incorrectly try
      to insert a duplicate extent_map.  The dup hits -EEXIST from
      add_extent_map, but then we try to merge with the existing one and end
      up trying to insert a zero length extent_map.
      
      This actually works most of the time, except when there are extent maps
      past the end of the inline extent.  rocksdb will trigger this sometimes
      because it preallocates an extent and then truncates down.
      
      Josef made a script to trigger with xfs_io:
      
      	#!/bin/bash
      
      	xfs_io -f -c "pwrite 0 1000" inline
      	xfs_io -c "falloc -k 4k 1M" inline
      	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
      	xfs_io -c "fadvise -d 0 1000" inline
      	cat inline
      
      You'll get EIOs trying to read inline after this because add_extent_map
      is returning EEXIST
      Signed-off-by: NChris Mason <clm@fb.com>
      8dff9c85
  22. 26 5月, 2016 1 次提交
  23. 19 5月, 2016 1 次提交