1. 08 7月, 2016 11 次提交
    • J
      Btrfs: fix release reserved extents trace points · 31bada7c
      Josef Bacik 提交于
      We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
      isn't quite right because we will go through and free that extent later when we
      unpin, so it messes up apps that are accounting for the reservation space.  We
      were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
      only actually free the reservation instead of pinning the extent.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31bada7c
    • J
      Btrfs: add tracepoints for flush events · f376df2b
      Josef Bacik 提交于
      We want to track when we're triggering flushing from our reservation code and
      what flushing is being done when we start flushing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f376df2b
    • J
      Btrfs: fix delalloc reservation amount tracepoint · f485c9ee
      Josef Bacik 提交于
      We can sometimes drop the reservation we had for our inode, so we need to remove
      that amount from to_reserve so that our tracepoint reports a valid amount of
      space.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f485c9ee
    • J
      Btrfs: trace pinned extents · c51e7bb1
      Josef Bacik 提交于
      Pinned extents are an important metric to keep track of for enospc.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c51e7bb1
    • J
      Btrfs: introduce ticketed enospc infrastructure · 957780eb
      Josef Bacik 提交于
      Our enospc flushing sucks.  It is born from a time where we were early
      enospc'ing constantly because multiple threads would race in for the same
      reservation and randomly starve other ones out.  So I came up with this solution
      to block any other reservations from happening while one guy tried to flush
      stuff to satisfy his reservation.  This gives us pretty good correctness, but
      completely crap latency.
      
      The solution I've come up with is ticketed reservations.  Basically we try to
      make our reservation, and if we can't we put a ticket on a list in order and
      kick off an async flusher thread.  This async flusher thread does the same old
      flushing we always did, just asynchronously.  As space is freed and added back
      to the space_info it checks and sees if we have any tickets that need
      satisfying, and adds space to the tickets and wakes up anything we've satisfied.
      
      Once the flusher thread stops making progress it wakes up all the current
      tickets and tells them to take a hike.
      
      There is a priority list for things that can't flush, since the async flusher
      could do anything we need to avoid deadlocks.  These guys get priority for
      having their reservation made, and will still do manual flushing themselves in
      case the async flusher isn't running.
      
      This patch gives us significantly better latencies.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      957780eb
    • J
      Btrfs: add tracepoint for adding block groups · c83f8eff
      Josef Bacik 提交于
      I'm writing a tool to visualize the enospc system inside btrfs, I need this
      tracepoint in order to keep track of the block groups in the system.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83f8eff
    • J
      Btrfs: warn_on for unaccounted spaces · d555b6c3
      Josef Bacik 提交于
      These were hidden behind enospc_debug, which isn't helpful as they indicate
      actual bugs, unlike the rest of the enospc_debug stuff which is really debug
      information.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d555b6c3
    • J
      Btrfs: change delayed reservation fallback behavior · c48f49d6
      Josef Bacik 提交于
      We reserve space for the inode update when we first reserve space for writing to
      a file.  However there are lots of ways that we can use this reservation and not
      have it for subsequent ordered extents.  Previously we'd fall through and try to
      reserve metadata bytes for this, then we'd just steal the full reservation from
      the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
      reservation from the global reserve.  The problem with this is we can easily
      just return ENOSPC and fallback to updating the inode item directly.  In the
      worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
      fall all the way through, however if we just fallback and update the inode
      directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.
      
      We would have also just added the extent item for the inode so we likely will
      have already cow'ed down most of the way to the leaf containing the inode item,
      so we are more often than not only need one or two nodesize's worth of
      reservations.  Given the reservation for the extent itself is also a worst case
      we will likely already have space to cover the inode update.
      
      This change will make us behave better in the theoretical worst case, and much
      better in the case that we don't have our reservation and cannot reserve more
      metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c48f49d6
    • J
      Btrfs: always reserve metadata for delalloc extents · 48c3d480
      Josef Bacik 提交于
      There are a few races in the metadata reservation stuff.  First we add the bytes
      to the block_rsv well after we've set the bit on the inode saying that we have
      space for it and after we've reserved the bytes.  So use the normal
      btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
      extents when we try to reserve space for our write, which means that we could
      have used up the space for the inode and we wouldn't know because we only check
      before the reservation.  So instead make sure we are always reserving space for
      the inode update, and then if we don't need it release those bytes afterward.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48c3d480
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
    • J
      Btrfs: add bytes_readonly to the spaceinfo at once · e40edf2d
      Josef Bacik 提交于
      For some reason we're adding bytes_readonly to the space info after we update
      the space info with the block group info.  This creates a tiny race where we
      could over-reserve space because we haven't yet taken out the bytes_readonly
      bit.  Since we already know this information at the time we call
      update_space_info, just pass it along so it can be updated all at once.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e40edf2d
  2. 25 6月, 2016 1 次提交
    • O
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval 提交于
      Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      
      Tested with xfstests and by doing a parallel kernel build:
      
      	while make tinyconfig && make -j4 && git clean dqfx; do
      		:
      	done
      
      along with a bunch of parallel finds in another shell:
      
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      		done
      		wait
      	done
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      02dbfc99
  3. 24 6月, 2016 4 次提交
    • C
      Btrfs: Force stripesize to the value of sectorsize · b7f67055
      Chandan Rajendra 提交于
      Btrfs code currently assumes stripesize to be same as
      sectorsize. However Btrfs-progs (until commit
      df05c7ed455f519e6e15e46196392e4757257305) has been setting
      btrfs_super_block->stripesize to a value of 4096.
      
      This commit makes sure that the value of btrfs_super_block->stripesize
      is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
      to sectorsize.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b7f67055
    • W
      btrfs: fix disk_i_size update bug when fallocate() fails · c0d2f610
      Wang Xiaoguang 提交于
      When doing truncate operation, btrfs_setsize() will first call
      truncate_setsize() to set new inode->i_size, but if later
      btrfs_truncate() fails, btrfs_setsize() will call
      "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
      inmemory inode size, now bug occurs. It's because for truncate
      case btrfs_ordered_update_i_size() directly uses inode->i_size
      to update BTRFS_I(inode)->disk_i_size, indeed we should use the
      "offset" argument to update disk_i_size. Here is the call graph:
      ==>btrfs_truncate()
      ====>btrfs_truncate_inode_items()
      ======>btrfs_ordered_update_i_size(inode, last_size, NULL);
      Here btrfs_ordered_update_i_size()'s offset argument is last_size.
      
      And below test case can reveal this bug:
      
      dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
      dev=$(losetup --show -f fs.img)
      mkdir -p /mnt/mntpoint
      mkfs.btrfs  -f $dev
      mount $dev /mnt/mntpoint
      cd /mnt/mntpoint
      
      echo "workdir is: /mnt/mntpoint"
      blocksize=$((128 * 1024))
      dd if=/dev/zero of=testfile bs=$blocksize count=1
      sync
      count=$((17*1024*1024*1024/blocksize))
      echo "file size is:" $((count*blocksize))
      for ((i = 1; i <= $count; i++)); do
      	i=$((i + 1))
      	dst_offset=$((blocksize * i))
      	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
      		testfile > /dev/null
      done
      sync
      
      truncate --size 0 testfile
      ls -l testfile
      du -sh testfile
      exit
      
      In this case, truncate operation will fail for enospc reason and
      "du -sh testfile" returns value greater than 0, but testfile's
      size is 0, we need to reflect correct inode->i_size.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c0d2f610
    • L
      Btrfs: fix error handling in map_private_extent_buffer · 415b35a5
      Liu Bo 提交于
      map_private_extent_buffer() can return -EINVAL in two different cases,
      1. when the requested contents span two pages if nodesize is larger
         than pagesize,
      2. when it detects something insane.
      
      The 2nd one used to be only a WARN_ON(1), and we decided to return a error
      to callers, but we didn't fix up all its callers, which will be
      addressed by this patch.
      
      Without this, btrfs may end up with 'general protection', ie.
      reading invalid memory.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      415b35a5
    • W
      Btrfs: fix error return code in btrfs_init_test_fs() · 04e1b65a
      Wei Yongjun 提交于
      Fix to return a negative error code from the kern_mount() error handling
      case instead of 0(ret is set to 0 by register_filesystem), as done
      elsewhere in this function.
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04e1b65a
  4. 23 6月, 2016 3 次提交
    • J
      Btrfs: don't do nocow check unless we have to · c6887cd1
      Josef Bacik 提交于
      Before we write into prealloc/nocow space we have to make sure that there are no
      references to the extents we are writing into, which means checking the extent
      tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
      unless we can't reserve data space, since it's a serious drag on performance.
      With the following sequence
      
      fallocate -l10737418240 /mnt/btrfs-test/file
      cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
      fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
      	--end_fsync=1
      
      we get the worst case scenario where we have to fall back on to doing the check
      anyway.
      
      Without this patch
      lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
      write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
      
      With this patch
      lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
      write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
      
      We get twice the throughput, half of the runtime, and half of the average
      latency.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      [ PAGE_CACHE_ removal related fixups ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c6887cd1
    • C
      btrfs: fix deadlock in delayed_ref_async_start · 0f873eca
      Chris Mason 提交于
      "Btrfs: track transid for delayed ref flushing" was deadlocking on
      btrfs_attach_transaction because its not safe to call from the async
      delayed ref start code.  This commit brings back btrfs_join_transaction
      instead and checks for a blocked commit.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f873eca
    • J
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik 提交于
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31b9655f
  5. 18 6月, 2016 8 次提交
  6. 06 6月, 2016 9 次提交
  7. 04 6月, 2016 1 次提交
    • C
      Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent · 8dff9c85
      Chris Mason 提交于
      When dealing with inline extents, btrfs_get_extent will incorrectly try
      to insert a duplicate extent_map.  The dup hits -EEXIST from
      add_extent_map, but then we try to merge with the existing one and end
      up trying to insert a zero length extent_map.
      
      This actually works most of the time, except when there are extent maps
      past the end of the inline extent.  rocksdb will trigger this sometimes
      because it preallocates an extent and then truncates down.
      
      Josef made a script to trigger with xfs_io:
      
      	#!/bin/bash
      
      	xfs_io -f -c "pwrite 0 1000" inline
      	xfs_io -c "falloc -k 4k 1M" inline
      	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
      	xfs_io -c "fadvise -d 0 1000" inline
      	cat inline
      
      You'll get EIOs trying to read inline after this because add_extent_map
      is returning EEXIST
      Signed-off-by: NChris Mason <clm@fb.com>
      8dff9c85
  8. 03 6月, 2016 3 次提交