1. 06 9月, 2016 1 次提交
    • W
      btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress · ce129655
      Wang Xiaoguang 提交于
      In btrfs_async_reclaim_metadata_space(), we use ticket's address to
      determine whether asynchronous metadata reclaim work is making progress.
      
      	ticket = list_first_entry(&space_info->tickets,
      				  struct reserve_ticket, list);
      	if (last_ticket == ticket) {
      		flush_state++;
      	} else {
      		last_ticket = ticket;
      		flush_state = FLUSH_DELAYED_ITEMS_NR;
      		if (commit_cycles)
      			commit_cycles--;
      	}
      
      But indeed it's wrong, we should not rely on local variable's address to
      do this check, because addresses may be same. In my test environment, I
      dd one 168MB file in a 256MB fs, found that for this file, every time
      wait_reserve_ticket() called, local variable ticket's address is same,
      
      For above codes, assume a previous ticket's address is addrA, last_ticket
      is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
      wake up it, then another ticket is added, but with the same address addrA,
      now last_ticket will be same to current ticket, then current ticket's flush
      work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
      which may result in some enospc issues(I have seen this in my test machine).
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce129655
  2. 05 9月, 2016 1 次提交
  3. 01 9月, 2016 1 次提交
  4. 25 8月, 2016 7 次提交
    • J
      Btrfs: fix em leak in find_first_block_group · 187ee58c
      Josef Bacik 提交于
      We need to call free_extent_map() on the em we look up.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      187ee58c
    • L
      Btrfs: clarify do_chunk_alloc()'s return value · 28b737f6
      Liu Bo 提交于
      Function start_transaction() can return ERR_PTR(1) when flush is
      BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is
      
      start_transaction (return ERR_PTR(1))
        -> btrfs_block_rsv_add (return 1)
           -> reserve_metadata_bytes (return 1)
              -> flush_space (return 1)
                 -> do_chunk_alloc  (return 1)
      
      With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
      flush_state of ALLOC_CHUNK and it successfully allocates a new
      chunk, then instead of trying to reserve space again,
      reserve_metadata_bytes returns 1 immediately.
      
      Eventually the callers who call start_transaction() usually just
      do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
      a panic when dereferencing a pointer which is ERR_PTR(1).
      
      The following patch fixes the above problem.
      "btrfs: flush_space: treat return value of do_chunk_alloc properly"
      https://patchwork.kernel.org/patch/7778651/
      
      This add comments to clarify do_chunk_alloc()'s return value.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      28b737f6
    • W
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang 提交于
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18513091
    • W
      btrfs: divide btrfs_update_reserved_bytes() into two functions · 4824f1f4
      Wang Xiaoguang 提交于
      This patch divides btrfs_update_reserved_bytes() into
      btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
      next patch will extend btrfs_add_reserved_bytes()to fix some
      false ENOSPC error, please see later patch for detailed info.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4824f1f4
    • Q
      btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() · cb93b52c
      Qu Wenruo 提交于
      Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
      1. btrfs_qgroup_insert_dirty_extent_nolock()
         Almost the same with original code.
         For delayed_ref usage, which has delayed refs locked.
      
         Change the return value type to int, since caller never needs the
         pointer, but only needs to know if they need to free the allocated
         memory.
      
      2. btrfs_qgroup_insert_dirty_extent()
         The more encapsulated version.
      
         Will do the delayed_refs lock, memory allocation, quota enabled check
         and other things.
      
      The original design is to keep exported functions to minimal, but since
      more btrfs hacks exposed, like replacing path in balance, we need to
      record dirty extents manually, so we have to add such functions.
      
      Also, add comment for both functions, to info developers how to keep
      qgroup correct when doing hacks.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cb93b52c
    • A
      btrfs: flush_space: treat return value of do_chunk_alloc properly · eecba891
      Alex Lyakas 提交于
      do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
      But flush_space will not convert this to 0, and will also return 1.
      As a result, reserve_metadata_bytes will think that flush_space failed,
      and may potentially return this value "1" to the caller (depends how
      reserve_metadata_bytes was called). The caller will also treat this as an error.
      For example, btrfs_block_rsv_refill does:
      
      int ret = -ENOSPC;
      ...
      ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
      if (!ret) {
              block_rsv_add_bytes(block_rsv, num_bytes, 0);
              return 0;
      }
      
      return ret;
      
      So it will return -ENOSPC.
      Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      eecba891
    • L
      Btrfs: add ASSERT for block group's memory leak · f3bca802
      Liu Bo 提交于
      This adds several ASSERT()' s to report memory leak of block group cache.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f3bca802
  5. 26 7月, 2016 9 次提交
  6. 21 7月, 2016 1 次提交
  7. 08 7月, 2016 15 次提交
    • J
      Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes · 8ca17f0f
      Josef Bacik 提交于
      We used to allow you to set FLUSH_ALL and then just wouldn't do things like
      commit transactions or wait on ordered extents if we noticed you were in a
      transaction.  However now that all the flushing for FLUSH_ALL is asynchronous
      we've lost the ability to tell, and we could end up deadlocking.  So instead use
      FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
      we error out to preserve the previous behavior.  I've also added an ASSERT() to
      catch anybody else who tries to do this.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ca17f0f
    • J
      Btrfs: always use trans->block_rsv for orphans · 40acc3ee
      Josef Bacik 提交于
      This is the case all the time anyway except for relocation which could be doing
      a reloc root for a non ref counted root, in which case we'd end up with some
      random block rsv rather than the one we have our reservation in.  If there isn't
      enough space in the block rsv we are trying to steal from we'll BUG() because we
      expect there to be space for the orphan to make its reservation.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40acc3ee
    • J
      Btrfs: change how we calculate the global block rsv · ae2e4728
      Josef Bacik 提交于
      Traditionally we've calculated the global block rsv by guessing how much of the
      metadata used amount was the extent tree, and then taking the data size and
      figuring out how large the csum tree would have to be to hold that much data.
      
      This is imprecise and falls down on MIXED file systems as we can't trust the
      data used amount.  This resulted in failures for xfstests generic/333 because it
      creates lots of clones, which explodes out the extent tree.  Our global reserve
      calculations were woefully inaccurate in this case which meant we got into a
      situation where we did not have enough reserved to do our work.
      
      We know we only use the global block rsv for the extent, csum, and root trees,
      so just get the bytes used for these trees and use that as the basis of our
      global reserve.  Since these are not reference counted trees the bytes_used
      value will be accurate.  This fixed the transaction aborts seen with
      generic/333.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae2e4728
    • J
      Btrfs: use root when checking need_async_flush · 87241c2e
      Josef Bacik 提交于
      Instead of doing fs_info->fs_root in need_async_flush, which may not be set
      during recovery when mounting, just pass the root itself in, which makes more
      sense as thats what btrfs_calc_reclaim_metadata_size takes.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      87241c2e
    • J
      Btrfs: don't bother kicking async if there's nothing to reclaim · d38b349c
      Josef Bacik 提交于
      We do this check when we start the async reclaimer thread, might as well check
      before we kick it off to save us some cycles.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d38b349c
    • J
      Btrfs: fix release reserved extents trace points · 31bada7c
      Josef Bacik 提交于
      We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
      isn't quite right because we will go through and free that extent later when we
      unpin, so it messes up apps that are accounting for the reservation space.  We
      were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
      only actually free the reservation instead of pinning the extent.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31bada7c
    • J
      Btrfs: add tracepoints for flush events · f376df2b
      Josef Bacik 提交于
      We want to track when we're triggering flushing from our reservation code and
      what flushing is being done when we start flushing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f376df2b
    • J
      Btrfs: fix delalloc reservation amount tracepoint · f485c9ee
      Josef Bacik 提交于
      We can sometimes drop the reservation we had for our inode, so we need to remove
      that amount from to_reserve so that our tracepoint reports a valid amount of
      space.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f485c9ee
    • J
      Btrfs: trace pinned extents · c51e7bb1
      Josef Bacik 提交于
      Pinned extents are an important metric to keep track of for enospc.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c51e7bb1
    • J
      Btrfs: introduce ticketed enospc infrastructure · 957780eb
      Josef Bacik 提交于
      Our enospc flushing sucks.  It is born from a time where we were early
      enospc'ing constantly because multiple threads would race in for the same
      reservation and randomly starve other ones out.  So I came up with this solution
      to block any other reservations from happening while one guy tried to flush
      stuff to satisfy his reservation.  This gives us pretty good correctness, but
      completely crap latency.
      
      The solution I've come up with is ticketed reservations.  Basically we try to
      make our reservation, and if we can't we put a ticket on a list in order and
      kick off an async flusher thread.  This async flusher thread does the same old
      flushing we always did, just asynchronously.  As space is freed and added back
      to the space_info it checks and sees if we have any tickets that need
      satisfying, and adds space to the tickets and wakes up anything we've satisfied.
      
      Once the flusher thread stops making progress it wakes up all the current
      tickets and tells them to take a hike.
      
      There is a priority list for things that can't flush, since the async flusher
      could do anything we need to avoid deadlocks.  These guys get priority for
      having their reservation made, and will still do manual flushing themselves in
      case the async flusher isn't running.
      
      This patch gives us significantly better latencies.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      957780eb
    • J
      Btrfs: add tracepoint for adding block groups · c83f8eff
      Josef Bacik 提交于
      I'm writing a tool to visualize the enospc system inside btrfs, I need this
      tracepoint in order to keep track of the block groups in the system.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83f8eff
    • J
      Btrfs: warn_on for unaccounted spaces · d555b6c3
      Josef Bacik 提交于
      These were hidden behind enospc_debug, which isn't helpful as they indicate
      actual bugs, unlike the rest of the enospc_debug stuff which is really debug
      information.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d555b6c3
    • J
      Btrfs: always reserve metadata for delalloc extents · 48c3d480
      Josef Bacik 提交于
      There are a few races in the metadata reservation stuff.  First we add the bytes
      to the block_rsv well after we've set the bit on the inode saying that we have
      space for it and after we've reserved the bytes.  So use the normal
      btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
      extents when we try to reserve space for our write, which means that we could
      have used up the space for the inode and we wouldn't know because we only check
      before the reservation.  So instead make sure we are always reserving space for
      the inode update, and then if we don't need it release those bytes afterward.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48c3d480
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
    • J
      Btrfs: add bytes_readonly to the spaceinfo at once · e40edf2d
      Josef Bacik 提交于
      For some reason we're adding bytes_readonly to the space info after we update
      the space info with the block group info.  This creates a tiny race where we
      could over-reserve space because we haven't yet taken out the bytes_readonly
      bit.  Since we already know this information at the time we call
      update_space_info, just pass it along so it can be updated all at once.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e40edf2d
  8. 23 6月, 2016 2 次提交
    • C
      btrfs: fix deadlock in delayed_ref_async_start · 0f873eca
      Chris Mason 提交于
      "Btrfs: track transid for delayed ref flushing" was deadlocking on
      btrfs_attach_transaction because its not safe to call from the async
      delayed ref start code.  This commit brings back btrfs_join_transaction
      instead and checks for a blocked commit.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f873eca
    • J
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik 提交于
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31b9655f
  9. 18 6月, 2016 2 次提交
    • J
      btrfs: account for non-CoW'd blocks in btrfs_abort_transaction · 64c12921
      Jeff Mahoney 提交于
      The test for !trans->blocks_used in btrfs_abort_transaction is
      insufficient to determine whether it's safe to drop the transaction
      handle on the floor.  btrfs_cow_block, informed by should_cow_block,
      can return blocks that have already been CoW'd in the current
      transaction.  trans->blocks_used is only incremented for new block
      allocations. If an operation overlaps the blocks in the current
      transaction entirely and must abort the transaction, we'll happily
      let it clean up the trans handle even though it may have modified
      the blocks and will commit an incomplete operation.
      
      In the long-term, I'd like to do closer tracking of when the fs
      is actually modified so we can still recover as gracefully as possible,
      but that approach will need some discussion.  In the short term,
      since this is the only code using trans->blocks_used, let's just
      switch it to a bool indicating whether any blocks were used and set
      it when should_cow_block returns false.
      
      Cc: stable@vger.kernel.org # 3.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64c12921
    • L
      Btrfs: check if extent buffer is aligned to sectorsize · c871b0f2
      Liu Bo 提交于
      Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
      via alloc_extent_buffer().  An unaligned eb can have more pages than it
      should have, which ends up extent buffer's leak or some corrupted content
      in extent buffer.
      
      This adds a warning to let us quickly know what was happening.
      
      Now that alloc_extent_buffer() no more returns NULL, this changes its
      caller and callers of its caller to match with the new error
      handling.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c871b0f2
  10. 08 6月, 2016 1 次提交