1. 07 5月, 2013 14 次提交
    • M
      Btrfs: allocate new chunks if the space is not enough for global rsv · 3c76cd84
      Miao Xie 提交于
      When running the 208th of xfstests, the fs returned the enospc
      error when there was lots of free space in the disk.
      
      By bisect debug, we found it was introduced by commit 96f1bb57.
      This commit makes the space check for the global reservation in
      can_overcommit() be inconsistent with should_alloc_chunk().
      can_overcommit() requires that the free space is 2 times the size
      of the global reservation, or we can't do overcommit. And instead,
      we need reclaim some reserved space, and if we still don't have
      enough free space, we need allocate a new chunk. But unfortunately,
      should_alloc_chunk() just requires that the free space is 1 time
      the size of the global reservation, that is we would not try to
      allocate a new chunk if the free space size is in the middle of
      these two requires, and just return the enospc error. Fix it.
      
      Cc: Jim Schutt <jaschut@sandia.gov>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3c76cd84
    • J
      Btrfs: separate sequence numbers for delayed ref tracking and tree mod log · fc36ed7e
      Jan Schmidt 提交于
      Sequence numbers for delayed refs have been introduced in the first version
      of the qgroup patch set. To solve the problem of find_all_roots on a busy
      file system, the tree mod log was introduced. The sequence numbers for that
      were simply shared between those two users.
      
      However, at one point in qgroup's quota accounting, there's a statement
      accessing the previous sequence number, that's still just doing (seq - 1)
      just as it would have to in the very first version.
      
      To satisfy that requirement, this patch makes the sequence number counter 64
      bit and splits it into a major part (used for qgroup sequence number
      counting) and a minor part (incremented for each tree modification in the
      log). This enables us to go exactly one major step backwards, as required
      for qgroups, while still incrementing the sequence counter for tree mod log
      insertions to keep track of their order. Keeping them in a single variable
      means there's no need to change all the code dealing with comparisons of two
      sequence numbers.
      
      The sequence number is reset to 0 on commit (not new in this patch), which
      ensures we won't overflow the two 32 bit counters.
      
      Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
      from the tree mod log code may happen.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fc36ed7e
    • J
      Btrfs: don't panic if we're trying to drop too many refs · 32b02538
      Josef Bacik 提交于
      This is just obnoxious.  Just print a message, abort the transaction, and return
      an error.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      32b02538
    • J
      Btrfs: fix all callers of read_tree_block · 416bc658
      Josef Bacik 提交于
      We kept leaking extent buffers when mounting a broken file system and it turns
      out it's because not everybody uses read_tree_block properly.  You need to check
      and make sure the extent_buffer is uptodate before you use it.  This patch fixes
      everybody who calls read_tree_block directly to make sure they check that it is
      uptodate and free it and return an error if it is not.  With this we no longer
      leak EB's when things go horribly wrong.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      416bc658
    • J
      Btrfs: only exclude supers in the range of our block group · 51bf5f0b
      Josef Bacik 提交于
      If we fail to load block groups halfway through we can leave extent_state's on
      the excluded tree.  This is because we just lookup the supers and add them to
      the excluded tree regardless of which block group we are looking at currently.
      This is a problem because we remove the excluded extents for the range of the
      block group only, so if we don't ever load a block group for one of the excluded
      extents we won't ever free it.  This fixes the problem by only adding excluded
      extents if it falls in the block group range we care about.  With this patch
      we're no longer leaking space when we fail to read all of the block groups.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      51bf5f0b
    • J
      Btrfs: fix possible infinite loop in slow caching · 0a3896d0
      Josef Bacik 提交于
      So I noticed there is an infinite loop in the slow caching code.  If we return 1
      when we hit the end of the tree, so we could end up caching the last block group
      the slow way and suddenly we're looping forever because we just keep
      re-searching and trying again.  Fix this by only doing btrfs_next_leaf() if we
      don't need_resched().  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0a3896d0
    • T
      Btrfs: cleanup of function where btrfs_extend_item() is called · fd279fae
      Tsutomu Itoh 提交于
      Argument 'trans' became unnecessary from setup_inline_extent_backref()
      that called btrfs_extend_item().
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fd279fae
    • T
      Btrfs: remove unused argument of btrfs_extend_item() · 4b90c680
      Tsutomu Itoh 提交于
      Argument 'trans' is not used in btrfs_extend_item().
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4b90c680
    • T
      Btrfs: cleanup of function where fixup_low_keys() is called · afe5fea7
      Tsutomu Itoh 提交于
      If argument 'trans' is unnecessary in the function where
      fixup_low_keys() is called, 'trans' is deleted.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      afe5fea7
    • J
      Btrfs: don't wait on ordered extents if we have a trans open · 98ad69cf
      Josef Bacik 提交于
      Dave was hitting a lockdep warning because we're now properly taking the ordered
      operations mutex in the ordered wait stuff.  This is because some cases we will
      have a trans handle when we are flushing delalloc space, but we can't wait on
      ordered extents because we could potentially deadlock, so fix this by not doing
      the wait if we have a trans handle.  Thanks
      Reported-and-tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      98ad69cf
    • J
      Btrfs: fix error handling in make/read block group · 8c579fe7
      Josef Bacik 提交于
      I noticed that we will add a block group to the space info before we add it to
      the block group cache rb tree, so we could potentially allocate from the block
      group before it's able to be searched for.  I don't think this is too much of
      a problem, the race window is microscopic, but just in case move the tree
      insertion to above the space info linking.  This makes it easier to adjust the
      error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real
      error handling setup for these scenarios.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      8c579fe7
    • S
      Btrfs: Include the device in most error printk()s · c2cf52eb
      Simon Kirby 提交于
      With more than one btrfs volume mounted, it can be very difficult to find
      out which volume is hitting an error. btrfs_error() will print this, but
      it is currently rigged as more of a fatal error handler, while many of
      the printk()s are currently for debugging and yet-unhandled cases.
      
      This patch just changes the functions where the device information is
      already available. Some cases remain where the root or fs_info is not
      passed to the function emitting the error.
      
      This may introduce some confusion with volumes backed by multiple devices
      emitting errors referring to the primary device in the set instead of the
      one on which the error occurred.
      
      Use btrfs_printk(fs_info, format, ...) rather than writing the device
      string every time, and introduce macro wrappers ala XFS for brevity.
      Since the function already cannot be used for continuations, print a
      newline as part of the btrfs_printk() message rather than at each caller.
      Signed-off-by: NSimon Kirby <sim@hostway.ca>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c2cf52eb
    • D
      btrfs: clean snapshots one by one · 9d1a2a3a
      David Sterba 提交于
      Each time pick one dead root from the list and let the caller know if
      it's needed to continue. This should improve responsiveness during
      umount and balance which at some point waits for cleaning all currently
      queued dead roots.
      
      A new dead root is added to the end of the list, so the snapshots
      disappear in the order of deletion.
      
      The snapshot cleaning work is now done only from the cleaner thread and the
      others wake it if needed.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      9d1a2a3a
    • J
      Btrfs: add a incompatible format change for smaller metadata extent refs · 3173a18f
      Josef Bacik 提交于
      We currently store the first key of the tree block inside the reference for the
      tree block in the extent tree.  This takes up quite a bit of space.  Make a new
      key type for metadata which holds the level as the offset and completely removes
      storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
      from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
      this results in a 30-35% decrease in the size of our extent tree, which means we
      COW less and can keep more of the extent tree in memory which makes our heavy
      metadata operations go much faster.  This is not an automatic format change, you
      must enable it at mkfs time or with btrfstune.  This patch deals with having
      metadata stored as either the old format or the new format so it is easy to
      convert.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3173a18f
  2. 28 3月, 2013 2 次提交
    • J
      Btrfs: limit the global reserve to 512mb · fdf30d1c
      Josef Bacik 提交于
      A user reported a problem where he was getting early ENOSPC with hundreds of
      gigs of free data space and 6 gigs of free metadata space.  This is because the
      global block reserve was taking up the entire free metadata space.  This is
      ridiculous, we have infrastructure in place to throttle if we start using too
      much of the global reserve, so instead of letting it get this huge just limit it
      to 512mb so that users can still get work done.  This allowed the user to
      complete his rsync without issues.  Thanks
      
      Cc: stable@vger.kernel.org
      Reported-and-tested-by: NStefan Priebe <s.priebe@profihost.ag>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fdf30d1c
    • J
      Btrfs: fix space leak when we fail to reserve metadata space · f4881bc7
      Josef Bacik 提交于
      Dave reported a warning when running xfstest 275.  We have been leaking delalloc
      metadata space when our reservations fail.  This is because we were improperly
      calculating how much space to free for our checksum reservations.  The problem
      is we would sometimes free up space that had already been freed in another
      thread and we would end up with negative usage for the delalloc space.  This
      patch fixes the problem by calculating how much space the other threads would
      have already freed, and then calculate how much space we need to free had we not
      done the reservation at all, and then freeing any excess space.  This makes
      xfstests 275 no longer have leaked space.  Thanks
      
      Cc: stable@vger.kernel.org
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f4881bc7
  3. 22 3月, 2013 1 次提交
    • J
      Btrfs: handle a bogus chunk tree nicely · 835d974f
      Josef Bacik 提交于
      If you restore a btrfs-image file system and try to mount that file system we'll
      panic.  That's because btrfs-image restores and just makes one big chunk to
      envelope the whole disk, since they are really only meant to be messed with by
      our btrfs-progs.  So fix up btrfs_rmap_block and the callers of it for mount so
      that we no longer panic but instead just return an error and fail to mount.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      835d974f
  4. 15 3月, 2013 1 次提交
  5. 01 3月, 2013 3 次提交
  6. 27 2月, 2013 2 次提交
    • Q
      btrfs: cleanup for open-coded alignment · fda2832f
      Qu Wenruo 提交于
      Though most of the btrfs codes are using ALIGN macro for page alignment,
      there are still some codes using open-coded alignment like the
      following:
      ------
              u64 mask = ((u64)root->stripesize - 1);
              u64 ret = (val + mask) & ~mask;
      ------
      Or even hidden one:
      ------
              num_bytes = (end - start + blocksize) & ~(blocksize - 1);
      ------
      
      Sometimes these open-coded alignment is not so easy to understand for
      newbie like me.
      
      This commit changes the open-coded alignment to the ALIGN macro for a
      better readability.
      
      Also there is a previous patch from David Sterba with similar changes,
      but the patch is for 3.2 kernel and seems not merged.
      http://www.spinics.net/lists/linux-btrfs/msg12747.html
      
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fda2832f
    • A
      clear chunk_alloc flag on retryable failure · a81cb9a2
      Alexandre Oliva 提交于
      I've experienced filesystem freezes with permanent spikes in the active
      process count for quite a while, particularly on filesystems whose
      available raw space has already been fully allocated to chunks.
      
      While looking into this, I found a pretty obvious error in
      do_chunk_alloc: it sets space_info->chunk_alloc, but if
      btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
      that flag set, which causes any other threads waiting for
      space_info->chunk_alloc to become zero to spin indefinitely.
      
      I haven't double-checked that this patch fixes the failure I've observed
      fully (it's not exactly trivial to trigger), but it surely is a bug and
      the fix is trivial, so...  Please put it in :-)
      
      What I saw in that function also happens to explain why in some cases I
      see filesystems allocate a huge number of chunks that remain unused
      (leading to the scenario above, of not having more chunks to allocate).
      It happens for data and metadata, but not necessarily both.  I'm
      guessing some thread sets the force_alloc flag on the corresponding
      space_info, and then several threads trying to get disk space end up
      attempting to allocate a new chunk concurrently.  All of them will see
      the force_alloc flag and bump their local copy of force up to the level
      they see first, and they won't clear it even if another thread succeeds
      in allocating a chunk, thus clearing the force flag.  Then each thread
      that observed the force flag will, on its turn, force the allocation of
      a new chunk.  And any threads that come in while it does that will see
      the force flag still set and pick it up, and so on.  This sounds like a
      problem to me, but...  what should the correct behavior be?  Clear
      force_flag once we copy it to a local force?  Reset force to the
      incoming value on every loop?  Set the flag to our incoming force if we
      have it at first, clear our local flag, and move it from the space_info
      when we determined that we are the thread that's going to perform the
      allocation?
      
      btrfs: clear chunk_alloc flag on retryable failure
      
      From: Alexandre Oliva <oliva@gnu.org>
      
      If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
      without clearing chunk_alloc in space_info.  As a result, any further
      calls to do_chunk_alloc on that filesystem will start busy-waiting for
      chunk_alloc to be cleared, but it never will be.  This patch adjusts
      do_chunk_alloc so that it clears this flag in case of an error.
      Signed-off-by: NAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a81cb9a2
  7. 21 2月, 2013 9 次提交
    • Z
      btrfs: limit fallocate extent reservation to 256MB · 24542bf7
      Zach Brown 提交于
      Very large fallocate requests are cpu bound and result in extents with a
      repeating pattern of ever decreasing size:
      
      $ time fallocate -l 1T file
      real	0m13.039s
      
      ( an excerpt of the extents from btrfs-debug-tree: )
        prealloc data disk byte 1536292564992 nr 397312
        prealloc data disk byte 1536292962304 nr 196608
        prealloc data disk byte 1536293158912 nr 98304
        prealloc data disk byte 1536293257216 nr 49152
        prealloc data disk byte 1536293306368 nr 24576
        prealloc data disk byte 1536293330944 nr 12288
        prealloc data disk byte 1536293343232 nr 8192
        prealloc data disk byte 1536293351424 nr 4096
        prealloc data disk byte 1536293355520 nr 4096
        prealloc data disk byte 1536293359616 nr 4096
      
      The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
      allocate the entire remaining size after each extent is allocated.
      btrfs_reserve_extent() repeatedly cuts this requested size in half until
      it gets down to the size that the allocators can return.  We limit the
      problem for now by capping each reservation at 256 meg.
      
      The small extents come from a masking bug when decreasing the requested
      reservation size.  The high 32bits are cleared and the remaining low
      bits might happen to reserve a small size.   Fix this by using
      round_down() which properly casts the mask.
      
      After these fixes huge fallocate requests are fast and result in nice
      large extents:
      
      $ time fallocate -l 1T file
      real	0m0.082s
      
        prealloc data disk byte 1112425889792 nr 268435456
        prealloc data disk byte 1112694325248 nr 268435456
        prealloc data disk byte 1112962760704 nr 268435456
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      24542bf7
    • D
      btrfs: put some enospc messages under enospc_debug · b069e0c3
      David Sterba 提交于
      The warning in use_block_rsv is not useful for users and may fill
      the logs unnecessarily.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b069e0c3
    • M
      Btrfs: fix deadlock due to unsubmitted · 0934856d
      Miao Xie 提交于
      The deadlock problem happened when running fsstress(a test program in LTP).
      
      Steps to reproduce:
       # mkfs.btrfs -b 100M <partition>
       # mount <partition> <mnt>
       # <Path>/fsstress -p 3 -n 10000000 -d <mnt>
      
      The reason is:
      btrfs_direct_IO()
       |->do_direct_IO()
           |->get_page()
           |->get_blocks()
           |	 |->btrfs_delalloc_resereve_space()
           |	 |->btrfs_add_ordered_extent() -------	Add a new ordered extent
           |->dio_send_cur_page(page0) --------------	We didn't submit bio here
           |->get_page()
           |->get_blocks()
      	 |->btrfs_delalloc_resereve_space()
      	     |->flush_space()
      		 |->btrfs_start_ordered_extent()
      		     |->wait_event() ----------	Wait the completion of
      						the ordered extent that is
      						mentioned above
      
      But because we didn't submit the bio that is mentioned above, the ordered
      extent can not complete, we would wait for its completion forever.
      
      There are two methods which can fix this deadlock problem:
      1. submit the bio before we invoke get_blocks()
      2. reserve the space before we do dio
      
      Though the 1st is the simplest way, we need modify the code of VFS, and it
      is likely to break contiguous requests, and introduce performance regression
      for the other filesystems.
      
      So we have to choose the 2nd way.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0934856d
    • J
      Btrfs: steal from global reserve if we are cleaning up orphans · 5d80366e
      Josef Bacik 提交于
      Sometimes xfstest 83 will fail to remount the scratch device because we've
      gotten ourselves so full that we cannot cleanup the orphan items.  In this
      case check to see if we're doing the orphan cleanup and if we are allow us
      to steal our reservation from the global block rsv.  With this patch I've
      not been able to reproduce the failed mount problem.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5d80366e
    • J
      Btrfs: rework the overcommit logic to be based on the total size · 70afa399
      Josef Bacik 提交于
      People have been complaining about random ENOSPC errors that will clear up
      after a umount or just a given amount of time.  Chris was able to reproduce
      this with stress.sh and lots of processes and so was I.  Basically the
      overcommit stuff would really let us get out of hand, in my tests I saw up
      to 30 gigs of outstanding reservations with only 2 gigs total of metadata
      space.  This usually worked out fine but with so much outstanding
      reservation the flushing stuff short circuits to make sure we don't hang
      forever flushing when we really need ENOSPC.  Plus we allocate chunks in
      order to alleviate the pressure, but this doesn't actually help us since we
      only use the non-allocated area in our over commit logic.
      
      So instead of basing overcommit on the amount of non-allocated space,
      instead just do it based on how much total space we have, and then limit it
      to the non-allocated space in case we are short on space to spill over into.
      This allows us to have the same performance as well as no longer giving
      random ENOSPC.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      70afa399
    • E
      btrfs: remove unnecessary DEFINE_WAIT() declarations · 1971e917
      Eric Sandeen 提交于
      No point in DEFINE_WAIT(wait) if it's not used!
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      1971e917
    • J
      Btrfs: do not overcommit if we don't have enough space for global rsv · 96f1bb57
      Josef Bacik 提交于
      Because of how little we allocate chunks now we can get really tight on
      metadata space before we will allocate a new chunk.  This resulted in being
      unable to add device extents when allocating a new metadata chunk as we did
      not have enough space.  This is because we were allowed to overcommit too
      much metadata without actually making sure we had enough space to make
      allocations.  The idea behind overcommit is that we are allowed to say "sure
      you can have that reservation" when most of the free space is occupied by
      reservations, not actual allocations.  But in this case where a majority of
      the total space is in use by actual allocations we can screw ourselves by
      not being able to make real allocations when it matters.  So make sure we
      have enough real space for our global reserve, and if not then don't allow
      overcommitting.  Thanks,
      Reported-and-tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      96f1bb57
    • M
      Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits · de98ced9
      Miao Xie 提交于
      There is no lock to protect
        fs_info->avail_{data, metadata, system}_alloc_bits,
      it may introduce some problem, such as the wrong profile
      information, so we add a seqlock to protect them.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      de98ced9
    • M
      Btrfs: use percpu counter for fs_info->delalloc_bytes · 963d678b
      Miao Xie 提交于
      fs_info->delalloc_bytes is accessed very frequently, so use percpu
      counter instead of the u64 variant for it to reduce the lock
      contention.
      
      This patch also fixed the problem that we access the variant
      without the lock protection.At worst, we would not flush the
      delalloc inodes, and just return ENOSPC error when we still have
      some free space in the fs.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      963d678b
  8. 20 2月, 2013 8 次提交