1. 02 10月, 2012 9 次提交
    • J
      Btrfs: remove bytes argument from do_chunk_alloc · 698d0082
      Josef Bacik 提交于
      Everybody is just making stuff up, and it's just used to see if we really do
      need to alloc a chunk, and since we do this when we already know we really
      do it's just a waste of space.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      698d0082
    • J
      Btrfs: delay block group item insertion · ea658bad
      Josef Bacik 提交于
      So we have lots of places where we try to preallocate chunks in order to
      make sure we have enough space as we make our allocations.  This has
      historically meant that we're constantly tweaking when we should allocate a
      new chunk, and historically we have gotten this horribly wrong so we way
      over allocate either metadata or data.  To try and keep this from happening
      we are going to make it so that the block group item insertion is done out
      of band at the end of a transaction.  This will allow us to create chunks
      even if we are trying to make an allocation for the extent tree.  With this
      patch my enospc tests run faster (didn't expect this) and more efficiently
      use the disk space (this is what I wanted).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ea658bad
    • J
      Btrfs: fix our overcommit math · a80c8dcf
      Josef Bacik 提交于
      I noticed I was seeing large lags when running my torrent test in a vm on my
      laptop.  While trying to make it lag less I noticed that our overcommit math
      was taking into account the number of bytes we wanted to reclaim, not the
      number of bytes we actually wanted to allocate, which means we wouldn't
      overcommit as often.  This patch fixes the overcommit math and makes
      shrink_delalloc() use that logic so that it will stop looping faster.  We
      still have pretty high spikes of latency, but the test now takes 3 minutes
      less time (about 5% faster).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a80c8dcf
    • J
      Btrfs: wait on async pages when shrinking delalloc · dea31f52
      Josef Bacik 提交于
      Mitch reported a problem where you could get an ENOSPC error when untarring
      a kernel git tree onto a 16gb file system with compress-force=zlib.  This is
      because compression is a huge pain, it will return from ->writepages()
      without having actually created any ordered extents.  To get around this we
      check to see if the async submit counter is up, and if it is wait until it
      drops to 0 before doing our normal ordered wait dance.  With this patch I
      can now untar a kernel git tree onto a 16gb file system without getting
      ENOSPC errors.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      dea31f52
    • M
      Btrfs: fix wrong size for the reservation of the, snapshot creation · 48c03c4b
      Miao Xie 提交于
      We should insert/update 6 items(root ref, root backref, dir item, dir index,
      root item and parent inode) when creating a snapshot, not 5 items, fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      48c03c4b
    • M
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie 提交于
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • J
      Btrfs: add hole punching · 2aaa6655
      Josef Bacik 提交于
      This patch adds hole punching via fallocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2aaa6655
    • J
      Btrfs: do not needlessly restart the transaction for enospc · ca7e70f5
      Josef Bacik 提交于
      We will stop and restart a transaction every time we move to a different leaf
      when truncating a file.  This is for enospc reasons, but really we could
      probably get away with doing this a little better by actually working until we
      hit an ENOSPC.  So add a ->failfast flag to the block_rsv and set it when we do
      truncates which will fail as soon as the block rsv runs out of space, and then
      at that point we can stop and restart the transaction and refill the block rsv
      and carry on.  This will make rm'ing of a file with lots of extents a bit
      faster.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ca7e70f5
    • J
      Btrfs: do not allocate chunks as agressively · 54338b5c
      Josef Bacik 提交于
      Swinging this pendulum back the other way.  We've been allocating chunks up
      to 2% of the disk no matter how much we actually have allocated.  So instead
      fix this calculation to only allocate chunks if we have more than 80% of the
      space available allocated.  Please test this as it will likely cause all
      sorts of ENOSPC problems to pop up suddenly.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      54338b5c
  2. 29 8月, 2012 5 次提交
    • J
      Btrfs: allow delayed refs to be merged · ae1e206b
      Josef Bacik 提交于
      Daniel Blueman reported a bug with fio+balance on a ramdisk setup.
      Basically what happens is the balance relocates a tree block which will drop
      the implicit refs for all of its children and adds a full backref.  Once the
      block is relocated we have to add the implicit refs back, so when we cow the
      block again we add the implicit refs for its children back.  The problem
      comes when the original drop ref doesn't get run before we add the implicit
      refs back.  The delayed ref stuff will specifically prefer ADD operations
      over DROP to keep us from freeing up an extent that will have references to
      it, so we try to add the implicit ref before it is actually removed and we
      panic.  This worked fine before because the add would have just canceled the
      drop out and we would have been fine.  But the backref walking work needs to
      be able to freeze the delayed ref stuff in time so we have this ever
      increasing sequence number that gets attached to all new delayed ref updates
      which makes us not merge refs and we run into this issue.
      
      So to fix this we need to merge delayed refs.  So everytime we run a
      clustered ref we need to try and merge all of its delayed refs.  The backref
      walking stuff locks the delayed ref head before processing, so if we have it
      locked we are safe to merge any refs inside of the sequence number.  If
      there is no sequence number we can merge all refs.  Doing this not only
      fixes our bug but keeps the delayed ref code from adding and removing
      useless refs and batching together multiple refs into one search instead of
      one search per delayed ref, which will really help our commit times.  I ran
      this with Daniels test and 276 and I haven't seen any problems.  Thanks,
      Reported-by: NDaniel J Blueman <daniel@quora.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ae1e206b
    • A
      Btrfs: fix race in run_clustered_refs · 22cd2e7d
      Arne Jansen 提交于
      With commit
      
      commit d1270cd9
      Author: Arne Jansen <sensille@gmx.net>
      Date:   Tue Sep 13 15:16:43 2011 +0200
      
           Btrfs: put back delayed refs that are too new
      
      I added a window where the delayed_ref's head->ref_mod code can diverge
      from the sum of the remaining refs, because we release the head->mutex
      in the middle. This leads to btrfs_lookup_extent_info returning wrong
      numbers. This patch fixes this by adjusting the head's ref_mod with each
      delayed ref we run.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      22cd2e7d
    • J
      Btrfs: increase the size of the free space cache · 6fc823b1
      Josef Bacik 提交于
      Arne was complaining about the space cache having mismatching generation
      numbers when debugging a deadlock.  This is because we can run out of space
      in our preallocated range for our space cache if you have a pretty
      fragmented amount of space in your pinned space.  So just increase the
      amount of space we preallocate for space cache so we can be sure to have
      enough space.  This will only really affect data ranges since their the only
      chunks that end up larger than 256MB.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      6fc823b1
    • A
      Btrfs: fix deadlock in wait_for_more_refs · 1fa11e26
      Arne Jansen 提交于
      Commit a168650c introduced a waiting mechanism to prevent busy waiting in
      btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
      where a tree_mod_seq is held while waiting for the io to complete, while
      the end_io calls btrfs_run_delayed_refs.
      This whole mechanism is unnecessary. If not enough runnable refs are
      available to satisfy count, just return as count is more like a guideline
      than a strict requirement.
      In case we have to run all refs, commit transaction makes sure that no
      other threads are working in the transaction anymore, so we just assert
      here that no refs are blocked.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      1fa11e26
    • D
      Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() · 55e591ff
      Dan Carpenter 提交于
      We should release this mutex before returning the error code.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      55e591ff
  3. 26 7月, 2012 1 次提交
  4. 24 7月, 2012 9 次提交
    • L
      Btrfs: make btrfs's allocation smoothly with preallocation · df57dbe6
      Liu Bo 提交于
      For backref walking, we've introduce delayed ref's sequence.  However,
      it changes our preallocation behavior.
      
      The story is that when we preallocate an extent and then mark it written
      piece by piece, the ideal case should be that we don't need to COW the
      extent, which is why we use 'preallocate'.
      
      But we may not make use of preallocation, since when we check for cross refs on
      the extent, we may have two ref entries which have the same content except
      the sequence value, and we recognize them as cross refs and do COW to allocate
      another extent.
      
      So we end up with several pieces of space instead of an whole extent.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      df57dbe6
    • L
      Btrfs: kill free_space pointer from inode structure · b4d7c3c9
      Li Zefan 提交于
      Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
      which means every inode has the same BTRFS_I(inode)->free_space pointer.
      
      This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      b4d7c3c9
    • L
      Btrfs: add ro notification to dump_space_info · 799ffc3c
      Liu Bo 提交于
      Block group has ro attributes, make dump_space_info show it.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      799ffc3c
    • L
      Btrfs: fix a bug of writting free space cache during balance · cf7c1ef6
      Liu Bo 提交于
      Here is the whole story:
      1)
      A free space cache consists of two parts:
      o  free space cache inode, which is special becase it's stored in root tree.
      o  free space info, which is stored as the above inode's file data.
      
      But we only build up another new inode and does not flush its free space info
      onto disk when we _clear and setup_ free space cache, and this ends up with
      that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.
      
      And holding DC_SETUP means that we will not truncate this free space cache inode,
      which means the disk offset of its file extent will remain _unchanged_ at least
      until next transaction finishes committing itself.
      
      2)
      We can set a block group readonly when we relocate the block group.
      
      However,
      if the readonly block group covers the disk offset where our free space cache
      inode is going to write, it will force the free space cache inode into
      cow_file_range() and it'll end up hitting a BUG_ON.
      
      3)
      Due to the above analysis, we fix this bug by adding the missing dirty flag.
      
      4)
      However, it's not over, there is still another case, nospace_cache.
      
      With nospace_cache, we do not want to set dirty flag, instead we just truncate
      free space cache inode and bail out with setting cache state DC_WRITTEN.
      
      We can benifit from it since it saves us another 'pre-allocation' part which
      usually costs a lot.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      cf7c1ef6
    • L
      Btrfs: do not abort transaction in prealloc case · 06789384
      Liu Bo 提交于
      During disk balance, we prealloc new file extent for file data relocation,
      but we may fail in 'no available space' case, and it leads to flipping btrfs
      into readonly.
      
      It is not necessary to bail out and abort transaction since we do have several
      ways to rescue ourselves from ENOSPC case.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      06789384
    • L
      Btrfs: kill root from btrfs_is_free_space_inode · 83eea1f1
      Liu Bo 提交于
      Since root can be fetched via BTRFS_I macro directly, we can save an args
      for btrfs_is_free_space_inode().
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      83eea1f1
    • J
      Btrfs: rework shrink_delalloc · f4c738c2
      Josef Bacik 提交于
      So shrink_delalloc has grown all sorts of cruft over the years thanks to
      many reworkings of how we track enospc.  What happens now as we fill up the
      disk is we will loop for freaking ever hoping to reclaim a arbitrary amount
      of space of metadata, this was from when everybody flushed at the same time.
      Now we only have people flushing one at a time.  So instead of trying to
      reclaim a huge amount of space, just try to flush a decent chunk of space,
      and stop looping as soon as we have enough free space to satisfy our
      reservation.  This makes xfstests 224 go much faster.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f4c738c2
    • J
      Btrfs: change how we indicate we're adding csums · 0e721106
      Josef Bacik 提交于
      There is weird logic I had to put in place to make sure that when we were
      adding csums that we'd used the delalloc block rsv instead of the global
      block rsv.  Part of this meant that we had to free up our transaction
      reservation before we ran the delayed refs since csum deletion happens
      during the delayed ref work.  The problem with this is that when we release
      a reservation we will add it to the global reserve if it is not full in
      order to keep us going along longer before we have to force a transaction
      commit.  By releasing our reservation before we run delayed refs we don't
      get the opportunity to drain down the global reserve for the work we did, so
      we won't refill it as often.  This isn't a problem per-se, it just results
      in us possibly committing transactions more and more often, and in rare
      cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
      out of space in our block rsv.
      
      This also helps us by holding onto space while the delayed refs run so we
      don't end up with as many people trying to do things at the same time, which
      again will help us not force commits or hit the use_block_rsv warnings.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0e721106
    • J
      Btrfs: flush delayed inodes if we're short on space · 96c3f433
      Josef Bacik 提交于
      Those crazy gentoo guys have been complaining about ENOSPC errors on their
      portage volumes.  This is because doing things like untar tends to create
      lots of new files which will soak up all the reservation space in the
      delayed inodes.  Usually this gets papered over by the fact that we will try
      and commit the transaction, however if this happens in the wrong spot or we
      choose not to commit the transaction you will be screwed.  So add the
      ability to expclitly flush delayed inodes to free up space.  Please test
      this out guys to make sure it works since as usual I cannot reproduce.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      96c3f433
  5. 12 7月, 2012 3 次提交
  6. 10 7月, 2012 2 次提交
  7. 27 6月, 2012 1 次提交
    • J
      Btrfs: avoid waiting for delayed refs when we must not · 8ca78f3e
      Jan Schmidt 提交于
      We track two conditions to decide if we should sleep while waiting for more
      delayed refs, the number of delayed refs (num_refs) and the first entry in
      the list of blockers (first_seq).
      
      When we suspect staleness, we save num_refs and do one more cycle. If
      nothing changes, we then save first_seq for later comparison and do
      wait_event. We ought to save first_seq the very same moment we're saving
      num_refs. Otherwise we cannot be sure that nothing has changed and we might
      start waiting when we shouldn't, which could lead to starvation.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      8ca78f3e
  8. 30 5月, 2012 1 次提交
    • J
      Btrfs: convert the inode bit field to use the actual bit operations · 72ac3c0d
      Josef Bacik 提交于
      Miao pointed this out while I was working on an orphan problem that messing
      with a bitfield where different ranges are protected by different locks
      doesn't work out right.  Turns out we've been doing this forever where we
      have different parts of the bit field protected by either no lock at all or
      different locks which could cause all sorts of weird problems including the
      issue I was hitting.  So instead make a runtime_flags thing that we use the
      normal bit operations on that are all atomic so we can keep having our
      no/different locking for the different flags and then make force_compress
      it's own thing so it can be treated normally.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      72ac3c0d
  9. 26 5月, 2012 1 次提交
  10. 11 5月, 2012 1 次提交
  11. 06 5月, 2012 1 次提交
    • C
      Btrfs: avoid sleeping in verify_parent_transid while atomic · b9fab919
      Chris Mason 提交于
      verify_parent_transid needs to lock the extent range to make
      sure no IO is underway, and so it can safely clear the
      uptodate bits if our checks fail.
      
      But, a few callers are using it with spinlocks held.  Most
      of the time, the generation numbers are going to match, and
      we don't want to switch to a blocking lock just for the error
      case.  This adds an atomic flag to verify_parent_transid,
      and changes it to return EAGAIN if it needs to block to
      properly verifiy things.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9fab919
  12. 28 4月, 2012 1 次提交
  13. 19 4月, 2012 2 次提交
    • A
      btrfs: don't return EINTR · b9688bb8
      Arne Jansen 提交于
      It is basically a good thing if we are interruptible when waiting for
      free space, but the generality in which it is implemented currently
      leads to system calls being interruptible that are not documented this
      way. For example git can't handle interrupted unlink(), leading to
      corrupt repos under space pressure.
      Instead we raise the bar to only be interruptible by SIGKILL.
      Thanks to David Sterba for suggesting this.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      b9688bb8
    • D
      Btrfs: double unlock bug in error handling · 253beebd
      Dan Carpenter 提交于
      The caller expects this function to return with the lock held and
      releases it immediately on error.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      253beebd
  14. 13 4月, 2012 3 次提交