1. 27 3月, 2012 3 次提交
    • J
      Btrfs: introduce free_extent_buffer_stale · 3083ee2e
      Josef Bacik 提交于
      Because btrfs cow's we can end up with extent buffers that are no longer
      necessary just sitting around in memory.  So instead of evicting these pages, we
      could end up evicting things we actually care about.  Thus we have
      free_extent_buffer_stale for use when we are freeing tree blocks.  This will
      make it so that the ref for the eb being in the radix tree is dropped as soon as
      possible and then is freed when the refcount hits 0 instead of waiting to be
      released by releasepage.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3083ee2e
    • J
      Btrfs: remove search_start and search_end from find_free_extent and callers · 81c9ad23
      Josef Bacik 提交于
      We have been passing nothing but (u64)-1 to find_free_extent for search_end in
      all of the callers, so it's completely useless, and we've always been passing 0
      in as search_start, so just remove them as function arguments and move
      search_start into find_free_extent.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      81c9ad23
    • J
      Btrfs: remove the ideal caching code · 285ff5af
      Josef Bacik 提交于
      This is a relic from before we had the disk space cache and it was to make
      bootup times when you had btrfs as root not be so damned slow.  Now that we have
      the disk space cache this isn't a problem anymore and really having this code
      casues uneeded fragmentation and complexity, so just remove it.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      285ff5af
  2. 24 2月, 2012 1 次提交
  3. 23 2月, 2012 1 次提交
  4. 17 2月, 2012 1 次提交
  5. 15 2月, 2012 1 次提交
    • L
      Btrfs: fix trim 0 bytes after a device delete · 2cac13e4
      Liu Bo 提交于
      A user reported a bug of btrfs's trim, that is we will trim 0 bytes
      after a device delete.
      
      The reproducer:
      
      $ mkfs.btrfs disk1
      $ mkfs.btrfs disk2
      $ mount disk1 /mnt
      $ fstrim -v /mnt
      $ btrfs device add disk2 /mnt
      $ btrfs device del disk1 /mnt
      $ fstrim -v /mnt
      
      This is because after we delete the device, the block group may start from
      a non-zero place, which will confuse trim to discard nothing.
      Reported-by: NLutz Euler <lutz.euler@freenet.de>
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      2cac13e4
  6. 27 1月, 2012 1 次提交
    • M
      Btrfs: fix enospc error caused by wrong checks of the chunk · 9e622d6b
      Miao Xie 提交于
      When we did sysbench test for inline files, enospc error happened easily though
      there was lots of free disk space which could be allocated for new chunks.
      
      Reproduce steps:
       # mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
       # mount <test partition> /mnt
       # ulimit -n 102400
       # cd /mnt
       # sysbench --num-threads=1 --test=fileio --file-num=81920 \
       > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
       > --file-test-mode=seqwr prepare
       # sysbench --num-threads=1 --test=fileio --file-num=81920 \
       > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
       > --file-test-mode=seqwr run
       <soon later, BUG_ON() was triggered by enospc error>
      
      The reason of this bug is:
      Now, we can reserve space which is larger than the free space in the chunks if
      we have enough free disk space which can be used for new chunks. By this way,
      the space allocator should allocate a new chunk by force if there is no free
      space in the free space cache. But there are two wrong checks which break this
      operation.
      
      One is
      	if (ret == -ENOSPC && num_bytes > min_alloc_size)
      in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
      even we fail to allocate free space by minimum allocable size.
      
      The other is
      	if (space_info->force_alloc)
      		force = space_info->force_alloc;
      in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
      sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
      
      Fix these two wrong checks. Especially the second one, we fix it by changing
      the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
      CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
      higher priority. And if the value which is passed in by the caller is greater
      than ->force_alloc, use the passed value.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e622d6b
  7. 17 1月, 2012 10 次提交
  8. 11 1月, 2012 2 次提交
    • L
      Btrfs: update global block_rsv when creating a new block group · c7c144db
      Li Zefan 提交于
      A bug was triggered while using seed device:
      
          # mkfs.btrfs /dev/loop1
          # btrfstune -S 1 /dev/loop1
          # mount -o /dev/loop1 /mnt
          # btrfs dev add /dev/loop2 /mnt
      
      btrfs: block rsv returned -28
      ------------[ cut here ]------------
      WARNING: at fs/btrfs/extent-tree.c:5969 btrfs_alloc_free_block+0x166/0x396 [btrfs]()
      ...
      Call Trace:
      ...
      [<f7b7c31c>] btrfs_cow_block+0x101/0x147 [btrfs]
      [<f7b7eaa6>] btrfs_search_slot+0x1b8/0x55f [btrfs]
      [<f7b7f844>] btrfs_insert_empty_items+0x42/0x7f [btrfs]
      [<f7b7f8c1>] btrfs_insert_item+0x40/0x7e [btrfs]
      [<f7b8ac02>] btrfs_make_block_group+0x243/0x2aa [btrfs]
      [<f7bb3f53>] __btrfs_alloc_chunk+0x672/0x70e [btrfs]
      [<f7bb41ff>] init_first_rw_device+0x77/0x13c [btrfs]
      [<f7bb5a62>] btrfs_init_new_device+0x664/0x9fd [btrfs]
      [<f7bbb65a>] btrfs_ioctl+0x694/0xdbe [btrfs]
      [<c04f55f7>] do_vfs_ioctl+0x496/0x4cc
      [<c04f5660>] sys_ioctl+0x33/0x4f
      [<c07b9edf>] sysenter_do_call+0x12/0x38
      ---[ end trace 906adac595facc7d ]---
      
      Since seed device is readonly, there's no usable space in the filesystem.
      Afterwards we add a sprout device to it, and the kernel creates a METADATA
      block group and a SYSTEM block group where comes free space we can reserve,
      but we still get revervation failure because the global block_rsv hasn't
      been updated accordingly.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      c7c144db
    • L
      Btrfs: don't pass a trans handle unnecessarily in volumes.c · 125ccb0a
      Li Zefan 提交于
      Some functions never use the transaction handle passed to them.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      125ccb0a
  9. 08 1月, 2012 1 次提交
  10. 07 1月, 2012 3 次提交
    • A
      Btrfs: test free space only for unclustered allocation · a5f6f719
      Alexandre Oliva 提交于
      Since the clustered allocation may be taking extents from a different
      block group, there's no point in spin-locking and testing the current
      block group free space before attempting to allocate space from a
      cluster, even more so when we might refrain from even trying the
      cluster in the current block group because, after the cluster was set
      up, not enough free space remained.  Furthermore, cluster creation
      attempts fail fast when the block group doesn't have enough free
      space, so the test was completely superfluous.
      
      I've move the free space test past the cluster allocation attempt,
      where it is more useful, and arranged for a cluster in the current
      block group to be released before trying an unclustered allocation,
      when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
      the cluster stands a chance of being combined with additional free
      space in the block group so as to succeed in the allocation attempt.
      Signed-off-by: NAlexandre Oliva <oliva@lsd.ic.unicamp.br>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a5f6f719
    • C
      Btrfs: lower the bar for chunk allocation · cf1d72c9
      Chris Mason 提交于
      The chunk allocation code has tried to keep a pretty tight lid on creating new
      metadata chunks.  This is partially because in the past the reservation
      code didn't give us an accurate idea of how much space was being used.
      
      The new code is much more accurate, so we're able to get rid of some of these
      checks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cf1d72c9
    • C
      Btrfs: run chunk allocations while we do delayed refs · 203bf287
      Chris Mason 提交于
      Btrfs tries to batch extent allocation tree changes to improve performance
      and reduce metadata trashing.  But it doesn't allocate new metadata chunks
      while it is doing allocations for the extent allocation tree.
      
      This commit changes the delayed refence code to do chunk allocations if we're
      getting low on room.  It prevents crashes and improves performance.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      203bf287
  11. 04 1月, 2012 2 次提交
  12. 22 12月, 2011 1 次提交
    • A
      Btrfs: mark delayed refs as for cow · 66d7e7f0
      Arne Jansen 提交于
      Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
      from every call site. The for_cow parameter will later on be used to
      determine if a ref will change anything with respect to qgroups.
      
      Delayed refs coming from relocation are always counted as for_cow, as they
      don't change subvol quota.
      
      Also pass in the fs_info for later use.
      
      btrfs_find_all_roots() will use this as an optimization, as changes that are
      for_cow will not change anything with respect to which root points to a
      certain leaf. Thus, we don't need to add the current sequence number to
      those delayed refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      66d7e7f0
  13. 16 12月, 2011 2 次提交
    • J
      Btrfs: only set cache_generation if we setup the block group · e65cbb94
      Josef Bacik 提交于
      A user reported a problem booting into a new kernel with the old format inodes.
      He was panicing in cow_file_range while writing out the inode cache.  This is
      because if the block group is not cached we'll just skip writing out the cache,
      however if it gets dirtied again in the same transaction and it finished caching
      we'd go ahead and write it out, but since we set cache_generation to the transid
      we think we've already truncated it and will just carry on, running into
      cow_file_range and blowing up.  We need to make sure we only set
      cache_generation if we've done the truncate.  The user tested this patch and
      verified that the panic no longer occured.  Thanks,
      Reported-and-Tested-by: NKlaus Bitto <klaus.bitto@gmail.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      e65cbb94
    • J
      Btrfs: fix how we do delalloc reservations and how we free reservations on error · 660d3f6c
      Josef Bacik 提交于
      Running xfstests 269 with some tracing my scripts kept spitting out errors about
      releasing bytes that we didn't actually have reserved.  This took me down a huge
      rabbit hole and it turns out the way we deal with reserved_extents is wrong,
      we need to only be setting it if the reservation succeeds, otherwise the free()
      method will come in and unreserve space that isn't actually reserved yet, which
      can lead to other warnings and such.  The math was all working out right in the
      end, but it caused all sorts of other issues in addition to making my scripts
      yell and scream and generally make it impossible for me to track down the
      original issue I was looking for.  The other problem is with our error handling
      in the reservation code.  There are two cases that we need to deal with
      
      1) We raced with free.  In this case free won't free anything because csum_bytes
      is modified before we dro the lock in our reservation path, so free rightly
      doesn't release any space because the reservation code may be depending on that
      reservation.  However if we fail, we need the reservation side to do the free at
      that point since that space is no longer in use.  So as it stands the code was
      doing this fine and it worked out, except in case #2
      
      2) We don't race with free.  Nobody comes in and changes anything, and our
      reservation fails.  In this case we didn't reserve anything anyway and we just
      need to clean up csum_bytes but not free anything.  So we keep track of
      csum_bytes before we drop the lock and if it hasn't changed we know we can just
      decrement csum_bytes and carry on.
      
      Because of the case where we can race with free()'s since we have to drop our
      spin_lock to do the reservation, I'm going to serialize all reservations with
      the i_mutex.  We already get this for free in the heavy use paths, truncate and
      file write all hold the i_mutex, just needed to add it to page_mkwrite and
      various ioctl/balance things.  With this patch my space leak scripts no longer
      scream bloody murder.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      660d3f6c
  14. 08 12月, 2011 2 次提交
  15. 01 12月, 2011 4 次提交
  16. 20 11月, 2011 1 次提交
    • J
      Btrfs: wait on caching if we're loading the free space cache · 291c7d2f
      Josef Bacik 提交于
      We've been hitting panics when running xfstest 13 in a loop for long periods of
      time.  And actually this problem has always existed so we've been hitting these
      things randomly for a while.  Basically what happens is we get a thread coming
      into the allocator and reading the space cache off of disk and adding the
      entries to the free space cache as we go.  Then we get another thread that comes
      in and tries to allocate from that block group.  Since block_group->cached !=
      BTRFS_CACHE_NO it goes ahead and tries to do the allocation.  We do this because
      if we're doing the old slow way of caching we don't want to hold people up and
      wait for everything to finish.  The problem with this is we could end up
      discarding the space cache at some arbitrary point in the future, which means we
      could very well end up allocating space that is either bad, or when the real
      caching happens it could end up thinking the space isn't in use when it really
      is and cause all sorts of other problems.
      
      The solution is to add a new flag to indicate we are loading the free space
      cache from disk, and always try to cache the block group if cache->cached !=
      BTRFS_CACHE_FINISHED.  That way if we are loading the space cache anybody else
      who tries to allocate from the block group will have to wait until it's finished
      to make sure it completes successfully.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      291c7d2f
  17. 11 11月, 2011 1 次提交
  18. 09 11月, 2011 1 次提交
    • J
      Btrfs: fix our reservations for updating an inode when completing io · 7fd2ae21
      Josef Bacik 提交于
      People have been reporting ENOSPC crashes in finish_ordered_io.  This is because
      we try to steal from the delalloc block rsv to satisfy a reservation to update
      the inode.  The problem with this is we don't explicitly save space for updating
      the inode when doing delalloc.  This is kind of a problem and we've gotten away
      with this because way back when we just stole from the delalloc reserve without
      any questions, and this worked out fine because generally speaking the leaf had
      been modified either by the mtime update when we did the original write or
      because we just updated the leaf when we inserted the file extent item, only on
      rare occasions had the leaf not actually been modified, and that was still ok
      because we'd just use a block or two out of the over-reservation that is
      delalloc.
      
      Then came the delayed inode stuff.  This is amazing, except it wants a full
      reservation for updating the inode since it may do it at some point down the
      road after we've written the blocks and we have to recow everything again.  This
      worked out because the delayed inode stuff just stole from the global reserve,
      that is until recently when I changed that because it caused other problems.
      
      So here we are, we're doing everything right and being screwed for it.  So take
      an extra reservation for the inode at delalloc reservation time and carry it
      through the life of the delalloc reservation.  If we need it we can steal it in
      the delayed inode stuff.  If we have already stolen it try and do a normal
      metadata reservation.  If that fails try to steal from the delalloc reservation.
      If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
      solve this and in the meantime we'll steal from the global reserve.
      
      With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
      any problems.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7fd2ae21
  19. 06 11月, 2011 2 次提交
    • J
      Btrfs: fix delayed insertion reservation · c06a0e12
      Josef Bacik 提交于
      We all keep getting those stupid warnings from use_block_rsv when running
      stress.sh, and it's because the delayed insertion stuff is being stupid.  It's
      not the delayed insertion stuffs fault, it's all just stupid.  When marking an
      inode dirty for oh say updating the time on it, we just do a
      btrfs_join_transaction, which doesn't reserve any space.  This is stupid because
      we're going to have to have space reserve to make this change, but we do it
      because it's fast because chances are we're going to call it over and over again
      and it doesn't matter.  Well thanks to the delayed insertion stuff this is
      mostly the case, so we do actually need to make this reservation.  So if
      trans->bytes_reserved is 0 then try to do a normal reservation.  If not return
      ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
      will let it do the whole ENOSPC dance and reserve enough space for the delayed
      insertion to steal the reservation from the transaction.
      
      The other stupid thing we do is not reserve space for the inode when writing to
      the thing.  Usually this is ok since we have to update the time so we'd have
      already done all this work before we get to the endio stuff, so it doesn't
      matter.  But this is stupid because we could write the data after the
      transaction commits where we changed the mtime of the inode so we have to cow
      all the way down to the inode anyway.  This used to be masked by the delalloc
      reservation stuff, but because we delay the update it doesn't get masked in this
      case.  So again the delayed insertion stuff bites us in the ass.  So if our
      trans->block_rsv is delalloc, just steal the reservation from the delalloc
      reserve.  Hopefully this won't bite us in the ass, but I've said that before.
      
      With this patch stress.sh no longer spits out those stupid warnings (famous last
      words).  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c06a0e12
    • J
      Btrfs: be smarter about committing the transaction in reserve_metadata_bytes · 663350ac
      Josef Bacik 提交于
      Because of the overcommit stuff I had to make it so that we committed the
      transaction all the time in reserve_metadata_bytes in case we had overcommitted
      because of delayed items.  This was because previously we had no way of knowing
      how much space was reserved for delayed items.  Now that we have the
      delayed_block_rsv we can check it to see if committing the transaction would get
      us anywhere.  This patch breaks out the committing logic into a helper function
      that will check to see if committing the transaction would free enough space for
      us to get anything done.  With this patch xfstests 83 goes from taking 445
      seconds to taking 28 seconds on my box.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      663350ac