1. 20 10月, 2011 19 次提交
    • J
      Btrfs: take overflow into account in reserving space · 9a82ca65
      Josef Bacik 提交于
      My overcommit stuff can be a little racy when we're filling up the disk with
      fs_mark and we overcommit into things that quickly get used up for data.  So use
      num_bytes to see if we have enough available space so we're less likely to
      overcommit ourselves out of the ability to make reservations.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      9a82ca65
    • J
      Btrfs: introduce mount option no_space_cache · 73bc1876
      Josef Bacik 提交于
      Some users have requested this and I've found I needed a way to disable cache
      loading without actually clearing the cache, so introduce the no_space_cache
      option.  Before we check the super blocks cache generation field and if it was
      populated we always turned space caching on.  Now we check this and set the
      space cache option on, and then parse the mount options so that if we want it
      off it get's turned off.  Then we check the mount option all the places we do
      the caching work instead of checking the super's cache generation.  This makes
      things more consistent and lets us turn space caching off.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      73bc1876
    • J
      Btrfs: allow us to overcommit our enospc reservations · 2bf64758
      Josef Bacik 提交于
      One of the things that kills us is the fact that our ENOSPC reservations are
      horribly over the top in most normal cases.  There isn't too much that can be
      done about this because when we are completely full we really need them to work
      like this so we don't under reserve.  However if there is plenty of unallocated
      chunks on the disk we can use that to gauge how much we can overcommit.  So this
      patch adds chunk free space accounting so we always know how much unallocated
      space we have.  Then if we fail to make a reservation within our allocated
      space, check to see if we can overcommit.  In the normal flushing case (like
      with delalloc metadata reservations) we'll take the free space and divide it by
      2 if our metadata profile is setup for DUP or any of those, and then divide it
      by 8 to make sure we don't overcommit too much.  Then if we're in a non-flushing
      case (we really need this reservation now!) we only limit ourselves to half of
      the free space.  This makes this fio test
      
      [torrent]
      filename=torrent-test
      rw=randwrite
      size=4g
      ioengine=sync
      directory=/mnt/btrfs-test
      
      go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
      file system.  This doesn't seem to break my other enospc tests, but could really
      use some more testing as this is a super scary change.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      2bf64758
    • J
      Btrfs: check unused against how much space we actually want · ef3be457
      Josef Bacik 提交于
      There is a bug that may lead to early ENOSPC in our reservation code.  We've
      been checking against num_bytes which may be above and beyond what we want to
      actually reserve, which could give us a false ENOSPC.  Fix this by making sure
      the unused space is above how much we want to reserve and not how much we're
      trying to flush.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      ef3be457
    • J
      Btrfs: delay iput when deleting a block group · 455757c3
      Josef Bacik 提交于
      I kept getting warnings from evict because we were calling
      btrfs_start_transaction() with a transaction already started when doing a
      balance.  This is because we remove a block group which requires a transaction,
      and the put the last reference on the cache inode.  Instead of doing this we
      need to delay the iput so it is done not within a transaction having started.
      This gets rid of our warnings.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      455757c3
    • J
      Btrfs: stop passing a trans handle all around the reservation code · 4a92b1b8
      Josef Bacik 提交于
      The only thing that we need to have a trans handle for is in
      reserve_metadata_bytes and thats to know how much flushing we can do.  So
      instead of passing it around, just check current->journal_info for a
      trans_handle so we know if we can commit a transaction to try and free up space
      or not.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4a92b1b8
    • J
      Btrfs: don't get the block_rsv in btrfs_free_tree_block · d02c9955
      Josef Bacik 提交于
      Since the durable block rsv stuff has been killed there is no need to get the
      block_rsv in btrfs_free_tree_block anymore.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      d02c9955
    • J
      Btrfs: use the transactions block_rsv for the csum root · 4c13d758
      Josef Bacik 提交于
      The alloc warnings everybody has been seeing is because we have been reserving
      space for csums, but we weren't actually using that space.  So make
      get_block_rsv() return the trans->block_rsv if we're modifying the csum root.
      Also set the trans->block_rsv to NULL so that if we modify the csum root when
      running delayed ref's that comes out of the global reserve like it's supposed
      to.  With this patch I'm not seeing those alloc warnings anymore.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4c13d758
    • J
      Btrfs: handle enospc accounting for free space inodes · c09544e0
      Josef Bacik 提交于
      Since free space inodes now use normal checksumming we need to make sure to
      account for their metadata use.  So reserve metadata space, and then if we fail
      to write out the metadata we can just release it, otherwise it will be freed up
      when the io completes.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c09544e0
    • J
      Btrfs: don't increase the block_rsv's size when emergency allocating space · 7f701508
      Josef Bacik 提交于
      If we have to emergency reserve space we need to not increase the block_rsv
      size, otherwise we'll leak space.  Take for instance delalloc, say we reserve
      4k, and we use that 4k, and then we have to emergency allocate another 4k, we
      bump the size up to 8k, however we've only accounted for 4k in reservations in
      all of our supporting logic, so we'll go to free the 4k and end up having a size
      of 4k, which will cause us to later not free as much space.  I saw this doing
      testing where I wasn't reserving enough space for something but was still
      leaking space, very frustrating.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7f701508
    • J
      Btrfs: fix space leak when we fail to make an allocation · 7ed49f18
      Josef Bacik 提交于
      When changing back to using a spin_lock to protect the extent counters I decided
      that since we would only be dropping our original extent, it was ok to just drop
      the extent and return.  However since somebody else could have come in and done
      a reservation, we need to do the normal song and dance to clear the reservation
      out properly.  So calculate how much space we need to free, and then subtract
      what we just attempted to reserve.  If it's more then we know we need to drop
      those bytes from the delalloc block rsv.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7ed49f18
    • J
      Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check · 482e6dc5
      Josef Bacik 提交于
      If you run xfstest 224 it you will get lots of messages about not being able to
      delete inodes and that they will be cleaned up next mount.  This is because
      btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
      flush, so if there was not enough space, it simply failed.  But in truncate and
      evict case we could easily flush space to try and get enough space to do our
      work, so make btrfs_block_rsv_check take a flush argument to pass down to
      reserve_metadata_bytes.  Now xfstests 224 runs fine without all those
      complaints.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      482e6dc5
    • J
      Btrfs: kill btrfs_truncate_reserve_metadata · 5e962c78
      Josef Bacik 提交于
      Since we've optimized the truncate path, we no longer require this function.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5e962c78
    • J
      Btrfs: don't try to commit in btrfs_block_rsv_check · 13553e52
      Josef Bacik 提交于
      We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot
      because we have a transaction open it will return EAGAIN, so we do not need to
      try and commit the transaction again.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      13553e52
    • J
      Btrfs: kill unused parts of block_rsv · dabdb640
      Josef Bacik 提交于
      The priority and refill_used flags are not used anymore, and neither is the
      usage counter, so just remove them from btrfs_block_rsv.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      dabdb640
    • J
      Btrfs: kill the durable block rsv stuff · 37be25bc
      Josef Bacik 提交于
      This is confusing code and isn't used by anything anymore, so delete it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      37be25bc
    • J
      Btrfs: calculate checksum space correctly · 7709cde3
      Josef Bacik 提交于
      We have not been reserving enough space for checksums.  We were just reserving
      bytes for the checksum items themselves, we were not taking into account having
      to cow the tree and such.  This patch adds a csum_bytes counter to the inode for
      keeping track of the number of bytes outstanding we have for checksums.  Then we
      calculate how many leaves would be required for the checksums we are given and
      use that to reserve space.  This adds a significant amount of bytes to our
      reservations, but we will handle this later.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7709cde3
    • J
      Btrfs: use bytes_may_use for all ENOSPC reservations · fb25e914
      Josef Bacik 提交于
      We have been using bytes_reserved for metadata reservations, which is wrong
      since we use that to keep track of outstanding reservations from the allocator.
      This resulted in us doing a lot of silly things to make sure we don't allocate a
      bunch of metadata chunks since we never had a real view of how much space was
      actually in use by metadata.
      
      This passes Arne's enospc test and xfstests as well as my own enospc tests.
      Hopefully this will get us moving in the right direction.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fb25e914
    • J
      Btrfs: kill reserved_bytes in inode · 0cbbdf7c
      Josef Bacik 提交于
      reserved_bytes is not used for anything in the inode, remove it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0cbbdf7c
  2. 21 8月, 2011 1 次提交
  3. 17 8月, 2011 3 次提交
  4. 02 8月, 2011 4 次提交
  5. 28 7月, 2011 7 次提交
    • C
      Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors · 75c195a2
      Chris Mason 提交于
      The btrfs transaction code will return any errors that come from
      reserve_metadata_bytes.  We need to make sure we don't return funny
      things like 1 or EAGAIN.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      75c195a2
    • C
      Btrfs: make a lockdep class for each root · 85d4e461
      Chris Mason 提交于
      This patch was originally from Tejun Heo.  lockdep complains about the btrfs
      locking because we sometimes take btree locks from two different trees at the
      same time.  The current classes are based only on level in the btree, which
      isn't enough information for lockdep to figure out if the lock is safe.
      
      This patch makes a class for each type of tree, and lumps all the FS trees that
      actually have files and directories into the same class.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      85d4e461
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
    • M
      Btrfs: fix BUG_ON() caused by ENOSPC when relocating space · 199c36ea
      Miao Xie 提交于
      When we balanced the chunks across the devices, BUG_ON() in
      __finish_chunk_alloc() was triggered.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:2568!
      [SNIP]
      Call Trace:
       [<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
       [<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
       [<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
       [<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
       [<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
       [<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
       [<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
       [<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
       [<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
       [<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
       [<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
       [<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
       [<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
       [<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
       [<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
       [<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
       [<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
       [<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
       [<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
       [<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
       [<ffffffff81159c63>] ? generic_permission+0x23/0xb0
       [<ffffffff8115b415>] vfs_create+0xa5/0xc0
       [<ffffffff8115ce6e>] do_last+0x5fe/0x880
       [<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
       [<ffffffff8115e029>] do_filp_open+0x49/0xa0
       [<ffffffff8116a965>] ? alloc_fd+0x95/0x160
       [<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
       [<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
       [<ffffffff8114f1e0>] sys_open+0x20/0x30
       [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
      [SNIP]
      RIP  [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
      
      The reason is:
      Task1					Space balance task
      do_chunk_alloc()
        __finish_chunk_alloc()
          update device info
          in the chunk tree
            alloc system metadata block
      					relocate system metadata block group
      					  set system metadata block group
      					  readonly, This block group is the
      					  only one that can allocate space. So
      					  there is no free space that can be
      					  allocated now.
              find no space and don't try
              to alloc new chunk, and then
              return ENOSPC
        BUG_ON() in __finish_chunk_alloc()
        was triggered.
      
      Fix this bug by allocating a new system metadata chunk before relocating the
      old one if we find there is no free space which can be allocated after setting
      the old block group to be read-only.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      199c36ea
    • J
      Btrfs: fix enospc problems with delalloc · 9e0baf60
      Josef Bacik 提交于
      So I had this brilliant idea to use atomic counters for outstanding and reserved
      extents, but this turned out to be a bad idea.  Consider this where we have 1
      outstanding extent and 1 reserved extent
      
      Reserver				Releaser
      					atomic_dec(outstanding) now 0
      atomic_read(outstanding)+1 get 1
      atomic_read(reserved) get 1
      don't actually reserve anything because
      they are the same
      					atomic_cmpxchg(reserved, 1, 0)
      atomic_inc(outstanding)
      atomic_add(0, reserved)
      					free reserved space for 1 extent
      
      Then the reserver now has no actual space reserved for it, and when it goes to
      finish the ordered IO it won't have enough space to do it's allocation and you
      get those lovely warnings.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e0baf60
    • J
      Btrfs: don't flush delalloc arbitrarily · a5991428
      Josef Bacik 提交于
      Kill the check to see if we have 512mb of reserved space in delalloc and
      shrink_delalloc if we do.  This causes unexpected latencies and we have other
      logic to see if we need to throttle.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a5991428
    • J
      Btrfs: use a worker thread to do caching · bab39bf9
      Josef Bacik 提交于
      A user reported a deadlock when copying a bunch of files.  This is because they
      were low on memory and kthreadd got hung up trying to migrate pages for an
      allocation when starting the caching kthread.  The page was locked by the person
      starting the caching kthread.  To fix this we just need to use the async thread
      stuff so that the threads are already created and we don't have to worry about
      deadlocks.  Thanks,
      Reported-by: NRoman Mamedov <rm@romanrm.ru>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      bab39bf9
  6. 26 7月, 2011 2 次提交
    • M
      btrfs: don't BUG_ON allocation errors in btrfs_drop_snapshot · 38a1a919
      Mark Fasheh 提交于
      In addition to properly handling allocation failure from btrfs_alloc_path, I
      also fixed up the kzalloc error handling code immediately below it.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      38a1a919
    • M
      btrfs: Don't BUG_ON alloc_path errors in find_next_chunk · 92b8e897
      Mark Fasheh 提交于
      I also removed the BUG_ON from error return of find_next_chunk in
      init_first_rw_device(). It turns out that the only caller of
      init_first_rw_device() also BUGS on any nonzero return so no actual behavior
      change has occurred here.
      
      do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
      which can now return -ENOMEM. Instead of setting space_info->full on any
      error from btrfs_alloc_chunk() I catch and return every error value _except_
      -ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      92b8e897
  7. 15 7月, 2011 1 次提交
    • M
      btrfs: don't BUG_ON btrfs_alloc_path() errors · d8926bb3
      Mark Fasheh 提交于
      This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
      failure. All the sites that are fixed in this patch were checked by me to
      be fairly trivial to fix because of at least one of two criteria:
      
       - Callers of the function catch errors from it already so bubbling the
         error up will be handled.
       - Callers of the function might BUG_ON any nonzero return code in which
         case there is no behavior changed (but we still got to remove a BUG_ON)
      
      The following functions were updated:
      
      btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
      btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
      btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
      insert_reserved_file_extent, and run_delalloc_nocow
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      d8926bb3
  8. 11 7月, 2011 2 次提交
    • J
      Btrfs: serialize flushers in reserve_metadata_bytes · fdb5effd
      Josef Bacik 提交于
      We keep having problems with early enospc, and that's because our method of
      making space is inherently racy.  The problem is we can have one guy trying to
      make space for himself, and in the meantime people come in and steal his
      reservation.  In order to stop this we make a waitqueue and put anybody who
      comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
      make more space.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fdb5effd
    • J
      Btrfs: do transaction space reservation before joining the transaction · b5009945
      Josef Bacik 提交于
      We have to do weird things when handling enospc in the transaction joining code.
      Because we've already joined the transaction we cannot commit the transaction
      within the reservation code since it will deadlock, so we have to return EAGAIN
      and then make sure we don't retry too many times.  Instead of doing this, just
      do the reservation the normal way before we join the transaction, that way we
      can do whatever we want to try and reclaim space, and then if it fails we know
      for sure we are out of space and we can return ENOSPC.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      b5009945
  9. 25 6月, 2011 1 次提交