1. 26 3月, 2018 2 次提交
  2. 29 1月, 2018 1 次提交
  3. 25 1月, 2018 1 次提交
    • J
      Btrfs: fix stale entries in readdir · e4fd493c
      Josef Bacik 提交于
      In fixing the readdir+pagefault deadlock I accidentally introduced a
      stale entry regression in readdir.  If we get close to full for the
      temporary buffer, and then skip a few delayed deletions, and then try to
      add another entry that won't fit, we will emit the entries we found and
      retry.  Unfortunately we delete entries from our del_list as we find
      them, assuming we won't need them.  However our pos will be with
      whatever our last entry was, which could be before the delayed deletions
      we skipped, so the next search will add the deleted entries back into
      our readdir buffer.  So instead don't delete entries we find in our
      del_list so we can make sure we always find our delayed deletions.  This
      is a slight perf hit for readdir with lots of pending deletions, but
      hopefully this isn't a common occurrence.  If it is we can revist this
      and optimize it.
      
      cc: stable@vger.kernel.org
      Fixes: 23b5ec74 ("btrfs: fix readdir deadlock with pagefault")
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4fd493c
  4. 22 1月, 2018 2 次提交
  5. 03 1月, 2018 1 次提交
    • C
      btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes · ec35e48b
      Chris Mason 提交于
      refcounts have a generic implementation and an asm optimized one.  The
      generic version has extra debugging to make sure that once a refcount
      goes to zero, refcount_inc won't increase it.
      
      The btrfs delayed inode code wasn't expecting this, and we're tripping
      over the warnings when the generic refcounts are used.  We ended up with
      this race:
      
      Process A                                         Process B
                                                        btrfs_get_delayed_node()
      						  spin_lock(root->inode_lock)
      						  radix_tree_lookup()
      __btrfs_release_delayed_node()
      refcount_dec_and_test(&delayed_node->refs)
      our refcount is now zero
      						  refcount_add(2) <---
      						  warning here, refcount
                                                        unchanged
      
      spin_lock(root->inode_lock)
      radix_tree_delete()
      
      With the generic refcounts, we actually warn again when process B above
      tries to release his refcount because refcount_add() turned into a
      no-op.
      
      We saw this in production on older kernels without the asm optimized
      refcounts.
      
      The fix used here is to use refcount_inc_not_zero() to detect when the
      object is in the middle of being freed and return NULL.  This is almost
      always the right answer anyway, since we usually end up pitching the
      delayed_node if it didn't have fresh data in it.
      
      This also changes __btrfs_release_delayed_node() to remove the extra
      check for zero refcounts before radix tree deletion.
      btrfs_get_delayed_node() was the only path that was allowing refcounts
      to go from zero to one.
      
      Fixes: 6de5f18e ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
      CC: <stable@vger.kernel.org> # 4.12+
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec35e48b
  6. 02 11月, 2017 1 次提交
    • J
      btrfs: make the delalloc block rsv per inode · 69fe2d75
      Josef Bacik 提交于
      The way we handle delalloc metadata reservations has gotten
      progressively more complicated over the years.  There is so much cruft
      and weirdness around keeping the reserved count and outstanding counters
      consistent and handling the error cases that it's impossible to
      understand.
      
      Fix this by making the delalloc block rsv per-inode.  This way we can
      calculate the actual size of the outstanding metadata reservations every
      time we make a change, and then reserve the delta based on that amount.
      This greatly simplifies the code everywhere, and makes the error
      handling in btrfs_delalloc_reserve_metadata far less terrifying.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69fe2d75
  7. 18 8月, 2017 1 次提交
  8. 18 4月, 2017 2 次提交
  9. 28 2月, 2017 1 次提交
  10. 14 2月, 2017 14 次提交
  11. 14 12月, 2016 1 次提交
    • M
      btrfs: limit async_work allocation and worker func duration · 2939e1a8
      Maxim Patlasov 提交于
      Problem statement: unprivileged user who has read-write access to more than
      one btrfs subvolume may easily consume all kernel memory (eventually
      triggering oom-killer).
      
      Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):
      
      [root@kteam1 ~]# cat prep.sh
      
      DEV=/dev/sdb
      mkfs.btrfs -f $DEV
      mount $DEV /mnt
      for i in `seq 1 16`
      do
      	mkdir /mnt/$i
      	btrfs subvolume create /mnt/SV_$i
      	ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
      	mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
      	chmod a+rwx /mnt/$i
      done
      
      [root@kteam1 ~]# sh prep.sh
      
      [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done
      
      [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done
      kmalloc-128        10144  10144    128   32    1 : tunables    0    0    0 : slabdata    317    317      0
      kmalloc-128       9992352 9992352    128   32    1 : tunables    0    0    0 : slabdata 312261 312261      0
      kmalloc-128       24226752 24226752    128   32    1 : tunables    0    0    0 : slabdata 757086 757086      0
      kmalloc-128       42754240 42754240    128   32    1 : tunables    0    0    0 : slabdata 1336070 1336070      0
      
      The huge numbers above come from insane number of async_work-s allocated
      and queued by btrfs_wq_run_delayed_node.
      
      The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
      works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
      worker func (btrfs_async_run_delayed_root) processes at least
      BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
      works as expected while the list is almost empty. As soon as it is getting
      bigger, worker func starts to process more than one item at a time, it takes
      longer, and the chances to have async_works queued more than needed is getting
      higher.
      
      The problem above is worsened by another flaw of delayed-inode implementation:
      if async_work was queued in a throttling branch (number of items >=
      BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
      the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
      the func occupies CPU infinitely (up to 30sec in my experiments): while the
      func is trying to drain the list, the user activity may add more and more
      items to the list.
      
      The patch fixes both problems in straightforward way: refuse queuing too
      many works in btrfs_wq_run_delayed_node and bail out of worker func if
      at least BTRFS_DELAYED_WRITEBACK items are processed.
      
      Changed in v2: remove support of thresh == NO_THRESHOLD.
      Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org # v3.15+
      2939e1a8
  12. 06 12月, 2016 5 次提交
  13. 30 11月, 2016 1 次提交
    • J
      btrfs: increment ctx->pos for every emitted or skipped dirent in readdir · d2fbb2b5
      Jeff Mahoney 提交于
      If we process the last item in the leaf and hit an I/O error while
      reading the next leaf, we return -EIO without having adjusted the
      position.  Since we have emitted dirents, getdents() will return
      the byte count to the user instead of the error.  Subsequent callers
      will emit the last successful dirent again, and return -EIO again,
      with the same result.  Callers loop forever.
      
      Instead, if we always increment ctx->pos after emitting or skipping
      the dirent, we'll be sure that we won't hit the same one again.  When
      we go to process the next leaf, we won't have emitted any dirents
      and the -EIO will be returned to the user properly.  We also don't
      need to track if we've emitted a dirent already or if we've changed
      the position yet.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2fbb2b5
  14. 27 9月, 2016 2 次提交
  15. 26 9月, 2016 1 次提交
    • J
      Btrfs: add a flags field to btrfs_fs_info · afcdd129
      Josef Bacik 提交于
      We have a lot of random ints in btrfs_fs_info that can be put into flags.  This
      is mostly equivalent with the exception of how we deal with quota going on or
      off, now instead we set a flag when we are turning it on or off and deal with
      that appropriately, rather than just having a pending state that the current
      quota_enabled gets set to.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      afcdd129
  16. 26 7月, 2016 2 次提交
    • J
      btrfs: btrfs_abort_transaction, drop root parameter · 66642832
      Jeff Mahoney 提交于
      __btrfs_abort_transaction doesn't use its root parameter except to
      obtain an fs_info pointer.  We can obtain that from trans->root->fs_info
      for now and from trans->fs_info in a later patch.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      66642832
    • N
      btrfs: Fix slab accounting flags · fba4b697
      Nikolay Borisov 提交于
      BTRFS is using a variety of slab caches to satisfy internal needs.
      Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
      meaning allocations from the caches are going to be accounted as
      SReclaimable. At the same time btrfs is not registering any shrinkers
      whatsoever, thus preventing memory from the slabs to be shrunk. This
      means those caches are not in fact reclaimable.
      
      To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
      inode cache, since this one is being freed by the generic VFS super_block
      shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
      to better document the lifetime of the objects (it just translates
      to SLAB_RECLAIM_ACCOUNT).
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fba4b697
  17. 08 7月, 2016 2 次提交
    • J
      Btrfs: change delayed reservation fallback behavior · c48f49d6
      Josef Bacik 提交于
      We reserve space for the inode update when we first reserve space for writing to
      a file.  However there are lots of ways that we can use this reservation and not
      have it for subsequent ordered extents.  Previously we'd fall through and try to
      reserve metadata bytes for this, then we'd just steal the full reservation from
      the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
      reservation from the global reserve.  The problem with this is we can easily
      just return ENOSPC and fallback to updating the inode item directly.  In the
      worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
      fall all the way through, however if we just fallback and update the inode
      directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.
      
      We would have also just added the extent item for the inode so we likely will
      have already cow'ed down most of the way to the leaf containing the inode item,
      so we are more often than not only need one or two nodesize's worth of
      reservations.  Given the reservation for the extent itself is also a worst case
      we will likely already have space to cover the inode update.
      
      This change will make us behave better in the theoretical worst case, and much
      better in the case that we don't have our reservation and cannot reserve more
      metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c48f49d6
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8