1. 26 7月, 2016 2 次提交
    • J
      btrfs: btrfs_abort_transaction, drop root parameter · 66642832
      Jeff Mahoney 提交于
      __btrfs_abort_transaction doesn't use its root parameter except to
      obtain an fs_info pointer.  We can obtain that from trans->root->fs_info
      for now and from trans->fs_info in a later patch.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      66642832
    • N
      btrfs: Fix slab accounting flags · fba4b697
      Nikolay Borisov 提交于
      BTRFS is using a variety of slab caches to satisfy internal needs.
      Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
      meaning allocations from the caches are going to be accounted as
      SReclaimable. At the same time btrfs is not registering any shrinkers
      whatsoever, thus preventing memory from the slabs to be shrunk. This
      means those caches are not in fact reclaimable.
      
      To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
      inode cache, since this one is being freed by the generic VFS super_block
      shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
      to better document the lifetime of the objects (it just translates
      to SLAB_RECLAIM_ACCOUNT).
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fba4b697
  2. 08 7月, 2016 2 次提交
    • J
      Btrfs: change delayed reservation fallback behavior · c48f49d6
      Josef Bacik 提交于
      We reserve space for the inode update when we first reserve space for writing to
      a file.  However there are lots of ways that we can use this reservation and not
      have it for subsequent ordered extents.  Previously we'd fall through and try to
      reserve metadata bytes for this, then we'd just steal the full reservation from
      the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
      reservation from the global reserve.  The problem with this is we can easily
      just return ENOSPC and fallback to updating the inode item directly.  In the
      worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
      fall all the way through, however if we just fallback and update the inode
      directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.
      
      We would have also just added the extent item for the inode so we likely will
      have already cow'ed down most of the way to the leaf containing the inode item,
      so we are more often than not only need one or two nodesize's worth of
      reservations.  Given the reservation for the extent itself is also a worst case
      we will likely already have space to cover the inode update.
      
      This change will make us behave better in the theoretical worst case, and much
      better in the case that we don't have our reservation and cannot reserve more
      metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c48f49d6
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
  3. 25 6月, 2016 1 次提交
    • O
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval 提交于
      Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      
      Tested with xfstests and by doing a parallel kernel build:
      
      	while make tinyconfig && make -j4 && git clean dqfx; do
      		:
      	done
      
      along with a bunch of parallel finds in another shell:
      
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      		done
      		wait
      	done
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      02dbfc99
  4. 10 5月, 2016 1 次提交
  5. 14 3月, 2016 1 次提交
  6. 18 2月, 2016 1 次提交
  7. 11 2月, 2016 1 次提交
    • D
      btrfs: properly set the termination value of ctx->pos in readdir · bc4ef759
      David Sterba 提交于
      The value of ctx->pos in the last readdir call is supposed to be set to
      INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
      larger value, then it's LLONG_MAX.
      
      There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
      overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
      64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
      the increment.
      
      We can get to that situation like that:
      
      * emit all regular readdir entries
      * still in the same call to readdir, bump the last pos to INT_MAX
      * next call to readdir will not emit any entries, but will reach the
        bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
      
      Normally this is not a problem, but if we call readdir again, we'll find
      'pos' set to LLONG_MAX and the unconditional increment will overflow.
      
      The report from Victor at
      (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
      print shows that pattern:
      
       Overflow: e
       Overflow: 7fffffff
       Overflow: 7fffffffffffffff
       PAX: size overflow detected in function btrfs_real_readdir
         fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
         context: dir_context;
       CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
       Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
        ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
        ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
        ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
       Call Trace:
        [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
        [<ffffffff811cb706>] report_size_overflow+0x36/0x40
        [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
        [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
        [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
        [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
       Overflow: 1a
        [<ffffffff811db070>] ? iterate_dir+0x150/0x150
        [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
      
      The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
      are not yet synced and are processed from the delayed list. Then the code
      could go to the bump section again even though it might not emit any new
      dir entries from the delayed list.
      
      The fix avoids entering the "bump" section again once we've finished
      emitting the entries, both for synced and delayed entries.
      
      References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bc4ef759
  8. 07 1月, 2016 1 次提交
  9. 11 10月, 2015 1 次提交
  10. 26 4月, 2015 1 次提交
    • Y
      Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b
      Yang Dongsheng 提交于
      We need to fill inode when we found a node for it in delayed_nodes_tree.
      But we did not fill the ->last_trans currently, it will cause the test
      of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
      
      Problem:
      	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(2). pwrite(test_fd, buf, 4096, 0)
      	(3). close(test_fd)
      	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
      	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(6). fsync(test_fd);
      				<-------- we did not get the correct log entry for the file
      Reason:
      	When we re-open this file in (5), we would find a node
      in delayed_nodes_tree and fill the inode we are lookup with the
      information. But the ->last_trans is not filled, then the fsync()
      will check the ->last_trans and found it's 0 then say this inode
      is already in our tree which is commited, not recording the extents
      for it.
      
      Fix:
      	This patch fill the ->last_trans properly and set the
      runtime_flags if needed in this situation. Then we can get the
      log entries we expected after (6) and generic/311 passed.
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e17d30b
  11. 17 2月, 2015 1 次提交
    • D
      Btrfs: delayed-inode: replace root args iff only fs_info used · a585e948
      Daniel Dressler 提交于
      This is the second independent patch of a larger project to cleanup
      btrfs's internal usage of btrfs_root. Many functions take btrfs_root
      only to grab the fs_info struct.
      
      By requiring a root these functions cause programmer overhead. That
      these functions can accept any valid root is not obvious until
      inspection.
      
      This patch reduces the specificity of such functions to accept the
      fs_info directly.
      
      These patches can be applied independently and thus are not being
      submitted as a patch series. There should be about 26 patches by the
      project's completion. Each patch will cleanup between 1 and 34 functions
      apiece.  Each patch covers a single file's functions.
      
      This patch affects the following function(s):
        1) btrfs_wq_run_delayed_node
      Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      a585e948
  12. 03 2月, 2015 2 次提交
  13. 03 1月, 2015 1 次提交
    • C
      Btrfs: don't delay inode ref updates during log replay · 6f896054
      Chris Mason 提交于
      Commit 1d52c78a (Btrfs: try not to ENOSPC on log replay) added a
      check to skip delayed inode updates during log replay because it
      confuses the enospc code.  But the delayed processing will end up
      ignoring delayed refs from log replay because the inode itself wasn't
      put through the delayed code.
      
      This can end up triggering a warning at commit time:
      
      WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34()
      
      Which is repeated for each commit because we never process the delayed
      inode ref update.
      
      The fix used here is to change btrfs_delayed_delete_inode_ref to return
      an error if we're currently in log replay.  The caller will do the ref
      deletion immediately and everything will work properly.
      Signed-off-by: NChris Mason <clm@fb.com>
      cc: stable@vger.kernel.org # v3.18 and any stable series that picked 1d52c78a
      6f896054
  14. 18 9月, 2014 1 次提交
  15. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  16. 10 6月, 2014 1 次提交
  17. 11 3月, 2014 2 次提交
  18. 29 1月, 2014 7 次提交
  19. 12 11月, 2013 4 次提交
  20. 01 9月, 2013 3 次提交
  21. 29 6月, 2013 1 次提交
  22. 14 6月, 2013 1 次提交
  23. 07 5月, 2013 2 次提交
  24. 07 3月, 2013 1 次提交
    • C
      Btrfs: improve the delayed inode throttling · de3cb945
      Chris Mason 提交于
      The delayed inode code batches up changes to the btree in hopes of doing
      them in bulk.  As the changes build up, processes kick off worker
      threads and wait for them to make progress.
      
      The current code kicks off an async work queue item for each delayed
      node, which creates a lot of churn.  It also uses a fixed 1 HZ waiting
      period for the throttle, which allows us to build a lot of pending
      work and can slow down the commit.
      
      This changes us to watch a sequence counter as it is bumped during the
      operations.  We kick off fewer work items and have each work item do
      more work.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      de3cb945