1. 18 9月, 2014 4 次提交
  2. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  3. 19 8月, 2014 1 次提交
    • M
      Btrfs: don't consider the missing device when allocating new chunks · 95669976
      Miao Xie 提交于
      The original code allocated new chunks by the number of the writable devices
      and missing devices to make sure that any RAID levels on a degraded FS continue
      to be honored, but it introduced a problem that it stopped us to allocating
      new chunks, the steps to reproduce is following:
      
       # mkfs.btrfs -m raid1 -d raid1 -f <dev0> <dev1>
       # mkfs.btrfs -f <dev1>			//Removing <dev1> from the original fs
       # mount -o degraded <dev0> <mnt>
       # dd if=/dev/null of=<mnt>/tmpfile bs=1M
      
      It is because we allocate new chunks only on the writable devices, if we take
      the number of missing devices into account, and want to allocate new chunks
      with higher RAID level, we will fail becaue we don't have enough writable
      device. Fix it by ignoring the number of missing devices when allocating
      new chunks.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      95669976
  4. 15 8月, 2014 2 次提交
    • M
      btrfs: qgroup: account shared subtrees during snapshot delete · 1152651a
      Mark Fasheh 提交于
      During its tree walk, btrfs_drop_snapshot() will skip any shared
      subtrees it encounters. This is incorrect when we have qgroups
      turned on as those subtrees need to have their contents
      accounted. In particular, the case we're concerned with is when
      removing our snapshot root leaves the subtree with only one root
      reference.
      
      In those cases we need to find the last remaining root and add
      each extent in the subtree to the corresponding qgroup exclusive
      counts.
      
      This patch implements the shared subtree walk and a new qgroup
      operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
      this type is encountered during qgroup accounting, we search for
      any root references to that extent and in the case that we find
      only one reference left, we go ahead and do the math on it's
      exclusive counts.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1152651a
    • J
      Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0
      Josef Bacik 提交于
      Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
      understand how snapshot delete was using it and assumed that we needed the
      quota operations there.  With Mark's work this has turned out to be not the
      case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
      argument and make __btrfs_mod_ref call it's process function with no_quota set
      always.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e339a6b0
  5. 03 7月, 2014 1 次提交
    • L
      Btrfs: fix race of using total_bytes_pinned · d288db5d
      Liu Bo 提交于
      This percpu counter @total_bytes_pinned is introduced to skip unnecessary
      operations of 'commit transaction', it accounts for those space we may free
      but are stuck in delayed refs.
      
      And we zero out @space_info->total_bytes_pinned every transaction period so
      we have a better idea of how much space we'll actually free up by committing
      this transaction.  However, we do the 'zero out' part a little earlier, before
      we actually unpin space, so we end up returning ENOSPC when we actually have
      free space that's just unpinned from committing transaction.
      
      xfstests/generic/074 complained then.
      
      This fixes it by actually accounting the percpu pinned number when 'unpin',
      and since it's protected by space_info->lock, the race is gone now.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d288db5d
  6. 20 6月, 2014 1 次提交
    • M
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie 提交于
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e570fd27
  7. 10 6月, 2014 9 次提交
    • J
      btrfs: allocate raid type kobjects dynamically · c1895442
      Jeff Mahoney 提交于
      We are currently allocating space_info objects in an array when we
      allocate space_info. When a user does something like:
      
      # btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
      # btrfs balance start -mconvert=single -dconvert=single /mnt -f
      # btrfs balance start -mconvert=raid1 -dconvert=raid1 /
      
      We can end up with memory corruption since the kobject hasn't
      been reinitialized properly and the name pointer was left set.
      
      The rationale behind allocating them statically was to avoid
      creating a separate kobject container that just contained the
      raid type. It used the index in the array to determine the index.
      
      Ultimately, though, this wastes more memory than it saves in all
      but the most complex scenarios and introduces kobject lifetime
      questions.
      
      This patch allocates the kobjects dynamically instead. Note that
      we also remove the kobject_get/put of the parent kobject since
      kobject_add and kobject_del do that internally.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c1895442
    • C
      Btrfs: async delayed refs · a79b7d4b
      Chris Mason 提交于
      Delayed extent operations are triggered during transaction commits.
      The goal is to queue up a healthly batch of changes to the extent
      allocation tree and run through them in bulk.
      
      This farms them off to async helper threads.  The goal is to have the
      bulk of the delayed operations being done in the background, but this is
      also important to limit our stack footprint.
      Signed-off-by: NChris Mason <clm@fb.com>
      a79b7d4b
    • D
      btrfs: remove stale newlines from log messages · 351fd353
      David Sterba 提交于
      I've noticed an extra line after "use no compression", but search
      revealed much more in messages of more critical levels and rare errors.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      351fd353
    • J
      Btrfs: add sanity tests for new qgroup accounting code · faa2dbf0
      Josef Bacik 提交于
      This exercises the various parts of the new qgroup accounting code.  We do some
      basic stuff and do some things with the shared refs to make sure all that code
      works.  I had to add a bunch of infrastructure because I needed to be able to
      insert items into a fake tree without having to do all the hard work myself,
      hopefully this will be usefull in the future.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      faa2dbf0
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456
    • W
      Btrfs: fix joining same transaction handle more than twice · f017f15f
      Wang Shilong 提交于
      We hit something like the following function call flows:
      
      |->run_delalloc_range()
       |->btrfs_join_transaction()
         |->cow_file_range()
           |->btrfs_join_transaction()
             |->find_free_extent()
               |->btrfs_join_transaction()
      
      Trace infomation can be seen as:
      
      [ 7411.127040] ------------[ cut here ]------------
      [ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 start_transaction+0x561/0x580 [btrfs]()
      [ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G           O 3.13.0+ #4
      [ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013
      [ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5)
      [ 7411.127092] Call Trace:
      [ 7411.127097]  [<ffffffff815b87b0>] dump_stack+0x45/0x56
      [ 7411.127101]  [<ffffffff81051ffd>] warn_slowpath_common+0x7d/0xa0
      [ 7411.127102]  [<ffffffff810520da>] warn_slowpath_null+0x1a/0x20
      [ 7411.127109]  [<ffffffffa0444fb1>] start_transaction+0x561/0x580 [btrfs]
      [ 7411.127115]  [<ffffffffa0445027>] btrfs_join_transaction+0x17/0x20 [btrfs]
      [ 7411.127120]  [<ffffffffa0431c91>] find_free_extent+0xa21/0xb50 [btrfs]
      [ 7411.127126]  [<ffffffffa0431f68>] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
      [ 7411.127131]  [<ffffffffa04322ce>] btrfs_alloc_free_block+0xee/0x440 [btrfs]
      [ 7411.127137]  [<ffffffffa043bd6e>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
      [ 7411.127142]  [<ffffffffa041da51>] __btrfs_cow_block+0x121/0x530 [btrfs]
      [ 7411.127146]  [<ffffffffa041dfff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
      [ 7411.127151]  [<ffffffffa0421b74>] btrfs_search_slot+0x1d4/0x9c0 [btrfs]
      [ 7411.127157]  [<ffffffffa0438567>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
      [ 7411.127163]  [<ffffffffa0456bfc>] __btrfs_drop_extents+0x16c/0xd90 [btrfs]
      [ 7411.127169]  [<ffffffffa0444ae3>] ? start_transaction+0x93/0x580 [btrfs]
      [ 7411.127171]  [<ffffffff811663e2>] ? kmem_cache_alloc+0x132/0x140
      [ 7411.127176]  [<ffffffffa041cd9a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
      [ 7411.127182]  [<ffffffffa044aa61>] cow_file_range_inline+0x181/0x2e0 [btrfs]
      [ 7411.127187]  [<ffffffffa044aead>] cow_file_range+0x2ed/0x440 [btrfs]
      [ 7411.127194]  [<ffffffffa0464d7f>] ? free_extent_buffer+0x4f/0xb0 [btrfs]
      [ 7411.127200]  [<ffffffffa044b38f>] run_delalloc_nocow+0x38f/0xa60 [btrfs]
      [ 7411.127207]  [<ffffffffa0461600>] ? test_range_bit+0x30/0x180 [btrfs]
      [ 7411.127212]  [<ffffffffa044bd48>] run_delalloc_range+0x2e8/0x350 [btrfs]
      [ 7411.127219]  [<ffffffffa04618f9>] ? find_lock_delalloc_range+0x1a9/0x1e0 [btrfs]
      [ 7411.127222]  [<ffffffff812a1e71>] ? blk_queue_bio+0x2c1/0x330
      [ 7411.127228]  [<ffffffffa0462ad4>] __extent_writepage+0x2f4/0x760 [btrfs]
      
      Here we fix it by avoiding joining transaction again if we have held
      a transaction handle when allocating chunk in find_free_extent().
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f017f15f
    • M
    • M
    • M
      Btrfs: reclaim the reserved metadata space at background · 21c7e756
      Miao Xie 提交于
      Before applying this patch, the task had to reclaim the metadata space
      by itself if the metadata space was not enough. And When the task started
      the space reclamation, all the other tasks which wanted to reserve the
      metadata space were blocked. At some cases, they would be blocked for
      a long time, it made the performance fluctuate wildly.
      
      So we introduce the background metadata space reclamation, when the space
      is about to be exhausted, we insert a reclaim work into the workqueue, the
      worker of the workqueue helps us to reclaim the reserved space at the
      background. By this way, the tasks needn't reclaim the space by themselves at
      most cases, and even if the tasks have to reclaim the space or are blocked
      for the space reclamation, they will get enough space more quickly.
      
      Here is my test result(Tested by compilebench):
       Memory:	2GB
       CPU:		2Cores * 1CPU
       Partition:	40GB(SSD)
      
      Test command:
       # compilebench -D <mnt> -m
      
      Without this patch:
       intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
       compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
       read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
       delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
      
      With this patch:
       intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
       compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
       read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
       delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      21c7e756
  8. 25 4月, 2014 2 次提交
  9. 08 4月, 2014 2 次提交
    • J
      Btrfs: abort the transaction when we don't find our extent ref · c4a050bb
      Josef Bacik 提交于
      I'm not sure why we weren't aborting here in the first place, it is obviously a
      bad time from the fact that we print the leaf and yell loudly about it.  Fix
      this up, otherwise we panic because our path could be pointing into oblivion.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c4a050bb
    • J
      btrfs: fix lockdep warning with reclaim lock inversion · ed55b6ac
      Jeff Mahoney 提交于
      When encountering memory pressure, testers have run into the following
      lockdep warning. It was caused by __link_block_group calling kobject_add
      with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
      which gets us into reclaim context. The kobject doesn't actually need
      to be added under the lock -- it just needs to ensure that it's only
      added for the first block group to be linked.
      
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      3.14.0-rc8-default #1 Not tainted
      ---------------------------------------------------------
      kswapd0/169 just changed the state of lock:
       (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
      but this lock took another, RECLAIM_FS-unsafe lock in the past:
       (&found->groups_sem){+++++.}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(&found->groups_sem);
                                     local_irq_disable();
                                     lock(&delayed_node->mutex);
                                     lock(&found->groups_sem);
        <Interrupt>
          lock(&delayed_node->mutex);
      
       *** DEADLOCK ***
      2 locks held by kswapd0/169:
       #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
       #1:  (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ed55b6ac
  10. 07 4月, 2014 2 次提交
    • J
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik 提交于
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e351cc8
    • J
      Btrfs: check for an extent_op on the locked ref · 573a0755
      Josef Bacik 提交于
      We could have possibly added an extent_op to the locked_ref while we dropped
      locked_ref->lock, so check for this case as well and loop around.  Otherwise we
      could lose flag updates which would lead to extent tree corruption.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      573a0755
  11. 11 3月, 2014 6 次提交
  12. 09 2月, 2014 1 次提交
  13. 29 1月, 2014 8 次提交
    • C
      Btrfs: fix spin_unlock in check_ref_cleanup · cf93da7b
      Chris Mason 提交于
      Our goto out should have gone a little farther.
      Signed-off-by: NChris Mason <clm@fb.com>
      cf93da7b
    • M
      Btrfs: fix wrong block group in trace during the free space allocation · 89d4346a
      Miao Xie 提交于
      We allocate the free space from the former block group, not the current
      one, so should use the former one to output the trace information.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      89d4346a
    • M
      Btrfs: cleanup the code of used_block_group in find_free_extent() · 215a63d1
      Miao Xie 提交于
      used_block_group is just used for the space cluster which doesn't
      belong to the current block group, the other place needn't use it.
      Or the logic of code seems unclear.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      215a63d1
    • M
      920e4a58
    • F
      Btrfs: fix btrfs boot when compiled as built-in · 14a958e6
      Filipe David Borba Manana 提交于
      After the change titled "Btrfs: add support for inode properties", if
      btrfs was built-in the kernel (i.e. not as a module), it would cause a
      kernel panic, as reported recently by Fengguang:
      
      [    2.024722] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [    2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b
      [    2.028684] PGD 0
      [    2.028684] Oops: 0000 [#1] SMP
      [    2.028684] Modules linked in:
      [    2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1
      [    2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [    2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000
      [    2.028684] RIP: 0010:[<ffffffff81501594>]  [<ffffffff81501594>] crc32c+0xc/0x6b
      [    2.028684] RSP: 0000:ffff88000edd7e58  EFLAGS: 00010246
      [    2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000
      [    2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe
      [    2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20
      [    2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff
      [    2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000
      [    2.028684] FS:  0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000
      [    2.028684] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [    2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0
      [    2.028684] Stack:
      [    2.028684]  ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05
      [    2.028684]  0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05
      [    2.028684]  ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404
      [    2.028684] Call Trace:
      [    2.028684]  [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96
      [    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
      [    2.028684]  [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0
      [    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
      [    2.028684]  [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a
      [    2.028684]  [<ffffffff810e2404>] ? parse_args+0x25f/0x33d
      [    2.028684]  [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230
      [    2.028684]  [<ffffffff8234c785>] ? do_early_param+0x88/0x88
      [    2.028684]  [<ffffffff819f61b5>] ? rest_init+0x89/0x89
      [    2.028684]  [<ffffffff819f61c3>] kernel_init+0xe/0x109
      
      The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
      started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
      part of the properties initialization routine), the libcrc32c is not yet initialized,
      so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
      panic on boot.
      
      The approach to fix this is to use crypto component directly to use its crc32c (which
      is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
      doing as well, it uses crypto directly to get crc32c functionality.
      
      Verified this works both when btrfs is built-in and when it's loadable kernel module.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      14a958e6
    • J
      Btrfs: throttle delayed refs better · 0a2b2a84
      Josef Bacik 提交于
      On one of our gluster clusters we noticed some pretty big lag spikes.  This
      turned out to be because our transaction commit was taking like 3 minutes to
      complete.  This is because we have like 30 gigs of metadata, so our global
      reserve would end up being the max which is like 512 mb.  So our throttling code
      would allow a ridiculous amount of delayed refs to build up and then they'd all
      get run at transaction commit time, and for a cold mounted file system that
      could take up to 3 minutes to run.  So fix the throttling to be based on both
      the size of the global reserve and how long it takes us to run delayed refs.
      This patch tracks the time it takes to run delayed refs and then only allows 1
      seconds worth of outstanding delayed refs at a time.  This way it will auto-tune
      itself from cold cache up to when everything is in memory and it no longer has
      to go to disk.  This makes our transaction commits take much less time to run.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0a2b2a84
    • J
      Btrfs: attach delayed ref updates to delayed ref heads · d7df2c79
      Josef Bacik 提交于
      Currently we have two rb-trees, one for delayed ref heads and one for all of the
      delayed refs, including the delayed ref heads.  When we process the delayed refs
      we have to hold onto the delayed ref lock for all of the selecting and merging
      and such, which results in quite a bit of lock contention.  This was solved by
      having a waitqueue and only one flusher at a time, however this hurts if we get
      a lot of delayed refs queued up.
      
      So instead just have an rb tree for the delayed ref heads, and then attach the
      delayed ref updates to an rb tree that is per delayed ref head.  Then we only
      need to take the delayed ref lock when adding new delayed refs and when
      selecting a delayed ref head to process, all the rest of the time we deal with a
      per delayed ref head lock which will be much less contentious.
      
      The locking rules for this get a little more complicated since we have to lock
      up to 3 things to properly process delayed refs, but I will address that problem
      later.  For now this passes all of xfstests and my overnight stress tests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7df2c79
    • W
      Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot() · 90515e7f
      Wang Shilong 提交于
      We may return early in btrfs_drop_snapshot(), we shouldn't
      call btrfs_std_err() for this case, fix it.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      90515e7f