1. 10 6月, 2014 1 次提交
    • M
      Btrfs: reclaim the reserved metadata space at background · 21c7e756
      Miao Xie 提交于
      Before applying this patch, the task had to reclaim the metadata space
      by itself if the metadata space was not enough. And When the task started
      the space reclamation, all the other tasks which wanted to reserve the
      metadata space were blocked. At some cases, they would be blocked for
      a long time, it made the performance fluctuate wildly.
      
      So we introduce the background metadata space reclamation, when the space
      is about to be exhausted, we insert a reclaim work into the workqueue, the
      worker of the workqueue helps us to reclaim the reserved space at the
      background. By this way, the tasks needn't reclaim the space by themselves at
      most cases, and even if the tasks have to reclaim the space or are blocked
      for the space reclamation, they will get enough space more quickly.
      
      Here is my test result(Tested by compilebench):
       Memory:	2GB
       CPU:		2Cores * 1CPU
       Partition:	40GB(SSD)
      
      Test command:
       # compilebench -D <mnt> -m
      
      Without this patch:
       intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
       compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
       read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
       delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
      
      With this patch:
       intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
       compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
       read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
       delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      21c7e756
  2. 25 4月, 2014 2 次提交
  3. 08 4月, 2014 2 次提交
    • J
      Btrfs: abort the transaction when we don't find our extent ref · c4a050bb
      Josef Bacik 提交于
      I'm not sure why we weren't aborting here in the first place, it is obviously a
      bad time from the fact that we print the leaf and yell loudly about it.  Fix
      this up, otherwise we panic because our path could be pointing into oblivion.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c4a050bb
    • J
      btrfs: fix lockdep warning with reclaim lock inversion · ed55b6ac
      Jeff Mahoney 提交于
      When encountering memory pressure, testers have run into the following
      lockdep warning. It was caused by __link_block_group calling kobject_add
      with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
      which gets us into reclaim context. The kobject doesn't actually need
      to be added under the lock -- it just needs to ensure that it's only
      added for the first block group to be linked.
      
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      3.14.0-rc8-default #1 Not tainted
      ---------------------------------------------------------
      kswapd0/169 just changed the state of lock:
       (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
      but this lock took another, RECLAIM_FS-unsafe lock in the past:
       (&found->groups_sem){+++++.}
      
      and interrupts could create inverse lock ordering between them.
      
      other info that might help us debug this:
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(&found->groups_sem);
                                     local_irq_disable();
                                     lock(&delayed_node->mutex);
                                     lock(&found->groups_sem);
        <Interrupt>
          lock(&delayed_node->mutex);
      
       *** DEADLOCK ***
      2 locks held by kswapd0/169:
       #0:  (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160
       #1:  (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ed55b6ac
  4. 07 4月, 2014 2 次提交
    • J
      Btrfs: remove transaction from send · 9e351cc8
      Josef Bacik 提交于
      Lets try this again.  We can deadlock the box if we send on a box and try to
      write onto the same fs with the app that is trying to listen to the send pipe.
      This is because the writer could get stuck waiting for a transaction commit
      which is being blocked by the send.  So fix this by making sure looking at the
      commit roots is always going to be consistent.  We do this by keeping track of
      which roots need to have their commit roots swapped during commit, and then
      taking the commit_root_sem and swapping them all at once.  Then make sure we
      take a read lock on the commit_root_sem in cases where we search the commit root
      to make sure we're always looking at a consistent view of the commit roots.
      Previously we had problems with this because we would swap a fs tree commit root
      and then swap the extent tree commit root independently which would cause the
      backref walking code to screw up sometimes.  With this patch we no longer
      deadlock and pass all the weird send/receive corner cases.  Thanks,
      Reportedy-by: NHugo Mills <hugo@carfax.org.uk>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e351cc8
    • J
      Btrfs: check for an extent_op on the locked ref · 573a0755
      Josef Bacik 提交于
      We could have possibly added an extent_op to the locked_ref while we dropped
      locked_ref->lock, so check for this case as well and loop around.  Otherwise we
      could lose flag updates which would lead to extent tree corruption.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      573a0755
  5. 11 3月, 2014 6 次提交
  6. 09 2月, 2014 1 次提交
  7. 29 1月, 2014 16 次提交
    • C
      Btrfs: fix spin_unlock in check_ref_cleanup · cf93da7b
      Chris Mason 提交于
      Our goto out should have gone a little farther.
      Signed-off-by: NChris Mason <clm@fb.com>
      cf93da7b
    • M
      Btrfs: fix wrong block group in trace during the free space allocation · 89d4346a
      Miao Xie 提交于
      We allocate the free space from the former block group, not the current
      one, so should use the former one to output the trace information.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      89d4346a
    • M
      Btrfs: cleanup the code of used_block_group in find_free_extent() · 215a63d1
      Miao Xie 提交于
      used_block_group is just used for the space cluster which doesn't
      belong to the current block group, the other place needn't use it.
      Or the logic of code seems unclear.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      215a63d1
    • M
      920e4a58
    • F
      Btrfs: fix btrfs boot when compiled as built-in · 14a958e6
      Filipe David Borba Manana 提交于
      After the change titled "Btrfs: add support for inode properties", if
      btrfs was built-in the kernel (i.e. not as a module), it would cause a
      kernel panic, as reported recently by Fengguang:
      
      [    2.024722] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [    2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b
      [    2.028684] PGD 0
      [    2.028684] Oops: 0000 [#1] SMP
      [    2.028684] Modules linked in:
      [    2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1
      [    2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [    2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000
      [    2.028684] RIP: 0010:[<ffffffff81501594>]  [<ffffffff81501594>] crc32c+0xc/0x6b
      [    2.028684] RSP: 0000:ffff88000edd7e58  EFLAGS: 00010246
      [    2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000
      [    2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe
      [    2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20
      [    2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff
      [    2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000
      [    2.028684] FS:  0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000
      [    2.028684] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [    2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0
      [    2.028684] Stack:
      [    2.028684]  ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05
      [    2.028684]  0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05
      [    2.028684]  ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404
      [    2.028684] Call Trace:
      [    2.028684]  [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96
      [    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
      [    2.028684]  [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0
      [    2.028684]  [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145
      [    2.028684]  [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a
      [    2.028684]  [<ffffffff810e2404>] ? parse_args+0x25f/0x33d
      [    2.028684]  [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230
      [    2.028684]  [<ffffffff8234c785>] ? do_early_param+0x88/0x88
      [    2.028684]  [<ffffffff819f61b5>] ? rest_init+0x89/0x89
      [    2.028684]  [<ffffffff819f61c3>] kernel_init+0xe/0x109
      
      The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
      started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
      part of the properties initialization routine), the libcrc32c is not yet initialized,
      so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
      panic on boot.
      
      The approach to fix this is to use crypto component directly to use its crc32c (which
      is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
      doing as well, it uses crypto directly to get crc32c functionality.
      
      Verified this works both when btrfs is built-in and when it's loadable kernel module.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      14a958e6
    • J
      Btrfs: throttle delayed refs better · 0a2b2a84
      Josef Bacik 提交于
      On one of our gluster clusters we noticed some pretty big lag spikes.  This
      turned out to be because our transaction commit was taking like 3 minutes to
      complete.  This is because we have like 30 gigs of metadata, so our global
      reserve would end up being the max which is like 512 mb.  So our throttling code
      would allow a ridiculous amount of delayed refs to build up and then they'd all
      get run at transaction commit time, and for a cold mounted file system that
      could take up to 3 minutes to run.  So fix the throttling to be based on both
      the size of the global reserve and how long it takes us to run delayed refs.
      This patch tracks the time it takes to run delayed refs and then only allows 1
      seconds worth of outstanding delayed refs at a time.  This way it will auto-tune
      itself from cold cache up to when everything is in memory and it no longer has
      to go to disk.  This makes our transaction commits take much less time to run.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0a2b2a84
    • J
      Btrfs: attach delayed ref updates to delayed ref heads · d7df2c79
      Josef Bacik 提交于
      Currently we have two rb-trees, one for delayed ref heads and one for all of the
      delayed refs, including the delayed ref heads.  When we process the delayed refs
      we have to hold onto the delayed ref lock for all of the selecting and merging
      and such, which results in quite a bit of lock contention.  This was solved by
      having a waitqueue and only one flusher at a time, however this hurts if we get
      a lot of delayed refs queued up.
      
      So instead just have an rb tree for the delayed ref heads, and then attach the
      delayed ref updates to an rb tree that is per delayed ref head.  Then we only
      need to take the delayed ref lock when adding new delayed refs and when
      selecting a delayed ref head to process, all the rest of the time we deal with a
      per delayed ref head lock which will be much less contentious.
      
      The locking rules for this get a little more complicated since we have to lock
      up to 3 things to properly process delayed refs, but I will address that problem
      later.  For now this passes all of xfstests and my overnight stress tests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7df2c79
    • W
      Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot() · 90515e7f
      Wang Shilong 提交于
      We may return early in btrfs_drop_snapshot(), we shouldn't
      call btrfs_std_err() for this case, fix it.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      90515e7f
    • L
      Btrfs: return free space to global_rsv as much as possible · 17504584
      Liu Bo 提交于
      @full is not protected within global_rsv.lock, so we may think global_rsv
      is already full but in fact it's not, so we miss the opportunity to return
      free space to global_rsv directly when we release other block_rsvs.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      17504584
    • J
      Btrfs: stop caching thread if extent_commit_sem is contended · c9ea7b24
      Josef Bacik 提交于
      We can starve out the transaction commit with a bunch of caching threads all
      running at the same time.  This is because we will only drop the
      extent_commit_sem if we need_resched(), which isn't likely to happen since we
      will be reading a lot from the disk so have already schedule()'ed plenty.  Alex
      observed that he could starve out a transaction commit for up to a minute with
      32 caching threads all running at once.  This will allow us to drop the
      extent_commit_sem to allow the transaction commit to swap the commit_root out
      and then all the cachers will start back up. Here is an explanation provided by
      Igno
      
      So, just to fill in what happens in this loop:
      
                                      mutex_unlock(&caching_ctl->mutex);
                                      cond_resched();
                                      goto again;
      
      where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem
      again:
      
              again:
                      mutex_lock(&caching_ctl->mutex);
                      /* need to make sure the commit_root doesn't disappear */
                      down_read(&fs_info->extent_commit_sem);
      
      So, if I'm reading the code correct, there can be a fair amount of
      concurrency here: there may be multiple 'caching kthreads' per filesystem
      active, while there's one fs_info->extent_commit_sem per filesystem
      AFAICS.
      
      So, what happens if there are a lot of CPUs all busy holding the
      ->extent_commit_sem rwsem read-locked and a writer arrives? They'd all
      rush to try to release the fs_info->extent_commit_sem, and they'd block in
      the down_read() because there's a writer waiting.
      
      So there's a guarantee of forward progress. This should answer akpm's
      concern I think.
      
      Thanks,
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c9ea7b24
    • F
      Btrfs: convert printk to btrfs_ and fix BTRFS prefix · efe120a0
      Frank Holton 提交于
      Convert all applicable cases of printk and pr_* to the btrfs_* macros.
      
      Fix all uses of the BTRFS prefix.
      Signed-off-by: NFrank Holton <fholton@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      efe120a0
    • M
      Btrfs: fix double initialization of the raid kobject · 536cd964
      Miao Xie 提交于
      We met the following oops when doing space balance:
       kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong.
       ...
       Call Trace:
        [<ffffffff81937262>] dump_stack+0x49/0x5f
        [<ffffffff8137d259>] kobject_init+0x89/0xa0
        [<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70
        [<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs]
        [<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs]
        [<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs]
        [<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs]
        [<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs]
        [<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs]
        [<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs]
        [<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs]
        [<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs]
        ...
      
      Steps to reproduce:
       # mkfs.btrfs -f <dev>
       # mount -o nospace_cache <dev> <mnt>
       # btrfs balance start <mnt>
       # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
      
      The reason of this problem is that we initialized the raid kobject when we added
      a block group into a empty raid list. As we know, when we mounted a btrfs filesystem,
      the raid list was empty, we would initialize the raid kobject when we added the first
      block group. But if there was not data stored in the block group, the block group
      would be freed when doing balance, and the raid list would be empty. And then if we
      allocated a new block group and added it into the raid list, we would initialize
      the raid kobject again, the oops happened.
      
      Fix this problem by initializing the raid kobject just when mounting the fs.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reported-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      536cd964
    • J
      btrfs: fix static checker warnings · 1b8e5df6
      Jeff Mahoney 提交于
      This patch fixes the following warnings:
      fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static?
      fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index));
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1b8e5df6
    • V
      btrfs: remove unused variable from find_free_extent · 4b447bfa
      Valentina Giusti 提交于
      The variable found_uncached_bg in find_free_extent is not used since commit
      285ff5af
      (Btrfs: remove the ideal caching code)
      Signed-off-by: NValentina Giusti <valentina.giusti@microon.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4b447bfa
    • J
      btrfs: publish allocation data in sysfs · 6ab0a202
      Jeff Mahoney 提交于
      While trying to debug ENOSPC issues, it's helpful to understand what the
      kernel's view of the available space is. We export this information
      via ioctl, but sysfs files are more easily used.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6ab0a202
    • L
      Btrfs: introduce a head ref rbtree · c46effa6
      Liu Bo 提交于
      The way how we process delayed refs is
      1) get a bunch of head refs,
      2) pick up one head ref,
      3) go one node back for any delayed ref updates.
      
      The head ref is also linked in the same rbtree as the delayed ref is,
      so in 1) stage, we have to walk one by one including not only head refs, but
      delayed refs.
      
      When we have a great number of delayed refs pending to process,
      this'll cost time a lot.
      
      Here we introduce a head ref specific rbtree, it only has head refs, so troubles
      go away.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c46effa6
  8. 12 12月, 2013 1 次提交
    • F
      Btrfs: don't miss skinny extent items on delayed ref head contention · 639eefc8
      Filipe David Borba Manana 提交于
      Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup
      of skinny extent items. This can happen when the execution flow is the
      following:
      
      * We do an extent tree lookup and fail to find a skinny extent item;
      
      * As a result, we attempt to see if a non-skinny extent item exists,
        either by looking at previous item in the leaf or by doing another
        full extent tree search;
      
      * We have a transaction and then we check for a matching delayed ref
        head in the transaction's delayed refs rbtree;
      
      * We find such delayed ref head and then we try to lock it with a
        call to mutex_trylock();
      
      * The lock was contended so we jump to the label "again", which repeats
        the extent tree search but for a non-skinny extent item, because we set
        previously metadata variable to 0 and the search key to look for a
        non-skinny extent-item;
      
      * After the jump (and after releasing the transaction's delayed refs
        lock), a skinny extent item might have been added to the extent tree
        but we will miss it because metadata is set to 0 and the search key
        is set for a non-skinny extent-item.
      
      The fix here is to not reset metadata to 0 and to jump to the initial search
      key setup if the delayed ref head is contended, instead of jumping directly
      to the extent tree search label ("again").
      
      This issue was found while investigating the issue reported at Bugzilla 64961.
      
      David Sterba suspected this function was missing extent items, and that
      this could be caused by the last change to this function, which was made
      in the following patch:
      
          [PATCH] Btrfs: optimize btrfs_lookup_extent_info()
          (commit 74be9510)
      
      But in fact this issue already existed before, because after failing to find
      a skinny extent item, the code set the search key for a non-skinny extent
      item, and on contention of a matching delayed ref head it would not search
      the extent tree for a skinny extent item anymore.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      639eefc8
  9. 12 11月, 2013 9 次提交