1. 11 4月, 2015 5 次提交
    • C
      Btrfs: two stage dirty block group writeout · c9dc4c65
      Chris Mason 提交于
      Block group cache writeout is currently waiting on the pages for each
      block group cache before moving on to writing the next one.  This commit
      switches things around to send down all the caches and then wait on them
      in batches.
      
      The end result is much faster, since we're keeping the disk pipeline
      full.
      Signed-off-by: NChris Mason <clm@fb.com>
      c9dc4c65
    • J
      Btrfs: don't commit the transaction in the async space flushing · 365c5313
      Josef Bacik 提交于
      We're triggering a huge number of commits from
      btrfs_async_reclaim_metadata_space.  These aren't really requried,
      because everyone calling the async reclaim code is going to end up
      triggering a commit on their own.
      Signed-off-by: NChris Mason <clm@fb.com>
      365c5313
    • J
      Btrfs: reserve space for block groups · cb723e49
      Josef Bacik 提交于
      This changes our delayed refs calculations to include the space needed
      to write back dirty block groups.
      Signed-off-by: NChris Mason <clm@fb.com>
      cb723e49
    • C
      Btrfs: refill block reserves during truncate · 28f75a0e
      Chris Mason 提交于
      When truncate starts, it allocates some space in the block reserves so
      that we'll have enough to update metadata along the way.
      
      For very large files, we can easily go through all of that space as we
      loop through the extents.  This changes truncate to refill the space
      reservation as it progresses through the file.
      Signed-off-by: NChris Mason <clm@fb.com>
      28f75a0e
    • J
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik 提交于
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: NChris Mason <clm@fb.com>
      1262133b
  2. 01 4月, 2015 1 次提交
    • C
      Btrfs: free and unlock our path before btrfs_free_and_pin_reserved_extent() · dd825259
      Chris Mason 提交于
      The error handling path for alloc_reserved_tree_block is calling
      btrfs_free_and_pin_reserved_extent with a spinning tree lock held.  This
      might sleep as we allocate extent_state objects:
      
       BUG: sleeping function called from invalid context at mm/slub.c:1268
       in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7
       5 locks held by kworker/u4:7/11093:
        #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
        #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
        #2:  (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs]
        #3:  (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs]
        #4:  (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs]
       CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G        W 4.0.0-rc6-default+ #246
       Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06
       Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
        00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848
        ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898
        0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb
       Call Trace:
        [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c
        [<ffffffff8109d516>] ___might_sleep+0xf6/0x140
        [<ffffffff8109d5b2>] __might_sleep+0x52/0x90
        [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
        [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0
        [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs]
        [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs]
        [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs]
        [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs]
        [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
        [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs]
        [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs]
        [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs]
        [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs]
        [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs]
        [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs]
        [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs]
        [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs]
        [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs]
        [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs]
        [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs]
        [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
        [<ffffffff81091dcb>] process_one_work+0x1cb/0x520
        [<ffffffff81091d51>] ? process_one_work+0x151/0x520
        [<ffffffff811c7abf>] ? seq_read+0x3f/0x400
        [<ffffffff8109260b>] worker_thread+0x5b/0x4e0
        [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0
        [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450
        [<ffffffff81098686>] kthread+0xf6/0x120
        [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
        [<ffffffff81ab8088>] ret_from_fork+0x58/0x90
        [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
       ------------[ cut here ]------------
      
      This changes things to free the path first, which will also unlock the
      extent buffer.
      Signed-off-by: NChris Mason <clm@fb.com>
      Reported-by: NDave Sterba <dsterba@suse.cz>
      Tested-by: NDave Sterba <dsterba@suse.cz>
      dd825259
  3. 27 3月, 2015 1 次提交
    • F
      Btrfs: fix log tree corruption when fs mounted with -o discard · dcc82f47
      Filipe Manana 提交于
      While committing a transaction we free the log roots before we write the
      new super block. Freeing the log roots implies marking the disk location
      of every node/leaf (metadata extent) as pinned before the new super block
      is written. This is to prevent the disk location of log metadata extents
      from being reused before the new super block is written, otherwise we
      would have a corrupted log tree if before the new super block is written
      a crash/reboot happens and the location of any log tree metadata extent
      ended up being reused and rewritten.
      
      Even though we pinned the log tree's metadata extents, we were issuing a
      discard against them if the fs was mounted with the -o discard option,
      resulting in corruption of the log tree if a crash/reboot happened before
      writing the new super block - the next time the fs was mounted, during
      the log replay process we would find nodes/leafs of the log btree with
      a content full of zeroes, causing the process to fail and require the
      use of the tool btrfs-zero-log to wipeout the log tree (and all data
      previously fsynced becoming lost forever).
      
      Fix this by not doing a discard when pinning an extent. The discard will
      be done later when it's safe (after the new super block is committed) at
      extent-tree.c:btrfs_finish_extent_commit().
      
      Fixes: e688b725 (Btrfs: fix extent pinning bugs in the tree log)
      CC: <stable@vger.kernel.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcc82f47
  4. 18 3月, 2015 1 次提交
    • J
      Btrfs: add sanity test for outstanding_extents accounting · 6a3891c5
      Josef Bacik 提交于
      I introduced a regression wrt outstanding_extents accounting.  These are tricky
      areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
      at any time.  So add sanity tests to cover the various conditions that are
      tricky in order to make sure we don't introduce regressions in the future.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6a3891c5
  5. 17 3月, 2015 1 次提交
    • J
      Btrfs: prepare block group cache before writing · dcdf7f6d
      Josef Bacik 提交于
      Writing the block group cache will modify the extent tree quite a bit because it
      truncates the old space cache and pre-allocates new stuff.  To try and cut down
      on the churn lets do the setup dance first, then later on hopefully we can avoid
      looping with newly dirtied roots.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dcdf7f6d
  6. 14 3月, 2015 1 次提交
  7. 04 3月, 2015 5 次提交
  8. 03 3月, 2015 1 次提交
  9. 21 2月, 2015 1 次提交
  10. 17 2月, 2015 2 次提交
  11. 15 2月, 2015 2 次提交
    • F
      Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group · 3d84be79
      Forrest Liu 提交于
      Removing large amount of block group in a transaction may encounters
      BUG_ON() in btrfs_orphan_add(). That is because btrfs_orphan_reserve_metadata()
      will grab metadata reservation from transaction handle, and
      btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when
      delete unused block group.
      
      The problem can be reproduce by following script
      
          mntpath=/btrfs
          loopdev=/dev/loop0
          filepath=/home/forrest/image
      
          umount $mntpath
          losetup -d $loopdev
          truncate --size 1000g $filepath
          losetup $loopdev $filepath
          mkfs.btrfs -f $loopdev
          mount $loopdev $mntpath
      
          for j in `seq 1 1 1000`; do
              fallocate -l 1g $mntpath/$j
          done
          # wait cleaner thread remove unused block group
          sleep 300
      
      The call trace that results from the BUG_ON() is:
      
      [  613.093084] ------------[ cut here ]------------
      [  613.097928] kernel BUG at fs/btrfs/inode.c:3142!
      [  613.105855] invalid opcode: 0000 [#1] SMP
      [  613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E)
      [  613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: G            E  3.19.0-rc7-custom #2
      [  613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
      [  613.152694] task: ffff880035cdb1a0 ti: ffff880039cf4000 task.ti: ffff880039cf4000
      [  613.154969] RIP: 0010:[<ffffffffa01441c2>]  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
      [  613.157780] RSP: 0018:ffff880039cf7c48  EFLAGS: 00010286
      [  613.159560] RAX: 00000000ffffffe4 RBX: ffff88003bd981a0 RCX: ffff88003c9e4000
      [  613.161904] RDX: 0000000000002244 RSI: 0000000000040000 RDI: ffff88003c9e4138
      [  613.164264] RBP: ffff880039cf7c88 R08: 000060ffc0000850 R09: 0000000000000000
      [  613.166507] R10: ffff88003bc4b7a0 R11: ffffea0000eb6740 R12: ffff88003c9c0000
      [  613.168681] R13: ffff88003c102160 R14: ffff88003c9c0458 R15: 0000000000000001
      [  613.170932] FS:  0000000000000000(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
      [  613.173316] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  613.175227] CR2: 00007f6343537000 CR3: 0000000036329000 CR4: 00000000000407f0
      [  613.177554] Stack:
      [  613.178712]  ffff880039cf7c88 ffffffffa0182a54 ffff88003c9e4b04 ffff88003c9c7800
      [  613.181297]  ffff88003bc4b7a0 ffff88003bd981a0 ffff88003c8db200 ffff88003c2fcc60
      [  613.183782]  ffff880039cf7d18 ffffffffa012da97 ffff88003bc4b7a4 ffff88003bc4b7a0
      [  613.186171] Call Trace:
      [  613.187493]  [<ffffffffa0182a54>] ? lookup_free_space_inode+0x44/0x100 [btrfs]
      [  613.189801]  [<ffffffffa012da97>] btrfs_remove_block_group+0x137/0x740 [btrfs]
      [  613.192126]  [<ffffffffa0166912>] btrfs_remove_chunk+0x672/0x780 [btrfs]
      [  613.194267]  [<ffffffffa012e2ff>] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
      [  613.196567]  [<ffffffffa0135e4c>] cleaner_kthread+0x12c/0x190 [btrfs]
      [  613.198687]  [<ffffffffa0135d20>] ? check_leaf+0x350/0x350 [btrfs]
      [  613.200758]  [<ffffffff8108f232>] kthread+0xd2/0xf0
      [  613.202616]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
      [  613.204738]  [<ffffffff8175dabc>] ret_from_fork+0x7c/0xb0
      [  613.206652]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
      [  613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
      [  613.216562] RIP  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
      [  613.218828]  RSP <ffff880039cf7c48>
      [  613.220382] ---[ end trace 71073106deb8a457 ]---
      
      This patch replace btrfs_join_transaction() with btrfs_start_transaction() in
      btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add()
      Signed-off-by: NForrest Liu <forrestl@synology.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3d84be79
    • J
      Btrfs: account for large extents with enospc · dcab6a3b
      Josef Bacik 提交于
      On our gluster boxes we stream large tar balls of backups onto our fses.  With
      160gb of ram this means we get really large contiguous ranges of dirty data, but
      the way our ENOSPC stuff works is that as long as it's contiguous we only hold
      metadata reservation for one extent.  The problem is we limit our extents to
      128mb, so we'll end up with at least 800 extents so our enospc accounting is
      quite a bit lower than what we need.  To keep track of this make sure we
      increase outstanding_extents for every multiple of the max extent size so we can
      be sure to have enough reserved metadata space.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcab6a3b
  12. 03 2月, 2015 2 次提交
    • S
      btrfs: delete chunk allocation attemp when setting block group ro · 2f081088
      Shaohua Li 提交于
      Below test will fail currently:
            mkfs.ext4 -F /dev/sda
            btrfs-convert /dev/sda
            mount /dev/sda /mnt
            btrfs device add -f /dev/sdb /mnt
            btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt
      
      The reason is there are some block groups with usage 0, but the whole
      disk hasn't free space to allocate new chunk, so we even can't set such
      block group readonly. This patch deletes the chunk allocation when
      setting block group ro. For META, we already have reserve. But for
      SYSTEM, we don't have, so the check_system_chunk is still required.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2f081088
    • F
      Btrfs: fix race between transaction commit and empty block group removal · d4b450cd
      Filipe Manana 提交于
      Committing a transaction can race with automatic removal of empty block
      groups (cleaner kthread), leading to a BUG_ON() in the transaction
      commit code while running btrfs_finish_extent_commit(). The following
      sequence diagram shows how it can happen:
      
                 CPU 1                                       CPU 2
      
      btrfs_commit_transaction()
        fs_info->running_transaction = NULL
        btrfs_finish_extent_commit()
          find_first_extent_bit()
            -> found range for block group X
               in fs_info->freed_extents[]
      
                                                     btrfs_delete_unused_bgs()
                                                       -> found block group X
      
                                                       Removed block group X's range
                                                       from fs_info->freed_extents[]
      
                                                       btrfs_remove_chunk()
                                                          btrfs_remove_block_group(bg X)
      
          unpin_extent_range(bg X range)
             btrfs_lookup_block_group(bg X)
                -> returns NULL
                  -> BUG_ON()
      
      The trace that results from the BUG_ON() is:
      
      [48665.187808] ------------[ cut here ]------------
      [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
      [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
      [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
      [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
      [48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
      [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
      [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
      [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
      [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
      [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
      [48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
      [48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
      [48665.197388] Stack:
      [48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
      [48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
      [48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
      [48665.197388] Call Trace:
      [48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
      [48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
      [48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
      [48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
      [48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
      [48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
      [48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
      [48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
      [48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
      [48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
      [48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
      [48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
      [48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388]  RSP <ffff8801b56a7b88>
      [48665.272246] ---[ end trace b9c6ab9957521376 ]---
      
      Fix this by ensuring that unpining the block group's range in
      btrfs_finish_extent_commit() is done in a synchronized fashion
      with removing the block group's range from freed_extents[]
      in btrfs_delete_unused_bgs()
      
      This race got introduced with the change:
      
          Btrfs: remove empty block groups automatically
          commit 47ab2a6cSigned-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d4b450cd
  13. 22 1月, 2015 4 次提交
    • L
      Btrfs: cleanup unused run_most · 26455d33
      Liu Bo 提交于
      "run_most" is not used anymore.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      26455d33
    • Z
      Btrfs: add ref_count and free function for btrfs_bio · 6e9606d2
      Zhao Lei 提交于
      1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
         to keep btrfs_bio's memory in raid56 recovery implement.
      2: free function for bbio will make code clean and flexible, plus
         forced data type checking in compile.
      
      Changelog v1->v2:
       Rename following by David Sterba's suggestion:
       put_btrfs_bio() -> btrfs_put_bio()
       get_btrfs_bio() -> btrfs_get_bio()
       bbio->ref_count -> bbio->refs
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e9606d2
    • F
      Btrfs: lookup for block group only if needed when freeing a tree block · 6219872d
      Filipe Manana 提交于
      Very often our extent buffer's header generation doesn't match the current
      transaction's id or it is also referenced by other trees (snapshots), so
      we don't need the corresponding block group cache object. Therefore only
      search for it if we are going to use it, so we avoid an unnecessary search
      in the block groups rbtree (and acquiring and releasing its spinlock).
      
      Freeing a tree block is performed when COWing or deleting a node/leaf,
      which implies we are holding the node/leaf's parent node lock, therefore
      reducing the amount of time spent when freeing a tree block helps reducing
      the amount of time we are holding the parent node's lock.
      
      For example, for a run of xfstests/generic/083, the block group cache
      object was needed only 682 times for a total of 226691 calls to free
      a tree block.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6219872d
    • J
      Btrfs: track dirty block groups on their own list · ce93ec54
      Josef Bacik 提交于
      Currently any time we try to update the block groups on disk we will walk _all_
      block groups and check for the ->dirty flag to see if it is set.  This function
      can get called several times during a commit.  So if you have several terabytes
      of data you will be a very sad panda as we will loop through _all_ of the block
      groups several times, which makes the commit take a while which slows down the
      rest of the file system operations.
      
      This patch introduces a dirty list for the block groups that we get added to
      when we dirty the block group for the first time.  Then we simply update any
      block groups that have been dirtied since the last time we called
      btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
      free space cache out so it is much cleaner.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce93ec54
  14. 20 1月, 2015 1 次提交
    • F
      Btrfs: fix race deleting block group from space_info->ro_bgs list · 75c68e9f
      Filipe Manana 提交于
      When removing a block group we were deleting it from its space_info's
      ro_bgs list without the correct protection - the space info's spinlock.
      Fix this by doing the list delete while holding the spinlock of the
      corresponding space info, which is the correct lock for any operation
      on that list.
      
      This issue was introduced in the 3.19 kernel by the following change:
      
          Btrfs: move read only block groups onto their own list V2
          commit 633c0aad
      
      I ran into a kernel crash while a task was running statfs, which iterates
      the space_info->ro_bgs list while holding the space info's spinlock,
      and another task was deleting it from the same list, without holding that
      spinlock, as part of the block group remove operation (while running the
      function btrfs_remove_block_group). This happened often when running the
      stress test xfstests/generic/038 I recently made.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      75c68e9f
  15. 03 1月, 2015 1 次提交
  16. 13 12月, 2014 3 次提交
  17. 11 12月, 2014 3 次提交
    • F
      Btrfs: remove non-sense btrfs_error_discard_extent() function · 1edb647b
      Filipe Manana 提交于
      It doesn't do anything special, it just calls btrfs_discard_extent(),
      so just remove it.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1edb647b
    • F
      Btrfs: fix fs corruption on transaction abort if device supports discard · 678886bd
      Filipe Manana 提交于
      When we abort a transaction we iterate over all the ranges marked as dirty
      in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
      from those trees, add them back (unpin) to the free space caches and, if
      the fs was mounted with "-o discard", perform a discard on those regions.
      Also, after adding the regions to the free space caches, a fitrim ioctl call
      can see those ranges in a block group's free space cache and perform a discard
      on the ranges, so the same issue can happen without "-o discard" as well.
      
      This causes corruption, affecting one or multiple btree nodes (in the worst
      case leaving the fs unmountable) because some of those ranges (the ones in
      the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
      referred by the last committed super block - breaking the rule that anything
      that was committed by a transaction is untouched until the next transaction
      commits successfully.
      
      I ran into this while running in a loop (for several hours) the fstest that
      I recently submitted:
      
        [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
      
      The corruption always happened when a transaction aborted and then fsck complained
      like this:
      
         _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
         *** fsck.btrfs output ***
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         read block failed check_tree_block
         Couldn't open file system
      
      In this case 94945280 corresponded to the root of a tree.
      Using frace what I observed was the following sequence of steps happened:
      
         1) transaction N started, fs_info->pinned_extents pointed to
            fs_info->freed_extents[0];
      
         2) node/eb 94945280 is created;
      
         3) eb is persisted to disk;
      
         4) transaction N commit starts, fs_info->pinned_extents now points to
            fs_info->freed_extents[1], and transaction N completes;
      
         5) transaction N + 1 starts;
      
         6) eb is COWed, and btrfs_free_tree_block() called for this eb;
      
         7) eb range (94945280 to 94945280 + 16Kb) is added to
            fs_info->pinned_extents (fs_info->freed_extents[1]);
      
         8) Something goes wrong in transaction N + 1, like hitting ENOSPC
            for example, and the transaction is aborted, turning the fs into
            readonly mode. The stack trace I got for example:
      
            [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
            [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
            [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
            [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
            [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
            [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
            [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
            (...)
            [112065.264493] ---[ end trace dd7903a975a31a08 ]---
            [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
            [112065.264997] BTRFS info (device sdc): forced readonly
      
         9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
            fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
            turn calls btrfs_destroy_pinned_extent();
      
         10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
             marked as dirty in fs_info->freed_extents[], and for each one
             it calls discard, if the fs was mounted with "-o discard", and
             adds the range to the free space cache of the respective block
             group;
      
         11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
             sees the free space entries and performs a discard;
      
         12) After an umount and mount (or fsck), our eb's location on disk was full
             of zeroes, and it should have been untouched, because it was marked as
             dirty in the fs_info->pinned_extents tree, and therefore used by the
             trees that the last committed superblock points to.
      
      Fix this by not performing a discard and not adding the ranges to the free space
      caches - it's useless from this point since the fs is now in readonly mode and
      we won't write free space caches to disk anymore (otherwise we would leak space)
      nor any new superblock. By not adding the ranges to the free space caches, it
      prevents other code paths from allocating that space and write to it as well,
      therefore being safer and simpler.
      
      This isn't a new problem, as it's been present since 2011 (git commit
      acce952b).
      
      Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      678886bd
    • F
      Btrfs: always clear a block group node when removing it from the tree · 01eacb27
      Filipe Manana 提交于
      Always clear a block group's rbnode after removing it from the rbtree to
      ensure that any tasks that might be holding a reference on the block group
      don't end up accessing stale rbnode left and right child pointers through
      next_block_group().
      
      This is a leftover from the change titled:
      "Btrfs: fix invalid block group rbtree access after bg is removed"
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      01eacb27
  18. 03 12月, 2014 5 次提交
    • J
      Btrfs: make get_caching_control unconditionally return the ctl · cb83b7b8
      Josef Bacik 提交于
      This was written when we didn't do a caching control for the fast free space
      cache loading.  However we started doing that a long time ago, and there is
      still a small window of time that we could be caching the block group the fast
      way, so if there is a caching_ctl at all on the block group just return it, the
      callers all wait properly for what they want.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cb83b7b8
    • F
      Btrfs: fix unprotected deletion from pending_chunks list · 8dbcd10f
      Filipe Manana 提交于
      On block group remove if the corresponding extent map was on the
      transaction->pending_chunks list, we were deleting the extent map
      from that list, through remove_extent_mapping(), without any
      synchronization with chunk allocation (which iterates that list
      and adds new elements to it). Fix this by ensure that this is done
      while the chunk mutex is held, since that's the mutex that protects
      the list in the chunk allocation code path.
      
      This applies on top (depends on) of my previous patch titled:
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      
      But the issue in fact was already present before that change, it only
      became easier to hit after Josef's 3.18 patch that added automatic
      removal of empty block groups.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8dbcd10f
    • F
      Btrfs: fix fs mapping extent map leak · 495e64f4
      Filipe Manana 提交于
      On chunk allocation error (label "error_del_extent"), after adding the
      extent map to the tree and to the pending chunks list, we would leave
      decrementing the extent map's refcount by 2 instead of 3 (our allocation
      + tree reference + list reference).
      
      Also, on chunk/block group removal, if the block group was on the list
      pending_chunks we weren't decrementing the respective list reference.
      
      Detected by 'rmmod btrfs':
      
      [20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
      [20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
      [20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
      [20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
      [20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
      [20770.106136] Call Trace:
      [20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
      [20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
      [20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
      [20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
      [20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
      [20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
      [20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
      
      This applies on top (depends on) of my previous patch titled:
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      
      But the issue in fact was already present before that change, it only
      became easier to hit after Josef's 3.18 patch that added automatic
      removal of empty block groups.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      495e64f4
    • F
      Btrfs: make btrfs_abort_transaction consider existence of new block groups · c92f6be3
      Filipe Manana 提交于
      If the transaction handle doesn't have used blocks but has created new block
      groups make sure we turn the fs into readonly mode too. This is because the
      new block groups didn't get all their metadata persisted into the chunk and
      device trees, and therefore if a subsequent transaction starts, allocates
      space from the new block groups, writes data or metadata into that space,
      commits successfully and then after we unmount and mount the filesystem
      again, the same space can be allocated again for a new block group,
      resulting in file data or metadata corruption.
      
      Example where we don't abort the transaction when we fail to finish the
      chunk allocation (add items to the chunk and device trees) and later a
      future transaction where the block group is removed fails because it can't
      find the chunk item in the chunk tree:
      
      [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
      [25230.404301] BTRFS: Transaction aborted (error -28)
      [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
      [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
      [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
      [25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
      [25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
      [25230.404334] Call Trace:
      [25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
      [25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
      [25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
      [25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
      [25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
      [25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
      [25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
      [25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
      [25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
      [25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
      [25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
      [25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
      [25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
      [25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
      [25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
      [25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
      [25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
      [25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
      [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
      [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
      [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c92f6be3
    • F
      Btrfs: fix race between fs trimming and block group remove/allocation · 04216820
      Filipe Manana 提交于
      Our fs trim operation, which is completely transactionless (doesn't start
      or joins an existing transaction) consists of visiting all block groups
      and then for each one to iterate its free space entries and perform a
      discard operation against the space range represented by the free space
      entries. However before performing a discard, the corresponding free space
      entry is removed from the free space rbtree, and when the discard completes
      it is added back to the free space rbtree.
      
      If a block group remove operation happens while the discard is ongoing (or
      before it starts and after a free space entry is hidden), we end up not
      waiting for the discard to complete, remove the extent map that maps
      logical address to physical addresses and the corresponding chunk metadata
      from the the chunk and device trees. After that and before the discard
      completes, the current running transaction can finish and a new one start,
      allowing for new block groups that map to the same physical addresses to
      be allocated and written to.
      
      So fix this by keeping the extent map in memory until the discard completes
      so that the same physical addresses aren't reused before it completes.
      
      If the physical locations that are under a discard operation end up being
      used for a new metadata block group for example, and dirty metadata extents
      are written before the discard finishes (the VM might call writepages() of
      our btree inode's i_mapping for example, or an fsync log commit happens) we
      end up overwriting metadata with zeroes, which leads to errors from fsck
      like the following:
      
              checking extents
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              owner ref check failed [833912832 16384]
              Errors found in extent allocation tree or chunk allocation
              checking free space cache
              checking fs roots
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              root 5 root dir 256 error
              root 5 inode 260 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 262 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 263 errors 2001, no inode item, link count wrong
              (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04216820