1. 15 2月, 2015 2 次提交
    • F
      Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group · 3d84be79
      Forrest Liu 提交于
      Removing large amount of block group in a transaction may encounters
      BUG_ON() in btrfs_orphan_add(). That is because btrfs_orphan_reserve_metadata()
      will grab metadata reservation from transaction handle, and
      btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when
      delete unused block group.
      
      The problem can be reproduce by following script
      
          mntpath=/btrfs
          loopdev=/dev/loop0
          filepath=/home/forrest/image
      
          umount $mntpath
          losetup -d $loopdev
          truncate --size 1000g $filepath
          losetup $loopdev $filepath
          mkfs.btrfs -f $loopdev
          mount $loopdev $mntpath
      
          for j in `seq 1 1 1000`; do
              fallocate -l 1g $mntpath/$j
          done
          # wait cleaner thread remove unused block group
          sleep 300
      
      The call trace that results from the BUG_ON() is:
      
      [  613.093084] ------------[ cut here ]------------
      [  613.097928] kernel BUG at fs/btrfs/inode.c:3142!
      [  613.105855] invalid opcode: 0000 [#1] SMP
      [  613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E)
      [  613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: G            E  3.19.0-rc7-custom #2
      [  613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
      [  613.152694] task: ffff880035cdb1a0 ti: ffff880039cf4000 task.ti: ffff880039cf4000
      [  613.154969] RIP: 0010:[<ffffffffa01441c2>]  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
      [  613.157780] RSP: 0018:ffff880039cf7c48  EFLAGS: 00010286
      [  613.159560] RAX: 00000000ffffffe4 RBX: ffff88003bd981a0 RCX: ffff88003c9e4000
      [  613.161904] RDX: 0000000000002244 RSI: 0000000000040000 RDI: ffff88003c9e4138
      [  613.164264] RBP: ffff880039cf7c88 R08: 000060ffc0000850 R09: 0000000000000000
      [  613.166507] R10: ffff88003bc4b7a0 R11: ffffea0000eb6740 R12: ffff88003c9c0000
      [  613.168681] R13: ffff88003c102160 R14: ffff88003c9c0458 R15: 0000000000000001
      [  613.170932] FS:  0000000000000000(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
      [  613.173316] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  613.175227] CR2: 00007f6343537000 CR3: 0000000036329000 CR4: 00000000000407f0
      [  613.177554] Stack:
      [  613.178712]  ffff880039cf7c88 ffffffffa0182a54 ffff88003c9e4b04 ffff88003c9c7800
      [  613.181297]  ffff88003bc4b7a0 ffff88003bd981a0 ffff88003c8db200 ffff88003c2fcc60
      [  613.183782]  ffff880039cf7d18 ffffffffa012da97 ffff88003bc4b7a4 ffff88003bc4b7a0
      [  613.186171] Call Trace:
      [  613.187493]  [<ffffffffa0182a54>] ? lookup_free_space_inode+0x44/0x100 [btrfs]
      [  613.189801]  [<ffffffffa012da97>] btrfs_remove_block_group+0x137/0x740 [btrfs]
      [  613.192126]  [<ffffffffa0166912>] btrfs_remove_chunk+0x672/0x780 [btrfs]
      [  613.194267]  [<ffffffffa012e2ff>] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
      [  613.196567]  [<ffffffffa0135e4c>] cleaner_kthread+0x12c/0x190 [btrfs]
      [  613.198687]  [<ffffffffa0135d20>] ? check_leaf+0x350/0x350 [btrfs]
      [  613.200758]  [<ffffffff8108f232>] kthread+0xd2/0xf0
      [  613.202616]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
      [  613.204738]  [<ffffffff8175dabc>] ret_from_fork+0x7c/0xb0
      [  613.206652]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
      [  613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
      [  613.216562] RIP  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
      [  613.218828]  RSP <ffff880039cf7c48>
      [  613.220382] ---[ end trace 71073106deb8a457 ]---
      
      This patch replace btrfs_join_transaction() with btrfs_start_transaction() in
      btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add()
      Signed-off-by: NForrest Liu <forrestl@synology.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3d84be79
    • J
      Btrfs: account for large extents with enospc · dcab6a3b
      Josef Bacik 提交于
      On our gluster boxes we stream large tar balls of backups onto our fses.  With
      160gb of ram this means we get really large contiguous ranges of dirty data, but
      the way our ENOSPC stuff works is that as long as it's contiguous we only hold
      metadata reservation for one extent.  The problem is we limit our extents to
      128mb, so we'll end up with at least 800 extents so our enospc accounting is
      quite a bit lower than what we need.  To keep track of this make sure we
      increase outstanding_extents for every multiple of the max extent size so we can
      be sure to have enough reserved metadata space.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcab6a3b
  2. 03 2月, 2015 2 次提交
    • S
      btrfs: delete chunk allocation attemp when setting block group ro · 2f081088
      Shaohua Li 提交于
      Below test will fail currently:
            mkfs.ext4 -F /dev/sda
            btrfs-convert /dev/sda
            mount /dev/sda /mnt
            btrfs device add -f /dev/sdb /mnt
            btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt
      
      The reason is there are some block groups with usage 0, but the whole
      disk hasn't free space to allocate new chunk, so we even can't set such
      block group readonly. This patch deletes the chunk allocation when
      setting block group ro. For META, we already have reserve. But for
      SYSTEM, we don't have, so the check_system_chunk is still required.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2f081088
    • F
      Btrfs: fix race between transaction commit and empty block group removal · d4b450cd
      Filipe Manana 提交于
      Committing a transaction can race with automatic removal of empty block
      groups (cleaner kthread), leading to a BUG_ON() in the transaction
      commit code while running btrfs_finish_extent_commit(). The following
      sequence diagram shows how it can happen:
      
                 CPU 1                                       CPU 2
      
      btrfs_commit_transaction()
        fs_info->running_transaction = NULL
        btrfs_finish_extent_commit()
          find_first_extent_bit()
            -> found range for block group X
               in fs_info->freed_extents[]
      
                                                     btrfs_delete_unused_bgs()
                                                       -> found block group X
      
                                                       Removed block group X's range
                                                       from fs_info->freed_extents[]
      
                                                       btrfs_remove_chunk()
                                                          btrfs_remove_block_group(bg X)
      
          unpin_extent_range(bg X range)
             btrfs_lookup_block_group(bg X)
                -> returns NULL
                  -> BUG_ON()
      
      The trace that results from the BUG_ON() is:
      
      [48665.187808] ------------[ cut here ]------------
      [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
      [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
      [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
      [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
      [48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
      [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
      [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
      [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
      [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
      [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
      [48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
      [48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
      [48665.197388] Stack:
      [48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
      [48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
      [48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
      [48665.197388] Call Trace:
      [48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
      [48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
      [48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
      [48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
      [48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
      [48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
      [48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
      [48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
      [48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
      [48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
      [48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
      [48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
      [48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388]  RSP <ffff8801b56a7b88>
      [48665.272246] ---[ end trace b9c6ab9957521376 ]---
      
      Fix this by ensuring that unpining the block group's range in
      btrfs_finish_extent_commit() is done in a synchronized fashion
      with removing the block group's range from freed_extents[]
      in btrfs_delete_unused_bgs()
      
      This race got introduced with the change:
      
          Btrfs: remove empty block groups automatically
          commit 47ab2a6cSigned-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d4b450cd
  3. 22 1月, 2015 4 次提交
    • L
      Btrfs: cleanup unused run_most · 26455d33
      Liu Bo 提交于
      "run_most" is not used anymore.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      26455d33
    • Z
      Btrfs: add ref_count and free function for btrfs_bio · 6e9606d2
      Zhao Lei 提交于
      1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
         to keep btrfs_bio's memory in raid56 recovery implement.
      2: free function for bbio will make code clean and flexible, plus
         forced data type checking in compile.
      
      Changelog v1->v2:
       Rename following by David Sterba's suggestion:
       put_btrfs_bio() -> btrfs_put_bio()
       get_btrfs_bio() -> btrfs_get_bio()
       bbio->ref_count -> bbio->refs
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e9606d2
    • F
      Btrfs: lookup for block group only if needed when freeing a tree block · 6219872d
      Filipe Manana 提交于
      Very often our extent buffer's header generation doesn't match the current
      transaction's id or it is also referenced by other trees (snapshots), so
      we don't need the corresponding block group cache object. Therefore only
      search for it if we are going to use it, so we avoid an unnecessary search
      in the block groups rbtree (and acquiring and releasing its spinlock).
      
      Freeing a tree block is performed when COWing or deleting a node/leaf,
      which implies we are holding the node/leaf's parent node lock, therefore
      reducing the amount of time spent when freeing a tree block helps reducing
      the amount of time we are holding the parent node's lock.
      
      For example, for a run of xfstests/generic/083, the block group cache
      object was needed only 682 times for a total of 226691 calls to free
      a tree block.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6219872d
    • J
      Btrfs: track dirty block groups on their own list · ce93ec54
      Josef Bacik 提交于
      Currently any time we try to update the block groups on disk we will walk _all_
      block groups and check for the ->dirty flag to see if it is set.  This function
      can get called several times during a commit.  So if you have several terabytes
      of data you will be a very sad panda as we will loop through _all_ of the block
      groups several times, which makes the commit take a while which slows down the
      rest of the file system operations.
      
      This patch introduces a dirty list for the block groups that we get added to
      when we dirty the block group for the first time.  Then we simply update any
      block groups that have been dirtied since the last time we called
      btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
      free space cache out so it is much cleaner.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce93ec54
  4. 03 1月, 2015 1 次提交
  5. 13 12月, 2014 3 次提交
  6. 11 12月, 2014 3 次提交
    • F
      Btrfs: remove non-sense btrfs_error_discard_extent() function · 1edb647b
      Filipe Manana 提交于
      It doesn't do anything special, it just calls btrfs_discard_extent(),
      so just remove it.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1edb647b
    • F
      Btrfs: fix fs corruption on transaction abort if device supports discard · 678886bd
      Filipe Manana 提交于
      When we abort a transaction we iterate over all the ranges marked as dirty
      in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
      from those trees, add them back (unpin) to the free space caches and, if
      the fs was mounted with "-o discard", perform a discard on those regions.
      Also, after adding the regions to the free space caches, a fitrim ioctl call
      can see those ranges in a block group's free space cache and perform a discard
      on the ranges, so the same issue can happen without "-o discard" as well.
      
      This causes corruption, affecting one or multiple btree nodes (in the worst
      case leaving the fs unmountable) because some of those ranges (the ones in
      the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
      referred by the last committed super block - breaking the rule that anything
      that was committed by a transaction is untouched until the next transaction
      commits successfully.
      
      I ran into this while running in a loop (for several hours) the fstest that
      I recently submitted:
      
        [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
      
      The corruption always happened when a transaction aborted and then fsck complained
      like this:
      
         _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
         *** fsck.btrfs output ***
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         read block failed check_tree_block
         Couldn't open file system
      
      In this case 94945280 corresponded to the root of a tree.
      Using frace what I observed was the following sequence of steps happened:
      
         1) transaction N started, fs_info->pinned_extents pointed to
            fs_info->freed_extents[0];
      
         2) node/eb 94945280 is created;
      
         3) eb is persisted to disk;
      
         4) transaction N commit starts, fs_info->pinned_extents now points to
            fs_info->freed_extents[1], and transaction N completes;
      
         5) transaction N + 1 starts;
      
         6) eb is COWed, and btrfs_free_tree_block() called for this eb;
      
         7) eb range (94945280 to 94945280 + 16Kb) is added to
            fs_info->pinned_extents (fs_info->freed_extents[1]);
      
         8) Something goes wrong in transaction N + 1, like hitting ENOSPC
            for example, and the transaction is aborted, turning the fs into
            readonly mode. The stack trace I got for example:
      
            [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
            [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
            [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
            [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
            [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
            [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
            [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
            (...)
            [112065.264493] ---[ end trace dd7903a975a31a08 ]---
            [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
            [112065.264997] BTRFS info (device sdc): forced readonly
      
         9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
            fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
            turn calls btrfs_destroy_pinned_extent();
      
         10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
             marked as dirty in fs_info->freed_extents[], and for each one
             it calls discard, if the fs was mounted with "-o discard", and
             adds the range to the free space cache of the respective block
             group;
      
         11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
             sees the free space entries and performs a discard;
      
         12) After an umount and mount (or fsck), our eb's location on disk was full
             of zeroes, and it should have been untouched, because it was marked as
             dirty in the fs_info->pinned_extents tree, and therefore used by the
             trees that the last committed superblock points to.
      
      Fix this by not performing a discard and not adding the ranges to the free space
      caches - it's useless from this point since the fs is now in readonly mode and
      we won't write free space caches to disk anymore (otherwise we would leak space)
      nor any new superblock. By not adding the ranges to the free space caches, it
      prevents other code paths from allocating that space and write to it as well,
      therefore being safer and simpler.
      
      This isn't a new problem, as it's been present since 2011 (git commit
      acce952b).
      
      Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      678886bd
    • F
      Btrfs: always clear a block group node when removing it from the tree · 01eacb27
      Filipe Manana 提交于
      Always clear a block group's rbnode after removing it from the rbtree to
      ensure that any tasks that might be holding a reference on the block group
      don't end up accessing stale rbnode left and right child pointers through
      next_block_group().
      
      This is a leftover from the change titled:
      "Btrfs: fix invalid block group rbtree access after bg is removed"
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      01eacb27
  7. 03 12月, 2014 8 次提交
    • J
      Btrfs: make get_caching_control unconditionally return the ctl · cb83b7b8
      Josef Bacik 提交于
      This was written when we didn't do a caching control for the fast free space
      cache loading.  However we started doing that a long time ago, and there is
      still a small window of time that we could be caching the block group the fast
      way, so if there is a caching_ctl at all on the block group just return it, the
      callers all wait properly for what they want.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cb83b7b8
    • F
      Btrfs: fix unprotected deletion from pending_chunks list · 8dbcd10f
      Filipe Manana 提交于
      On block group remove if the corresponding extent map was on the
      transaction->pending_chunks list, we were deleting the extent map
      from that list, through remove_extent_mapping(), without any
      synchronization with chunk allocation (which iterates that list
      and adds new elements to it). Fix this by ensure that this is done
      while the chunk mutex is held, since that's the mutex that protects
      the list in the chunk allocation code path.
      
      This applies on top (depends on) of my previous patch titled:
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      
      But the issue in fact was already present before that change, it only
      became easier to hit after Josef's 3.18 patch that added automatic
      removal of empty block groups.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8dbcd10f
    • F
      Btrfs: fix fs mapping extent map leak · 495e64f4
      Filipe Manana 提交于
      On chunk allocation error (label "error_del_extent"), after adding the
      extent map to the tree and to the pending chunks list, we would leave
      decrementing the extent map's refcount by 2 instead of 3 (our allocation
      + tree reference + list reference).
      
      Also, on chunk/block group removal, if the block group was on the list
      pending_chunks we weren't decrementing the respective list reference.
      
      Detected by 'rmmod btrfs':
      
      [20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
      [20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
      [20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
      [20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
      [20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
      [20770.106136] Call Trace:
      [20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
      [20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
      [20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
      [20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
      [20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
      [20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
      [20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
      
      This applies on top (depends on) of my previous patch titled:
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      
      But the issue in fact was already present before that change, it only
      became easier to hit after Josef's 3.18 patch that added automatic
      removal of empty block groups.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      495e64f4
    • F
      Btrfs: make btrfs_abort_transaction consider existence of new block groups · c92f6be3
      Filipe Manana 提交于
      If the transaction handle doesn't have used blocks but has created new block
      groups make sure we turn the fs into readonly mode too. This is because the
      new block groups didn't get all their metadata persisted into the chunk and
      device trees, and therefore if a subsequent transaction starts, allocates
      space from the new block groups, writes data or metadata into that space,
      commits successfully and then after we unmount and mount the filesystem
      again, the same space can be allocated again for a new block group,
      resulting in file data or metadata corruption.
      
      Example where we don't abort the transaction when we fail to finish the
      chunk allocation (add items to the chunk and device trees) and later a
      future transaction where the block group is removed fails because it can't
      find the chunk item in the chunk tree:
      
      [25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
      [25230.404301] BTRFS: Transaction aborted (error -28)
      [25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
      [25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
      [25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
      [25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
      [25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
      [25230.404334] Call Trace:
      [25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
      [25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
      [25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
      [25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
      [25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
      [25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
      [25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
      [25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
      [25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
      [25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
      [25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
      [25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
      [25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
      [25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
      [25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
      [25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
      [25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
      [25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
      [25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
      [25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
      [25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c92f6be3
    • F
      Btrfs: fix race between fs trimming and block group remove/allocation · 04216820
      Filipe Manana 提交于
      Our fs trim operation, which is completely transactionless (doesn't start
      or joins an existing transaction) consists of visiting all block groups
      and then for each one to iterate its free space entries and perform a
      discard operation against the space range represented by the free space
      entries. However before performing a discard, the corresponding free space
      entry is removed from the free space rbtree, and when the discard completes
      it is added back to the free space rbtree.
      
      If a block group remove operation happens while the discard is ongoing (or
      before it starts and after a free space entry is hidden), we end up not
      waiting for the discard to complete, remove the extent map that maps
      logical address to physical addresses and the corresponding chunk metadata
      from the the chunk and device trees. After that and before the discard
      completes, the current running transaction can finish and a new one start,
      allowing for new block groups that map to the same physical addresses to
      be allocated and written to.
      
      So fix this by keeping the extent map in memory until the discard completes
      so that the same physical addresses aren't reused before it completes.
      
      If the physical locations that are under a discard operation end up being
      used for a new metadata block group for example, and dirty metadata extents
      are written before the discard finishes (the VM might call writepages() of
      our btree inode's i_mapping for example, or an fsync log commit happens) we
      end up overwriting metadata with zeroes, which leads to errors from fsck
      like the following:
      
              checking extents
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              owner ref check failed [833912832 16384]
              Errors found in extent allocation tree or chunk allocation
              checking free space cache
              checking fs roots
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              root 5 root dir 256 error
              root 5 inode 260 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 262 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 263 errors 2001, no inode item, link count wrong
              (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04216820
    • F
      Btrfs: fix freeing used extents after removing empty block group · ae0ab003
      Filipe Manana 提交于
      There's a race between adding a block group to the list of the unused
      block groups and removing an unused block group (cleaner kthread) that
      leads to freeing extents that are in use or a crash during transaction
      commmit. Basically the cleaner kthread, when executing
      btrfs_delete_unused_bgs(), might catch the newly added block group to
      the list fs_info->unused_bgs and clear the range representing the whole
      group from fs_info->freed_extents[] before the task that added the block
      group to the list (running update_block_group()) marked the last freed
      extent as dirty in fs_info->freed_extents (pinned_extents).
      
      That is:
      
           CPU 1                                CPU 2
      
                                        btrfs_delete_unused_bgs()
      update_block_group()
         add block group to
         fs_info->unused_bgs
                                          got block group from the list
                                          clear_extent_bits for the whole
                                          block group range in freed_extents[]
         set_extent_dirty for the
         range covering the freed
         extent in freed_extents[]
         (fs_info->pinned_extents)
      
                                        block group deleted, and a new block
                                        group with the same logical address is
                                        created
      
                                        reserve space from the new block group
                                        for new data or metadata - the reserved
                                        space overlaps the range specified by
                                        CPU 1 for set_extent_dirty()
      
                                        commit transaction
                                          find all ranges marked as dirty in
                                          fs_info->pinned_extents, clear them
                                          and add them to the free space cache
      
      Alternatively, if CPU 2 doesn't create a new block group with the same
      logical address, we get a crash/BUG_ON at transaction commit when unpining
      extent ranges because we can't find a block group for the range marked as
      dirty by CPU 1. Sample trace:
      
      [ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
      gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
       sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
      [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
      [ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
      [ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
      [ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
      [ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
      [ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
      [ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
      [ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
      [ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
      [ 2163.430042] Stack:
      [ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
      [ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
      [ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
      [ 2163.430042] Call Trace:
      [ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
      [ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
      [ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      
      So fix this by making update_block_group() first set the range as dirty
      in pinned_extents before adding the block group to the unused_bgs list.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ae0ab003
    • F
      Btrfs: fix crash caused by block group removal · 4f69cb98
      Filipe Manana 提交于
      If we remove a block group (because it became empty), we might have left
      a caching_ctl structure in fs_info->caching_block_groups that points to
      the block group and is accessed at transaction commit time. This results
      in accessing an invalid or incorrect block group. This issue became visible
      after Josef's patch "Btrfs: remove empty block groups automatically".
      
      So if the block group is removed make sure we don't leave a dangling
      caching_ctl in caching_block_groups.
      
      Sample crash trace:
      
      [58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
      [58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
      [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
      [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
      [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
      [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
      [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
      [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
      [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
      [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
      [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
      [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
      [58380.443238] Stack:
      [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
      [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
      [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
      [58380.443238] Call Trace:
      [58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
      [58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
      [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4f69cb98
    • F
      Btrfs: fix invalid block group rbtree access after bg is removed · 292cbd51
      Filipe Manana 提交于
      If we grab a block group, for example in btrfs_trim_fs(), we will be holding
      a reference on it but the block group can be removed after we got it (via
      btrfs_remove_block_group), which means it will no longer be part of the
      rbtree.
      
      However, btrfs_remove_block_group() was only calling rb_erase() which leaves
      the block group's rb_node left and right child pointers with the same content
      they had before calling rb_erase. This was dangerous because a call to
      next_block_group() would access the node's left and right child pointers (via
      rb_next), which can be no longer valid.
      
      Fix this by clearing a block group's node after removing it from the tree,
      and have next_block_group() do a tree search to get the next block group
      instead of using rb_next() if our block group was removed.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      292cbd51
  8. 25 11月, 2014 2 次提交
    • F
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana 提交于
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ea24bbe
    • F
      Btrfs: fix freeing used extent after removing empty block group · 758eb51e
      Filipe Manana 提交于
      Due to ignoring errors returned by clear_extent_bits (at the moment only
      -ENOMEM is possible), we can end up freeing an extent that is actually in
      use (i.e. return the extent to the free space cache).
      
      The sequence of steps that lead to this:
      
      1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
         the goal of freeing empty block groups;
      
      2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
         transaction (or starts a new one if none is running) and attempts to
         clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
         and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);
      
      3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
         -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
         to delete the block group and the respective chunk, while pinned_extents
         remains with that bit set for the whole (or a part of the) range covered
         by the block group;
      
      4) Later while the transaction is still running, the chunk ends up being reused
         for a new block group (maybe for different purpose, data or metadata), and
         extents belonging to the new block group are allocated for file data or btree
         nodes/leafs;
      
      5) The current transaction is committed, meaning that we unpinned one or more
         extents from the new block group (through btrfs_finish_extent_commit() and
         unpin_extent_range()) which are now being used for new file data or new
         metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
         And unpinning means we returned the extents to the free space cache of the
         new block group, which implies those extents can be used for future allocations
         while they're still in use.
      
      Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
      object in unpin_extent_range() if a new block group didn't end up being allocated for
      the same chunk (step 4 above).
      
      Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758eb51e
  9. 21 11月, 2014 1 次提交
  10. 29 10月, 2014 1 次提交
    • F
      Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items · d05a2b4c
      Filipe Manana 提交于
      We have a race that can lead us to miss skinny extent items in the function
      btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
      So basically the sequence of steps is:
      
      1) We search in the extent tree for the skinny extent, which returns > 0
         (not found);
      
      2) We check the previous item in the returned leaf for a non-skinny extent,
         and we don't find it;
      
      3) Because we didn't find the non-skinny extent in step 2), we release our
         path to search the extent tree again, but this time for a non-skinny
         extent key;
      
      4) Right after we released our path in step 3), a skinny extent was inserted
         in the extent tree (delayed refs were run) - our second extent tree search
         will miss it, because it's not looking for a skinny extent;
      
      5) After the second search returned (with ret > 0), we look for any delayed
         ref for our extent's bytenr (and we do it while holding a read lock on the
         leaf), but we won't find any, as such delayed ref had just run and completed
         after we released out path in step 3) before doing the second search.
      
      Fix this by removing completely the path release and re-search logic. This is
      safe, because if we seach for a metadata item and we don't find it, we have the
      guarantee that the returned leaf is the one where the item would be inserted,
      and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
      non-skinny extent item is if it exists. The only case where path->slots[0] is
      zero is when there are no smaller keys in the tree (i.e. no left siblings for
      our leaf), in which case the re-search logic isn't needed as well.
      
      This race has been present since the introduction of skinny metadata (change
      3173a18f).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d05a2b4c
  11. 28 10月, 2014 1 次提交
    • F
      Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() · 1a4ed8fd
      Filipe Manana 提交于
      If we couldn't find our extent item, we accessed the current slot
      (path->slots[0]) to check if it corresponds to an equivalent skinny
      metadata item. However this slot could be beyond our last item in the
      leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
      we shouldn't process it.
      
      Since btrfs_lookup_extent() is only used to find extent items for data
      extents, fix this by removing completely the logic that looks up for an
      equivalent skinny metadata item, since it can not exist.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a4ed8fd
  12. 04 10月, 2014 1 次提交
    • F
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana 提交于
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      656f30db
  13. 02 10月, 2014 7 次提交
  14. 23 9月, 2014 2 次提交
    • J
      Btrfs: don't do async reclaim during log replay · f6acfd50
      Josef Bacik 提交于
      Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
      during log replay.  This is because we use fs_info->fs_root as our root for
      shrinking and such.  Technically we can use whatever root we want, but let's
      just not allow async reclaim while we're doing log replay.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f6acfd50
    • J
      Btrfs: remove empty block groups automatically · 47ab2a6c
      Josef Bacik 提交于
      One problem that has plagued us is that a user will use up all of his space with
      data, remove a bunch of that data, and then try to create a bunch of small files
      and run out of space.  This happens because all the chunks were allocated for
      data since the metadata requirements were so low.  But now there's a bunch of
      empty data block groups and not enough metadata space to do anything.  This
      patch solves this problem by automatically deleting empty block groups.  If we
      notice the used count go down to 0 when deleting or on mount notice that a block
      group has a used count of 0 then we will queue it to be deleted.
      
      When the cleaner thread runs we will double check to make sure the block group
      is still empty and then we will delete it.  This patch has the side effect of no
      longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
      helpful for both this and relocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47ab2a6c
  15. 18 9月, 2014 2 次提交
    • M
      Btrfs: Fix misuse of chunk mutex · 2196d6e8
      Miao Xie 提交于
      There were several problems about chunk mutex usage:
      - Lock chunk mutex when updating metadata. It would cause the nested
        deadlock because updating metadata might need allocate new chunks
        that need acquire chunk mutex. We remove chunk mutex at this case,
        because b-tree lock and other lock mechanism can help us.
      - ABBA deadlock occured between device_list_mutex and chunk_mutex.
        When we update device status, we must acquire device_list_mutex at the
        beginning, and then we might get chunk_mutex during the device status
        update because we need allocate new chunks for metadata COW. But at
        most place, we acquire chunk_mutex at first and then acquire device list
        mutex. We need change the lock order.
      - Some place we needn't acquire chunk_mutex. For example we needn't get
        chunk_mutex when we free a empty seed fs_devices structure.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2196d6e8
    • L
      Btrfs: fix loop writing of async reclaim · 25ce459c
      Liu Bo 提交于
      One of my tests shows that when we really don't have space to reclaim via
      flush_space and also run out of space, this async reclaim work loops on adding
      itself into the workqueue and keeps writing something to disk according to
      iostat's results, and these writes mainly comes from commit_transaction which
      writes super_block.  This's unacceptable as it can be bad to disks, especially
      memeory storages.
      
      This adds a check to avoid the above situation.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      25ce459c