1. 03 12月, 2014 3 次提交
    • F
      Btrfs: fix freeing used extents after removing empty block group · ae0ab003
      Filipe Manana 提交于
      There's a race between adding a block group to the list of the unused
      block groups and removing an unused block group (cleaner kthread) that
      leads to freeing extents that are in use or a crash during transaction
      commmit. Basically the cleaner kthread, when executing
      btrfs_delete_unused_bgs(), might catch the newly added block group to
      the list fs_info->unused_bgs and clear the range representing the whole
      group from fs_info->freed_extents[] before the task that added the block
      group to the list (running update_block_group()) marked the last freed
      extent as dirty in fs_info->freed_extents (pinned_extents).
      
      That is:
      
           CPU 1                                CPU 2
      
                                        btrfs_delete_unused_bgs()
      update_block_group()
         add block group to
         fs_info->unused_bgs
                                          got block group from the list
                                          clear_extent_bits for the whole
                                          block group range in freed_extents[]
         set_extent_dirty for the
         range covering the freed
         extent in freed_extents[]
         (fs_info->pinned_extents)
      
                                        block group deleted, and a new block
                                        group with the same logical address is
                                        created
      
                                        reserve space from the new block group
                                        for new data or metadata - the reserved
                                        space overlaps the range specified by
                                        CPU 1 for set_extent_dirty()
      
                                        commit transaction
                                          find all ranges marked as dirty in
                                          fs_info->pinned_extents, clear them
                                          and add them to the free space cache
      
      Alternatively, if CPU 2 doesn't create a new block group with the same
      logical address, we get a crash/BUG_ON at transaction commit when unpining
      extent ranges because we can't find a block group for the range marked as
      dirty by CPU 1. Sample trace:
      
      [ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
      gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
       sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
      [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
      [ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
      [ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
      [ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
      [ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
      [ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
      [ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
      [ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
      [ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
      [ 2163.430042] Stack:
      [ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
      [ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
      [ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
      [ 2163.430042] Call Trace:
      [ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
      [ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
      [ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      
      So fix this by making update_block_group() first set the range as dirty
      in pinned_extents before adding the block group to the unused_bgs list.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ae0ab003
    • F
      Btrfs: fix crash caused by block group removal · 4f69cb98
      Filipe Manana 提交于
      If we remove a block group (because it became empty), we might have left
      a caching_ctl structure in fs_info->caching_block_groups that points to
      the block group and is accessed at transaction commit time. This results
      in accessing an invalid or incorrect block group. This issue became visible
      after Josef's patch "Btrfs: remove empty block groups automatically".
      
      So if the block group is removed make sure we don't leave a dangling
      caching_ctl in caching_block_groups.
      
      Sample crash trace:
      
      [58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
      [58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
      [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
      [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
      [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
      [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
      [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
      [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
      [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
      [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
      [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
      [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
      [58380.443238] Stack:
      [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
      [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
      [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
      [58380.443238] Call Trace:
      [58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
      [58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
      [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4f69cb98
    • F
      Btrfs: fix invalid block group rbtree access after bg is removed · 292cbd51
      Filipe Manana 提交于
      If we grab a block group, for example in btrfs_trim_fs(), we will be holding
      a reference on it but the block group can be removed after we got it (via
      btrfs_remove_block_group), which means it will no longer be part of the
      rbtree.
      
      However, btrfs_remove_block_group() was only calling rb_erase() which leaves
      the block group's rb_node left and right child pointers with the same content
      they had before calling rb_erase. This was dangerous because a call to
      next_block_group() would access the node's left and right child pointers (via
      rb_next), which can be no longer valid.
      
      Fix this by clearing a block group's node after removing it from the tree,
      and have next_block_group() do a tree search to get the next block group
      instead of using rb_next() if our block group was removed.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      292cbd51
  2. 25 11月, 2014 2 次提交
    • F
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana 提交于
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ea24bbe
    • F
      Btrfs: fix freeing used extent after removing empty block group · 758eb51e
      Filipe Manana 提交于
      Due to ignoring errors returned by clear_extent_bits (at the moment only
      -ENOMEM is possible), we can end up freeing an extent that is actually in
      use (i.e. return the extent to the free space cache).
      
      The sequence of steps that lead to this:
      
      1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
         the goal of freeing empty block groups;
      
      2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
         transaction (or starts a new one if none is running) and attempts to
         clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
         and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);
      
      3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
         -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
         to delete the block group and the respective chunk, while pinned_extents
         remains with that bit set for the whole (or a part of the) range covered
         by the block group;
      
      4) Later while the transaction is still running, the chunk ends up being reused
         for a new block group (maybe for different purpose, data or metadata), and
         extents belonging to the new block group are allocated for file data or btree
         nodes/leafs;
      
      5) The current transaction is committed, meaning that we unpinned one or more
         extents from the new block group (through btrfs_finish_extent_commit() and
         unpin_extent_range()) which are now being used for new file data or new
         metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
         And unpinning means we returned the extents to the free space cache of the
         new block group, which implies those extents can be used for future allocations
         while they're still in use.
      
      Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
      object in unpin_extent_range() if a new block group didn't end up being allocated for
      the same chunk (step 4 above).
      
      Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758eb51e
  3. 21 11月, 2014 1 次提交
  4. 29 10月, 2014 1 次提交
    • F
      Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items · d05a2b4c
      Filipe Manana 提交于
      We have a race that can lead us to miss skinny extent items in the function
      btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
      So basically the sequence of steps is:
      
      1) We search in the extent tree for the skinny extent, which returns > 0
         (not found);
      
      2) We check the previous item in the returned leaf for a non-skinny extent,
         and we don't find it;
      
      3) Because we didn't find the non-skinny extent in step 2), we release our
         path to search the extent tree again, but this time for a non-skinny
         extent key;
      
      4) Right after we released our path in step 3), a skinny extent was inserted
         in the extent tree (delayed refs were run) - our second extent tree search
         will miss it, because it's not looking for a skinny extent;
      
      5) After the second search returned (with ret > 0), we look for any delayed
         ref for our extent's bytenr (and we do it while holding a read lock on the
         leaf), but we won't find any, as such delayed ref had just run and completed
         after we released out path in step 3) before doing the second search.
      
      Fix this by removing completely the path release and re-search logic. This is
      safe, because if we seach for a metadata item and we don't find it, we have the
      guarantee that the returned leaf is the one where the item would be inserted,
      and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
      non-skinny extent item is if it exists. The only case where path->slots[0] is
      zero is when there are no smaller keys in the tree (i.e. no left siblings for
      our leaf), in which case the re-search logic isn't needed as well.
      
      This race has been present since the introduction of skinny metadata (change
      3173a18f).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d05a2b4c
  5. 28 10月, 2014 1 次提交
    • F
      Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() · 1a4ed8fd
      Filipe Manana 提交于
      If we couldn't find our extent item, we accessed the current slot
      (path->slots[0]) to check if it corresponds to an equivalent skinny
      metadata item. However this slot could be beyond our last item in the
      leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
      we shouldn't process it.
      
      Since btrfs_lookup_extent() is only used to find extent items for data
      extents, fix this by removing completely the logic that looks up for an
      equivalent skinny metadata item, since it can not exist.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a4ed8fd
  6. 04 10月, 2014 1 次提交
    • F
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana 提交于
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      656f30db
  7. 02 10月, 2014 7 次提交
  8. 23 9月, 2014 2 次提交
    • J
      Btrfs: don't do async reclaim during log replay · f6acfd50
      Josef Bacik 提交于
      Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
      during log replay.  This is because we use fs_info->fs_root as our root for
      shrinking and such.  Technically we can use whatever root we want, but let's
      just not allow async reclaim while we're doing log replay.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f6acfd50
    • J
      Btrfs: remove empty block groups automatically · 47ab2a6c
      Josef Bacik 提交于
      One problem that has plagued us is that a user will use up all of his space with
      data, remove a bunch of that data, and then try to create a bunch of small files
      and run out of space.  This happens because all the chunks were allocated for
      data since the metadata requirements were so low.  But now there's a bunch of
      empty data block groups and not enough metadata space to do anything.  This
      patch solves this problem by automatically deleting empty block groups.  If we
      notice the used count go down to 0 when deleting or on mount notice that a block
      group has a used count of 0 then we will queue it to be deleted.
      
      When the cleaner thread runs we will double check to make sure the block group
      is still empty and then we will delete it.  This patch has the side effect of no
      longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
      helpful for both this and relocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47ab2a6c
  9. 18 9月, 2014 5 次提交
    • M
      Btrfs: Fix misuse of chunk mutex · 2196d6e8
      Miao Xie 提交于
      There were several problems about chunk mutex usage:
      - Lock chunk mutex when updating metadata. It would cause the nested
        deadlock because updating metadata might need allocate new chunks
        that need acquire chunk mutex. We remove chunk mutex at this case,
        because b-tree lock and other lock mechanism can help us.
      - ABBA deadlock occured between device_list_mutex and chunk_mutex.
        When we update device status, we must acquire device_list_mutex at the
        beginning, and then we might get chunk_mutex during the device status
        update because we need allocate new chunks for metadata COW. But at
        most place, we acquire chunk_mutex at first and then acquire device list
        mutex. We need change the lock order.
      - Some place we needn't acquire chunk_mutex. For example we needn't get
        chunk_mutex when we free a empty seed fs_devices structure.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2196d6e8
    • L
      Btrfs: fix loop writing of async reclaim · 25ce459c
      Liu Bo 提交于
      One of my tests shows that when we really don't have space to reclaim via
      flush_space and also run out of space, this async reclaim work loops on adding
      itself into the workqueue and keeps writing something to disk according to
      iostat's results, and these writes mainly comes from commit_transaction which
      writes super_block.  This's unacceptable as it can be bad to disks, especially
      memeory storages.
      
      This adds a check to avoid the above situation.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      25ce459c
    • D
      btrfs: clean away stripe_align helper · 4e54b17a
      David Sterba 提交于
      Only wraps the ALIGN macro.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      4e54b17a
    • D
      btrfs: use nodesize everywhere, kill leafsize · 707e8a07
      David Sterba 提交于
      The nodesize and leafsize were never of different values. Unify the
      usage and make nodesize the one. Cleanup the redundant checks and
      helpers.
      
      Shaves a few bytes from .text:
      
        text    data     bss     dec     hex filename
      852418   24560   23112  900090   dbbfa btrfs.ko.before
      851074   24584   23112  898770   db6d2 btrfs.ko.after
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      707e8a07
    • D
      btrfs: kill the key type accessor helpers · 962a298f
      David Sterba 提交于
      btrfs_set_key_type and btrfs_key_type are used inconsistently along with
      open coded variants. Other members of btrfs_key are accessed directly
      without any helpers anyway.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      962a298f
  10. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  11. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  12. 19 8月, 2014 1 次提交
    • M
      Btrfs: don't consider the missing device when allocating new chunks · 95669976
      Miao Xie 提交于
      The original code allocated new chunks by the number of the writable devices
      and missing devices to make sure that any RAID levels on a degraded FS continue
      to be honored, but it introduced a problem that it stopped us to allocating
      new chunks, the steps to reproduce is following:
      
       # mkfs.btrfs -m raid1 -d raid1 -f <dev0> <dev1>
       # mkfs.btrfs -f <dev1>			//Removing <dev1> from the original fs
       # mount -o degraded <dev0> <mnt>
       # dd if=/dev/null of=<mnt>/tmpfile bs=1M
      
      It is because we allocate new chunks only on the writable devices, if we take
      the number of missing devices into account, and want to allocate new chunks
      with higher RAID level, we will fail becaue we don't have enough writable
      device. Fix it by ignoring the number of missing devices when allocating
      new chunks.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      95669976
  13. 15 8月, 2014 2 次提交
    • M
      btrfs: qgroup: account shared subtrees during snapshot delete · 1152651a
      Mark Fasheh 提交于
      During its tree walk, btrfs_drop_snapshot() will skip any shared
      subtrees it encounters. This is incorrect when we have qgroups
      turned on as those subtrees need to have their contents
      accounted. In particular, the case we're concerned with is when
      removing our snapshot root leaves the subtree with only one root
      reference.
      
      In those cases we need to find the last remaining root and add
      each extent in the subtree to the corresponding qgroup exclusive
      counts.
      
      This patch implements the shared subtree walk and a new qgroup
      operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
      this type is encountered during qgroup accounting, we search for
      any root references to that extent and in the case that we find
      only one reference left, we go ahead and do the math on it's
      exclusive counts.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1152651a
    • J
      Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0
      Josef Bacik 提交于
      Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
      understand how snapshot delete was using it and assumed that we needed the
      quota operations there.  With Mark's work this has turned out to be not the
      case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
      argument and make __btrfs_mod_ref call it's process function with no_quota set
      always.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e339a6b0
  14. 03 7月, 2014 1 次提交
    • L
      Btrfs: fix race of using total_bytes_pinned · d288db5d
      Liu Bo 提交于
      This percpu counter @total_bytes_pinned is introduced to skip unnecessary
      operations of 'commit transaction', it accounts for those space we may free
      but are stuck in delayed refs.
      
      And we zero out @space_info->total_bytes_pinned every transaction period so
      we have a better idea of how much space we'll actually free up by committing
      this transaction.  However, we do the 'zero out' part a little earlier, before
      we actually unpin space, so we end up returning ENOSPC when we actually have
      free space that's just unpinned from committing transaction.
      
      xfstests/generic/074 complained then.
      
      This fixes it by actually accounting the percpu pinned number when 'unpin',
      and since it's protected by space_info->lock, the race is gone now.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d288db5d
  15. 20 6月, 2014 1 次提交
    • M
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie 提交于
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e570fd27
  16. 10 6月, 2014 9 次提交
    • J
      btrfs: allocate raid type kobjects dynamically · c1895442
      Jeff Mahoney 提交于
      We are currently allocating space_info objects in an array when we
      allocate space_info. When a user does something like:
      
      # btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
      # btrfs balance start -mconvert=single -dconvert=single /mnt -f
      # btrfs balance start -mconvert=raid1 -dconvert=raid1 /
      
      We can end up with memory corruption since the kobject hasn't
      been reinitialized properly and the name pointer was left set.
      
      The rationale behind allocating them statically was to avoid
      creating a separate kobject container that just contained the
      raid type. It used the index in the array to determine the index.
      
      Ultimately, though, this wastes more memory than it saves in all
      but the most complex scenarios and introduces kobject lifetime
      questions.
      
      This patch allocates the kobjects dynamically instead. Note that
      we also remove the kobject_get/put of the parent kobject since
      kobject_add and kobject_del do that internally.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c1895442
    • C
      Btrfs: async delayed refs · a79b7d4b
      Chris Mason 提交于
      Delayed extent operations are triggered during transaction commits.
      The goal is to queue up a healthly batch of changes to the extent
      allocation tree and run through them in bulk.
      
      This farms them off to async helper threads.  The goal is to have the
      bulk of the delayed operations being done in the background, but this is
      also important to limit our stack footprint.
      Signed-off-by: NChris Mason <clm@fb.com>
      a79b7d4b
    • D
      btrfs: remove stale newlines from log messages · 351fd353
      David Sterba 提交于
      I've noticed an extra line after "use no compression", but search
      revealed much more in messages of more critical levels and rare errors.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      351fd353
    • J
      Btrfs: add sanity tests for new qgroup accounting code · faa2dbf0
      Josef Bacik 提交于
      This exercises the various parts of the new qgroup accounting code.  We do some
      basic stuff and do some things with the shared refs to make sure all that code
      works.  I had to add a bunch of infrastructure because I needed to be able to
      insert items into a fake tree without having to do all the hard work myself,
      hopefully this will be usefull in the future.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      faa2dbf0
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456
    • W
      Btrfs: fix joining same transaction handle more than twice · f017f15f
      Wang Shilong 提交于
      We hit something like the following function call flows:
      
      |->run_delalloc_range()
       |->btrfs_join_transaction()
         |->cow_file_range()
           |->btrfs_join_transaction()
             |->find_free_extent()
               |->btrfs_join_transaction()
      
      Trace infomation can be seen as:
      
      [ 7411.127040] ------------[ cut here ]------------
      [ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 start_transaction+0x561/0x580 [btrfs]()
      [ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G           O 3.13.0+ #4
      [ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013
      [ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5)
      [ 7411.127092] Call Trace:
      [ 7411.127097]  [<ffffffff815b87b0>] dump_stack+0x45/0x56
      [ 7411.127101]  [<ffffffff81051ffd>] warn_slowpath_common+0x7d/0xa0
      [ 7411.127102]  [<ffffffff810520da>] warn_slowpath_null+0x1a/0x20
      [ 7411.127109]  [<ffffffffa0444fb1>] start_transaction+0x561/0x580 [btrfs]
      [ 7411.127115]  [<ffffffffa0445027>] btrfs_join_transaction+0x17/0x20 [btrfs]
      [ 7411.127120]  [<ffffffffa0431c91>] find_free_extent+0xa21/0xb50 [btrfs]
      [ 7411.127126]  [<ffffffffa0431f68>] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
      [ 7411.127131]  [<ffffffffa04322ce>] btrfs_alloc_free_block+0xee/0x440 [btrfs]
      [ 7411.127137]  [<ffffffffa043bd6e>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
      [ 7411.127142]  [<ffffffffa041da51>] __btrfs_cow_block+0x121/0x530 [btrfs]
      [ 7411.127146]  [<ffffffffa041dfff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
      [ 7411.127151]  [<ffffffffa0421b74>] btrfs_search_slot+0x1d4/0x9c0 [btrfs]
      [ 7411.127157]  [<ffffffffa0438567>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
      [ 7411.127163]  [<ffffffffa0456bfc>] __btrfs_drop_extents+0x16c/0xd90 [btrfs]
      [ 7411.127169]  [<ffffffffa0444ae3>] ? start_transaction+0x93/0x580 [btrfs]
      [ 7411.127171]  [<ffffffff811663e2>] ? kmem_cache_alloc+0x132/0x140
      [ 7411.127176]  [<ffffffffa041cd9a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
      [ 7411.127182]  [<ffffffffa044aa61>] cow_file_range_inline+0x181/0x2e0 [btrfs]
      [ 7411.127187]  [<ffffffffa044aead>] cow_file_range+0x2ed/0x440 [btrfs]
      [ 7411.127194]  [<ffffffffa0464d7f>] ? free_extent_buffer+0x4f/0xb0 [btrfs]
      [ 7411.127200]  [<ffffffffa044b38f>] run_delalloc_nocow+0x38f/0xa60 [btrfs]
      [ 7411.127207]  [<ffffffffa0461600>] ? test_range_bit+0x30/0x180 [btrfs]
      [ 7411.127212]  [<ffffffffa044bd48>] run_delalloc_range+0x2e8/0x350 [btrfs]
      [ 7411.127219]  [<ffffffffa04618f9>] ? find_lock_delalloc_range+0x1a9/0x1e0 [btrfs]
      [ 7411.127222]  [<ffffffff812a1e71>] ? blk_queue_bio+0x2c1/0x330
      [ 7411.127228]  [<ffffffffa0462ad4>] __extent_writepage+0x2f4/0x760 [btrfs]
      
      Here we fix it by avoiding joining transaction again if we have held
      a transaction handle when allocating chunk in find_free_extent().
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f017f15f
    • M
    • M
    • M
      Btrfs: reclaim the reserved metadata space at background · 21c7e756
      Miao Xie 提交于
      Before applying this patch, the task had to reclaim the metadata space
      by itself if the metadata space was not enough. And When the task started
      the space reclamation, all the other tasks which wanted to reserve the
      metadata space were blocked. At some cases, they would be blocked for
      a long time, it made the performance fluctuate wildly.
      
      So we introduce the background metadata space reclamation, when the space
      is about to be exhausted, we insert a reclaim work into the workqueue, the
      worker of the workqueue helps us to reclaim the reserved space at the
      background. By this way, the tasks needn't reclaim the space by themselves at
      most cases, and even if the tasks have to reclaim the space or are blocked
      for the space reclamation, they will get enough space more quickly.
      
      Here is my test result(Tested by compilebench):
       Memory:	2GB
       CPU:		2Cores * 1CPU
       Partition:	40GB(SSD)
      
      Test command:
       # compilebench -D <mnt> -m
      
      Without this patch:
       intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
       compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
       read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
       delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)
      
      With this patch:
       intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
       compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
       read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
       delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      21c7e756
  17. 25 4月, 2014 1 次提交