1. 02 7月, 2019 2 次提交
    • F
      Btrfs: prevent send failures and crashes due to concurrent relocation · 9e967495
      Filipe Manana 提交于
      Send always operates on read-only trees and always expected that while it
      is in progress, nothing changes in those trees. Due to that expectation
      and the fact that send is a read-only operation, it operates on commit
      roots and does not hold transaction handles. However relocation can COW
      nodes and leafs from read-only trees, which can cause unexpected failures
      and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
      COWed, the transaction used to COW it is committed, a new transaction
      starts, the extent previously used for that node/leaf gets allocated,
      possibly for another tree, and the respective extent buffer' content
      changes while send is still using it. When this happens send normally
      fails with EIO being returned to user space and messages like the
      following are found in dmesg/syslog:
      
        [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on 58703872 wanted 250 found 253
        [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root. inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
      
      Other times, less often, we hit a BUG_ON() because an extent buffer that
      send is using used to be a node, and while send is still using it, it
      got COWed and got reused as a leaf while send is still using, producing
      the following trace:
      
       [ 3478.466280] ------------[ cut here ]------------
       [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
       [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
       [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46 #1
       [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
       [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX: 0000000000000002
       [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI: ffff961599b30000
       [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09: ffffa437826bfb8e
       [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12: ffff9614385c8708
       [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
       [ 3478.475840] FS:  00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000) knlGS:0000000000000000
       [ 3478.476489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4: 00000000003606f0
       [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [ 3478.479003] Call Trace:
       [ 3478.479600]  ? do_raw_spin_unlock+0x49/0xc0
       [ 3478.480202]  tree_advance+0x173/0x1d0 [btrfs]
       [ 3478.480810]  btrfs_compare_trees+0x30c/0x690 [btrfs]
       [ 3478.481388]  ? process_extent+0x1280/0x1280 [btrfs]
       [ 3478.481954]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [ 3478.482510]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [ 3478.483062]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [ 3478.483581]  ? rq_clock_task+0x2e/0x60
       [ 3478.484086]  ? wake_up_new_task+0x1f3/0x370
       [ 3478.484582]  ? do_vfs_ioctl+0xa2/0x6f0
       [ 3478.485075]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [ 3478.485552]  do_vfs_ioctl+0xa2/0x6f0
       [ 3478.486016]  ? __fget+0x113/0x200
       [ 3478.486467]  ksys_ioctl+0x70/0x80
       [ 3478.486911]  __x64_sys_ioctl+0x16/0x20
       [ 3478.487337]  do_syscall_64+0x60/0x1b0
       [ 3478.487751]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
       (...)
       [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f8e0d7d4dd7
       [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI: 0000000000000005
       [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09: 00007f8e0d6f3700
       [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12: 0000000000000005
       [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
      
      Another possibility, much less likely to happen, is that send will not
      fail but the contents of the stream it produces may not be correct.
      
      To avoid this, do not allow send and relocation (balance) to run in
      parallel. In the long term the goal is to allow for both to be able to
      run concurrently without any problems, but that will take a significant
      effort in development and testing.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e967495
    • D
      btrfs: document BTRFS_MAX_MIRRORS · 71a9c488
      David Sterba 提交于
      The real meaning of that constant is not clear from the context due to
      the target device inclusion.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      71a9c488
  2. 01 7月, 2019 6 次提交
  3. 02 5月, 2019 1 次提交
  4. 30 4月, 2019 19 次提交
    • J
      btrfs: track DIO bytes in flight · 4297ff84
      Josef Bacik 提交于
      When diagnosing a slowdown of generic/224 I noticed we were not doing
      anything when calling into shrink_delalloc().  This is because all
      writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
      counter is 0, which short circuits most of the work inside of
      shrink_delalloc().  However O_DIRECT writes still consume metadata
      resources and generate ordered extents, which we can still wait on.
      
      Fix this by tracking outstanding DIO write bytes, and use this as well
      as the delalloc bytes counter to decide if we need to lookup and wait on
      any ordered extents.  If we have more DIO writes than delalloc bytes
      we'll go ahead and wait on any ordered extents regardless of our flush
      state as flushing delalloc is likely to not gain us anything.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ use dio instead of odirect in identifiers ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4297ff84
    • F
      Btrfs: fix race between send and deduplication that lead to failures and crashes · 62d54f3a
      Filipe Manana 提交于
      Send operates on read only trees and expects them to never change while it
      is using them. This is part of its initial design, and this expection is
      due to two different reasons:
      
      1) When it was introduced, no operations were allowed to modifiy read-only
         subvolumes/snapshots (including defrag for example).
      
      2) It keeps send from having an impact on other filesystem operations.
         Namely send does not need to keep locks on the trees nor needs to hold on
         to transaction handles and delay transaction commits. This ends up being
         a consequence of the former reason.
      
      However the deduplication feature was introduced later (on September 2013,
      while send was introduced in July 2012) and it allowed for deduplication
      with destination files that belong to read-only trees (subvolumes and
      snapshots).
      
      That means that having a send operation (either full or incremental) running
      in parallel with a deduplication that has the destination inode in one of
      the trees used by the send operation, can result in tree nodes and leaves
      getting freed and reused while send is using them. This problem is similar
      to the problem solved for the root nodes getting freed and reused when a
      snapshot is made against one tree that is currenly being used by a send
      operation, fixed in commits [1] and [2]. These commits explain in detail
      how the problem happens and the explanation is valid for any node or leaf
      that is not the root of a tree as well. This problem was also discussed
      and explained recently in a thread [3].
      
      The problem is very easy to reproduce when using send with large trees
      (snapshots) and just a few concurrent deduplication operations that target
      files in the trees used by send. A stress test case is being sent for
      fstests that triggers the issue easily. The most common error to hit is
      the send ioctl return -EIO with the following messages in dmesg/syslog:
      
       [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400
       [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27
      
      The first one is very easy to hit while the second one happens much less
      frequently, except for very large trees (in that test case, snapshots
      with 100000 files having large xattrs to get deep and wide trees).
      Less frequently, at least one BUG_ON can be hit:
      
       [1631742.130080] ------------[ cut here ]------------
       [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806!
       [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI
       [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G    B D W         5.0.0-rc8-btrfs-next-45 #1
       [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
       (...)
       [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246
       [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002
       [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000
       [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d
       [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708
       [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       [1631742.138455] FS:  00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000
       [1631742.139010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0
       [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [1631742.141272] Call Trace:
       [1631742.141826]  ? do_raw_spin_unlock+0x49/0xc0
       [1631742.142390]  tree_advance+0x173/0x1d0 [btrfs]
       [1631742.142948]  btrfs_compare_trees+0x268/0x690 [btrfs]
       [1631742.143533]  ? process_extent+0x1070/0x1070 [btrfs]
       [1631742.144088]  btrfs_ioctl_send+0x1037/0x1270 [btrfs]
       [1631742.144645]  _btrfs_ioctl_send+0x80/0x110 [btrfs]
       [1631742.145161]  ? trace_sched_stick_numa+0xe0/0xe0
       [1631742.145685]  btrfs_ioctl+0x13fe/0x3120 [btrfs]
       [1631742.146179]  ? account_entity_enqueue+0xd3/0x100
       [1631742.146662]  ? reweight_entity+0x154/0x1a0
       [1631742.147135]  ? update_curr+0x20/0x2a0
       [1631742.147593]  ? check_preempt_wakeup+0x103/0x250
       [1631742.148053]  ? do_vfs_ioctl+0xa2/0x6f0
       [1631742.148510]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
       [1631742.148942]  do_vfs_ioctl+0xa2/0x6f0
       [1631742.149361]  ? __fget+0x113/0x200
       [1631742.149767]  ksys_ioctl+0x70/0x80
       [1631742.150159]  __x64_sys_ioctl+0x16/0x20
       [1631742.150543]  do_syscall_64+0x60/0x1b0
       [1631742.150931]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [1631742.151326] RIP: 0033:0x7f4cd9f5add7
       (...)
       [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7
       [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007
       [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700
       [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003
       [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001
       (...)
       [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]---
      
      That BUG_ON happens because while send is using a node, that node is COWed
      by a concurrent deduplication, gets freed and gets reused as a leaf (because
      a transaction commit happened in between), so when it attempts to read a
      slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer
      contents were wiped out and it now matches a leaf (which can even belong to
      some other tree now), hitting the BUG_ON(level == 0).
      
      Fix this concurrency issue by not allowing send and deduplication to run
      in parallel if both operate on the same readonly trees, returning EAGAIN
      to user space and logging an exlicit warning in dmesg/syslog.
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea
      [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2
      [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62d54f3a
    • D
    • D
      c71dd880
    • D
      78ac4f9e
    • Q
      btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent() · ffd4bb2a
      Qu Wenruo 提交于
      Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
      parameter list and the confusing @owner parameter.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ffd4bb2a
    • Q
      btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() · 82fa113f
      Qu Wenruo 提交于
      Use the new btrfs_ref structure and replace parameter list to clean up
      the usage of owner and level to distinguish the extent types.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      82fa113f
    • D
      btrfs: get fs_info from device in btrfs_scrub_cancel_dev · 163e97ee
      David Sterba 提交于
      We can read fs_info from the device and can drop it from the parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      163e97ee
    • F
      Btrfs: remove no longer used function to run delayed refs asynchronously · 32b593bf
      Filipe Manana 提交于
      It used to be called from only two places (truncate path and releasing a
      transaction handle), but commits 28bad212 ("btrfs: fix truncate
      throttling") and db2462a6 ("btrfs: don't run delayed refs in the end
      transaction logic") removed their calls to this function, so it's not used
      anymore. Just remove it and all its helpers.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32b593bf
    • D
      btrfs: get fs_info from trans in btrfs_write_dirty_block_groups · 5742d15f
      David Sterba 提交于
      We can read fs_info from the transaction and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5742d15f
    • D
      btrfs: get fs_info from trans in btrfs_setup_space_cache · bbebb3e0
      David Sterba 提交于
      We can read fs_info from the transaction and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbebb3e0
    • N
      btrfs: Factor out in_range macro · e74e3993
      Nikolay Borisov 提交于
      This is used in more than one places so let's factor it out in ctree.h.
      No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e74e3993
    • J
      btrfs: replace pending/pinned chunks lists with io tree · 1c11b63e
      Jeff Mahoney 提交于
      The pending chunks list contains chunks that are allocated in the
      current transaction but haven't been created yet. The pinned chunks
      list contains chunks that are being released in the current transaction.
      Both describe chunks that are not reflected on disk as in use but are
      unavailable just the same.
      
      The pending chunks list is anchored by the transaction handle, which
      means that we need to hold a reference to a transaction when working
      with the list.
      
      The way we use them is by iterating over both lists to perform
      comparisons on the stripes they describe for each device. This is
      backwards and requires that we keep a transaction handle open while
      we're trimming.
      
      This patchset adds an extent_io_tree to btrfs_device that maintains
      the allocation state of the device.  Extents are set dirty when
      chunks are first allocated -- when the extent maps are added to the
      mapping tree. They're cleared when last removed -- when the extent
      maps are removed from the mapping tree. This matches the lifespan
      of the pending and pinned chunks list and allows us to do trims
      on unallocated space safely without pinning the transaction for what
      may be a lengthy operation. We can also use this io tree to mark
      which chunks have already been trimmed so we don't repeat the operation.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c11b63e
    • Q
      btrfs: tree-checker: Verify inode item · 496245ca
      Qu Wenruo 提交于
      There is a report in kernel bugzilla about mismatch file type in dir
      item and inode item.
      
      This inspires us to check inode mode in inode item.
      
      This patch will check the following members:
      
      - inode key objectid
        Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO.
      
      - inode key offset
        Should be 0
      
      - inode item generation
      - inode item transid
        No newer than sb generation + 1.
        The +1 is for log tree.
      
      - inode item mode
        No unknown bits.
        No invalid S_IF* bit.
        NOTE: S_IFMT check is not enough, need to check every know type.
      
      - inode item nlink
        Dir should have no more link than 1.
      
      - inode item flags
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      496245ca
    • D
      btrfs: qgroup: remove obsolete fs_info members · 90b1377d
      David Sterba 提交于
      The commit fcebe456 ("Btrfs: rework qgroup accounting") reworked
      qgroups and added some new structures. Another rework of qgroup
      mechanics e69bcee3 ("btrfs: qgroup: Cleanup the old
      ref_node-oriented mechanism.") stopped using them and left uncleaned.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      90b1377d
    • D
      btrfs: get fs_info from eb in btrfs_leaf_free_space · e902baac
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e902baac
    • D
      btrfs: get fs_info from eb in btrfs_exclude_logged_extents · bcdc428c
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bcdc428c
    • D
      btrfs: get fs_info from eb in leaf_data_end · 8f881e8c
      David Sterba 提交于
      We can read fs_info from extent buffer and can drop it from the
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8f881e8c
    • Q
      btrfs: Make btrfs_(set|clear)_header_flag return void · 80fbc341
      Qu Wenruo 提交于
      From the introduction of btrfs_(set|clear)_header_flag, there is no
      usage of its return value.  So just make it return void.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      80fbc341
  5. 08 3月, 2019 1 次提交
  6. 27 2月, 2019 1 次提交
    • J
      btrfs: check for refs on snapshot delete resume · 78c52d9e
      Josef Bacik 提交于
      There's a bug in snapshot deletion where we won't update the
      drop_progress key if we're in the UPDATE_BACKREF stage.  This is a
      problem because we could drop refs for blocks we know don't belong to
      ours.  If we crash or umount at the right time we could experience
      messages such as the following when snapshot deletion resumes
      
       BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258  owner 1 offset 0
       ------------[ cut here ]------------
       WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
       CPU: 3 PID: 16052 Comm: umount Tainted: G        W  OE     5.0.0-rc4+ #147
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
       RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
       RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
       RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18
       RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001
       R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000
       R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0
       FS:  00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0
       Call Trace:
       ? _raw_spin_unlock+0x27/0x40
       ? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs]
       __btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs]
       ? join_transaction+0x2b/0x460 [btrfs]
       btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs]
       btrfs_commit_transaction+0x52/0xa50 [btrfs]
       ? start_transaction+0xa6/0x510 [btrfs]
       btrfs_sync_fs+0x79/0x1c0 [btrfs]
       sync_filesystem+0x70/0x90
       generic_shutdown_super+0x27/0x120
       kill_anon_super+0x12/0x30
       btrfs_kill_super+0x16/0xa0 [btrfs]
       deactivate_locked_super+0x43/0x70
       deactivate_super+0x40/0x60
       cleanup_mnt+0x3f/0x80
       __cleanup_mnt+0x12/0x20
       task_work_run+0x8b/0xc0
       exit_to_usermode_loop+0xce/0xd0
       do_syscall_64+0x20b/0x210
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To fix this simply mark dead roots we read from disk as DEAD and then
      set the walk_control->restarted flag so we know we have a restarted
      deletion.  From here whenever we try to drop refs for blocks we check to
      verify our ref is set on them, and if it is not we skip it.  Once we
      find a ref that is set we unset walk_control->restarted since the tree
      should be in a normal state from then on, and any problems we run into
      from there are different issues.  I tested this with an existing broken
      fs and my reproducer that creates a broken fs and it fixed both file
      systems.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78c52d9e
  7. 25 2月, 2019 10 次提交
    • D
      btrfs: scrub: remove unused nocow worker pointer · 0ea82076
      David Sterba 提交于
      The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow
      optimization was removed from scrub in 9bebe665 ("btrfs: scrub:
      Remove unused copy_nocow_pages and its callchain").
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ea82076
    • D
      btrfs: scrub: add assertions for worker pointers · c8352942
      David Sterba 提交于
      The scrub worker pointers are not NULL iff the scrub is running, so
      reset them back once the last reference is dropped. Add assertions to
      the initial phase of scrub to verify that.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c8352942
    • A
      btrfs: scrub: convert scrub_workers_refcnt to refcount_t · ff09c4ca
      Anand Jain 提交于
      Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
      we get the extra checks. All reference changes are still done under
      scrub_lock.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff09c4ca
    • J
      btrfs: don't use global reserve for chunk allocation · 450114fc
      Josef Bacik 提交于
      We've done this forever because of the voodoo around knowing how much
      space we have.  However, we have better ways of doing this now, and on
      normal file systems we'll easily have a global reserve of 512MiB, and
      since metadata chunks are usually 1GiB that means we'll allocate
      metadata chunks more readily.  Instead use the actual used amount when
      determining if we need to allocate a chunk or not.
      
      This has a side effect for mixed block group fs'es where we are no
      longer allocating enough chunks for the data/metadata requirements.  To
      deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
      machine.  This will only get used if we've already made a full loop
      through the flushing machinery and tried committing the transaction.
      
      If we have then we can try and force a chunk allocation since we likely
      need it to make progress.  This resolves issues I was seeing with
      the mixed bg tests in xfstests without the new flushing state.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      450114fc
    • J
      btrfs: replace cleaner_delayed_iput_mutex with a waitqueue · 034f784d
      Josef Bacik 提交于
      The throttle path doesn't take cleaner_delayed_iput_mutex, which means
      we could think we're done flushing iputs in the data space reservation
      path when we could have a throttler doing an iput.  There's no real
      reason to serialize the delayed iput flushing, so instead of taking the
      cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
      replace it with an atomic counter and a waitqueue.  This removes the
      short (or long depending on how big the inode is) window where we think
      there are no more pending iputs when there really are some.
      
      The waiting is killable as it could be indirectly called from user
      operations like fallocate or zero-range. Such call sites should handle
      the error but otherwise it's not necessary. Eg. flush_space just needs
      to attempt to make space by waiting on iputs.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add killable comment and changelog parts ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      034f784d
    • A
      btrfs: let the assertion expression compile in all configs · 2eec5f00
      Anders Roxell 提交于
      A compiler warning (in a patch in development) pointed to a variable
      that was used only inside and ASSERT:
      
        u64 root_objectid = root->root_key.objectid;
        ASSERT(root_objectid == ...);
      
        fs/btrfs/relocation.c: In function ‘insert_dirty_subv’:
        fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable]
          u64 root_objectid = root->root_key.objectid;
      	^~~~~~~~~~~~~
      
      When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used.
      
      Rework the assertion helper by adding a runtime check instead of the
      '#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the
      condition being passed into an inline function after preprocessing.
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2eec5f00
    • Q
      btrfs: qgroup: Introduce per-root swapped blocks infrastructure · 370a11b8
      Qu Wenruo 提交于
      To allow delayed subtree swap rescan, btrfs needs to record per-root
      information about which tree blocks get swapped.  This patch introduces
      the required infrastructure.
      
      The designed workflow will be:
      
      1) Record the subtree root block that gets swapped.
      
         During subtree swap:
         O = Old tree blocks
         N = New tree blocks
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               NA     OB                          OA      OB
             /  |     |  \                      /  |      |  \
           NC  ND     OE  OF                   OC  OD     OE  OF
      
        In this case, NA and OA are going to be swapped, record (NA, OA) into
        subvolume tree X.
      
      2) After subtree swap.
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               OA     OB                          NA      OB
             /  |     |  \                      /  |      |  \
           OC  OD     OE  OF                   NC  ND     OE  OF
      
      3a) COW happens for OB
          If we are going to COW tree block OB, we check OB's bytenr against
          tree X's swapped_blocks structure.
          If it doesn't fit any, nothing will happen.
      
      3b) COW happens for NA
          Check NA's bytenr against tree X's swapped_blocks, and get a hit.
          Then we do subtree scan on both subtrees OA and NA.
          Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
      
          Then no matter what we do to subvolume tree X, qgroup numbers will
          still be correct.
          Then NA's record gets removed from X's swapped_blocks.
      
      4)  Transaction commit
          Any record in X's swapped_blocks gets removed, since there is no
          modification to swapped subtrees, no need to trigger heavy qgroup
          subtree rescan for them.
      
      This will introduce 128 bytes overhead for each btrfs_root even qgroup
      is not enabled. This is to reduce memory allocations and potential
      failures.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      370a11b8
    • Q
      btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots · d2311e69
      Qu Wenruo 提交于
      Relocation code will drop btrfs_root::reloc_root as soon as
      merge_reloc_root() finishes.
      
      However later qgroup code will need to access btrfs_root::reloc_root
      after merge_reloc_root() for delayed subtree rescan.
      
      So alter the timming of resetting btrfs_root:::reloc_root, make it
      happens after transaction commit.
      
      With this patch, we will introduce a new btrfs_root::state,
      BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user
      that although btrfs_root::reloc_tree is still non-NULL, but still it's
      not used any more.
      
      The lifespan of btrfs_root::reloc tree will become:
                Old behavior            |              New
      ------------------------------------------------------------------------
      btrfs_init_reloc_root()      ---  | btrfs_init_reloc_root()      ---
        set reloc_root              |   |   set reloc_root              |
                                    |   |                               |
                                    |   |                               |
      merge_reloc_root()            |   | merge_reloc_root()            |
      |- btrfs_update_reloc_root() ---  | |- btrfs_update_reloc_root() -+-
           clear btrfs_root::reloc_root |      set ROOT_DEAD_RELOC_TREE |
                                        |      record root into dirty   |
                                        |      roots rbtree             |
                                        |                               |
                                        | reloc_block_group() Or        |
                                        | btrfs_recover_relocation()    |
                                        | | After transaction commit    |
                                        | |- clean_dirty_subvols()     ---
                                        |     clear btrfs_root::reloc_root
      
      During ROOT_DEAD_RELOC_TREE set lifespan, the only user of
      btrfs_root::reloc_tree should be qgroup.
      
      Since reloc root needs a longer life-span, this patch will also delay
      btrfs_drop_snapshot() call.
      Now btrfs_drop_snapshot() is called in clean_dirty_subvols().
      
      This patch will increase the size of btrfs_root by 16 bytes.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2311e69
    • N
      btrfs: Remove unused arguments from btrfs_get_extent_fiemap · 4ab47a8d
      Nikolay Borisov 提交于
      This function is a simple wrapper over btrfs_get_extent that returns
      either:
      
      a) A real extent in the passed range or
      b) Adjusted extent based on whether delalloc bytes are found backing up
         a hole.
      
      To support these semantics it doesn't need the page/pg_offset/create
      arguments which are passed to btrfs_get_extent in case an extent is to
      be created. So simplify the function by removing the unused arguments.
      No functional changes.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4ab47a8d
    • N
      btrfs: Make first argument of btrfs_run_delalloc_range directly an inode · bc9a8bf7
      Nikolay Borisov 提交于
      Since this function is no longer a callback there is no need to have
      its first argument obfuscated with a void *. Change it directly to a
      pointer to an inode. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc9a8bf7