1. 19 1月, 2019 5 次提交
    • J
      btrfs: wakeup cleaner thread when adding delayed iput · fd340d0f
      Josef Bacik 提交于
      The cleaner thread usually takes care of delayed iputs, with the
      exception of the btrfs_end_transaction_throttle path.  Delaying iputs
      means we are potentially delaying the eviction of an inode and it's
      respective space.  The cleaner thread only gets woken up every 30
      seconds, or when we require space.  If there are a lot of inodes that
      need to be deleted we could induce a serious amount of latency while we
      wait for these inodes to be evicted.  So instead wakeup the cleaner if
      it's not already awake to process any new delayed iputs we add to the
      list.  If we suddenly need space we will less likely be backed up
      behind a bunch of inodes that are waiting to be deleted, and we could
      possibly free space before we need to get into the flushing logic which
      will save us some latency.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd340d0f
    • J
      btrfs: run delayed iputs before committing · 3ec9a4c8
      Josef Bacik 提交于
      Delayed iputs means we can have final iputs of deleted inodes in the
      queue, which could potentially generate a lot of pinned space that could
      be free'd.  So before we decide to commit the transaction for ENOPSC
      reasons, run the delayed iputs so that any potential space is free'd up.
      If there is and we freed enough we can then commit the transaction and
      potentially be able to make our reservation.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ec9a4c8
    • J
      btrfs: wait on ordered extents on abort cleanup · 74d5d229
      Josef Bacik 提交于
      If we flip read-only before we initiate writeback on all dirty pages for
      ordered extents we've created then we'll have ordered extents left over
      on umount, which results in all sorts of bad things happening.  Fix this
      by making sure we wait on ordered extents if we have to do the aborted
      transaction cleanup stuff.
      
      generic/475 can produce this warning:
      
       [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
       [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
       [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
       [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
       [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
       [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
       [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
       [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
       [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
       [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
       [ 8531.207516] Call Trace:
       [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
       [ 8531.210209]  ? wait_for_completion+0x5b/0x190
       [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
       [ 8531.212412]  generic_shutdown_super+0x64/0x100
       [ 8531.213485]  kill_anon_super+0x14/0x30
       [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
       [ 8531.215539]  deactivate_locked_super+0x29/0x60
       [ 8531.216633]  cleanup_mnt+0x3b/0x70
       [ 8531.217497]  task_work_run+0x98/0xc0
       [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
       [ 8531.219324]  do_syscall_64+0x15b/0x180
       [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
       [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
       [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
       [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
       [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
       [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
       [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000
      
      Leaving a tree in the rb-tree:
      
      3853 void btrfs_free_fs_root(struct btrfs_root *root)
      3854 {
      3855         iput(root->ino_cache_inode);
      3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      
      CC: stable@vger.kernel.org
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74d5d229
    • J
      btrfs: handle delayed ref head accounting cleanup in abort · 31890da0
      Josef Bacik 提交于
      We weren't doing any of the accounting cleanup when we aborted
      transactions.  Fix this by making cleanup_ref_head_accounting global and
      calling it from the abort code, this fixes the issue where our
      accounting was all wrong after the fs aborts.
      
      The test generic/475 on a 2G VM can trigger the problems eg.:
      
        [ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs]
        [ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394
        [ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014
        [ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs]
        [ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206
        [ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000
        [ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400
        [ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000
        [ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108
        [ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100
        [ 8502.170296] FS:  00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000
        [ 8502.172439] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0
        [ 8502.175094] Call Trace:
        [ 8502.175759]  close_ctree+0x17f/0x350 [btrfs]
        [ 8502.176721]  generic_shutdown_super+0x64/0x100
        [ 8502.177702]  kill_anon_super+0x14/0x30
        [ 8502.178607]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [ 8502.179602]  deactivate_locked_super+0x29/0x60
        [ 8502.180595]  cleanup_mnt+0x3b/0x70
        [ 8502.181406]  task_work_run+0x98/0xc0
        [ 8502.182255]  exit_to_usermode_loop+0x83/0x90
        [ 8502.183113]  do_syscall_64+0x15b/0x180
        [ 8502.183919]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Corresponding to
      
        release_global_block_rsv() {
        ...
        WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
      
      CC: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add log dump ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31890da0
    • D
      Revert "btrfs: balance dirty metadata pages in btrfs_finish_ordered_io" · 77b7aad1
      David Sterba 提交于
      This reverts commit e73e81b6.
      
      This patch causes a few problems:
      
      - adds latency to btrfs_finish_ordered_io
      - as btrfs_finish_ordered_io is used for free space cache, generating
        more work from btrfs_btree_balance_dirty_nodelay could end up in the
        same workque, effectively deadlocking
      
      12260 kworker/u96:16+btrfs-freespace-write D
      [<0>] balance_dirty_pages+0x6e6/0x7ad
      [<0>] balance_dirty_pages_ratelimited+0x6bb/0xa90
      [<0>] btrfs_finish_ordered_io+0x3da/0x770
      [<0>] normal_work_helper+0x1c5/0x5a0
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x46/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Transaction commit will wait on the freespace cache:
      
      838 btrfs-transacti D
      [<0>] btrfs_start_ordered_extent+0x154/0x1e0
      [<0>] btrfs_wait_ordered_range+0xbd/0x110
      [<0>] __btrfs_wait_cache_io+0x49/0x1a0
      [<0>] btrfs_write_dirty_block_groups+0x10b/0x3b0
      [<0>] commit_cowonly_roots+0x215/0x2b0
      [<0>] btrfs_commit_transaction+0x37e/0x910
      [<0>] transaction_kthread+0x14d/0x180
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      And then writepages ends up waiting on transaction commit:
      
      9520 kworker/u96:13+flush-btrfs-1 D
      [<0>] wait_current_trans+0xac/0xe0
      [<0>] start_transaction+0x21b/0x4b0
      [<0>] cow_file_range_inline+0x10b/0x6b0
      [<0>] cow_file_range.isra.69+0x329/0x4a0
      [<0>] run_delalloc_range+0x105/0x3c0
      [<0>] writepage_delalloc+0x119/0x180
      [<0>] __extent_writepage+0x10c/0x390
      [<0>] extent_write_cache_pages+0x26f/0x3d0
      [<0>] extent_writepages+0x4f/0x80
      [<0>] do_writepages+0x17/0x60
      [<0>] __writeback_single_inode+0x59/0x690
      [<0>] writeback_sb_inodes+0x291/0x4e0
      [<0>] __writeback_inodes_wb+0x87/0xb0
      [<0>] wb_writeback+0x3bb/0x500
      [<0>] wb_workfn+0x40d/0x610
      [<0>] process_one_work+0x1ee/0x5a0
      [<0>] worker_thread+0x1e0/0x3d0
      [<0>] kthread+0xf5/0x130
      [<0>] ret_from_fork+0x24/0x30
      [<0>] 0xffffffffffffffff
      
      Eventually, we have every process in the system waiting on
      balance_dirty_pages(), and nobody is able to make progress on page
      writeback.
      
      The original patch tried to fix an OOM condition, that happened on 4.4 but no
      success reproducing that on later kernels (4.19 and 4.20). This is more likely
      a problem in OOM itself.
      
      Link: https://lore.kernel.org/linux-btrfs/20180528054821.9092-1-ethanlien@synology.com/Reported-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org # 4.18+
      CC: ethanlien <ethanlien@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77b7aad1
  2. 11 1月, 2019 1 次提交
    • Q
      btrfs: Use real device structure to verify dev extent · 1b3922a8
      Qu Wenruo 提交于
      [BUG]
      Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel
      message:
      
        BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0
        BTRFS error (device dm-6): failed to verify dev extents against chunks: -117
        BTRFS error (device dm-6): open_ctree failed
      
      [CAUSE]
      Commit cf90d884 ("btrfs: Introduce mount time chunk <-> dev extent
      mapping check") introduced strict check on dev extents.
      
      We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and
      only dependent on @devid to find the real device.
      
      For seed devices, we call clone_fs_devices() in open_seed_devices() to
      allow us search seed devices directly.
      
      However clone_fs_devices() just populates devices with devid and dev
      uuid, without populating other essential members, like disk_total_bytes.
      
      This makes any device returned by btrfs_find_device(fs_info, devid,
      NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev
      extents on the seed device will not pass the device boundary check.
      
      [FIX]
      This patch will try to verify the device returned by btrfs_find_device()
      and if it's a dummy then re-search in seed devices.
      
      Fixes: cf90d884 ("btrfs: Introduce mount time chunk <-> dev extent mapping check")
      CC: stable@vger.kernel.org # 4.19+
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b3922a8
  3. 09 1月, 2019 3 次提交
    • F
      Btrfs: fix deadlock when using free space tree due to block group creation · a6d8654d
      Filipe Manana 提交于
      When modifying the free space tree we can end up COWing one of its extent
      buffers which in turn might result in allocating a new chunk, which in
      turn can result in flushing (finish creation) of pending block groups. If
      that happens we can deadlock because creating a pending block group needs
      to update the free space tree, and if any of the updates tries to modify
      the same extent buffer that we are COWing, we end up in a deadlock since
      we try to write lock twice the same extent buffer.
      
      So fix this by skipping pending block group creation if we are COWing an
      extent buffer from the free space tree. This is a case missed by commit
      5ce55557 ("Btrfs: fix deadlock when writing out free space caches").
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173
      Fixes: 5ce55557 ("Btrfs: fix deadlock when writing out free space caches")
      CC: stable@vger.kernel.org # 4.18+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6d8654d
    • F
      Btrfs: fix race between reflink/dedupe and relocation · d8b55242
      Filipe Manana 提交于
      The recent rework that makes btrfs' remap_file_range operation use the
      generic helper generic_remap_file_range_prep() introduced a race between
      relocation and reflinking (for both cloning and deduplication) the file
      extents between the source and destination inodes.
      
      This happens because we no longer lock the source range anymore, and we do
      not lock it anymore because we wait for direct IO writes and writeback to
      complete early on the code path right after locking the inodes, which
      guarantees no other file operations interfere with the reflinking. However
      there is one exception which is relocation, since it replaces the byte
      number of file extents items in the fs tree after locking the range the
      file extent items represent. This is a problem because after finding each
      file extent to clone in the fs tree, the reflink process copies the file
      extent item into a local buffer, releases the search path, inserts new
      file extent items in the destination range and then increments the
      reference count for the extent mentioned in the file extent item that it
      previously copied to the buffer. If right after copying the file extent
      item into the buffer and releasing the path the relocation process
      updates the file extent item to point to the new extent, the reflink
      process ends up creating a delayed reference to increment the reference
      count of the old extent, for which the relocation process already created
      a delayed reference to drop it. This results in failure to run delayed
      references because we will attempt to increment the count of a reference
      that was already dropped. This is illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
                                              relocation is running
      
        btrfs_clone_files()
      
          btrfs_clone()
            --> finds extent item
                in source range
                point to extent
                at bytenr X
            --> copies it into a
                local buffer
            --> releases path
      
                                              replace_file_extents()
                                                --> successfully locks the
                                                    range represented by
                                                    the file extent item
                                                --> replaces disk_bytenr
                                                    field in the file
                                                    extent item with some
                                                    other value Y
                                                --> creates delayed reference
                                                    to increment reference
                                                    count for extent at
                                                    bytenr Y
                                                --> creates delayed reference
                                                    to drop the extent at
                                                    bytenr X
      
            --> starts transaction
            --> creates delayed
                reference to
                increment extent
                at bytenr X
      
                          <delayed references are run, due to a transaction
                           commit for example, and the transaction is aborted
                           with -EIO because we attempt to increment reference
                           count for the extent at bytenr X after we freed it>
      
      When this race is hit the running transaction ends up getting aborted with
      an -EIO error and a trace like the following is produced:
      
      [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G        W         4.20.0-rc6-btrfs-next-41 #1
      [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
      [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
      [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
      [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
      [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
      [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
      [ 4382.556315] FS:  00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
      [ 4382.563130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
      [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 4382.564887] Call Trace:
      [ 4382.565343]  insert_inline_extent_backref+0x55/0xe0 [btrfs]
      [ 4382.565796]  __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
      [ 4382.566249]  ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
      [ 4382.566702]  __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
      [ 4382.567162]  btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
      [ 4382.567623]  btrfs_commit_transaction+0x50/0x9c0 [btrfs]
      [ 4382.568112]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.568557]  ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
      [ 4382.569006]  create_subvol+0x3c8/0x830 [btrfs]
      [ 4382.569461]  ? btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.569906]  btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.570383]  ? rcu_sync_lockdep_assert+0xe/0x60
      [ 4382.570822]  ? __sb_start_write+0xd4/0x1c0
      [ 4382.571262]  ? mnt_want_write_file+0x24/0x50
      [ 4382.571712]  btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
      [ 4382.572155]  ? _copy_from_user+0x66/0x90
      [ 4382.572602]  btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
      [ 4382.573052]  btrfs_ioctl+0x7c1/0x30e0 [btrfs]
      [ 4382.573502]  ? mem_cgroup_commit_charge+0x8b/0x570
      [ 4382.573946]  ? do_raw_spin_unlock+0x49/0xc0
      [ 4382.574379]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.574803]  ? __handle_mm_fault+0xf29/0x12d0
      [ 4382.575215]  ? do_vfs_ioctl+0xa2/0x6f0
      [ 4382.575622]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [ 4382.576020]  do_vfs_ioctl+0xa2/0x6f0
      [ 4382.576405]  ksys_ioctl+0x70/0x80
      [ 4382.576776]  __x64_sys_ioctl+0x16/0x20
      [ 4382.577137]  do_syscall_64+0x60/0x1b0
      [ 4382.577488]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      (...)
      [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
      [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
      [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
      [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
      [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
      [ 4382.580792] irq event stamp: 0
      [ 4382.581106] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.581772] softirqs last  enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
      [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
      [ 4382.624295] BTRFS info (device sdc): forced readonly
      
      Fix this by locking the source range before searching for the file extent
      items in the fs tree, since the relocation process will try to lock the
      range a file extent item represents before updating it with the new extent
      location.
      
      Fixes: 34a28e3d ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8b55242
    • F
      Btrfs: fix race between cloning range ending at eof and writeback · f7fa1107
      Filipe Manana 提交于
      The recent rework that makes btrfs' remap_file_range operation use the
      generic helper generic_remap_file_range_prep() introduced a race between
      writeback and cloning a range that covers the eof extent of the source
      file into a destination offset that is greater then the same file's size.
      
      This happens because we now wait for writeback to complete before doing
      the truncation of the eof block, while previously we did the truncation
      and then waited for writeback to complete. This leads to a race between
      writeback of the truncated block and cloning the file extents in the
      source range, because we copy each file extent item we find in the fs
      root into a buffer, then release the path and then increment the reference
      count for the extent referred in that file extent item we copied, which
      can no longer exist if writeback of the truncated eof block completes
      after we copied the file extent item into the buffer and before we
      incremented the reference count. This is illustrated by the following
      diagram:
      
              CPU 1                                       CPU 2
      
        btrfs_clone_files()
          btrfs_cont_expand()
            btrfs_truncate_block()
               --> zeroes part of the
                   page containg eof,
                   marking it for
                  delalloc
      
          btrfs_clone()
            --> finds extent item
                covering eof,
                points to extent
                at bytenr X
            --> copies it into a
                local buffer
            --> releases path
      
                                              writeback starts
      
                                              btrfs_finish_ordered_io()
                                                insert_reserved_file_extent()
                                                  __btrfs_drop_extents()
                                                    --> creates delayed
                                                        reference to drop
                                                        the extent at
                                                        bytenr X
      
            --> starts transaction
            --> creates delayed
                reference to
                increment extent
                at bytenr X
      
                          <delayed references are run, due to a transaction
                           commit for example, and the transaction is aborted
                           with -EIO because we attempt to increment reference
                           count for the extent at bytenr X after we freed it>
      
      When this race is hit the running transaction ends up getting aborted with
      an -EIO error and a trace like the following is produced:
      
      [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G        W         4.20.0-rc6-btrfs-next-41 #1
      [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
      [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs]
      (...)
      [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202
      [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001
      [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
      [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000
      [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548
      [ 4382.556315] FS:  00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000
      [ 4382.563130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0
      [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 4382.564887] Call Trace:
      [ 4382.565343]  insert_inline_extent_backref+0x55/0xe0 [btrfs]
      [ 4382.565796]  __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs]
      [ 4382.566249]  ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs]
      [ 4382.566702]  __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs]
      [ 4382.567162]  btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs]
      [ 4382.567623]  btrfs_commit_transaction+0x50/0x9c0 [btrfs]
      [ 4382.568112]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.568557]  ? block_rsv_release_bytes+0x14e/0x410 [btrfs]
      [ 4382.569006]  create_subvol+0x3c8/0x830 [btrfs]
      [ 4382.569461]  ? btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.569906]  btrfs_mksubvol+0x317/0x600 [btrfs]
      [ 4382.570383]  ? rcu_sync_lockdep_assert+0xe/0x60
      [ 4382.570822]  ? __sb_start_write+0xd4/0x1c0
      [ 4382.571262]  ? mnt_want_write_file+0x24/0x50
      [ 4382.571712]  btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs]
      [ 4382.572155]  ? _copy_from_user+0x66/0x90
      [ 4382.572602]  btrfs_ioctl_snap_create+0x66/0x80 [btrfs]
      [ 4382.573052]  btrfs_ioctl+0x7c1/0x30e0 [btrfs]
      [ 4382.573502]  ? mem_cgroup_commit_charge+0x8b/0x570
      [ 4382.573946]  ? do_raw_spin_unlock+0x49/0xc0
      [ 4382.574379]  ? _raw_spin_unlock+0x24/0x30
      [ 4382.574803]  ? __handle_mm_fault+0xf29/0x12d0
      [ 4382.575215]  ? do_vfs_ioctl+0xa2/0x6f0
      [ 4382.575622]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [ 4382.576020]  do_vfs_ioctl+0xa2/0x6f0
      [ 4382.576405]  ksys_ioctl+0x70/0x80
      [ 4382.576776]  __x64_sys_ioctl+0x16/0x20
      [ 4382.577137]  do_syscall_64+0x60/0x1b0
      [ 4382.577488]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      (...)
      [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7
      [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003
      [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044
      [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010
      [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050
      [ 4382.580792] irq event stamp: 0
      [ 4382.581106] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.581772] softirqs last  enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320
      [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 4382.582413] ---[ end trace d3c188e3e9367382 ]---
      [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure
      [ 4382.624295] BTRFS info (device sdc): forced readonly
      
      Fix this by waiting for writeback to complete after truncating the eof
      block.
      
      Fixes: 34a28e3d ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7fa1107
  4. 05 1月, 2019 1 次提交
  5. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  6. 22 12月, 2018 3 次提交
    • A
      LSM: hide struct security_mnt_opts from any generic code · 204cc0cc
      Al Viro 提交于
      Keep void * instead, allocate on demand (in parse_str_opts, at the
      moment).  Eventually both selinux and smack will be better off
      with private structures with several strings in those, rather than
      this "counter and two pointers to dynamically allocated arrays"
      ugliness.  This commit allows to do that at leisure, without
      disrupting anything outside of given module.
      
      Changes:
      	* instead of struct security_mnt_opt use an opaque pointer
      initialized to NULL.
      	* security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and
      security_free_mnt_opts() take it as var argument (i.e. as void **);
      call sites are unchanged.
      	* security_sb_set_mnt_opts() and security_sb_remount() take
      it by value (i.e. as void *).
      	* new method: ->sb_free_mnt_opts().  Takes void *, does
      whatever freeing that needs to be done.
      	* ->sb_set_mnt_opts() and ->sb_remount() might get NULL as
      mnt_opts argument, meaning "empty".
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      204cc0cc
    • A
      btrfs: sanitize security_mnt_opts use · a65001e8
      Al Viro 提交于
      1) keeping a copy in btrfs_fs_info is completely pointless - we never
      use it for anything.  Getting rid of that allows for simpler calling
      conventions for setup_security_options() (caller is responsible for
      freeing mnt_opts in all cases).
      
      2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(),
      same as we would if not for FS_BINARY_MOUNTDATA.  Behaviours *are*
      close (in fact, selinux sb_set_mnt_opts() ought to punt to
      sb_remount() in "already initialized" case), but let's handle
      that uniformly.  And the only reason why the original btrfs changes
      didn't go for security_sb_remount() in btrfs_remount() case is that
      it hadn't been exported.  Let's export it for a while - it'll be
      going away soon anyway.
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a65001e8
    • A
      new helper: security_sb_eat_lsm_opts() · f5c0c26d
      Al Viro 提交于
      combination of alloc_secdata(), security_sb_copy_data(),
      security_sb_parse_opt_str() and free_secdata().
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f5c0c26d
  7. 17 12月, 2018 26 次提交
    • A
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini 提交于
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: NAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52042d8e
    • J
      btrfs: improve error handling of btrfs_add_link · 1690dd41
      Johannes Thumshirn 提交于
      In the error handling block, err holds the return value of either
      btrfs_del_root_ref() or btrfs_del_inode_ref() but it hasn't been checked
      since it's introduction with commit fe66a05a (Btrfs: improve error
      handling for btrfs_insert_dir_item callers) in 2012.
      
      If the error handling in the error handling fails, there's not much left
      to do and the abort either happened earlier in the callees or is
      necessary here.
      
      So if one of btrfs_del_root_ref() or btrfs_del_inode_ref() failed, abort
      the transaction, but still return the original code of the failure
      stored in 'ret' as this will be reported to the user.
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1690dd41
    • F
      Btrfs: use generic_remap_file_range_prep() for cloning and deduplication · 34a28e3d
      Filipe Manana 提交于
      Since cloning and deduplication are no longer Btrfs specific operations, we
      now have generic code to handle parameter validation, compare file ranges
      used for deduplication, clear capabilities when cloning, etc. This change
      makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
      few bugs, such as:
      
      1) When cloning, the destination file's capabilities were not dropped
         (the fstest generic/513 tests this);
      
      2) We were not checking if the destination file is immutable;
      
      3) Not checking if either the source or destination files are swap
         files (swap file support is coming soon for Btrfs);
      
      4) System limits were not checked (resource limits and O_LARGEFILE).
      
      Note that the generic helper generic_remap_file_range_prep() does start
      and waits for writeback by calling filemap_write_and_wait_range(), however
      that is not enough for Btrfs for two reasons:
      
      1) With compression, we need to start writeback twice in order to get the
         pages marked for writeback and ordered extents created;
      
      2) filemap_write_and_wait_range() (and all its other variants) only waits
         for the IO to complete, but we need to wait for the ordered extents to
         finish, so that when we do the actual reflinking operations the file
         extent items are in the fs tree. This is also important due to the fact
         that the generic helper, for the deduplication case, compares the
         contents of the pages in the requested range, which might require
         reading extents from disk in the very unlikely case that pages get
         invalidated after writeback finishes (so the file extent items must be
         up to date in the fs tree).
      
      Since these reasons are specific to Btrfs we have to do it in the Btrfs
      code before calling generic_remap_file_range_prep(). This also results
      in a simpler way of dealing with existing delalloc in the source/target
      ranges, specially for the deduplication case where we used to lock all
      the pages first and then if we found any dealloc for the range, or
      ordered extent, we would unlock the pages trigger writeback and wait for
      ordered extents to complete, then lock all the pages again and check if
      deduplication can be done. So now we get a simpler approach: lock the
      inodes, then trigger writeback and then wait for ordered extents to
      complete.
      
      So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
      to eliminate duplicated code, fix a few bugs and benefit from future bug
      fixes done there - for example the recent clone and dedupe bugs involving
      reflinking a partial EOF block got a counterpart fix in the generic
      helper, since it affected all filesystems supporting these operations,
      so we no longer need special checks in Btrfs for them.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      34a28e3d
    • N
      btrfs: Refactor main loop in extent_readpages · 61ed3a14
      Nikolay Borisov 提交于
      extent_readpages processes all pages in the readlist in batches of 16,
      this is implemented by a single for loop but thanks to an if condition
      the loop does 2 things based on whether we've filled the batch or not.
      Additionally due to the structure of the code there is an additional
      check which deals with partial batches.
      
      Streamline all of this by explicitly using two loops. The outter one is
      used to process all pages while the inner one just fills in the batch
      of 16 (currently). Due to this new structure the code guarantees that
      all pages are processed in the loop hence the code to deal with any
      leftovers is eliminated.
      
      This also enable the compiler to inline __extent_readpages:
      
      	./scripts/bloat-o-meter fs/btrfs/extent_io.o extent_io.for
      
      	add/remove: 0/1 grow/shrink: 1/0 up/down: 660/-820 (-160)
      	Function                                     old     new   delta
      	extent_readpages                             476    1136    +660
      	__extent_readpages                           820       -    -820
      	Total: Before=44315, After=44155, chg -0.36%
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      61ed3a14
    • N
      btrfs: Remove 1st shrink/grow phase from balance · 15c82763
      Nikolay Borisov 提交于
      The first step of the rebalance process ensures there is 1MiB free on
      each device. This number seems rather small. And in fact when talking to
      the original authors their opinions were:
      
      "man that's a little bonkers"
      "i don't think we even need that code anymore"
      "I think it was there to make sure we had room for the blank 1M at the
      beginning. I bet it goes all the way back to v0"
      "we just don't need any of that tho, i say we just delete it"
      
      Clearly, this piece of code has lost its original intent throughout the
      years. It doesn't really bring any real practical benefits to the
      relocation process.
      
      Additionally, this patch makes the balance process more lightweight by
      removing a pair of shrink/grow operations which are rather expensive for
      heavily populated filesystems. This is mainly due to shrink requiring
      relocating block groups, involving heavy use of the btree.
      
      The intermediate shrink/grow can fail and leave the filesystem in a
      middle state that would need to be changed back by the user.
      Suggested-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      15c82763
    • F
      Btrfs: send, fix race with transaction commits that create snapshots · be6821f8
      Filipe Manana 提交于
      If we create a snapshot of a snapshot currently being used by a send
      operation, we can end up with send failing unexpectedly (returning
      -ENOENT error to user space for example). The following diagram shows
      how this happens.
      
                  CPU 1                                   CPU2                                CPU3
      
       btrfs_ioctl_send()
        (...)
                                           create_snapshot()
                                            -> creates snapshot of a
                                               root used by the send
                                               task
                                            btrfs_commit_transaction()
                                             create_pending_snapshot()
        __get_inode_info()
         btrfs_search_slot()
          btrfs_search_slot_get_root()
           down_read commit_root_sem
      
           get reference on eb of the
           commit root
            -> eb with bytenr == X
      
           up_read commit_root_sem
      
                                              btrfs_cow_block(root node)
                                               btrfs_free_tree_block()
                                                -> creates delayed ref to
                                                   free the extent
      
                                             btrfs_run_delayed_refs()
                                              -> runs the delayed ref,
                                                 adds extent to
                                                 fs_info->pinned_extents
      
                                             btrfs_finish_extent_commit()
                                              unpin_extent_range()
                                               -> marks extent as free
                                                  in the free space cache
      
                                            transaction commit finishes
      
                                                                             btrfs_start_transaction()
                                                                              (...)
                                                                              btrfs_cow_block()
                                                                               btrfs_alloc_tree_block()
                                                                                btrfs_reserve_extent()
                                                                                 -> allocates extent at
                                                                                    bytenr == X
                                                                                btrfs_init_new_buffer(bytenr X)
                                                                                 btrfs_find_create_tree_block()
                                                                                  alloc_extent_buffer(bytenr X)
                                                                                   find_extent_buffer(bytenr X)
                                                                                    -> returns existing eb,
                                                                                       which the send task got
      
                                                                              (...)
                                                                               -> modifies content of the
                                                                                  eb with bytenr == X
      
          -> uses an eb that now
             belongs to some other
             tree and no more matches
             the commit root of the
             snapshot, resuts will be
             unpredictable
      
      The consequences of this race can be various, and can lead to searches in
      the commit root performed by the send task failing unexpectedly (unable to
      find inode items, returning -ENOENT to user space, for example) or not
      failing because an inode item with the same number was added to the tree
      that reused the metadata extent, in which case send can behave incorrectly
      in the worst case or just fail later for some reason.
      
      Fix this by performing a copy of the commit root's extent buffer when doing
      a search in the context of a send operation.
      
      CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e: Btrfs: move get root out of btrfs_search_slot to a helper
      CC: stable@vger.kernel.org # 4.4.x: f9ddfd05: Btrfs: remove unused check of skip_locking
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be6821f8
    • F
      Btrfs: use nofs context when initializing security xattrs to avoid deadlock · 827aa18e
      Filipe Manana 提交于
      When initializing the security xattrs, we are holding a transaction handle
      therefore we need to use a GFP_NOFS context in order to avoid a deadlock
      with reclaim in case it's triggered.
      
      Fixes: 39a27ec1 ("btrfs: use GFP_KERNEL for xattr and acl allocations")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      827aa18e
    • J
      btrfs: run delayed items before dropping the snapshot · 0568e82d
      Josef Bacik 提交于
      With my delayed refs patches in place we started seeing a large amount
      of aborts in __btrfs_free_extent:
      
       BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964  owner 1 offset 0
       Call Trace:
        ? btrfs_merge_delayed_refs+0xaf/0x340
        __btrfs_run_delayed_refs+0x6ea/0xfc0
        ? btrfs_set_path_blocking+0x31/0x60
        btrfs_run_delayed_refs+0xeb/0x180
        btrfs_commit_transaction+0x179/0x7f0
        ? btrfs_check_space_for_delayed_refs+0x30/0x50
        ? should_end_transaction.isra.19+0xe/0x40
        btrfs_drop_snapshot+0x41c/0x7c0
        btrfs_clean_one_deleted_snapshot+0xb5/0xd0
        cleaner_kthread+0xf6/0x120
        kthread+0xf8/0x130
        ? btree_invalidatepage+0x90/0x90
        ? kthread_bind+0x10/0x10
        ret_from_fork+0x35/0x40
      
      This was because btrfs_drop_snapshot depends on the root not being
      modified while it's dropping the snapshot.  It will unlock the root node
      (and really every node) as it walks down the tree, only to re-lock it
      when it needs to do something.  This is a problem because if we modify
      the tree we could cow a block in our path, which frees our reference to
      that block.  Then once we get back to that shared block we'll free our
      reference to it again, and get ENOENT when trying to lookup our extent
      reference to that block in __btrfs_free_extent.
      
      This is ultimately happening because we have delayed items left to be
      processed for our deleted snapshot _after_ all of the inodes are closed
      for the snapshot.  We only run the delayed inode item if we're deleting
      the inode, and even then we do not run the delayed insertions or delayed
      removals.  These can be run at any point after our final inode does its
      last iput, which is what triggers the snapshot deletion.  We can end up
      with the snapshot deletion happening and then have the delayed items run
      on that file system, resulting in the above problem.
      
      This problem has existed forever, however my patches made it much easier
      to hit as I wake up the cleaner much more often to deal with delayed
      iputs, which made us more likely to start the snapshot dropping work
      before the transaction commits, which is when the delayed items would
      generally be run.  Before, generally speaking, we would run the delayed
      items, commit the transaction, and wakeup the cleaner thread to start
      deleting snapshots, which means we were less likely to hit this problem.
      You could still hit it if you had multiple snapshots to be deleted and
      ended up with lots of delayed items, but it was definitely harder.
      
      Fix for now by simply running all the delayed items before starting to
      drop the snapshot.  We could make this smarter in the future by making
      the delayed items per-root, and then simply drop any delayed items for
      roots that we are going to delete.  But for now just a quick and easy
      solution is the safest.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0568e82d
    • J
      btrfs: catch cow on deleting snapshots · 83354f07
      Josef Bacik 提交于
      When debugging some weird extent reference bug I suspected that we were
      changing a snapshot while we were deleting it, which could explain my
      bug.  This was indeed what was happening, and this patch helped me
      verify my theory.  It is never correct to modify the snapshot once it's
      being deleted, so mark the root when we are deleting it and make sure we
      complain about it when it happens.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83354f07
    • Q
      btrfs: extent-tree: cleanup one-shot usage of @blocksize in do_walk_down · 01e0da48
      Qu Wenruo 提交于
      @blocksize variable in do_walk_down() is only used once, really no need
      to declare it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      01e0da48
    • F
      Btrfs: scrub, move setup of nofs contexts higher in the stack · 7c3c7cb9
      Filipe Manana 提交于
      Since scrub workers only do memory allocation with GFP_KERNEL when they
      need to perform repair, we can move the recent setup of the nofs context
      up to scrub_handle_errored_block() instead of setting it up down the call
      chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
      removing some duplicate code and comment. So the only paths for which a
      scrub worker can do memory allocations using GFP_KERNEL are the following:
      
       scrub_bio_end_io_worker()
         scrub_block_complete()
           scrub_handle_errored_block()
             lock_full_stripe()
               insert_full_stripe_lock()
                 -> kmalloc with GFP_KERNEL
      
        scrub_bio_end_io_worker()
          scrub_block_complete()
            scrub_handle_errored_block()
              scrub_write_page_to_dev_replace()
                scrub_add_page_to_wr_bio()
                  -> kzalloc with GFP_KERNEL
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c3c7cb9
    • D
      btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex · 0e94c4f4
      David Sterba 提交于
      The scrub context is allocated with GFP_KERNEL and called from
      btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
      regarding reclaim that could try to flush filesystem data in order to
      get the memory. And the device_list_mutex is held during superblock
      commit, so this would cause a lockup.
      
      Move the alocation and initialization before any changes that require
      the mutex.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e94c4f4
    • D
      btrfs: scrub: pass fs_info to scrub_setup_ctx · 92f7ba43
      David Sterba 提交于
      We can pass fs_info directly as this is the only member of btrfs_device
      that's bing used inside scrub_setup_ctx.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92f7ba43
    • J
      btrfs: fix truncate throttling · 28bad212
      Josef Bacik 提交于
      We have a bunch of magic to make sure we're throttling delayed refs when
      truncating a file.  Now that we have a delayed refs rsv and a mechanism
      for refilling that reserve simply use that instead of all of this magic.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28bad212
    • J
      btrfs: don't run delayed refs in the end transaction logic · db2462a6
      Josef Bacik 提交于
      Over the years we have built up a lot of infrastructure to keep delayed
      refs in check, mostly by running them at btrfs_end_transaction() time.
      We have a lot of different maths we do to figure out how much, if we
      should do it inline or async, etc.  This existed because we had no
      feedback mechanism to force the flushing of delayed refs when they
      became a problem.  However with the enospc flushing infrastructure in
      place for flushing delayed refs when they put too much pressure on the
      enospc system we have this problem solved.  Rip out all of this code as
      it is no longer needed.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      db2462a6
    • J
      btrfs: rework btrfs_check_space_for_delayed_refs · 64403612
      Josef Bacik 提交于
      Now with the delayed_refs_rsv we can now know exactly how much pending
      delayed refs space we need.  This means we can drastically simplify
      btrfs_check_space_for_delayed_refs by simply checking how much space we
      have reserved for the global rsv (which acts as a spill over buffer) and
      the delayed refs rsv.  If our total size is beyond that amount then we
      know it's time to commit the transaction and stop any more delayed refs
      from being generated.
      
      With the introduction of dealyed_refs_rsv infrastructure, namely
      btrfs_update_delayed_refs_rsv we now know exactly how much pending
      delayed refs space is required.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64403612
    • J
      btrfs: add new flushing states for the delayed refs rsv · 413df725
      Josef Bacik 提交于
      A nice thing we gain with the delayed refs rsv is the ability to flush
      the delayed refs on demand to deal with enospc pressure.  Add states to
      flush delayed refs on demand, and this will allow us to remove a lot of
      ad-hoc work around checking to see if we should commit the transaction
      to run our delayed refs.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      413df725
    • J
      btrfs: update may_commit_transaction to use the delayed refs rsv · 4c8edbc7
      Josef Bacik 提交于
      Any space used in the delayed_refs_rsv will be freed up by a transaction
      commit, so instead of just counting the pinned space we also need to
      account for any space in the delayed_refs_rsv when deciding if it will
      make a different to commit the transaction to satisfy our space
      reservation.  If we have enough bytes to satisfy our reservation ticket
      then we are good to go, otherwise subtract out what space we would gain
      back by committing the transaction and compare that against the pinned
      space to make our decision.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c8edbc7
    • J
      btrfs: introduce delayed_refs_rsv · ba2c4d4e
      Josef Bacik 提交于
      Traditionally we've had voodoo in btrfs to account for the space that
      delayed refs may take up by having a global_block_rsv.  This works most
      of the time, except when it doesn't.  We've had issues reported and seen
      in production where sometimes the global reserve is exhausted during
      transaction commit before we can run all of our delayed refs, resulting
      in an aborted transaction.  Because of this voodoo we have equally
      dubious flushing semantics around throttling delayed refs which we often
      get wrong.
      
      So instead give them their own block_rsv.  This way we can always know
      exactly how much outstanding space we need for delayed refs.  This
      allows us to make sure we are constantly filling that reservation up
      with space, and allows us to put more precise pressure on the enospc
      system.  Instead of doing math to see if its a good time to throttle,
      the normal enospc code will be invoked if we have a lot of delayed refs
      pending, and they will be run via the normal flushing mechanism.
      
      For now the delayed_refs_rsv will hold the reservations for the delayed
      refs, the block group updates, and deleting csums.  We could have a
      separate rsv for the block group updates, but the csum deletion stuff is
      still handled via the delayed_refs so that will stay there.
      
      Historical background:
      
      The global reserve has grown to cover everything we don't reserve space
      explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
      if we're running short on space and when it's time to force a commit.  A
      failure rate of 20-40 file systems when we run hundreds of thousands of
      them isn't super high, but cleaning up this code will make things less
      ugly and more predictible.
      
      Thus the delayed refs rsv.  We always know how many delayed refs we have
      outstanding, and although running them generates more we can use the
      global reserve for that spill over, which fits better into it's desired
      use than a full blown reservation.  This first approach is to simply
      take how many times we're reserving space for and multiply that by 2 in
      order to save enough space for the delayed refs that could be generated.
      This is a niave approach and will probably evolve, but for now it works.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
      [ added background notes from the cover letter ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba2c4d4e
    • J
      btrfs: only track ref_heads in delayed_ref_updates · 158ffa36
      Josef Bacik 提交于
      We use this number to figure out how many delayed refs to run, but
      __btrfs_run_delayed_refs really only checks every time we need a new
      delayed ref head, so we always run at least one ref head completely no
      matter what the number of items on it.  Fix the accounting to only be
      adjusted when we add/remove a ref head.
      
      In addition to using this number to limit the number of delayed refs
      run, a future patch is also going to use it to calculate the amount of
      space required for delayed refs space reservation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      158ffa36
    • J
      btrfs: cleanup extent_op handling · bedc6617
      Josef Bacik 提交于
      The cleanup_extent_op function actually would run the extent_op if it
      needed running, which made the name sort of a misnomer.  Change it to
      run_and_cleanup_extent_op, and move the actual cleanup work to
      cleanup_extent_op so it can be used by check_ref_cleanup() in order to
      unify the extent op handling.
      Reviewed-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bedc6617
    • J
      btrfs: add cleanup_ref_head_accounting helper · 07c47775
      Josef Bacik 提交于
      We were missing some quota cleanups in check_ref_cleanup, so break the
      ref head accounting cleanup into a helper and call that from both
      check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
      we don't screw up accounting in the future for other things that we add.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07c47775
    • J
      btrfs: add btrfs_delete_ref_head helper · d7baffda
      Josef Bacik 提交于
      We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
      into a helper and cleanup the calling functions.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7baffda
    • J
      btrfs: use PAGE_ALIGNED instead of open-coding it · fdb1e121
      Johannes Thumshirn 提交于
      When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
      alignment and thus should use the PAGE_ALIGNED() macro instead of
      open-coding it.
      
      Convert all open-coded occurrences of PAGE_ALIGNED().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fdb1e121
    • J
      btrfs: use offset_in_page instead of open-coding it · 7073017a
      Johannes Thumshirn 提交于
      Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
      offset into a page.
      
      So replace them by the offset_in_page() macro instead of open-coding it if
      they're not used as an alignment check.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7073017a
    • D
      btrfs: dev-replace: open code trivial locking helpers · cb5583dd
      David Sterba 提交于
      The dev-replace locking functions are now trivial wrappers around rw
      semaphore that can be used directly everywhere. No functional change.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb5583dd