1. 27 12月, 2019 1 次提交
    • F
      Btrfs: send, fix race with transaction commits that create snapshots · 44a9ba6b
      Filipe Manana 提交于
      commit be6821f82c3cc36e026f5afd10249988852b35ea upstream.
      
      If we create a snapshot of a snapshot currently being used by a send
      operation, we can end up with send failing unexpectedly (returning
      -ENOENT error to user space for example). The following diagram shows
      how this happens.
      
                  CPU 1                                   CPU2                                CPU3
      
       btrfs_ioctl_send()
        (...)
                                           create_snapshot()
                                            -> creates snapshot of a
                                               root used by the send
                                               task
                                            btrfs_commit_transaction()
                                             create_pending_snapshot()
        __get_inode_info()
         btrfs_search_slot()
          btrfs_search_slot_get_root()
           down_read commit_root_sem
      
           get reference on eb of the
           commit root
            -> eb with bytenr == X
      
           up_read commit_root_sem
      
                                              btrfs_cow_block(root node)
                                               btrfs_free_tree_block()
                                                -> creates delayed ref to
                                                   free the extent
      
                                             btrfs_run_delayed_refs()
                                              -> runs the delayed ref,
                                                 adds extent to
                                                 fs_info->pinned_extents
      
                                             btrfs_finish_extent_commit()
                                              unpin_extent_range()
                                               -> marks extent as free
                                                  in the free space cache
      
                                            transaction commit finishes
      
                                                                             btrfs_start_transaction()
                                                                              (...)
                                                                              btrfs_cow_block()
                                                                               btrfs_alloc_tree_block()
                                                                                btrfs_reserve_extent()
                                                                                 -> allocates extent at
                                                                                    bytenr == X
                                                                                btrfs_init_new_buffer(bytenr X)
                                                                                 btrfs_find_create_tree_block()
                                                                                  alloc_extent_buffer(bytenr X)
                                                                                   find_extent_buffer(bytenr X)
                                                                                    -> returns existing eb,
                                                                                       which the send task got
      
                                                                              (...)
                                                                               -> modifies content of the
                                                                                  eb with bytenr == X
      
          -> uses an eb that now
             belongs to some other
             tree and no more matches
             the commit root of the
             snapshot, resuts will be
             unpredictable
      
      The consequences of this race can be various, and can lead to searches in
      the commit root performed by the send task failing unexpectedly (unable to
      find inode items, returning -ENOENT to user space, for example) or not
      failing because an inode item with the same number was added to the tree
      that reused the metadata extent, in which case send can behave incorrectly
      in the worst case or just fail later for some reason.
      
      Fix this by performing a copy of the commit root's extent buffer when doing
      a search in the context of a send operation.
      
      CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e: Btrfs: move get root out of btrfs_search_slot to a helper
      CC: stable@vger.kernel.org # 4.4.x: f9ddfd05: Btrfs: remove unused check of skip_locking
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      44a9ba6b
  2. 14 11月, 2018 1 次提交
    • F
      Btrfs: fix deadlock when writing out free space caches · ea9c846f
      Filipe Manana 提交于
      commit 5ce555578e0919237fa4bda92b4670e2dd176f85 upstream.
      
      When writing out a block group free space cache we can end deadlocking
      with ourselves on an extent buffer lock resulting in a warning like the
      following:
      
        [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
          W I      4.16.8 #1
        [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
        [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
        [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
        [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
        [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
        [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
        [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
        [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
        [245043.408670] Call Trace:
        [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
        [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
        [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
        [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
        [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
        [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
        [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
        [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
        [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
        [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
        [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
        [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
        [245043.425538]  ? iput+0x72/0x1e0
        [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
        [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
        [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
        [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
        [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
        [245043.433341]  kthread+0xf0/0x130
        [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
        [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
        [245043.437236]  ret_from_fork+0x1f/0x30
        [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
      
      This is because at write_one_cache_group() when we are COWing a leaf from
      the extent tree we end up allocating a new block group (chunk) and,
      because we have hit a threshold on the number of bytes reserved for system
      chunks, we attempt to finalize the creation of new block groups from the
      current transaction, by calling btrfs_create_pending_block_groups().
      However here we also need to modify the extent tree in order to insert
      a block group item, and if the location for this new block group item
      happens to be in the same leaf that we were COWing earlier, we deadlock
      since btrfs_search_slot() tries to write lock the extent buffer that we
      locked before at write_one_cache_group().
      
      We have already hit similar cases in the past and commit d9a0540a
      ("Btrfs: fix deadlock when finalizing block group creation") fixed some
      of those cases by delaying the creation of pending block groups at the
      known specific spots that could lead to a deadlock. This change reworks
      that commit to be more generic so that we don't have to add similar logic
      to every possible path that can lead to a deadlock. This is done by
      making __btrfs_cow_block() disallowing the creation of new block groups
      (setting the transaction's can_flush_pending_bgs to false) before it
      attempts to allocate a new extent buffer for either the extent, chunk or
      device trees, since those are the trees that pending block creation
      modifies. Once the new extent buffer is allocated, it allows creation of
      pending block groups to happen again.
      
      This change depends on a recent patch from Josef which is not yet in
      Linus' tree, named "btrfs: make sure we create all new block groups" in
      order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
      
      Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation")
      CC: stable@vger.kernel.org # 4.4+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/Reported-by: NE V <eliventer@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea9c846f
  3. 06 8月, 2018 3 次提交
  4. 30 5月, 2018 6 次提交
  5. 17 5月, 2018 1 次提交
    • L
      btrfs: fix reading stale metadata blocks after degraded raid1 mounts · 02a3307a
      Liu Bo 提交于
      If a btree block, aka. extent buffer, is not available in the extent
      buffer cache, it'll be read out from the disk instead, i.e.
      
      btrfs_search_slot()
        read_block_for_search()  # hold parent and its lock, go to read child
          btrfs_release_path()
          read_tree_block()  # read child
      
      Unfortunately, the parent lock got released before reading child, so
      commit 5bdd3536 ("Btrfs: Fix block generation verification race") had
      used 0 as parent transid to read the child block.  It forces
      read_tree_block() not to check if parent transid is different with the
      generation id of the child that it reads out from disk.
      
      A simple PoC is included in btrfs/124,
      
      0. A two-disk raid1 btrfs,
      
      1. Right after mkfs.btrfs, block A is allocated to be device tree's root.
      
      2. Mount this filesystem and put it in use, after a while, device tree's
         root got COW but block A hasn't been allocated/overwritten yet.
      
      3. Umount it and reload the btrfs module to remove both disks from the
         global @fs_devices list.
      
      4. mount -odegraded dev1 and write some data, so now block A is allocated
         to be a leaf in checksum tree.  Note that only dev1 has the latest
         metadata of this filesystem.
      
      5. Umount it and mount it again normally (with both disks), since raid1
         can pick up one disk by the writer task's pid, if btrfs_search_slot()
         needs to read block A, dev2 which does NOT have the latest metadata
         might be read for block A, then we got a stale block A.
      
      6. As parent transid is not checked, block A is marked as uptodate and
         put into the extent buffer cache, so the future search won't bother
         to read disk again, which means it'll make changes on this stale
         one and make it dirty and flush it onto disk.
      
      To avoid the problem, parent transid needs to be passed to
      read_tree_block().
      
      In order to get a valid parent transid, we need to hold the parent's
      lock until finishing reading child.
      
      This patch needs to be slightly adapted for stable kernels, the
      &first_key parameter added to read_tree_block() is from 4.16+
      (581c1760). The fix is to replace 0 by 'gen'.
      
      Fixes: 5bdd3536 ("Btrfs: Fix block generation verification race")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      02a3307a
  6. 14 5月, 2018 1 次提交
    • R
      Btrfs: send, fix invalid access to commit roots due to concurrent snapshotting · 6f2f0b39
      Robbie Ko 提交于
      [BUG]
      btrfs incremental send BUG happens when creating a snapshot of snapshot
      that is being used by send.
      
      [REASON]
      The problem can happen if while we are doing a send one of the snapshots
      used (parent or send) is snapshotted, because snapshoting implies COWing
      the root of the source subvolume/snapshot.
      
      1. When doing an incremental send, the send process will get the commit
         roots from the parent and send snapshots, and add references to them
         through extent_buffer_get().
      
      2. When a snapshot/subvolume is snapshotted, its root node is COWed
         (transaction.c:create_pending_snapshot()).
      
      3. COWing releases the space used by the node immediately, through:
      
         __btrfs_cow_block()
         --btrfs_free_tree_block()
         ----btrfs_add_free_space(bytenr of node)
      
      4. Because send doesn't hold a transaction open, it's possible that
         the transaction used to create the snapshot commits, switches the
         commit root and the old space used by the previous root node gets
         assigned to some other node allocation. Allocation of a new node will
         use the existing extent buffer found in memory, which we previously
         got a reference through extent_buffer_get(), and allow the extent
         buffer's content (pages) to be modified:
      
         btrfs_alloc_tree_block
         --btrfs_reserve_extent
         ----find_free_extent (get bytenr of old node)
         --btrfs_init_new_buffer (use bytenr of old node)
         ----btrfs_find_create_tree_block
         ------alloc_extent_buffer
         --------find_extent_buffer (get old node)
      
      5. So send can access invalid memory content and have unpredictable
         behaviour.
      
      [FIX]
      So we fix the problem by copying the commit roots of the send and
      parent snapshots and use those copies.
      
      CallTrace looks like this:
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ctree.c:1861!
       invalid opcode: 0000 [#1] SMP
       CPU: 6 PID: 24235 Comm: btrfs Tainted: P           O 3.10.105 #23721
       ffff88046652d680 ti: ffff88041b720000 task.ti: ffff88041b720000
       RIP: 0010:[<ffffffffa08dd0e8>] read_node_slot+0x108/0x110 [btrfs]
       RSP: 0018:ffff88041b723b68  EFLAGS: 00010246
       RAX: ffff88043ca6b000 RBX: ffff88041b723c50 RCX: ffff880000000000
       RDX: 000000000000004c RSI: ffff880314b133f8 RDI: ffff880458b24000
       RBP: 0000000000000000 R08: 0000000000000001 R09: ffff88041b723c66
       R10: 0000000000000001 R11: 0000000000001000 R12: ffff8803f3e48890
       R13: ffff8803f3e48880 R14: ffff880466351800 R15: 0000000000000001
       FS:  00007f8c321dc8c0(0000) GS:ffff88047fcc0000(0000)
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       R2: 00007efd1006d000 CR3: 0000000213a24000 CR4: 00000000003407e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Stack:
       ffff88041b723c50 ffff8803f3e48880 ffff8803f3e48890 ffff8803f3e48880
       ffff880466351800 0000000000000001 ffffffffa08dd9d7 ffff88041b723c50
       ffff8803f3e48880 ffff88041b723c66 ffffffffa08dde85 a9ff88042d2c4400
       Call Trace:
       [<ffffffffa08dd9d7>] ? tree_move_down.isra.33+0x27/0x50 [btrfs]
       [<ffffffffa08dde85>] ? tree_advance+0xb5/0xc0 [btrfs]
       [<ffffffffa08e83d4>] ? btrfs_compare_trees+0x2d4/0x760 [btrfs]
       [<ffffffffa0982050>] ? finish_inode_if_needed+0x870/0x870 [btrfs]
       [<ffffffffa09841ea>] ? btrfs_ioctl_send+0xeda/0x1050 [btrfs]
       [<ffffffffa094bd3d>] ? btrfs_ioctl+0x1e3d/0x33f0 [btrfs]
       [<ffffffff81111133>] ? handle_pte_fault+0x373/0x990
       [<ffffffff8153a096>] ? atomic_notifier_call_chain+0x16/0x20
       [<ffffffff81063256>] ? set_task_cpu+0xb6/0x1d0
       [<ffffffff811122c3>] ? handle_mm_fault+0x143/0x2a0
       [<ffffffff81539cc0>] ? __do_page_fault+0x1d0/0x500
       [<ffffffff81062f07>] ? check_preempt_curr+0x57/0x90
       [<ffffffff8115075a>] ? do_vfs_ioctl+0x4aa/0x990
       [<ffffffff81034f83>] ? do_fork+0x113/0x3b0
       [<ffffffff812dd7d7>] ? trace_hardirqs_off_thunk+0x3a/0x6c
       [<ffffffff81150cc8>] ? SyS_ioctl+0x88/0xa0
       [<ffffffff8153e422>] ? system_call_fastpath+0x16/0x1b
       ---[ end trace 29576629ee80b2e1 ]---
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 3.6+
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f2f0b39
  7. 12 4月, 2018 1 次提交
  8. 31 3月, 2018 14 次提交
  9. 26 3月, 2018 1 次提交
  10. 22 1月, 2018 3 次提交
  11. 07 12月, 2017 1 次提交
  12. 30 10月, 2017 2 次提交
  13. 16 8月, 2017 1 次提交
  14. 20 6月, 2017 2 次提交
  15. 09 5月, 2017 1 次提交
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  16. 18 4月, 2017 1 次提交