1. 30 4月, 2019 18 次提交
  2. 25 2月, 2019 6 次提交
    • F
      Btrfs: remove assertion when searching for a key in a node/leaf · 253002f2
      Filipe Manana 提交于
      At ctree.c:key_search(), the assertion that verifies the first key on a
      child extent buffer corresponds to the key at a specific slot in the
      parent has a disadvantage: we effectively hit a BUG_ON() which requires
      rebooting the machine later. It also does not tell any information about
      which extent buffer is affected, from which root, the expected and found
      keys, etc.
      
      However as of commit 581c1760 ("btrfs: Validate child tree block's
      level and first key"), that assertion is not needed since at the time we
      read an extent buffer from disk we validate that its first key matches the
      key, at the respective slot, in the parent extent buffer. Therefore just
      remove the assertion at key_search().
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      253002f2
    • F
      Btrfs: add missing error handling after doing leaf/node binary search · cbca7d59
      Filipe Manana 提交于
      The function map_private_extent_buffer() can return an -EINVAL error, and
      it is called by generic_bin_search() which will return back the error. The
      btrfs_bin_search() function in turn calls generic_bin_search() and the
      key_search() function calls btrfs_bin_search(), so both can return the
      -EINVAL error coming from the map_private_extent_buffer() function. Some
      callers of these functions were ignoring that these functions can return
      an error, so fix them to deal with error return values.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cbca7d59
    • D
      btrfs: merge btrfs_set_lock_blocking_rw with it's caller · 766ece54
      David Sterba 提交于
      The last caller that does not have a fixed value of lock is
      btrfs_set_path_blocking, that actually does the same conditional swtich
      by the lock type so we can merge the branches together and remove the
      helper.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      766ece54
    • D
      btrfs: open code now trivial btrfs_set_lock_blocking · 8bead258
      David Sterba 提交于
      btrfs_set_lock_blocking is now only a simple wrapper around
      btrfs_set_lock_blocking_write. The name does not bring any semantic
      value that could not be inferred from the new function so there's no
      point keeping it.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8bead258
    • D
      btrfs: replace btrfs_set_lock_blocking_rw with appropriate helpers · 300aa896
      David Sterba 提交于
      We can use the right helper where the lock type is a fixed parameter.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      300aa896
    • Q
      btrfs: qgroup: Use delayed subtree rescan for balance · f616f5cd
      Qu Wenruo 提交于
      Before this patch, qgroup code traces the whole subtree of subvolume and
      reloc trees unconditionally.
      
      This makes qgroup numbers consistent, but it could cause tons of
      unnecessary extent tracing, which causes a lot of overhead.
      
      However for subtree swap of balance, just swap both subtrees because
      they contain the same contents and tree structure, so qgroup numbers
      won't change.
      
      It's the race window between subtree swap and transaction commit could
      cause qgroup number change.
      
      This patch will delay the qgroup subtree scan until COW happens for the
      subtree root.
      
      So if there is no other operations for the fs, balance won't cause extra
      qgroup overhead. (best case scenario)
      Depending on the workload, most of the subtree scan can still be
      avoided.
      
      Only for worst case scenario, it will fall back to old subtree swap
      overhead. (scan all swapped subtrees)
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
      And after file system population, there is no other activity, so it
      should be the best case scenario.
      
                           | v4.20-rc1            | w/ patchset    | diff
      -----------------------------------------------------------------------
      relocated extents    | 22615                | 22457          | -0.1%
      qgroup dirty extents | 163457               | 121606         | -25.6%
      time (sys)           | 22.884s              | 18.842s        | -17.6%
      time (real)          | 27.724s              | 22.884s        | -17.5%
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f616f5cd
  3. 28 1月, 2019 1 次提交
    • F
      Btrfs: fix deadlock when allocating tree block during leaf/node split · a6279470
      Filipe Manana 提交于
      When splitting a leaf or node from one of the trees that are modified when
      flushing pending block groups (extent, chunk, device and free space trees),
      we need to allocate a new tree block, which in turn can result in the need
      to allocate a new block group. After allocating the new block group we may
      need to flush new block groups that were previously allocated during the
      course of the current transaction, which is what may cause a deadlock due
      to attempts to write lock twice the same leaf or node, as when splitting
      a leaf or node we are holding a write lock on it and its parent node.
      
      The same type of deadlock can also happen when increasing the tree's
      height, since we are holding a lock on the existing root while allocating
      the tree block to use as the new root node.
      
      An example trace when the deadlock happens during the leaf split path is:
      
        [27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G        W         4.19.16 #1
        [27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018
        [27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
        (...)
        [27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246
        [27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001
        [27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48
        [27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000
        [27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000
        [27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18
        [27175.303601] FS:  0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000
        [27175.304468] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0
        [27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [27175.307940] Call Trace:
        [27175.308802]  btrfs_search_slot+0x779/0x9a0 [btrfs]
        [27175.309669]  ? update_space_info+0xba/0xe0 [btrfs]
        [27175.310534]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
        [27175.311397]  btrfs_insert_item+0x60/0xd0 [btrfs]
        [27175.312253]  btrfs_create_pending_block_groups+0xee/0x210 [btrfs]
        [27175.313116]  do_chunk_alloc+0x25f/0x300 [btrfs]
        [27175.313984]  find_free_extent+0x706/0x10d0 [btrfs]
        [27175.314855]  btrfs_reserve_extent+0x9b/0x1d0 [btrfs]
        [27175.315707]  btrfs_alloc_tree_block+0x100/0x5b0 [btrfs]
        [27175.316548]  split_leaf+0x130/0x610 [btrfs]
        [27175.317390]  btrfs_search_slot+0x94d/0x9a0 [btrfs]
        [27175.318235]  btrfs_insert_empty_items+0x67/0xc0 [btrfs]
        [27175.319087]  alloc_reserved_file_extent+0x84/0x2c0 [btrfs]
        [27175.319938]  __btrfs_run_delayed_refs+0x596/0x1150 [btrfs]
        [27175.320792]  btrfs_run_delayed_refs+0xed/0x1b0 [btrfs]
        [27175.321643]  delayed_ref_async_start+0x81/0x90 [btrfs]
        [27175.322491]  normal_work_helper+0xd0/0x320 [btrfs]
        [27175.323328]  ? move_linked_works+0x6e/0xa0
        [27175.324160]  process_one_work+0x191/0x370
        [27175.324976]  worker_thread+0x4f/0x3b0
        [27175.325763]  kthread+0xf8/0x130
        [27175.326531]  ? rescuer_thread+0x320/0x320
        [27175.327284]  ? kthread_create_worker_on_cpu+0x50/0x50
        [27175.328027]  ret_from_fork+0x35/0x40
        [27175.328741] ---[ end trace 300a1b9f0ac30e26 ]---
      
      Fix this by preventing the flushing of new blocks groups when splitting a
      leaf/node and when inserting a new root node for one of the trees modified
      by the flushing operation, similar to what is done when COWing a node/leaf
      from on of these trees.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383Reported-by: NEli V <eliventer@gmail.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a6279470
  4. 09 1月, 2019 1 次提交
  5. 17 12月, 2018 8 次提交
    • A
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini 提交于
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: NAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52042d8e
    • F
      Btrfs: send, fix race with transaction commits that create snapshots · be6821f8
      Filipe Manana 提交于
      If we create a snapshot of a snapshot currently being used by a send
      operation, we can end up with send failing unexpectedly (returning
      -ENOENT error to user space for example). The following diagram shows
      how this happens.
      
                  CPU 1                                   CPU2                                CPU3
      
       btrfs_ioctl_send()
        (...)
                                           create_snapshot()
                                            -> creates snapshot of a
                                               root used by the send
                                               task
                                            btrfs_commit_transaction()
                                             create_pending_snapshot()
        __get_inode_info()
         btrfs_search_slot()
          btrfs_search_slot_get_root()
           down_read commit_root_sem
      
           get reference on eb of the
           commit root
            -> eb with bytenr == X
      
           up_read commit_root_sem
      
                                              btrfs_cow_block(root node)
                                               btrfs_free_tree_block()
                                                -> creates delayed ref to
                                                   free the extent
      
                                             btrfs_run_delayed_refs()
                                              -> runs the delayed ref,
                                                 adds extent to
                                                 fs_info->pinned_extents
      
                                             btrfs_finish_extent_commit()
                                              unpin_extent_range()
                                               -> marks extent as free
                                                  in the free space cache
      
                                            transaction commit finishes
      
                                                                             btrfs_start_transaction()
                                                                              (...)
                                                                              btrfs_cow_block()
                                                                               btrfs_alloc_tree_block()
                                                                                btrfs_reserve_extent()
                                                                                 -> allocates extent at
                                                                                    bytenr == X
                                                                                btrfs_init_new_buffer(bytenr X)
                                                                                 btrfs_find_create_tree_block()
                                                                                  alloc_extent_buffer(bytenr X)
                                                                                   find_extent_buffer(bytenr X)
                                                                                    -> returns existing eb,
                                                                                       which the send task got
      
                                                                              (...)
                                                                               -> modifies content of the
                                                                                  eb with bytenr == X
      
          -> uses an eb that now
             belongs to some other
             tree and no more matches
             the commit root of the
             snapshot, resuts will be
             unpredictable
      
      The consequences of this race can be various, and can lead to searches in
      the commit root performed by the send task failing unexpectedly (unable to
      find inode items, returning -ENOENT to user space, for example) or not
      failing because an inode item with the same number was added to the tree
      that reused the metadata extent, in which case send can behave incorrectly
      in the worst case or just fail later for some reason.
      
      Fix this by performing a copy of the commit root's extent buffer when doing
      a search in the context of a send operation.
      
      CC: stable@vger.kernel.org # 4.4.x: 1fc28d8e: Btrfs: move get root out of btrfs_search_slot to a helper
      CC: stable@vger.kernel.org # 4.4.x: f9ddfd05: Btrfs: remove unused check of skip_locking
      CC: stable@vger.kernel.org # 4.4.x
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be6821f8
    • J
      btrfs: catch cow on deleting snapshots · 83354f07
      Josef Bacik 提交于
      When debugging some weird extent reference bug I suspected that we were
      changing a snapshot while we were deleting it, which could explain my
      bug.  This was indeed what was happening, and this patch helped me
      verify my theory.  It is never correct to modify the snapshot once it's
      being deleted, so mark the root when we are deleting it and make sure we
      complain about it when it happens.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83354f07
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • N
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov 提交于
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7239ff4b
    • N
      btrfs: Remove extra reference count bumps in btrfs_compare_trees · 8c7eeb65
      Nikolay Borisov 提交于
      When the 2 comparison trees roots are initialised they are private to
      the function and already have reference counts of 1 each. There is no
      need to further increment the reference count since the cloned buffers
      are already accessed via struct btrfs_path. Eventually the 2 paths used
      for comparison are going to be released, effectively disposing of the
      cloned buffers.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c7eeb65
    • N
      btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewind · 24cee18a
      Nikolay Borisov 提交于
      When a rewound buffer is created it already has a ref count of 1 and the
      dummy flag set. Then another ref is taken bumping the count to 2.
      Finally when this buffer is released from btrfs_release_path the extra
      reference is decremented by the special handling code in
      free_extent_buffer.
      
      However, this special code is in fact redundant sinca ref count of 1 is
      still correct since the buffer is only accessed via btrfs_path struct.
      This paves the way forward of removing the special handling in
      free_extent_buffer.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24cee18a
    • N
      btrfs: Remove redundant extent_buffer_get in get_old_root · 6c122e2a
      Nikolay Borisov 提交于
      get_old_root used used only by btrfs_search_old_slot to initialise the
      path structure. The old root is always a cloned buffer (either via alloc
      dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1
      from allocation, 1 from extent_buffer_get call in get_old_root.
      
      This latter explicit ref count acquire operation is in fact unnecessary
      since the semantic is such that the newly allocated buffer is handed
      over to the btrfs_path for lifetime management. Considering this just
      remove the extra extent_buffer_get in get_old_root.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6c122e2a
  6. 17 10月, 2018 1 次提交
    • F
      Btrfs: fix deadlock when writing out free space caches · 5ce55557
      Filipe Manana 提交于
      When writing out a block group free space cache we can end deadlocking
      with ourselves on an extent buffer lock resulting in a warning like the
      following:
      
        [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
          W I      4.16.8 #1
        [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
        [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
        [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
        [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
        [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
        [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
        [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
        [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
        [245043.408670] Call Trace:
        [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
        [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
        [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
        [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
        [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
        [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
        [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
        [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
        [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
        [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
        [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
        [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
        [245043.425538]  ? iput+0x72/0x1e0
        [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
        [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
        [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
        [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
        [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
        [245043.433341]  kthread+0xf0/0x130
        [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
        [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
        [245043.437236]  ret_from_fork+0x1f/0x30
        [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
      
      This is because at write_one_cache_group() when we are COWing a leaf from
      the extent tree we end up allocating a new block group (chunk) and,
      because we have hit a threshold on the number of bytes reserved for system
      chunks, we attempt to finalize the creation of new block groups from the
      current transaction, by calling btrfs_create_pending_block_groups().
      However here we also need to modify the extent tree in order to insert
      a block group item, and if the location for this new block group item
      happens to be in the same leaf that we were COWing earlier, we deadlock
      since btrfs_search_slot() tries to write lock the extent buffer that we
      locked before at write_one_cache_group().
      
      We have already hit similar cases in the past and commit d9a0540a
      ("Btrfs: fix deadlock when finalizing block group creation") fixed some
      of those cases by delaying the creation of pending block groups at the
      known specific spots that could lead to a deadlock. This change reworks
      that commit to be more generic so that we don't have to add similar logic
      to every possible path that can lead to a deadlock. This is done by
      making __btrfs_cow_block() disallowing the creation of new block groups
      (setting the transaction's can_flush_pending_bgs to false) before it
      attempts to allocate a new extent buffer for either the extent, chunk or
      device trees, since those are the trees that pending block creation
      modifies. Once the new extent buffer is allocated, it allows creation of
      pending block groups to happen again.
      
      This change depends on a recent patch from Josef which is not yet in
      Linus' tree, named "btrfs: make sure we create all new block groups" in
      order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
      
      Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation")
      CC: stable@vger.kernel.org # 4.4+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/Reported-by: NE V <eliventer@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ce55557
  7. 15 10月, 2018 5 次提交
    • L
      Btrfs: kill btrfs_clear_path_blocking · 52398340
      Liu Bo 提交于
      Btrfs's btree locking has two modes, spinning mode and blocking mode,
      while searching btree, locking is always acquired in spinning mode and
      then converted to blocking mode if necessary, and in some hot paths we may
      switch the locking back to spinning mode by btrfs_clear_path_blocking().
      
      When acquiring locks, both of reader and writer need to wait for blocking
      readers and writers to complete before doing read_lock()/write_lock().
      
      The problem is that btrfs_clear_path_blocking() needs to switch nodes
      in the path to blocking mode at first (by btrfs_set_path_blocking) to
      make lockdep happy before doing its actual clearing blocking job.
      
      When switching to blocking mode from spinning mode, it consists of
      
      step 1) bumping up blocking readers counter and
      step 2) read_unlock()/write_unlock(),
      
      this has caused serious ping-pong effect if there're a great amount of
      concurrent readers/writers, as waiters will be woken up and go to
      sleep immediately.
      
      1) Killing this kind of ping-pong results in a big improvement in my 1600k
      files creation script,
      
      MNT=/mnt/btrfs
      mkfs.btrfs -f /dev/sdf
      mount /dev/def $MNT
      time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
              -d  $MNT/0  -d  $MNT/1 \
              -d  $MNT/2  -d  $MNT/3 \
              -d  $MNT/4  -d  $MNT/5 \
              -d  $MNT/6  -d  $MNT/7 \
              -d  $MNT/8  -d  $MNT/9 \
              -d  $MNT/10  -d  $MNT/11 \
              -d  $MNT/12  -d  $MNT/13 \
              -d  $MNT/14  -d  $MNT/15
      
      w/o patch:
      real    2m27.307s
      user    0m12.839s
      sys     13m42.831s
      
      w/ patch:
      real    1m2.273s
      user    0m15.802s
      sys     8m16.495s
      
      1.1) latency histogram from funclatency[1]
      
      Overall with the patch, there're ~50% less write lock acquisition and
      the 95% max latency that write lock takes also reduces to ~100ms from
      >500ms.
      
      --------------------------------------------
      w/o patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 2385222  |****************************************|
               2 -> 3          : 37147    |                                        |
               4 -> 7          : 20452    |                                        |
               8 -> 15         : 13131    |                                        |
              16 -> 31         : 3877     |                                        |
              32 -> 63         : 3900     |                                        |
              64 -> 127        : 2612     |                                        |
             128 -> 255        : 974      |                                        |
             256 -> 511        : 165      |                                        |
             512 -> 1023       : 13       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6743860  |****************************************|
               2 -> 3          : 2146     |                                        |
               4 -> 7          : 190      |                                        |
               8 -> 15         : 38       |                                        |
              16 -> 31         : 4        |                                        |
      
      --------------------------------------------
      w/ patch:
      --------------------------------------------
      Function = btrfs_tree_lock
           msecs               : count     distribution
               0 -> 1          : 1318454  |****************************************|
               2 -> 3          : 6800     |                                        |
               4 -> 7          : 3664     |                                        |
               8 -> 15         : 2145     |                                        |
              16 -> 31         : 809      |                                        |
              32 -> 63         : 219      |                                        |
              64 -> 127        : 10       |                                        |
      
      Function = btrfs_tree_read_lock
           msecs               : count     distribution
               0 -> 1          : 6854317  |****************************************|
               2 -> 3          : 2383     |                                        |
               4 -> 7          : 601      |                                        |
               8 -> 15         : 92       |                                        |
      
      2) dbench also proves the improvement,
      dbench -t 120 -D /mnt/btrfs 16
      
      w/o patch:
      Throughput 158.363 MB/sec
      
      w/ patch:
      Throughput 449.52 MB/sec
      
      3) xfstests didn't show any additional failures.
      
      One thing to note is that callers may set path->leave_spinning to have
      all nodes in the path stay in spinning mode, which means callers are
      ready to not sleep before releasing the path, but it won't cause
      problems if they don't want to sleep in blocking mode.
      
      [1]: https://github.com/iovisor/bcc/blob/master/tools/funclatency.pySigned-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52398340
    • N
      btrfs: handle error of get_old_root · 315bed43
      Nikolay Borisov 提交于
      In btrfs_search_old_slot get_old_root is always used with the assumption
      it cannot fail. However, this is not true in rare circumstance it can
      fail and return null. This will lead to null point dereference when the
      header is read. Fix this by checking the return value and properly
      handling NULL by setting ret to -EIO and returning gracefully.
      
      Coverity-id: 1087503
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      315bed43
    • L
      Btrfs: remove unnecessary level check in balance_level · 98e6b1eb
      Liu Bo 提交于
      In the callchain:
      
      btrfs_search_slot()
         if (level != 0)
            setup_nodes_for_search()
                balance_level()
      
      It is just impossible to have level=0 in balance_level, we can drop the
      check.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98e6b1eb
    • L
      Btrfs: do not unnecessarily pass write_lock_level when processing leaf · 4b6f8e96
      Liu Bo 提交于
      As we're going to return right after the call, it's not necessary to get
      update the new write_lock_level from unlock_up.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b6f8e96
    • M
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro 提交于
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fd786e6