1. 30 4月, 2019 6 次提交
  2. 08 3月, 2019 1 次提交
  3. 27 2月, 2019 1 次提交
    • J
      btrfs: check for refs on snapshot delete resume · 78c52d9e
      Josef Bacik 提交于
      There's a bug in snapshot deletion where we won't update the
      drop_progress key if we're in the UPDATE_BACKREF stage.  This is a
      problem because we could drop refs for blocks we know don't belong to
      ours.  If we crash or umount at the right time we could experience
      messages such as the following when snapshot deletion resumes
      
       BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258  owner 1 offset 0
       ------------[ cut here ]------------
       WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
       CPU: 3 PID: 16052 Comm: umount Tainted: G        W  OE     5.0.0-rc4+ #147
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
       RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
       RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
       RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18
       RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001
       R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000
       R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0
       FS:  00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0
       Call Trace:
       ? _raw_spin_unlock+0x27/0x40
       ? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs]
       __btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs]
       ? join_transaction+0x2b/0x460 [btrfs]
       btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs]
       btrfs_commit_transaction+0x52/0xa50 [btrfs]
       ? start_transaction+0xa6/0x510 [btrfs]
       btrfs_sync_fs+0x79/0x1c0 [btrfs]
       sync_filesystem+0x70/0x90
       generic_shutdown_super+0x27/0x120
       kill_anon_super+0x12/0x30
       btrfs_kill_super+0x16/0xa0 [btrfs]
       deactivate_locked_super+0x43/0x70
       deactivate_super+0x40/0x60
       cleanup_mnt+0x3f/0x80
       __cleanup_mnt+0x12/0x20
       task_work_run+0x8b/0xc0
       exit_to_usermode_loop+0xce/0xd0
       do_syscall_64+0x20b/0x210
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To fix this simply mark dead roots we read from disk as DEAD and then
      set the walk_control->restarted flag so we know we have a restarted
      deletion.  From here whenever we try to drop refs for blocks we check to
      verify our ref is set on them, and if it is not we skip it.  Once we
      find a ref that is set we unset walk_control->restarted since the tree
      should be in a normal state from then on, and any problems we run into
      from there are different issues.  I tested this with an existing broken
      fs and my reproducer that creates a broken fs and it fixed both file
      systems.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78c52d9e
  4. 25 2月, 2019 10 次提交
    • D
      btrfs: scrub: remove unused nocow worker pointer · 0ea82076
      David Sterba 提交于
      The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow
      optimization was removed from scrub in 9bebe665 ("btrfs: scrub:
      Remove unused copy_nocow_pages and its callchain").
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ea82076
    • D
      btrfs: scrub: add assertions for worker pointers · c8352942
      David Sterba 提交于
      The scrub worker pointers are not NULL iff the scrub is running, so
      reset them back once the last reference is dropped. Add assertions to
      the initial phase of scrub to verify that.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c8352942
    • A
      btrfs: scrub: convert scrub_workers_refcnt to refcount_t · ff09c4ca
      Anand Jain 提交于
      Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
      we get the extra checks. All reference changes are still done under
      scrub_lock.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff09c4ca
    • J
      btrfs: don't use global reserve for chunk allocation · 450114fc
      Josef Bacik 提交于
      We've done this forever because of the voodoo around knowing how much
      space we have.  However, we have better ways of doing this now, and on
      normal file systems we'll easily have a global reserve of 512MiB, and
      since metadata chunks are usually 1GiB that means we'll allocate
      metadata chunks more readily.  Instead use the actual used amount when
      determining if we need to allocate a chunk or not.
      
      This has a side effect for mixed block group fs'es where we are no
      longer allocating enough chunks for the data/metadata requirements.  To
      deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
      machine.  This will only get used if we've already made a full loop
      through the flushing machinery and tried committing the transaction.
      
      If we have then we can try and force a chunk allocation since we likely
      need it to make progress.  This resolves issues I was seeing with
      the mixed bg tests in xfstests without the new flushing state.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      450114fc
    • J
      btrfs: replace cleaner_delayed_iput_mutex with a waitqueue · 034f784d
      Josef Bacik 提交于
      The throttle path doesn't take cleaner_delayed_iput_mutex, which means
      we could think we're done flushing iputs in the data space reservation
      path when we could have a throttler doing an iput.  There's no real
      reason to serialize the delayed iput flushing, so instead of taking the
      cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
      replace it with an atomic counter and a waitqueue.  This removes the
      short (or long depending on how big the inode is) window where we think
      there are no more pending iputs when there really are some.
      
      The waiting is killable as it could be indirectly called from user
      operations like fallocate or zero-range. Such call sites should handle
      the error but otherwise it's not necessary. Eg. flush_space just needs
      to attempt to make space by waiting on iputs.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add killable comment and changelog parts ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      034f784d
    • A
      btrfs: let the assertion expression compile in all configs · 2eec5f00
      Anders Roxell 提交于
      A compiler warning (in a patch in development) pointed to a variable
      that was used only inside and ASSERT:
      
        u64 root_objectid = root->root_key.objectid;
        ASSERT(root_objectid == ...);
      
        fs/btrfs/relocation.c: In function ‘insert_dirty_subv’:
        fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable]
          u64 root_objectid = root->root_key.objectid;
      	^~~~~~~~~~~~~
      
      When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used.
      
      Rework the assertion helper by adding a runtime check instead of the
      '#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the
      condition being passed into an inline function after preprocessing.
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2eec5f00
    • Q
      btrfs: qgroup: Introduce per-root swapped blocks infrastructure · 370a11b8
      Qu Wenruo 提交于
      To allow delayed subtree swap rescan, btrfs needs to record per-root
      information about which tree blocks get swapped.  This patch introduces
      the required infrastructure.
      
      The designed workflow will be:
      
      1) Record the subtree root block that gets swapped.
      
         During subtree swap:
         O = Old tree blocks
         N = New tree blocks
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               NA     OB                          OA      OB
             /  |     |  \                      /  |      |  \
           NC  ND     OE  OF                   OC  OD     OE  OF
      
        In this case, NA and OA are going to be swapped, record (NA, OA) into
        subvolume tree X.
      
      2) After subtree swap.
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               OA     OB                          NA      OB
             /  |     |  \                      /  |      |  \
           OC  OD     OE  OF                   NC  ND     OE  OF
      
      3a) COW happens for OB
          If we are going to COW tree block OB, we check OB's bytenr against
          tree X's swapped_blocks structure.
          If it doesn't fit any, nothing will happen.
      
      3b) COW happens for NA
          Check NA's bytenr against tree X's swapped_blocks, and get a hit.
          Then we do subtree scan on both subtrees OA and NA.
          Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
      
          Then no matter what we do to subvolume tree X, qgroup numbers will
          still be correct.
          Then NA's record gets removed from X's swapped_blocks.
      
      4)  Transaction commit
          Any record in X's swapped_blocks gets removed, since there is no
          modification to swapped subtrees, no need to trigger heavy qgroup
          subtree rescan for them.
      
      This will introduce 128 bytes overhead for each btrfs_root even qgroup
      is not enabled. This is to reduce memory allocations and potential
      failures.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      370a11b8
    • Q
      btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots · d2311e69
      Qu Wenruo 提交于
      Relocation code will drop btrfs_root::reloc_root as soon as
      merge_reloc_root() finishes.
      
      However later qgroup code will need to access btrfs_root::reloc_root
      after merge_reloc_root() for delayed subtree rescan.
      
      So alter the timming of resetting btrfs_root:::reloc_root, make it
      happens after transaction commit.
      
      With this patch, we will introduce a new btrfs_root::state,
      BTRFS_ROOT_DEAD_RELOC_TREE, to info part of btrfs_root::reloc_tree user
      that although btrfs_root::reloc_tree is still non-NULL, but still it's
      not used any more.
      
      The lifespan of btrfs_root::reloc tree will become:
                Old behavior            |              New
      ------------------------------------------------------------------------
      btrfs_init_reloc_root()      ---  | btrfs_init_reloc_root()      ---
        set reloc_root              |   |   set reloc_root              |
                                    |   |                               |
                                    |   |                               |
      merge_reloc_root()            |   | merge_reloc_root()            |
      |- btrfs_update_reloc_root() ---  | |- btrfs_update_reloc_root() -+-
           clear btrfs_root::reloc_root |      set ROOT_DEAD_RELOC_TREE |
                                        |      record root into dirty   |
                                        |      roots rbtree             |
                                        |                               |
                                        | reloc_block_group() Or        |
                                        | btrfs_recover_relocation()    |
                                        | | After transaction commit    |
                                        | |- clean_dirty_subvols()     ---
                                        |     clear btrfs_root::reloc_root
      
      During ROOT_DEAD_RELOC_TREE set lifespan, the only user of
      btrfs_root::reloc_tree should be qgroup.
      
      Since reloc root needs a longer life-span, this patch will also delay
      btrfs_drop_snapshot() call.
      Now btrfs_drop_snapshot() is called in clean_dirty_subvols().
      
      This patch will increase the size of btrfs_root by 16 bytes.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2311e69
    • N
      btrfs: Remove unused arguments from btrfs_get_extent_fiemap · 4ab47a8d
      Nikolay Borisov 提交于
      This function is a simple wrapper over btrfs_get_extent that returns
      either:
      
      a) A real extent in the passed range or
      b) Adjusted extent based on whether delalloc bytes are found backing up
         a hole.
      
      To support these semantics it doesn't need the page/pg_offset/create
      arguments which are passed to btrfs_get_extent in case an extent is to
      be created. So simplify the function by removing the unused arguments.
      No functional changes.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4ab47a8d
    • N
      btrfs: Make first argument of btrfs_run_delalloc_range directly an inode · bc9a8bf7
      Nikolay Borisov 提交于
      Since this function is no longer a callback there is no need to have
      its first argument obfuscated with a void *. Change it directly to a
      pointer to an inode. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc9a8bf7
  5. 19 1月, 2019 2 次提交
    • J
      btrfs: wakeup cleaner thread when adding delayed iput · fd340d0f
      Josef Bacik 提交于
      The cleaner thread usually takes care of delayed iputs, with the
      exception of the btrfs_end_transaction_throttle path.  Delaying iputs
      means we are potentially delaying the eviction of an inode and it's
      respective space.  The cleaner thread only gets woken up every 30
      seconds, or when we require space.  If there are a lot of inodes that
      need to be deleted we could induce a serious amount of latency while we
      wait for these inodes to be evicted.  So instead wakeup the cleaner if
      it's not already awake to process any new delayed iputs we add to the
      list.  If we suddenly need space we will less likely be backed up
      behind a bunch of inodes that are waiting to be deleted, and we could
      possibly free space before we need to get into the flushing logic which
      will save us some latency.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd340d0f
    • J
      btrfs: handle delayed ref head accounting cleanup in abort · 31890da0
      Josef Bacik 提交于
      We weren't doing any of the accounting cleanup when we aborted
      transactions.  Fix this by making cleanup_ref_head_accounting global and
      calling it from the abort code, this fixes the issue where our
      accounting was all wrong after the fs aborts.
      
      The test generic/475 on a 2G VM can trigger the problems eg.:
      
        [ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs]
        [ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394
        [ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014
        [ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs]
        [ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206
        [ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000
        [ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400
        [ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000
        [ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108
        [ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100
        [ 8502.170296] FS:  00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000
        [ 8502.172439] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0
        [ 8502.175094] Call Trace:
        [ 8502.175759]  close_ctree+0x17f/0x350 [btrfs]
        [ 8502.176721]  generic_shutdown_super+0x64/0x100
        [ 8502.177702]  kill_anon_super+0x14/0x30
        [ 8502.178607]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [ 8502.179602]  deactivate_locked_super+0x29/0x60
        [ 8502.180595]  cleanup_mnt+0x3b/0x70
        [ 8502.181406]  task_work_run+0x98/0xc0
        [ 8502.182255]  exit_to_usermode_loop+0x83/0x90
        [ 8502.183113]  do_syscall_64+0x15b/0x180
        [ 8502.183919]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Corresponding to
      
        release_global_block_rsv() {
        ...
        WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
      
      CC: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add log dump ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31890da0
  6. 22 12月, 2018 1 次提交
    • A
      btrfs: sanitize security_mnt_opts use · a65001e8
      Al Viro 提交于
      1) keeping a copy in btrfs_fs_info is completely pointless - we never
      use it for anything.  Getting rid of that allows for simpler calling
      conventions for setup_security_options() (caller is responsible for
      freeing mnt_opts in all cases).
      
      2) on remount we want to use ->sb_remount(), not ->sb_set_mnt_opts(),
      same as we would if not for FS_BINARY_MOUNTDATA.  Behaviours *are*
      close (in fact, selinux sb_set_mnt_opts() ought to punt to
      sb_remount() in "already initialized" case), but let's handle
      that uniformly.  And the only reason why the original btrfs changes
      didn't go for security_sb_remount() in btrfs_remount() case is that
      it hadn't been exported.  Let's export it for a while - it'll be
      going away soon anyway.
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a65001e8
  7. 17 12月, 2018 19 次提交
    • J
      btrfs: catch cow on deleting snapshots · 83354f07
      Josef Bacik 提交于
      When debugging some weird extent reference bug I suspected that we were
      changing a snapshot while we were deleting it, which could explain my
      bug.  This was indeed what was happening, and this patch helped me
      verify my theory.  It is never correct to modify the snapshot once it's
      being deleted, so mark the root when we are deleting it and make sure we
      complain about it when it happens.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83354f07
    • J
      btrfs: rework btrfs_check_space_for_delayed_refs · 64403612
      Josef Bacik 提交于
      Now with the delayed_refs_rsv we can now know exactly how much pending
      delayed refs space we need.  This means we can drastically simplify
      btrfs_check_space_for_delayed_refs by simply checking how much space we
      have reserved for the global rsv (which acts as a spill over buffer) and
      the delayed refs rsv.  If our total size is beyond that amount then we
      know it's time to commit the transaction and stop any more delayed refs
      from being generated.
      
      With the introduction of dealyed_refs_rsv infrastructure, namely
      btrfs_update_delayed_refs_rsv we now know exactly how much pending
      delayed refs space is required.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64403612
    • J
      btrfs: add new flushing states for the delayed refs rsv · 413df725
      Josef Bacik 提交于
      A nice thing we gain with the delayed refs rsv is the ability to flush
      the delayed refs on demand to deal with enospc pressure.  Add states to
      flush delayed refs on demand, and this will allow us to remove a lot of
      ad-hoc work around checking to see if we should commit the transaction
      to run our delayed refs.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      413df725
    • J
      btrfs: introduce delayed_refs_rsv · ba2c4d4e
      Josef Bacik 提交于
      Traditionally we've had voodoo in btrfs to account for the space that
      delayed refs may take up by having a global_block_rsv.  This works most
      of the time, except when it doesn't.  We've had issues reported and seen
      in production where sometimes the global reserve is exhausted during
      transaction commit before we can run all of our delayed refs, resulting
      in an aborted transaction.  Because of this voodoo we have equally
      dubious flushing semantics around throttling delayed refs which we often
      get wrong.
      
      So instead give them their own block_rsv.  This way we can always know
      exactly how much outstanding space we need for delayed refs.  This
      allows us to make sure we are constantly filling that reservation up
      with space, and allows us to put more precise pressure on the enospc
      system.  Instead of doing math to see if its a good time to throttle,
      the normal enospc code will be invoked if we have a lot of delayed refs
      pending, and they will be run via the normal flushing mechanism.
      
      For now the delayed_refs_rsv will hold the reservations for the delayed
      refs, the block group updates, and deleting csums.  We could have a
      separate rsv for the block group updates, but the csum deletion stuff is
      still handled via the delayed_refs so that will stay there.
      
      Historical background:
      
      The global reserve has grown to cover everything we don't reserve space
      explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
      if we're running short on space and when it's time to force a commit.  A
      failure rate of 20-40 file systems when we run hundreds of thousands of
      them isn't super high, but cleaning up this code will make things less
      ugly and more predictible.
      
      Thus the delayed refs rsv.  We always know how many delayed refs we have
      outstanding, and although running them generates more we can use the
      global reserve for that spill over, which fits better into it's desired
      use than a full blown reservation.  This first approach is to simply
      take how many times we're reserving space for and multiply that by 2 in
      order to save enough space for the delayed refs that could be generated.
      This is a niave approach and will probably evolve, but for now it works.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
      [ added background notes from the cover letter ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba2c4d4e
    • D
      btrfs: dev-replace: remove custom read/write blocking scheme · 53176dde
      David Sterba 提交于
      After the rw semaphore has been added, the custom blocking using
      ::blocking_readers and ::read_lock_wq is redundant.
      
      The blocking logic in __btrfs_map_block is replaced by extending the
      time the semaphore is held, that has the same blocking effect on writes
      as the previous custom scheme that waited until ::blocking_readers was
      zero.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53176dde
    • D
      btrfs: dev-replace: swich locking to rw semaphore · 129827e3
      David Sterba 提交于
      This is the first part of removing the custom locking and waiting scheme
      used for device replace. It was probably copied from extent buffer
      locking, but there's nothing that would require more than is provided by
      the common locking primitives.
      
      The rw spinlock protects waiting tasks counter in case of incompatible
      locks and the waitqueue. Same as rw semaphore.
      
      This patch only switches the locking primitive, for better
      bisectability.  There should be no functional change other than the
      overhead of the locking and potential sleeping instead of spinning when
      the lock is contended.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      129827e3
    • D
      btrfs: drop extra enum initialization where using defaults · bbe339cc
      David Sterba 提交于
      The first auto-assigned value to enum is 0, we can use that and not
      initialize all members where the auto-increment does the same. This is
      used for values that are not part of on-disk format.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbe339cc
    • D
      btrfs: switch BTRFS_ROOT_* to enums · 61fa90c1
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      root tree flags.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      61fa90c1
    • D
      btrfs: switch BTRFS_FS_* to enums · eb1a524c
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      internal filesystem states.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb1a524c
    • D
      btrfs: switch BTRFS_BLOCK_RSV_* to enums · 688a75b9
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      block reserve types.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      688a75b9
    • D
      btrfs: switch BTRFS_FS_STATE_* to enums · b00146b5
      David Sterba 提交于
      We can use simple enum for values that are not part of on-disk format:
      global filesystem states.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b00146b5
    • N
      btrfs: Refactor btrfs_merge_bio_hook · da12fe54
      Nikolay Borisov 提交于
      This function really checks whether adding more data to the bio will
      straddle a stripe/chunk. So first let's give it a more appropraite name
      - btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
      used to just remove it. Thirdly, pages are submitted to either btree or
      data inodes so it's guaranteed that tree->ops is set so replace the
      check with an ASSERT. Finally, document the parameters of the function.
      No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da12fe54
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • N
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov 提交于
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7239ff4b
    • J
      btrfs: introduce EXPORT_FOR_TESTS macro · f8f591df
      Johannes Thumshirn 提交于
      Depending on whether CONFIG_BTRFS_FS_RUN_SANITY_TESTS is set, some BTRFS
      functions are either local to the file they are implemented in and thus
      should be declared static or are called from within the test
      implementation defined in a different file.
      
      Introduce an EXPORT_FOR_TESTS macro which depending on
      CONFIG_BTRFS_FS_RUN_SANITY_TESTS either adds the 'static' keyword to a
      function or not.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f8f591df
    • E
      btrfs: use tagged writepage to mitigate livelock of snapshot · 3cd24c69
      Ethan Lien 提交于
      Snapshot is expected to be fast. But if there are writers steadily
      creating dirty pages in our subvolume, the snapshot may take a very long
      time to complete. To fix the problem, we use tagged writepage for
      snapshot flusher as we do in the generic write_cache_pages(), so we can
      omit pages dirtied after the snapshot command.
      
      This does not change the semantics regarding which data get to the
      snapshot, if there are pages being dirtied during the snapshotting
      operation.  There's a sync called before snapshot is taken in old/new
      case, any IO in flight just after that may be in the snapshot but this
      depends on other system effects that might still sync the IO.
      
      We do a simple snapshot speed test on a Intel D-1531 box:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=write --size=64G
      --direct=0 --thread=1 --numjobs=1 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 1m58sec
      patched:  6.54sec
      
      This is the best case for this patch since for a sequential write case,
      we omit nearly all pages dirtied after the snapshot command.
      
      For a multi writers, random write test:
      
      fio --ioengine=libaio --iodepth=32 --bs=4k --rw=randwrite --size=64G
      --direct=0 --thread=1 --numjobs=4 --time_based --runtime=120
      --filename=/mnt/sub/testfile --name=job1 --group_reporting & sleep 5;
      time btrfs sub snap -r /mnt/sub /mnt/snap; killall fio
      
      original: 15.83sec
      patched:  10.35sec
      
      The improvement is smaller compared to the sequential write case,
      since we omit only half of the pages dirtied after snapshot command.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3cd24c69
    • N
      btrfs: Remove unused extent_state argument from btrfs_writepage_endio_finish_ordered · c629732d
      Nikolay Borisov 提交于
      This parameter was never used, yet was part of the interface of the
      function ever since its introduction as extent_io_ops::writepage_end_io_hook
      in e6dcd2dc ("Btrfs: New data=ordered implementation"). Now that
      NULL is passed everywhere as a value for this parameter let's remove it
      for good. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c629732d
    • O
      Btrfs: prevent ioctls from interfering with a swap file · eede2bf3
      Omar Sandoval 提交于
      A later patch will implement swap file support for Btrfs, but before we
      do that, we need to make sure that the various Btrfs ioctls cannot
      change a swap file.
      
      When a swap file is active, we must make sure that the extents of the
      file are not moved and that they don't become shared. That means that
      the following are not safe:
      
      - chattr +c (enable compression)
      - reflink
      - dedupe
      - snapshot
      - defrag
      
      Don't allow those to happen on an active swap file.
      
      Additionally, balance, resize, device remove, and device replace are
      also unsafe if they affect an active swapfile. Add a red-black tree of
      block groups and devices which contain an active swapfile. Relocation
      checks each block group against this tree and skips it or errors out for
      balance or resize, respectively. Device remove and device replace check
      the tree for the device they will operate on.
      
      Note that we don't have to worry about chattr -C (disable nocow), which
      we ignore for non-empty files, because an active swapfile must be
      non-empty and can't be truncated. We also don't have to worry about
      autodefrag because it's only done on COW files. Truncate and fallocate
      are already taken care of by the generic code. Device add doesn't do
      relocation so it's not an issue, either.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eede2bf3
    • N
      btrfs: Remove extent_io_ops::split_extent_hook callback · abbb55f4
      Nikolay Borisov 提交于
      This is the counterpart to merge_extent_hook, similarly, it's used only
      for data/freespace inodes so let's remove it, rename it and call it
      directly where necessary. No functional changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      abbb55f4