1. 12 9月, 2019 2 次提交
    • F
      Btrfs: fix unwritten extent buffers and hangs on future writeback attempts · 18dfa711
      Filipe Manana 提交于
      The lock_extent_buffer_io() returns 1 to the caller to tell it everything
      went fine and the callers needs to start writeback for the extent buffer
      (submit a bio, etc), 0 to tell the caller everything went fine but it does
      not need to start writeback for the extent buffer, and a negative value if
      some error happened.
      
      When it's about to return 1 it tries to lock all pages, and if a try lock
      on a page fails, and we didn't flush any existing bio in our "epd", it
      calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
      an error. The page might have been locked elsewhere, not with the goal
      of starting writeback of the extent buffer, and even by some code other
      than btrfs, like page migration for example, so it does not mean the
      writeback of the extent buffer was already started by some other task,
      so returning a 0 tells the caller (btree_write_cache_pages()) to not
      start writeback for the extent buffer. Note that epd might currently have
      either no bio, so flush_write_bio() returns 0 (success) or it might have
      a bio for another extent buffer with a lower index (logical address).
      
      Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
      extent buffer and writeback is never started for the extent buffer,
      future attempts to writeback the extent buffer will hang forever waiting
      on that bit to be cleared, since it can only be cleared after writeback
      completes. Such hang is reported with a trace like the following:
      
        [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
        [49887.347059]       Not tainted 5.2.13-gentoo #2
        [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [49887.347062] btrfs-transacti D    0  1752      2 0x80004000
        [49887.347064] Call Trace:
        [49887.347069]  ? __schedule+0x265/0x830
        [49887.347071]  ? bit_wait+0x50/0x50
        [49887.347072]  ? bit_wait+0x50/0x50
        [49887.347074]  schedule+0x24/0x90
        [49887.347075]  io_schedule+0x3c/0x60
        [49887.347077]  bit_wait_io+0x8/0x50
        [49887.347079]  __wait_on_bit+0x6c/0x80
        [49887.347081]  ? __lock_release.isra.29+0x155/0x2d0
        [49887.347083]  out_of_line_wait_on_bit+0x7b/0x80
        [49887.347084]  ? var_wake_function+0x20/0x20
        [49887.347087]  lock_extent_buffer_for_io+0x28c/0x390
        [49887.347089]  btree_write_cache_pages+0x18e/0x340
        [49887.347091]  do_writepages+0x29/0xb0
        [49887.347093]  ? kmem_cache_free+0x132/0x160
        [49887.347095]  ? convert_extent_bit+0x544/0x680
        [49887.347097]  filemap_fdatawrite_range+0x70/0x90
        [49887.347099]  btrfs_write_marked_extents+0x53/0x120
        [49887.347100]  btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
        [49887.347102]  btrfs_commit_transaction+0x6bb/0x990
        [49887.347103]  ? start_transaction+0x33e/0x500
        [49887.347105]  transaction_kthread+0x139/0x15c
      
      So fix this by not overwriting the return value (ret) with the result
      from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
      bit in case flush_write_bio() returns an error, otherwise it will hang
      any future attempts to writeback the extent buffer, and undo all work
      done before (set back EXTENT_BUFFER_DIRTY, etc).
      
      This is a regression introduced in the 5.2 kernel.
      
      Fixes: 2e3c2513 ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
      Fixes: f4340622 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
      Reported-by: NZdenek Sojka <zsojka@seznam.cz>
      Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#uReported-by: NStefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#tReported-by: NDrazen Kacar <drazen.kacar@oradian.com>
      Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18dfa711
    • F
      Btrfs: fix assertion failure during fsync and use of stale transaction · 410f954c
      Filipe Manana 提交于
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      410f954c
  2. 09 9月, 2019 38 次提交
    • N
      btrfs: Relinquish CPUs in btrfs_compare_trees · 6af112b1
      Nikolay Borisov 提交于
      When doing any form of incremental send the parent and the child trees
      need to be compared via btrfs_compare_trees. This  can result in long
      loop chains without ever relinquishing the CPU. This causes softlockup
      detector to trigger when comparing trees with a lot of items. Example
      report:
      
      watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
      CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      pstate: 40000005 (nZcv daif -PAN -UAO)
      pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
      lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
      sp : ffff00001273b7e0
      Call trace:
       __ll_sc_arch_atomic_sub_return+0x14/0x20
       release_extent_buffer+0xdc/0x120 [btrfs]
       free_extent_buffer.part.0+0xb0/0x118 [btrfs]
       free_extent_buffer+0x24/0x30 [btrfs]
       btrfs_release_path+0x4c/0xa0 [btrfs]
       btrfs_free_path.part.0+0x20/0x40 [btrfs]
       btrfs_free_path+0x24/0x30 [btrfs]
       get_inode_info+0xa8/0xf8 [btrfs]
       finish_inode_if_needed+0xe0/0x6d8 [btrfs]
       changed_cb+0x9c/0x410 [btrfs]
       btrfs_compare_trees+0x284/0x648 [btrfs]
       send_subvol+0x33c/0x520 [btrfs]
       btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
       btrfs_ioctl+0x199c/0x2288 [btrfs]
       do_vfs_ioctl+0x4b0/0x820
       ksys_ioctl+0x84/0xb8
       __arm64_sys_ioctl+0x28/0x38
       el0_svc_common.constprop.0+0x7c/0x188
       el0_svc_handler+0x34/0x90
       el0_svc+0x8/0xc
      
      Fix this by adding a call to cond_resched at the beginning of the main
      loop in btrfs_compare_trees.
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6af112b1
    • N
      btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic · 65e99c43
      Nikolay Borisov 提交于
      Those function are simple boolean predicates there is no need to assign
      their return values to interim variables. Use them directly as
      predicates. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      65e99c43
    • J
      btrfs: create structure to encode checksum type and length · af024ed2
      Johannes Thumshirn 提交于
      Create a structure to encode the type and length for the known on-disk
      checksums.  This makes it easier to add new checksums later.
      
      The structure and helpers are moved from ctree.h so they don't occupy
      space in all headers including ctree.h. This save some space in the
      final object.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af024ed2
    • J
      btrfs: add enospc debug messages for ticket failure · 84fe47a4
      Josef Bacik 提交于
      When debugging weird enospc problems it's handy to be able to dump the
      space info when we wake up all tickets, and see what the ticket values
      are.  This helped me figure out cases where we were enospc'ing when we
      shouldn't have been.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84fe47a4
    • J
      btrfs: do not account global reserve in can_overcommit · 0096420a
      Josef Bacik 提交于
      We ran into a problem in production where a box with plenty of space was
      getting wedged doing ENOSPC flushing.  These boxes only had 20% of the
      disk allocated, but their metadata space + global reserve was right at
      the size of their metadata chunk.
      
      In this case can_overcommit should be allowing allocations without
      problem, but there's logic in can_overcommit that doesn't allow us to
      overcommit if there's not enough real space to satisfy the global
      reserve.
      
      This is for historical reasons.  Before there were only certain places
      we could allocate chunks.  We could go to commit the transaction and not
      have enough space for our pending delayed refs and such and be unable to
      allocate a new chunk.  This would result in a abort because of ENOSPC.
      This code was added to solve this problem.
      
      However since then we've gained the ability to always be able to
      allocate a chunk.  So we can easily overcommit in these cases without
      risking a transaction abort because of ENOSPC.
      
      Also prior to now the global reserve really would be used because that's
      the space we relied on for delayed refs.  With delayed refs being
      tracked separately we no longer have to worry about running out of
      delayed refs space while committing.  We are much less likely to
      exhaust our global reserve space during transaction commit.
      
      Fix the can_overcommit code to simply see if our current usage + what we
      want is less than our current free space plus whatever slack space we
      have in the disk is.  This solves the problem we were seeing in
      production and keeps us from flushing as aggressively as we approach our
      actual metadata size usage.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0096420a
    • J
      btrfs: use btrfs_try_granting_tickets in update_global_rsv · 426551f6
      Josef Bacik 提交于
      We have some annoying xfstests tests that will create a very small fs,
      fill it up, delete it, and repeat to make sure everything works right.
      This trips btrfs up sometimes because we may commit a transaction to
      free space, but most of the free metadata space was being reserved by
      the global reserve.  So we commit and update the global reserve, but the
      space is simply added to bytes_may_use directly, instead of trying to
      add it to existing tickets.  This results in ENOSPC when we really did
      have space.  Fix this by calling btrfs_try_granting_tickets once we add
      back our excess space to wake any pending tickets.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      426551f6
    • J
      btrfs: always reserve our entire size for the global reserve · d792b0f1
      Josef Bacik 提交于
      While messing with the overcommit logic I noticed that sometimes we'd
      ENOSPC out when really we should have run out of space much earlier.  It
      turns out it's because we'll only reserve up to the free amount left in
      the space info for the global reserve, but that doesn't make sense with
      overcommit because we could be well above our actual size.  This results
      in the global reserve not carving out it's entire reservation, and thus
      not putting enough pressure on the rest of the infrastructure to do the
      right thing and ENOSPC out at a convenient time.  Fix this by always
      taking our full reservation amount for the global reserve.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d792b0f1
    • J
      btrfs: change the minimum global reserve size · 3593ce30
      Josef Bacik 提交于
      It made sense to have the global reserve set at 16M in the past, but
      since it is used less nowadays set the minimum size to the number of
      items we'll need to update the main trees we update during a transaction
      commit, plus some slop area so we can do unlinks if we need to.
      
      In practice this doesn't affect normal file systems, but for xfstests
      where we do things like fill up a fs and then rm * it can fall over in
      weird ways.  This enables us for more sane behavior at extremely small
      file system sizes.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3593ce30
    • J
      btrfs: rename btrfs_space_info_add_old_bytes · d05e4649
      Josef Bacik 提交于
      This name doesn't really fit with how the space reservation stuff works
      now, rename it to btrfs_space_info_free_bytes_may_use so it's clear what
      the function is doing.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d05e4649
    • J
      btrfs: remove orig_bytes from reserve_ticket · def936e5
      Josef Bacik 提交于
      Now that we do not do partial filling of tickets simply remove
      orig_bytes, it is no longer needed.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      def936e5
    • J
      btrfs: fix may_commit_transaction to deal with no partial filling · 00c0135e
      Josef Bacik 提交于
      Now that we aren't partially filling tickets we may have some slack
      space left in the space_info.  We need to account for this in
      may_commit_transaction, otherwise we may choose to not commit the
      transaction despite it actually having enough space to satisfy our
      ticket.
      
      Calculate the free space we have in the space_info, if any, and subtract
      this from the ticket we have and use that amount to determine if we will
      need to commit to reclaim enough space.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      00c0135e
    • J
      btrfs: rework wake_all_tickets · 2341ccd1
      Josef Bacik 提交于
      Now that we no longer partially fill tickets we need to rework
      wake_all_tickets to call btrfs_try_to_wakeup_tickets() in order to see
      if any subsequent tickets are able to be satisfied.  If our tickets_id
      changes we know something happened and we can keep flushing.
      
      Also if we find a ticket that is smaller than the first ticket in our
      queue then we want to retry the flushing loop again in case
      may_commit_transaction() decides we could satisfy the ticket by
      committing the transaction.
      
      Rename this to maybe_fail_all_tickets() while we're at it, to better
      reflect what the function is actually doing.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2341ccd1
    • J
      btrfs: refactor the ticket wakeup code · 18fa2284
      Josef Bacik 提交于
      Now that btrfs_space_info_add_old_bytes simply checks if we can make the
      reservation and updates bytes_may_use, there's no reason to have both
      helpers in place.
      
      Factor out the ticket wakeup logic into it's own helper, make
      btrfs_space_info_add_old_bytes() update bytes_may_use and then call the
      wakeup helper, and replace all calls to btrfs_space_info_add_new_bytes()
      with the wakeup helper.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18fa2284
    • J
      btrfs: stop partially refilling tickets when releasing space · 91182645
      Josef Bacik 提交于
      btrfs_space_info_add_old_bytes is used when adding the extra space from
      an existing reservation back into the space_info to be used by any
      waiting tickets.  In order to keep us from overcommitting we check to
      make sure that we can still use this space for our reserve ticket, and
      if we cannot we'll simply subtract it from space_info->bytes_may_use.
      
      However this is problematic, because it assumes that only changes to
      bytes_may_use would affect our ability to make reservations.  Any
      changes to bytes_reserved would be missed.  If we were unable to make a
      reservation prior because of reserved space, but that reserved space was
      free'd due to unlink or truncate and we were allowed to immediately
      reclaim that metadata space we would still ENOSPC.
      
      Consider the example where we create a file with a bunch of extents,
      using up 2MiB of actual space for the new tree blocks.  Then we try to
      make a reservation of 2MiB but we do not have enough space to make this
      reservation.  The iput() occurs in another thread and we remove this
      space, and since we did not write the blocks we simply do
      space_info->bytes_reserved -= 2MiB.  We would never see this because we
      do not check our space info used, we just try to re-use the freed
      reservations.
      
      To fix this problem, and to greatly simplify the wakeup code, do away
      with this partial refilling nonsense.  Use
      btrfs_space_info_add_old_bytes to subtract the reservation from
      space_info->bytes_may_use, and then check the ticket against the total
      used of the space_info the same way we do with the initial reservation
      attempt.
      
      This keeps the reservation logic consistent and solves the problem of
      early ENOSPC in the case that we free up space in places other than
      bytes_may_use and bytes_pinned.  Thanks,
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      91182645
    • J
      btrfs: add space reservation tracepoint for reserved bytes · a43c3835
      Josef Bacik 提交于
      I noticed when folding the trace_btrfs_space_reservation() tracepoint
      into the btrfs_space_info_update_* helpers that we didn't emit a
      tracepoint when doing btrfs_add_reserved_bytes().  I know this is
      because we were swapping bytes_may_use for bytes_reserved, so in my mind
      there was no reason to have the tracepoint there.  But now there is
      because we always emit the unreserve for the bytes_may_use side, and
      this would have broken if compression was on anyway.  Add a tracepoint
      to cover the bytes_reserved counter so the math still comes out right.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a43c3835
    • J
      btrfs: roll tracepoint into btrfs_space_info_update helper · f3e75e38
      Josef Bacik 提交于
      We duplicate this tracepoint everywhere we call these helpers, so update
      the helper to have the tracepoint as well.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3e75e38
    • J
      btrfs: do not allow reservations if we have pending tickets · ef1317a1
      Josef Bacik 提交于
      If we already have tickets on the list we don't want to steal their
      reservations.  This is a preparation patch for upcoming changes,
      technically this shouldn't happen today because of the way we add bytes
      to tickets before adding them to the space_info in most cases.
      
      This does not change the FIFO nature of reserve tickets, it simply
      allows us to enforce it in a different way.  Previously it was enforced
      because any new space would be added to the first ticket on the list,
      which would result in new reservations getting a reserve ticket.  This
      replaces that mechanism by simply checking to see if we have outstanding
      reserve tickets and skipping straight to adding a ticket for our
      reservation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ef1317a1
    • O
      btrfs: stop clearing EXTENT_DIRTY in inode I/O tree · e182163d
      Omar Sandoval 提交于
      Since commit fee187d9 ("Btrfs: do not set EXTENT_DIRTY along with
      EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
      can simplify and stop trying to clear it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e182163d
    • O
      btrfs: treat RWF_{,D}SYNC writes as sync for CRCs · f50cb7af
      Omar Sandoval 提交于
      The VFS indicates a synchronous write to ->write_iter() via
      iocb->ki_flags. The IOCB_{,D}SYNC flags may be set based on the file
      (see iocb_flags()) or the RWF_* flags passed to a syscall like
      pwritev2() (see kiocb_set_rw_flags()).
      
      However, in btrfs_file_write_iter(), we're checking if a write is
      synchronous based only on the file; we use this to decide when to bump
      the sync_writers counter and thus do CRCs synchronously. Make sure we do
      this for all synchronous writes as determined by the VFS.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add const ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f50cb7af
    • O
      btrfs: use correct count in btrfs_file_write_iter() · c09767a8
      Omar Sandoval 提交于
      generic_write_checks() may modify iov_iter_count(), so we must get the
      count after the call, not before. Using the wrong one has a couple of
      consequences:
      
      1. We check a longer range in check_can_nocow() for nowait than we're
         actually writing.
      2. We create extra hole extent maps in btrfs_cont_expand(). As far as I
         can tell, this is harmless, but I might be missing something.
      
      These issues are pretty minor, but let's fix it before something more
      important trips on it.
      
      Fixes: edf064e7 ("btrfs: nowait aio support")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c09767a8
    • D
      btrfs: tie extent buffer and it's token together · c82f823c
      David Sterba 提交于
      Further simplifaction of the get/set helpers is possible when the token
      is uniquely tied to an extent buffer. A condition and an assignment can
      be avoided.
      
      The initializations are moved closer to the first use when the extent
      buffer is valid. There's one exception in __push_leaf_left where the
      token is reused.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c82f823c
    • D
      btrfs: assume valid token for btrfs_set/get_token helpers · 48bc3950
      David Sterba 提交于
      Now that we can safely assume that the token is always a valid pointer,
      remove the branches that check that.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48bc3950
    • D
      btrfs: define separate btrfs_set/get_XX helpers · cb495113
      David Sterba 提交于
      There are helpers for all type widths defined via macro and optionally
      can use a token which is a cached pointer to avoid repeated mapping of
      the extent buffer.
      
      The token value is known at compile time, when it's valid it's always
      address of a local variable, otherwise it's NULL passed by the
      token-less helpers.
      
      This can be utilized to remove some branching as the helpers are used
      frequenlty.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb495113
    • N
      btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extref · 6ff49c6a
      Nikolay Borisov 提交于
      btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it
      found a backref for the given name. If it returns true then the actual
      inode_ref struct is returned in one of its parameters. That's pointless,
      instead refactor the function such that it returns either a pointer
      to the btrfs_inode_extref or NULL it it didn't find anything. This
      streamlines the function calling convention.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ff49c6a
    • N
      btrfs: Make btrfs_find_name_in_backref return btrfs_inode_ref struct · 9bb8407f
      Nikolay Borisov 提交于
      btrfs_find_name_in_backref returns either 0/1 depending on whether it
      found a backref for the given name. If it returns true then the actual
      inode_ref struct is returned in one of its parameters. That's pointless,
      instead refactor the function such that it returns either a pointer
      to the btrfs_inode_ref or NULL it it didn't find anything. This
      streamlines the function calling convention.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9bb8407f
    • D
      btrfs: move dev_stats helpers to volumes.c · 1dc990df
      David Sterba 提交于
      The other dev stats functions are already there and the helpers are not
      used by anything else.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1dc990df
    • D
      btrfs: move struct io_ctl to free-space-cache.h · 67b61aef
      David Sterba 提交于
      The io_ctl structure is used for free space management, and used only by
      the v1 space cache code, but unfortunatlly the full definition is
      required by block-group.h so it can't be moved to free-space-cache.c
      without additional changes.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      67b61aef
    • D
      btrfs: move functions for tree compare to send.c · 18d0f5c6
      David Sterba 提交于
      Send is the only user of tree_compare, we can move it there along with
      the other helpers and definitions.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18d0f5c6
    • D
      btrfs: rename and export read_node_slot · 4b231ae4
      David Sterba 提交于
      Preparatory work for code that will be moved out of ctree and uses this
      function.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b231ae4
    • D
      8a953348
    • D
      btrfs: move math functions to misc.h · 784352fe
      David Sterba 提交于
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      784352fe
    • D
      btrfs: move cond_wake_up functions out of ctree · 602cbe91
      David Sterba 提交于
      The file ctree.h serves as a header for everything and has become quite
      bloated. Split some helpers that are generic and create a new file that
      should be the catch-all for code that's not btrfs-specific.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      602cbe91
    • A
      btrfs: use proper error values on allocation failure in clone_fs_devices · d2979aa2
      Anand Jain 提交于
      Fix the fake ENOMEM return error code to the actual error in
      clone_fs_devices().
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2979aa2
    • A
      btrfs: proper error handling when invalid device is found in find_next_devid · a06dee4d
      Anand Jain 提交于
      In a corrupted tree, if search for next devid finds the device with
      devid = -1, then report the error -EUCLEAN back to the parent function
      to fail gracefully.
      
      The tree checker will not catch this in case the devids are created
      using the following script:
      
        umount /btrfs
        dev1=/dev/sdb
        dev2=/dev/sdc
        mkfs.btrfs -fq -dsingle -msingle $dev1
        mount $dev1 /btrfs
      
        _fail()
        {
      	  echo $1
      	  exit 1
        }
      
        while true; do
      	  btrfs dev add -f $dev2 /btrfs || _fail "add failed"
      	  btrfs dev del $dev1 /btrfs || _fail "del failed"
      	  dev_tmp=$dev1
      	  dev1=$dev2
      	  dev2=$dev_tmp
        done
      
      With output:
      
        BTRFS critical (device sdb): corrupt leaf: root=3 block=313739198464 slot=1 devid=1 invalid devid: has=507 expect=[0, 506]
        BTRFS error (device sdb): block=313739198464 write time tree block corruption detected
        BTRFS: error (device sdb) in btrfs_commit_transaction:2268: errno=-5 IO failure (Error while writing out transaction)
        BTRFS warning (device sdb): Skipping commit of aborted transaction.
        BTRFS: error (device sdb) in cleanup_transaction:1827: errno=-5 IO failure
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      [ add script and messages ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a06dee4d
    • C
      btrfs: fix allocation of free space cache v1 bitmap pages · 3acd4850
      Christophe Leroy 提交于
      Various notifications of type "BUG kmalloc-4096 () : Redzone
      overwritten" have been observed recently in various parts of the kernel.
      After some time, it has been made a relation with the use of BTRFS
      filesystem and with SLUB_DEBUG turned on.
      
      [   22.809700] BUG kmalloc-4096 (Tainted: G        W        ): Redzone overwritten
      
      [   22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
      [   22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
      [   22.811193] 	__slab_alloc.constprop.26+0x44/0x70
      [   22.811345] 	kmem_cache_alloc_trace+0xf0/0x2ec
      [   22.811588] 	__load_free_space_cache+0x588/0x780 [btrfs]
      [   22.811848] 	load_free_space_cache+0xf4/0x1b0 [btrfs]
      [   22.812090] 	cache_block_group+0x1d0/0x3d0 [btrfs]
      [   22.812321] 	find_free_extent+0x680/0x12a4 [btrfs]
      [   22.812549] 	btrfs_reserve_extent+0xec/0x220 [btrfs]
      [   22.812785] 	btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
      [   22.813032] 	__btrfs_cow_block+0x150/0x5d4 [btrfs]
      [   22.813262] 	btrfs_cow_block+0x194/0x298 [btrfs]
      [   22.813484] 	commit_cowonly_roots+0x44/0x294 [btrfs]
      [   22.813718] 	btrfs_commit_transaction+0x63c/0xc0c [btrfs]
      [   22.813973] 	close_ctree+0xf8/0x2a4 [btrfs]
      [   22.814107] 	generic_shutdown_super+0x80/0x110
      [   22.814250] 	kill_anon_super+0x18/0x30
      [   22.814437] 	btrfs_kill_super+0x18/0x90 [btrfs]
      [   22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
      [   22.814841] 	proc_cgroup_show+0xc0/0x248
      [   22.814967] 	proc_single_show+0x54/0x98
      [   22.815086] 	seq_read+0x278/0x45c
      [   22.815190] 	__vfs_read+0x28/0x17c
      [   22.815289] 	vfs_read+0xa8/0x14c
      [   22.815381] 	ksys_read+0x50/0x94
      [   22.815475] 	ret_from_syscall+0x0/0x38
      
      Commit 69d24804 ("btrfs: use copy_page for copying pages instead of
      memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
      have the size of a page, they were allocated with kzalloc().
      
      Most of the time, kzalloc() allocates aligned blocks of memory, so
      copy_page() can be used. But when some debug options like SLAB_DEBUG are
      activated, kzalloc() may return unaligned pointer.
      
      On powerpc, memcpy(), copy_page() and other copying functions use
      'dcbz' instruction which provides an entire zeroed cacheline to avoid
      memory read when the intention is to overwrite a full line. Functions
      like memcpy() are writen to care about partial cachelines at the start
      and end of the destination, but copy_page() assumes it gets pages. As
      pages are naturally cache aligned, copy_page() doesn't care about
      partial lines. This means that when copy_page() is called with a
      misaligned pointer, a few leading bytes are zeroed.
      
      To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
      The cache pool is created with PAGE_SIZE alignment constraint.
      Reported-by: NErhard F. <erhard_f@mailbox.org>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
      Fixes: 69d24804 ("btrfs: use copy_page for copying pages instead of memcpy")
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_free_space_bitmap ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3acd4850
    • Q
      btrfs: Detect unbalanced tree with empty leaf before crashing btree operations · 62fdaa52
      Qu Wenruo 提交于
      [BUG]
      With crafted image, btrfs will panic at btree operations:
      
        kernel BUG at fs/btrfs/ctree.c:3894!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1138 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:__push_leaf_left+0x6b6/0x6e0
        RSP: 0018:ffffc0bd4128b990 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffffa0a4ab8f0e38 RCX: 0000000000000000
        RDX: ffffa0a280000000 RSI: 0000000000000000 RDI: ffffa0a4b3814000
        RBP: ffffc0bd4128ba38 R08: 0000000000001000 R09: ffffc0bd4128b948
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000240
        R13: ffffa0a4b556fb60 R14: ffffa0a4ab8f0af0 R15: ffffa0a4ab8f0af0
        FS: 0000000000000000(0000) GS:ffffa0a4b7a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f2461c80020 CR3: 000000022b32a006 CR4: 00000000000206f0
        Call Trace:
        ? _cond_resched+0x1a/0x50
        push_leaf_left+0x179/0x190
        btrfs_del_items+0x316/0x470
        btrfs_del_csums+0x215/0x3a0
        __btrfs_free_extent.isra.72+0x5a7/0xbe0
        __btrfs_run_delayed_refs+0x539/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace c2425e6e89b5558f ]---
      
      [CAUSE]
      The offending csum tree looks like this:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
      	  ...
      	  key (EXTENT_CSUM EXTENT_CSUM 85975040) block 29630464 gen 17
      	  key (EXTENT_CSUM EXTENT_CSUM 89911296) block 29642752 gen 17 <<<
      	  key (EXTENT_CSUM EXTENT_CSUM 92274688) block 29646848 gen 17
      	  ...
      
        leaf 29630464 items 6 free space 1 generation 17 owner CSUM_TREE
      	  item 0 key (EXTENT_CSUM EXTENT_CSUM 85975040) itemoff 3987 itemsize 8
      		  range start 85975040 end 85983232 length 8192
      	  ...
        leaf 29642752 items 0 free space 3995 generation 17 owner 0
      		      ^ empty leaf            invalid owner ^
      
        leaf 29646848 items 1 free space 602 generation 17 owner CSUM_TREE
      	  item 0 key (EXTENT_CSUM EXTENT_CSUM 92274688) itemoff 627 itemsize 3368
      		  range start 92274688 end 95723520 length 3448832
      
      So we have a corrupted csum tree where one tree leaf is completely
      empty, causing unbalanced btree, thus leading to unexpected btree
      balance error.
      
      [FIX]
      For this particular case, we handle it in two directions to catch it:
      - Check if the tree block is empty through btrfs_verify_level_key()
        So that invalid tree blocks won't be read out through
        btrfs_search_slot() and its variants.
      
      - Check 0 tree owner in tree checker
        NO tree is using 0 as its tree owner, detect it and reject at tree
        block read time.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202821Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62fdaa52
    • N
      btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag · ebc87351
      Nikolay Borisov 提交于
      Support for asynchronous snapshot creation was originally added in
      72fd032e ("Btrfs: add SNAP_CREATE_ASYNC ioctl") to cater for
      ceph's backend needs. However, since Ceph has deprecated support for
      btrfs there is no longer need for that support in btrfs. Additionally,
      this was never supported by btrfs-progs, the official userspace tools.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ebc87351
    • N
      btrfs: improve error handling in run_delalloc_nocow · 762bf098
      Nikolay Borisov 提交于
      Correctly handle failure cases when adding an ordered extents in case
      of REGULAR or PREALLOC extents. Remove the BUG_ON.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      762bf098