1. 30 4月, 2019 29 次提交
  2. 19 3月, 2019 1 次提交
  3. 27 2月, 2019 2 次提交
    • J
      btrfs: save drop_progress if we drop refs at all · aea6f028
      Josef Bacik 提交于
      Previously we only updated the drop_progress key if we were in the
      DROP_REFERENCE stage of snapshot deletion.  This is because the
      UPDATE_BACKREF stage checks the flags of the blocks it's converting to
      FULL_BACKREF, so if we go over a block we processed before it doesn't
      matter, we just don't do anything.
      
      The problem is in do_walk_down() we will go ahead and drop the roots
      reference to any blocks that we know we won't need to walk into.
      
      Given subvolume A and snapshot B.  The root of B points to all of the
      nodes that belong to A, so all of those nodes have a refcnt > 1.  If B
      did not modify those blocks it'll hit this condition in do_walk_down
      
      if (!wc->update_ref ||
          generation <= root->root_key.offset)
      	goto skip;
      
      and in "goto skip" we simply do a btrfs_free_extent() for that bytenr
      that we point at.
      
      Now assume we modified some data in B, and then took a snapshot of B and
      call it C.  C points to all the nodes in B, making every node the root
      of B points to have a refcnt > 1.  This assumes the root level is 2 or
      higher.
      
      We delete snapshot B, which does the above work in do_walk_down,
      free'ing our ref for nodes we share with A that we didn't modify.  Now
      we hit a node we _did_ modify, thus we own.  We need to walk down into
      this node and we set wc->stage == UPDATE_BACKREF.  We walk down to level
      0 which we also own because we modified data.  We can't walk any further
      down and thus now need to walk up and start the next part of the
      deletion.  Now walk_up_proc is supposed to put us back into
      DROP_REFERENCE, but there's an exception to this
      
      if (level < wc->shared_level)
      	goto out;
      
      we are at level == 0, and our shared_level == 1.  We skip out of this
      one and go up to level 1.  Since path->slots[1] < nritems we
      path->slots[1]++ and break out of walk_up_tree to stop our transaction
      and loop back around.  Now in btrfs_drop_snapshot we have this snippet
      
      if (wc->stage == DROP_REFERENCE) {
      	level = wc->level;
      	btrfs_node_key(path->nodes[level],
      		       &root_item->drop_progress,
      		       path->slots[level]);
      	root_item->drop_level = level;
      }
      
      our stage == UPDATE_BACKREF still, so we don't update the drop_progress
      key.  This is a problem because we would have done btrfs_free_extent()
      for the nodes leading up to our current position.  If we crash or
      unmount here and go to remount we'll start over where we were before and
      try to free our ref for blocks we've already freed, and thus abort()
      out.
      
      Fix this by keeping track of the last place we dropped a reference for
      our block in do_walk_down.  Then if wc->stage == UPDATE_BACKREF we know
      we'll start over from a place we meant to, and otherwise things continue
      to work as they did before.
      
      I have a complicated reproducer for this problem, without this patch
      we'll fail to fsck the fs when replaying the log writes log.  With this
      patch we can replay the whole log without any fsck or mount failures.
      
      The steps to reproduce this easily are sort of tricky, I had to add a
      couple of debug patches to the kernel in order to make it easy,
      basically I just needed to make sure we did actually commit the
      transaction every time we finished a walk_down_tree/walk_up_tree combo.
      
      The reproducer:
      
      1) Creates a base subvolume.
      2) Creates 100k files in the subvolume.
      3) Snapshots the base subvolume (snap1).
      4) Touches files 5000-6000 in snap1.
      5) Snapshots snap1 (snap2).
      6) Deletes snap1.
      
      I do this with dm-log-writes, and then replay to every FUA in the log
      and fsck the fs.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ copy reproducer steps ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aea6f028
    • J
      btrfs: check for refs on snapshot delete resume · 78c52d9e
      Josef Bacik 提交于
      There's a bug in snapshot deletion where we won't update the
      drop_progress key if we're in the UPDATE_BACKREF stage.  This is a
      problem because we could drop refs for blocks we know don't belong to
      ours.  If we crash or umount at the right time we could experience
      messages such as the following when snapshot deletion resumes
      
       BTRFS error (device dm-3): unable to find ref byte nr 66797568 parent 0 root 258  owner 1 offset 0
       ------------[ cut here ]------------
       WARNING: CPU: 3 PID: 16052 at fs/btrfs/extent-tree.c:7108 __btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
       CPU: 3 PID: 16052 Comm: umount Tainted: G        W  OE     5.0.0-rc4+ #147
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
       RIP: 0010:__btrfs_free_extent.isra.78+0x62c/0xb30 [btrfs]
       RSP: 0018:ffffc90005cd7b18 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
       RDX: ffff88842fade680 RSI: ffff88842fad6b18 RDI: ffff88842fad6b18
       RBP: ffffc90005cd7bc8 R08: 0000000000000000 R09: 0000000000000001
       R10: 0000000000000001 R11: ffffffff822696b8 R12: 0000000003fb4000
       R13: 0000000000000001 R14: 0000000000000102 R15: ffff88819c9d67e0
       FS:  00007f08bb138fc0(0000) GS:ffff88842fac0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f8f5d861ea0 CR3: 00000003e99fe000 CR4: 00000000000006e0
       Call Trace:
       ? _raw_spin_unlock+0x27/0x40
       ? btrfs_merge_delayed_refs+0x356/0x3e0 [btrfs]
       __btrfs_run_delayed_refs+0x75a/0x13c0 [btrfs]
       ? join_transaction+0x2b/0x460 [btrfs]
       btrfs_run_delayed_refs+0xf3/0x1c0 [btrfs]
       btrfs_commit_transaction+0x52/0xa50 [btrfs]
       ? start_transaction+0xa6/0x510 [btrfs]
       btrfs_sync_fs+0x79/0x1c0 [btrfs]
       sync_filesystem+0x70/0x90
       generic_shutdown_super+0x27/0x120
       kill_anon_super+0x12/0x30
       btrfs_kill_super+0x16/0xa0 [btrfs]
       deactivate_locked_super+0x43/0x70
       deactivate_super+0x40/0x60
       cleanup_mnt+0x3f/0x80
       __cleanup_mnt+0x12/0x20
       task_work_run+0x8b/0xc0
       exit_to_usermode_loop+0xce/0xd0
       do_syscall_64+0x20b/0x210
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      To fix this simply mark dead roots we read from disk as DEAD and then
      set the walk_control->restarted flag so we know we have a restarted
      deletion.  From here whenever we try to drop refs for blocks we check to
      verify our ref is set on them, and if it is not we skip it.  Once we
      find a ref that is set we unset walk_control->restarted since the tree
      should be in a normal state from then on, and any problems we run into
      from there are different issues.  I tested this with an existing broken
      fs and my reproducer that creates a broken fs and it fixed both file
      systems.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78c52d9e
  4. 25 2月, 2019 8 次提交
    • Q
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to... · 1418bae1
      Qu Wenruo 提交于
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
      
      [BUG]
      Btrfs/139 will fail with a high probability if the testing machine (VM)
      has only 2G RAM.
      
      Resulting the final write success while it should fail due to EDQUOT,
      and the fs will have quota exceeding the limit by 16K.
      
      The simplified reproducer will be: (needs a 2G ram VM)
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ for i in $(seq -w  1 8); do
        	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
        	echo "file $i written" > /dev/kmsg
          done
        $ sync
        $ btrfs qgroup show -pcre --raw $mnt
      
      The last pwrite will not trigger EDQUOT and final 'qgroup show' will
      show something like:
      
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5             16384        16384         none         none ---     ---
        0/256      1073758208   1073758208         none   1073741824 ---     ---
      
      And 1073758208 is larger than
        > 1073741824.
      
      [CAUSE]
      It's a bug in btrfs qgroup data reserved space management.
      
      For quota limit, we must ensure that:
        reserved (data + metadata) + rfer/excl <= limit
      
      Since rfer/excl is only updated at transaction commmit time, reserved
      space needs to be taken special care.
      
      One important part of reserved space is data, and for a new data extent
      written to disk, we still need to take the reserved space until
      rfer/excl numbers get updated.
      
      Originally when an ordered extent finishes, we migrate the reserved
      qgroup data space from extent_io tree to delayed ref head of the data
      extent, expecting delayed ref will only be cleaned up at commit
      transaction time.
      
      However for small RAM machine, due to memory pressure dirty pages can be
      flushed back to disk without committing a transaction.
      
      The related events will be something like:
      
        file 1 written
        btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
        btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
        btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
        btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
        btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
        cleanup_ref_head: num_bytes=54947840
        cleanup_ref_head: num_bytes=5636096
        cleanup_ref_head: num_bytes=569344
        cleanup_ref_head: num_bytes=57344
        cleanup_ref_head: num_bytes=8192
        ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
        file 2 written
        ...
        file 8 written
        cleanup_ref_head: num_bytes=8192
        ...
        btrfs_commit_transaction  <<< the only transaction committed during
      				the test
      
      When file 2 is written, we have already freed 128M reserved qgroup data
      space for ino 258. Thus later write won't trigger EDQUOT.
      
      This allows us to write more data beyond qgroup limit.
      
      In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
      
      [FIX]
      By moving reserved qgroup data space from btrfs_delayed_ref_head to
      btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
      space won't be freed half way before commit transaction, thus fix the
      problem.
      
      Fixes: f64d5ca8 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1418bae1
    • N
      btrfs: use WARN_ON in a canonical form btrfs_remove_block_group · 9a0ec83d
      Nikolay Borisov 提交于
      There is no point in using a construct like 'if (!condition)
      WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a0ec83d
    • J
      btrfs: be more explicit about allowed flush states · 8a1bbe1d
      Josef Bacik 提交于
      For FLUSH_LIMIT flushers we really can only allocate chunks and flush
      delayed inode items, everything else is problematic.  I added a bunch of
      new states and it lead to weirdness in the FLUSH_LIMIT case because I
      forgot about how it worked.  So instead explicitly declare the states
      that are ok for flushing with FLUSH_LIMIT and use that for our state
      machine.  Then as we add new things that are safe we can just add them
      to this list.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8a1bbe1d
    • J
      btrfs: loop in inode_rsv_refill · 5df11363
      Josef Bacik 提交于
      With severe fragmentation we can end up with our inode rsv size being
      huge during writeout, which would cause us to need to make very large
      metadata reservations.
      
      However we may not actually need that much once writeout is complete,
      because of the over-reservation for the worst case.
      
      So instead try to make our reservation, and if we couldn't make it
      re-calculate our new reservation size and try again.  If our reservation
      size doesn't change between tries then we know we are actually out of
      space and can error. Flushing that could have been running in parallel
      did not make any space.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ rename to calc_refill_bytes, update comment and changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5df11363
    • J
      btrfs: don't enospc all tickets on flush failure · f91587e4
      Josef Bacik 提交于
      With the introduction of the per-inode block_rsv it became possible to
      have really really large reservation requests made because of data
      fragmentation.  Since the ticket stuff assumed that we'd always have
      relatively small reservation requests it just killed all tickets if we
      were unable to satisfy the current request.
      
      However, this is generally not the case anymore.  So fix this logic to
      instead see if we had a ticket that we were able to give some
      reservation to, and if we were continue the flushing loop again.
      
      Likewise we make the tickets use the space_info_add_old_bytes() method
      of returning what reservation they did receive in hopes that it could
      satisfy reservations down the line.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f91587e4
    • J
      btrfs: don't use global reserve for chunk allocation · 450114fc
      Josef Bacik 提交于
      We've done this forever because of the voodoo around knowing how much
      space we have.  However, we have better ways of doing this now, and on
      normal file systems we'll easily have a global reserve of 512MiB, and
      since metadata chunks are usually 1GiB that means we'll allocate
      metadata chunks more readily.  Instead use the actual used amount when
      determining if we need to allocate a chunk or not.
      
      This has a side effect for mixed block group fs'es where we are no
      longer allocating enough chunks for the data/metadata requirements.  To
      deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
      machine.  This will only get used if we've already made a full loop
      through the flushing machinery and tried committing the transaction.
      
      If we have then we can try and force a chunk allocation since we likely
      need it to make progress.  This resolves issues I was seeing with
      the mixed bg tests in xfstests without the new flushing state.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      450114fc
    • J
      btrfs: dump block_rsv details when dumping space info · b78e5616
      Josef Bacik 提交于
      For enospc_debug having the block rsvs is super helpful to see if we've
      done something wrong.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b78e5616
    • J
      btrfs: check if there are free block groups for commit · d89dbefb
      Josef Bacik 提交于
      may_commit_transaction will skip committing the transaction if we don't
      have enough pinned space or if we're trying to find space for a SYSTEM
      chunk.  However, if we have pending free block groups in this transaction
      we still want to commit as we may be able to allocate a chunk to make
      our reservation.  So instead of just returning ENOSPC, check if we have
      free block groups pending, and if so commit the transaction to allow us
      to use that free space.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d89dbefb