1. 28 2月, 2017 4 次提交
  2. 17 2月, 2017 2 次提交
  3. 14 2月, 2017 2 次提交
  4. 06 12月, 2016 5 次提交
  5. 30 11月, 2016 4 次提交
  6. 19 11月, 2016 2 次提交
    • F
      Btrfs: remove unused code when creating and merging reloc trees · 001895b3
      Filipe Manana 提交于
      In commit 5bc7247a (Btrfs: fix broken nocow after balance) we started
      abusing the rtransid and otransid fields of root items from relocation
      trees to fix some issues with nodatacow mode. However later in commit
      ba8b0289 (Btrfs: do not reset last_snapshot after relocation) we
      dropped the code that made use of those fields but did not remove
      the code that sets those fields.
      
      So just remove them to avoid confusion.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      001895b3
    • F
      Btrfs: fix relocation incorrectly dropping data references · 054570a1
      Filipe Manana 提交于
      During relocation of a data block group we create a relocation tree
      for each fs/subvol tree by making a snapshot of each tree using
      btrfs_copy_root() and the tree's commit root, and then setting the last
      snapshot field for the fs/subvol tree's root to the value of the current
      transaction id minus 1. However this can lead to relocation later
      dropping references that it did not create if we have qgroups enabled,
      leaving the filesystem in an inconsistent state that keeps aborting
      transactions.
      
      Lets consider the following example to explain the problem, which requires
      qgroups to be enabled.
      
      We are relocating data block group Y, we have a subvolume with id 258 that
      has a root at level 1, that subvolume is used to store directory entries
      for snapshots and we are currently at transaction 3404.
      
      When committing transaction 3404, we have a pending snapshot and therefore
      we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot()
      in order to create its dentry at subvolume 258. This results in COWing
      leaf A from root 258 in order to add the dentry. Note that leaf A
      also contains file extent items referring to extents from some other
      block group X (we are currently relocating block group Y). Later on, still
      at create_pending_snapshot() we call qgroup_account_snapshot(), which
      switches the commit root for root 258 when it calls switch_commit_roots(),
      so now the COWed version of leaf A, lets call it leaf A', is accessible
      from the commit root of tree 258. At the end of qgroup_account_snapshot(),
      we call record_root_in_trans() with 258 as its argument, which results
      in btrfs_init_reloc_root() being called, which in turn calls
      relocation.c:create_reloc_root() in order to create a relocation tree
      associated to root 258, which results in assigning the value of 3403
      (which is the current transaction id minus 1 = 3404 - 1) to the
      last_snapshot field of root 258. When creating the relocation tree root
      at ctree.c:btrfs_copy_root() we add a shared reference for leaf A',
      corresponding to the relocation tree's root, when we call btrfs_inc_ref()
      against the COWed root (a copy of the commit root from tree 258), which
      is at level 1. So at this point leaf A' has 2 references, one normal
      reference corresponding to root 258 and one shared reference corresponding
      to the root of the relocation tree.
      
      Transaction 3404 finishes its commit and transaction 3405 is started by
      relocation when calling merge_reloc_root() for the relocation tree
      associated to root 258. In the meanwhile leaf A' is COWed again, in
      response to some filesystem operation, when we are still at transaction
      3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we
      call btrfs_block_can_be_shared() in order to figure out if other trees
      refer to the leaf and if any such trees exists, add a full back reference
      to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false
      because the following condition is false:
      
        btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item)
      
      which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with
      only one reference, corresponding to the shared reference we created when
      we called btrfs_copy_root() to create the relocation tree's root and
      btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting
      the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not
      adding shared references for the extents from block group X that leaf A'
      refers to with its file extent items.
      
      Later, after merging the relocation root we do a call to to
      btrfs_drop_snapshot() in order to delete the relocation tree. This ends
      up calling do_walk_down() when path->slots[1] points to leaf A', which
      results in calling btrfs_lookup_extent_info() to get the number of
      references for leaf A', which is 1 at this time (only the shared reference
      exists) and this value is stored at wc->refs[0]. After this walk_up_proc()
      is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'.
      Because the current level is 0 and wc->refs[0] is 1, it does call
      btrfs_dec_ref() against leaf A', which results in removing the single
      references that the extents from block group X have which are associated
      to root 258 - the expectation was to have each of these extents with 2
      references - one reference for root 258 and one shared reference related
      to the root of the relocation tree, and so we would drop only the shared
      reference (because leaf A' was supposed to have the flag
      BTRFS_BLOCK_FLAG_FULL_BACKREF set).
      
      This leaves the filesystem in an inconsistent state as we now have file
      extent items in a subvolume tree that point to extents from block group X
      without references in the extent tree. So later on when we try to decrement
      the references for these extents, for example due to a file unlink operation,
      truncate operation or overwriting ranges of a file, we fail because the
      expected references do not exist in the extent tree.
      
      This leads to warnings and transaction aborts like the following:
      
      [  588.965795] ------------[ cut here ]------------
      [  588.965815] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
      [  588.965816] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
      parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
      sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
      [  588.965831] CPU: 2 PID: 2479 Comm: kworker/u8:7 Not tainted 4.7.3-3-default-fdm+ #1
      [  588.965832] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  588.965844] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [  588.965845]  0000000000000000 ffff8802263bfa28 ffffffff813af542 0000000000000000
      [  588.965847]  0000000000000000 ffff8802263bfa68 ffffffff81081e8b 0000065900000000
      [  588.965848]  ffff8801db2af000 000000012bbe2000 0000000000000000 ffff880215703b48
      [  588.965849] Call Trace:
      [  588.965852]  [<ffffffff813af542>] dump_stack+0x63/0x81
      [  588.965854]  [<ffffffff81081e8b>] __warn+0xcb/0xf0
      [  588.965855]  [<ffffffff81081f7d>] warn_slowpath_null+0x1d/0x20
      [  588.965863]  [<ffffffffa0175042>] lookup_inline_extent_backref+0x432/0x5b0 [btrfs]
      [  588.965865]  [<ffffffff81143220>] ? trace_clock_local+0x10/0x30
      [  588.965867]  [<ffffffff8114c5df>] ? rb_reserve_next_event+0x6f/0x460
      [  588.965875]  [<ffffffffa0175215>] insert_inline_extent_backref+0x55/0xd0 [btrfs]
      [  588.965882]  [<ffffffffa017531f>] __btrfs_inc_extent_ref.isra.55+0x8f/0x240 [btrfs]
      [  588.965890]  [<ffffffffa017acea>] __btrfs_run_delayed_refs+0x74a/0x1260 [btrfs]
      [  588.965892]  [<ffffffff810cb046>] ? cpuacct_charge+0x86/0xa0
      [  588.965900]  [<ffffffffa017e74f>] btrfs_run_delayed_refs+0x9f/0x2c0 [btrfs]
      [  588.965908]  [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs]
      [  588.965918]  [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
      [  588.965928]  [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
      [  588.965930]  [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0
      [  588.965931]  [<ffffffff8109b658>] worker_thread+0x48/0x4e0
      [  588.965932]  [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0
      [  588.965934]  [<ffffffff810a1659>] kthread+0xc9/0xe0
      [  588.965936]  [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40
      [  588.965937]  [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170
      [  588.965938] ---[ end trace 34e5232c933a1749 ]---
      [  588.966187] ------------[ cut here ]------------
      [  588.966196] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:2966 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
      [  588.966196] BTRFS: Transaction aborted (error -5)
      [  588.966197] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc
      parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea
      sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg
      [  588.966206] CPU: 2 PID: 2479 Comm: kworker/u8:7 Tainted: G        W       4.7.3-3-default-fdm+ #1
      [  588.966207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  588.966217] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [  588.966217]  0000000000000000 ffff8802263bfc98 ffffffff813af542 ffff8802263bfce8
      [  588.966219]  0000000000000000 ffff8802263bfcd8 ffffffff81081e8b 00000b96345ee000
      [  588.966220]  ffffffffa021ae1c ffff880215703b48 00000000000005fe ffff8802345ee000
      [  588.966221] Call Trace:
      [  588.966223]  [<ffffffff813af542>] dump_stack+0x63/0x81
      [  588.966224]  [<ffffffff81081e8b>] __warn+0xcb/0xf0
      [  588.966225]  [<ffffffff81081eff>] warn_slowpath_fmt+0x4f/0x60
      [  588.966233]  [<ffffffffa017e93c>] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs]
      [  588.966241]  [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs]
      [  588.966250]  [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs]
      [  588.966259]  [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs]
      [  588.966260]  [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0
      [  588.966261]  [<ffffffff8109b658>] worker_thread+0x48/0x4e0
      [  588.966263]  [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0
      [  588.966264]  [<ffffffff810a1659>] kthread+0xc9/0xe0
      [  588.966265]  [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40
      [  588.966267]  [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170
      [  588.966268] ---[ end trace 34e5232c933a174a ]---
      [  588.966269] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2966: errno=-5 IO failure
      [  588.966270] BTRFS info (device sda2): forced readonly
      
      This was happening often on openSUSE and SLE systems using btrfs as the
      root filesystem (with its default layout where multiple subvolumes are
      used) where balance happens in the background triggered by a cron job and
      snapshots are automatically created before/after package installations,
      upgrades and removals. The issue could be triggered simply by running the
      following loop on the first system boot post installation:
      
        while true; do
           zypper -n in nfs-kernel-server
           zypper -n rm nfs-kernel-server
        done
      
      (If we were fast enough and made that loop before the cron job triggered
      a balance operation and the balance finished)
      
      So fix by setting the last_snapshot field of the root to the value of the
      generation of its commit root. Like this btrfs_block_can_be_shared()
      behaves correctly for the case where the relocation root is created during
      a transaction commit and for the case where it's created before a
      transaction commit.
      
      Fixes: 6426c7ad (btrfs: qgroup: Fix qgroup accounting when creating snapshot)
      Cc: stable@vger.kernel.org  # 4.7+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      054570a1
  7. 17 10月, 2016 1 次提交
    • L
      Btrfs: kill BUG_ON in do_relocation · 4547f4d8
      Liu Bo 提交于
      While updating btree, we try to push items between sibling
      nodes/leaves in order to keep height as low as possible.
      But we don't memset the original places with zero when
      pushing items so that we could end up leaving stale content
      in nodes/leaves.  One may read the above stale content by
      increasing btree blocks' @nritems.
      
      One case I've come across is that in fs tree, a leaf has two
      parent nodes, hence running balance ends up with processing
      this leaf with two parent nodes, but it can only reach the
      valid parent node through btrfs_search_slot, so it'd be like,
      
      do_relocation
          for P in all parent nodes of block A:
              if !P->eb:
                  btrfs_search_slot(key);   --> get path from P to A.
              if lowest:
                  BUG_ON(A->bytenr != bytenr of A recorded in P);
              btrfs_cow_block(P, A);   --> change A's bytenr in P.
      
      After btrfs_cow_block, P has the new bytenr of A, but with the
      same @key, we get the same path again, and get panic by BUG_ON.
      
      Note that this is only happening in a corrupted fs, for a
      regular fs in which we have correct @nritems so that we won't
      read stale content in any case.
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4547f4d8
  8. 27 9月, 2016 2 次提交
  9. 26 9月, 2016 3 次提交
  10. 01 9月, 2016 1 次提交
  11. 25 8月, 2016 3 次提交
    • W
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang 提交于
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18513091
    • W
      btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() · dcb40c19
      Wang Xiaoguang 提交于
      In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses
      wrong file offset for reloc_inode, it uses cluster->start and cluster->end,
      which indeed are extent's bytenr. The correct value should be
      cluster->[start|end] minus block group's start bytenr.
      
      start bytenr   cluster->start
      |              |     extent      |   extent   | ...| extent |
      |----------------------------------------------------------------|
      |                block group reloc_inode                         |
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcb40c19
    • Q
      btrfs: relocation: Fix leaking qgroups numbers on data extents · 62b99540
      Qu Wenruo 提交于
      This patch fixes a REGRESSION introduced in 4.2, caused by the big quota
      rework.
      
      When balancing data extents, qgroup will leak all its numbers for
      relocated data extents.
      
      The relocation is done in the following steps for data extents:
      1) Create data reloc tree and inode
      2) Copy all data extents to data reloc tree
         And commit transaction
      3) Create tree reloc tree(special snapshot) for any related subvolumes
      4) Replace file extent in tree reloc tree with new extents in data reloc
         tree
         And commit transaction
      5) Merge tree reloc tree with original fs, by swapping tree blocks
      
      For 1)~4), since tree reloc tree and data reloc tree doesn't count to
      qgroup, everything is OK.
      
      But for 5), the swapping of tree blocks will only info qgroup to track
      metadata extents.
      
      If metadata extents contain file extents, qgroup number for file extents
      will get lost, leading to corrupted qgroup accounting.
      
      The fix is, before commit transaction of step 5), manually info qgroup to
      track all file extents in data reloc tree.
      Since at commit transaction time, the tree swapping is done, and qgroup
      will account these data extents correctly.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Reported-by: NMark Fasheh <mfasheh@suse.de>
      Reported-by: NFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      62b99540
  12. 26 7月, 2016 3 次提交
  13. 08 7月, 2016 3 次提交
    • J
      Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes · 8ca17f0f
      Josef Bacik 提交于
      We used to allow you to set FLUSH_ALL and then just wouldn't do things like
      commit transactions or wait on ordered extents if we noticed you were in a
      transaction.  However now that all the flushing for FLUSH_ALL is asynchronous
      we've lost the ability to tell, and we could end up deadlocking.  So instead use
      FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
      we error out to preserve the previous behavior.  I've also added an ASSERT() to
      catch anybody else who tries to do this.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ca17f0f
    • J
      Btrfs: fill relocation block rsv after allocation · ac2fabac
      Josef Bacik 提交于
      Since we set the reloc control before we've reserved our space for relocation we
      could race with a root being dirtied and not actually have space to do our init
      reloc root.  So once we've allocated it and set it up go ahead and make our
      reservation before setting the relocate control, that way anybody who tries to
      do the reloc root init has space to use.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac2fabac
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
  14. 26 5月, 2016 1 次提交
  15. 13 5月, 2016 3 次提交
    • F
      Btrfs: fix race between block group relocation and nocow writes · f78c436c
      Filipe Manana 提交于
      Relocation of a block group waits for all existing tasks flushing
      dellaloc, starting direct IO writes and any ordered extents before
      starting the relocation process. However for direct IO writes that end
      up doing nocow (inode either has the flag nodatacow set or the write is
      against a prealloc extent) we have a short time window that allows for a
      race that makes relocation proceed without waiting for the direct IO
      write to complete first, resulting in data loss after the relocation
      finishes. This is illustrated by the following diagram:
      
                 CPU 1                                     CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                     direct IO write starts against
                                                     an extent in block group X
                                                     using nocow mode (inode has the
                                                     nodatacow flag or the write is
                                                     for a prealloc extent)
      
                                                     btrfs_direct_IO()
                                                       btrfs_get_blocks_direct()
                                                         --> can_nocow_extent() returns 1
      
         btrfs_inc_block_group_ro(bg X)
           --> turns block group into RO mode
      
         btrfs_wait_ordered_roots()
           --> returns and does not know about
               the DIO write happening at CPU 2
               (the task there has not created
                yet an ordered extent)
      
         relocate_block_group(bg X)
           --> rc->stage == MOVE_DATA_EXTENTS
      
           find_next_extent()
             --> returns extent that the DIO
                 write is going to write to
      
           relocate_data_extent()
      
             relocate_file_extent_cluster()
      
               --> reads the extent from disk into
                   pages belonging to the relocation
                   inode and dirties them
      
                                                         --> creates DIO ordered extent
      
                                                       btrfs_submit_direct()
                                                         --> submits bio against a location
                                                             on disk obtained from an extent
                                                             map before the relocation started
      
         btrfs_wait_ordered_range()
           --> writes all the pages read before
               to disk (belonging to the
               relocation inode)
      
         relocation finishes
      
                                                       bio completes and wrote new data
                                                       to the old location of the block
                                                       group
      
      So fix this by tracking the number of nocow writers for a block group and
      make sure relocation waits for that number to go down to 0 before starting
      to move the extents.
      
      The same race can also happen with buffered writes in nocow mode since the
      patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
      when relocating", because we are no longer flushing all delalloc which
      served as a synchonization mechanism (due to page locking) and ensured
      the ordered extents for nocow buffered writes were created before we
      called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
      mode existed before that patch (no pages are locked or used during direct
      IO) and that fixed only races with direct IO writes that do cow.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      f78c436c
    • F
      Btrfs: don't do unnecessary delalloc flushes when relocating · 9cfa3e34
      Filipe Manana 提交于
      Before we start the actual relocation process of a block group, we do
      calls to flush delalloc of all inodes and then wait for ordered extents
      to complete. However we do these flush calls just to make sure we don't
      race with concurrent tasks that have actually already started to run
      delalloc and have allocated an extent from the block group we want to
      relocate, right before we set it to readonly mode, but have not yet
      created the respective ordered extents. The flush calls make us wait
      for such concurrent tasks because they end up calling
      filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
      __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
      btrfs_run_delalloc_work()) which ends up serializing us with those tasks
      due to attempts to lock the same pages (and the delalloc flush procedure
      calls the allocator and creates the ordered extents before unlocking the
      pages).
      
      These flushing calls not only make us waste time (cpu, IO) but also reduce
      the chances of writing larger extents (applications might be writing to
      contiguous ranges and we flush before they finish dirtying the whole
      ranges).
      
      So make sure we don't flush delalloc and just wait for concurrent tasks
      that have already started flushing delalloc and have allocated an extent
      from the block group we are about to relocate.
      
      This change also ends up fixing a race with direct IO writes that makes
      relocation not wait for direct IO ordered extents. This race is
      illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                 starts direct IO write,
                                                 target inode currently has no
                                                 ordered extents ongoing nor
                                                 dirty pages (delalloc regions),
                                                 therefore the root for our inode
                                                 is not in the list
                                                 fs_info->ordered_roots
      
                                                 btrfs_direct_IO()
                                                   __blockdev_direct_IO()
                                                     btrfs_get_blocks_direct()
                                                       btrfs_lock_extent_direct()
                                                         locks range in the io tree
                                                       btrfs_new_extent_direct()
                                                         btrfs_reserve_extent()
                                                           --> extent allocated
                                                               from bg X
      
         btrfs_inc_block_group_ro(bg X)
      
         btrfs_start_delalloc_roots()
           __start_delalloc_inodes()
             --> does nothing, no dealloc ranges
                 in the inode's io tree so the
                 inode's root is not in the list
                 fs_info->delalloc_roots
      
         btrfs_wait_ordered_roots()
           --> does not find the inode's root in the
               list fs_info->ordered_roots
      
           --> ends up not waiting for the direct IO
               write started by the task at CPU 2
      
         relocate_block_group(rc->stage ==
           MOVE_DATA_EXTENTS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
      
           iterates the extent tree, using its
           commit root and moves extents into new
           locations
      
                                                         btrfs_add_ordered_extent_dio()
                                                           --> now a ordered extent is
                                                               created and added to the
                                                               list root->ordered_extents
                                                               and the root added to the
                                                               list fs_info->ordered_roots
                                                           --> this is too late and the
                                                               task at CPU 1 already
                                                               started the relocation
      
           btrfs_commit_transaction()
      
                                                         btrfs_finish_ordered_io()
                                                           btrfs_alloc_reserved_file_extent()
                                                             --> adds delayed data reference
                                                                 for the extent allocated
                                                                 from bg X
      
         relocate_block_group(rc->stage ==
           UPDATE_DATA_PTRS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
               --> delayed refs are run, so an extent
                   item for the allocated extent from
                   bg X is added to extent tree
               --> commit roots are switched, so the
                   next scan in the extent tree will
                   see the extent item
      
           sees the extent in the extent tree
      
      When this happens the relocation produces the following warning when it
      finishes:
      
      [ 7260.832836] ------------[ cut here ]------------
      [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
      [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
      [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
      [ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
      [ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
      [ 7260.852998] Call Trace:
      [ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
      [ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
      [ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
      [ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
      [ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
      [ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
      [ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
      [ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
      [ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
      [ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
      [ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
      [ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
      [ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
      [ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
      [ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
      [ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
      
      This is because at the end of the first stage, in relocate_block_group(),
      we commit the current transaction, which makes delayed refs run, the
      commit roots are switched and so the second stage will find the extent
      item that the ordered extent added to the delayed refs. But this extent
      was not moved (ordered extent completed after first stage finished), so
      at the end of the relocation our block group item still has a positive
      used bytes counter, triggering a warning at the end of
      btrfs_relocate_block_group(). Later on when trying to read the extent
      contents from disk we hit a BUG_ON() due to the inability to map a block
      with a logical address that belongs to the block group we relocated and
      is no longer valid, resulting in the following trace:
      
      [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
      [ 7344.887518] ------------[ cut here ]------------
      [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
      [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
      [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
      [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
      [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
      [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
      [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
      [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
      [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
      [ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
      [ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
      [ 7344.888431] Stack:
      [ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
      [ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
      [ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
      [ 7344.888431] Call Trace:
      [ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
      [ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
      [ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
      [ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
      [ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
      [ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
      [ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
      [ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
      [ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
      [ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
      [ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
      [ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
      [ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
      [ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
      [ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
      [ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431]  RSP <ffff8802046878f0>
      [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      9cfa3e34
    • F
      Btrfs: don't wait for unrelated IO to finish before relocation · 578def7c
      Filipe Manana 提交于
      Before the relocation process of a block group starts, it sets the block
      group to readonly mode, then flushes all delalloc writes and then finally
      it waits for all ordered extents to complete. This last step includes
      waiting for ordered extents destinated at extents allocated in other block
      groups, making us waste unecessary time.
      
      So improve this by waiting only for ordered extents that fall into the
      block group's range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      578def7c
  16. 29 4月, 2016 1 次提交