1. 03 1月, 2022 7 次提交
  2. 16 12月, 2021 1 次提交
    • F
      btrfs: fix invalid delayed ref after subvolume creation failure · 7a163608
      Filipe Manana 提交于
      When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
      insert the new root's root item into the root tree, we are freeing the
      metadata extent we reserved for the new root to prevent a metadata
      extent leak, as we don't abort the transaction at that point (since
      there is nothing at that point that is irreversible).
      
      However we allocated the metadata extent for the new root which we are
      creating for the new subvolume, so its delayed reference refers to the
      ID of this new root. But when we free the metadata extent we pass the
      root of the subvolume where the new subvolume is located to
      btrfs_free_tree_block() - this is incorrect because this will generate
      a delayed reference that refers to the ID of the parent subvolume's root,
      and not to ID of the new root.
      
      This results in a failure when running delayed references that leads to
      a transaction abort and a trace like the following:
      
      [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
      [3868.739857] Code: 68 0f 85 e6 fb ff (...)
      [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
      [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
      [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
      [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
      [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
      [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
      [3868.750725] FS:  00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
      [3868.752275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
      [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [3868.757803] Call Trace:
      [3868.758281]  <TASK>
      [3868.758655]  ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
      [3868.759827]  __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
      [3868.761047]  btrfs_run_delayed_refs+0x86/0x210 [btrfs]
      [3868.762069]  ? lock_acquired+0x19f/0x420
      [3868.762829]  btrfs_commit_transaction+0x69/0xb20 [btrfs]
      [3868.763860]  ? _raw_spin_unlock+0x29/0x40
      [3868.764614]  ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
      [3868.765870]  create_subvol+0x1d8/0x9a0 [btrfs]
      [3868.766766]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
      [3868.767669]  ? preempt_count_add+0x49/0xa0
      [3868.768444]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
      [3868.769639]  ? _copy_from_user+0x66/0xa0
      [3868.770391]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [3868.771495]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
      [3868.772364]  ? __slab_free+0x10a/0x360
      [3868.773198]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.774121]  ? lock_release+0x223/0x4a0
      [3868.774863]  ? lock_acquired+0x19f/0x420
      [3868.775634]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.776530]  ? trace_hardirqs_on+0x1b/0xe0
      [3868.777373]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
      [3868.778280]  ? kmem_cache_free+0x321/0x3c0
      [3868.779011]  ? __x64_sys_ioctl+0x83/0xb0
      [3868.779718]  __x64_sys_ioctl+0x83/0xb0
      [3868.780387]  do_syscall_64+0x3b/0xc0
      [3868.781059]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [3868.781953] RIP: 0033:0x7f281c59e957
      [3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
      [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
      [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
      [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
      [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
      [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
      [3868.793765]  </TASK>
      [3868.794037] irq event stamp: 0
      [3868.794548] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.797086] softirqs last  enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [3868.799284] ---[ end trace be24c7002fe27747 ]---
      [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
      [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
      [3868.802056]  item 0 key (237436928 169 0) itemoff 16250 itemsize 33
      [3868.802863]          extent refs 1 gen 1265 flags 2
      [3868.803447]          ref#0: tree block backref root 1610
      (...)
      [3869.064354]  item 114 key (241008640 169 0) itemoff 12488 itemsize 33
      [3869.065421]          extent refs 1 gen 1268 flags 2
      [3869.066115]          ref#0: tree block backref root 1689
      (...)
      [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622  owner 0 offset 0
      [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
      [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry
      
      Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
      This requires changing the root argument of btrfs_free_tree_block() from
      struct btrfs_root * to a u64, since at this point during the subvolume
      creation we have not yet created the struct btrfs_root for the new
      subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
      else from a struct btrfs_root.
      
      This was triggered by test case generic/475 from fstests.
      
      Fixes: 67addf29 ("btrfs: fix metadata extent leak after failure to create subvolume")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a163608
  3. 29 10月, 2021 1 次提交
  4. 27 10月, 2021 10 次提交
    • Q
      btrfs: make btrfs_super_block size match BTRFS_SUPER_INFO_SIZE · 38732474
      Qu Wenruo 提交于
      It's a common practice to avoid use sizeof(struct btrfs_super_block)
      (3531), but to use BTRFS_SUPER_INFO_SIZE (4096).
      
      The problem is that, sizeof(struct btrfs_super_block) doesn't match
      BTRFS_SUPER_INFO_SIZE from the very beginning.
      
      Furthermore, for all call sites except selftests, we always allocate
      BTRFS_SUPER_INFO_SIZE space for super block, there isn't any real reason
      to use the smaller value, and it doesn't really save any space.
      
      So let's get rid of such confusing behavior, and unify those two values.
      
      This modification also adds a new static_assert() to verify the size,
      and moves the BTRFS_SUPER_INFO_* macros to the definition of
      btrfs_super_block for the static_assert().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38732474
    • J
      btrfs: add a BTRFS_FS_ERROR helper · 84961539
      Josef Bacik 提交于
      We have a few flags that are inconsistently used to describe the fs in
      different states of failure.  As of 5963ffca ("btrfs: always abort
      the transaction if we abort a trans handle") we will always set
      BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED
      and ERROR to see if things have gone wrong.  Add a helper to check
      BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to
      use the helper.
      
      The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up
      resources during umount after trans is aborted") but is not actually
      specific.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84961539
    • Q
      btrfs: remove unused function btrfs_bio_fits_in_stripe() · 6aabd858
      Qu Wenruo 提交于
      As the last caller in compression.c has been removed, we don't need that
      function anymore.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6aabd858
    • F
      btrfs: unexport setup_items_for_insert() · f0641656
      Filipe Manana 提交于
      Since setup_items_for_insert() is not used anymore outside of ctree.c,
      make it static and remove its prototype from ctree.h. This also requires
      to move the definition of setup_item_for_insert() from ctree.h to ctree.c
      and move down btrfs_duplicate_item() so that it's defined after
      setup_items_for_insert().
      
      Further, since setup_item_for_insert() is used outside ctree.c, rename it
      to btrfs_setup_item_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 2/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f0641656
    • F
      btrfs: loop only once over data sizes array when inserting an item batch · b7ef5f3a
      Filipe Manana 提交于
      When inserting a batch of items into a btree, we end up looping over the
      data sizes array 3 times:
      
      1) Once in the caller of btrfs_insert_empty_items(), when it populates the
         array with the data sizes for each item;
      
      2) Once at btrfs_insert_empty_items() to sum the elements of the data
         sizes array and compute the total data size;
      
      3) And then once again at setup_items_for_insert(), where we do exactly
         the same as what we do at btrfs_insert_empty_items(), to compute the
         total data size.
      
      That is not bad for small arrays, but when the arrays have hundreds of
      elements, the time spent on looping is not negligible. For example when
      doing batch inserts of delayed items for dir index items or when logging
      a directory, it's common to have 200 to 260 dir index items in a single
      batch when using a leaf size of 16K and using file names between 8 and 12
      characters. For a 64K leaf size, multiply that by 4. Taking into account
      that during directory logging or when flushing delayed dir index items we
      can have many of those large batches, the time spent on the looping adds
      up quickly.
      
      It's also more important to avoid it at setup_items_for_insert(), since
      we are holding a write lock on a leaf and, in some cases, on upper nodes
      of the btree, which causes us to block other tasks that want to access
      the leaf and nodes for longer than necessary.
      
      So change the code so that setup_items_for_insert() and
      btrfs_insert_empty_items() no longer compute the total data size, and
      instead rely on the caller to supply it. This makes us loop over the
      array only once, where we can both populate the data size array and
      compute the total data size, taking advantage of spatial and temporal
      locality. To make this more manageable, use a structure to contain
      all the relevant details for a batch of items (keys array, data sizes
      array, total data size, number of items), and use it as an argument
      for btrfs_insert_empty_items() and setup_items_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 1/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7ef5f3a
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • J
      btrfs: zoned: add a dedicated data relocation block group · c2707a25
      Johannes Thumshirn 提交于
      Relocation in a zoned filesystem can fail with a transaction abort with
      error -22 (EINVAL). This happens because the relocation code assumes that
      the extents we relocated the data to have the same size the source extents
      had and ensures this by preallocating the extents.
      
      But in a zoned filesystem we currently can't preallocate the extents as
      this would break the sequential write required rule. Therefore it can
      happen that the writeback process kicks in while we're still adding pages
      to a delalloc range and starts writing out dirty pages.
      
      This then creates destination extents that are smaller than the source
      extents, triggering the following safety check in get_new_location():
      
       1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
       1035                 ret = -EINVAL;
       1036                 goto out;
       1037         }
      
      Temporarily create a dedicated block group for the relocation process, so
      no non-relocation data writes can interfere with the relocation writes.
      
      This is needed that we can switch the relocation process on a zoned
      filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
      like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.
      
      Fixes: 32430c61 ("btrfs: zoned: enable relocation on a zoned filesystem")
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2707a25
    • J
      btrfs: introduce btrfs_is_data_reloc_root · 37f00a6d
      Johannes Thumshirn 提交于
      There are several places in our codebase where we check if a root is the
      root of the data reloc tree and subsequent patches will introduce more.
      
      Factor out the check into a small helper function instead of open coding
      it multiple times.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37f00a6d
    • N
      btrfs: zoned: implement active zone tracking · afba2bc0
      Naohiro Aota 提交于
      Add zone_is_active flag to btrfs_block_group. This flag indicates the
      underlying zones are all active. Such zone active block groups are tracked
      by fs_info->active_bg_list.
      
      btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying
      device part. They set/clear the bitmap to indicate zone activeness and
      count the number of zones we can activate left.
      
      btrfs_zone_{activate,finish}() take responsibility for the logical part and
      the list management. In addition, btrfs_zone_finish() wait for any writes
      on it and send REQ_OP_ZONE_FINISH to the zone.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      afba2bc0
    • Q
      btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file() · 1ccc2e8a
      Qu Wenruo 提交于
      Currently btrfs_defrag_file() accepts both "struct inode" and "struct
      file" as parameter.  We can easily grab "struct inode" from "struct
      file" using file_inode() helper.
      
      The reason why we need "struct file" is just to re-use its f_ra.
      
      Change this to pass "struct file_ra_state" parameter, so that it's more
      clear what we really want.  Since we're here, also add some comments on
      the function btrfs_defrag_file().
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1ccc2e8a
  5. 26 10月, 2021 1 次提交
  6. 08 10月, 2021 1 次提交
    • F
      btrfs: unify lookup return value when dir entry is missing · 8dcbc261
      Filipe Manana 提交于
      btrfs_lookup_dir_index_item() and btrfs_lookup_dir_item() lookup for dir
      entries and both are used during log replay or when updating a log tree
      during an unlink.
      
      However when the dir item does not exists, btrfs_lookup_dir_item() returns
      NULL while btrfs_lookup_dir_index_item() returns PTR_ERR(-ENOENT), and if
      the dir item exists but there is no matching entry for a given name or
      index, both return NULL. This makes the call sites during log replay to
      be more verbose than necessary and it makes it easy to miss this slight
      difference. Since we don't need to distinguish between those two cases,
      make btrfs_lookup_dir_index_item() always return NULL when there is no
      matching directory entry - either because there isn't any dir entry or
      because there is one but it does not match the given name and index.
      
      Also rename the argument 'objectid' of btrfs_lookup_dir_index_item() to
      'index' since it is supposed to match an index number, and the name
      'objectid' is not very good because it can easily be confused with an
      inode number (like the inode number a dir entry points to).
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8dcbc261
  7. 23 8月, 2021 9 次提交
    • C
      btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls · 4d4340c9
      Christian Brauner 提交于
      Creating subvolumes and snapshots is one of the core features of btrfs
      and is even available to unprivileged users. Make it possible to use
      subvolume and snapshot creation on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d4340c9
    • M
      btrfs: introduce btrfs_search_backwards function · 0ff40a91
      Marcos Paulo de Souza 提交于
      It's a common practice to start a search using offset (u64)-1, which is
      the u64 maximum value, meaning that we want the search_slot function to
      be set in the last item with the same objectid and type.
      
      Once we are in this position, it's a matter to start a search backwards
      by calling btrfs_previous_item, which will check if we'll need to go to
      a previous leaf and other necessary checks, only to be sure that we are
      in last offset of the same object and type.
      
      The new btrfs_search_backwards function does the all these steps when
      necessary, and can be used to avoid code duplication.
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ff40a91
    • B
      btrfs: initial fsverity support · 14605409
      Boris Burkov 提交于
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: NEric Biggers <ebiggers@google.com>
      Co-developed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14605409
    • B
      btrfs: add ro compat flags to inodes · 77eea05e
      Boris Burkov 提交于
      Currently, inode flags are fully backwards incompatible in btrfs. If we
      introduce a new inode flag, then tree-checker will detect it and fail.
      This can even cause us to fail to mount entirely. To make it possible to
      introduce new flags which can be read-only compatible, like VERITY, we
      add new ro flags to btrfs without treating them quite so harshly in
      tree-checker. A read-only file system can survive an unexpected flag,
      and can be mounted.
      
      As for the implementation, it unfortunately gets a little complicated.
      
      The on-disk representation of the inode, btrfs_inode_item, has an __le64
      for flags but the in-memory representation, btrfs_inode, uses a u32.
      David Sterba had the nice idea that we could reclaim those wasted 32 bits
      on disk and use them for the new ro_compat flags.
      
      It turns out that the tree-checker code which checks for unknown flags
      is broken, and ignores the upper 32 bits we are hoping to use. The issue
      is that the flags use the literal 1 rather than 1ULL, so the flags are
      signed ints, and one of them is specifically (1 << 31). As a result, the
      mask which ORs the flags is a negative integer on machines where int is
      32 bit twos complement. When tree-checker evaluates the expression:
      
        btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
      
      The mask is something like 0x80000abc, which gets promoted to u64 with
      sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
      all the upper bits zeroed, and we can't detect unexpected flags.
      
      This suggests that we can't use those bits after all. Luckily, we have
      good reason to believe that they are zero anyway. Inode flags are
      metadata, which is always checksummed, so any bit flips that would
      introduce 1s would cause a checksum failure anyway (excluding the
      improbable case of the checksum getting corrupted exactly badly).
      
      Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
      inode flag should preserve its value and not add leading zeroes
      (at least for twos complement). The only place that flag
      (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
      the root item, and indeed for that inode we see 0xffffffff80000000 as
      the flags on disk. However, that inode is never seen by tree checker,
      nor is it used in a context where verity might be meaningful.
      Theoretically, a future ro flag might cause trouble on that inode, so we
      should proactively clean up that mess before it does.
      
      With the introduction of the new ro flags, keep two separate unsigned
      masks and check them against the appropriate u32. Since we no longer run
      afoul of sign extension, this also stops writing out 0xffffffff80000000
      in root_item inodes going forward.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77eea05e
    • J
      btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc
      Josef Bacik 提交于
      We have been hitting some early ENOSPC issues in production with more
      recent kernels, and I tracked it down to us simply not flushing delalloc
      as aggressively as we should be.  With tracing I was seeing us failing
      all tickets with all of the block rsvs at or around 0, with very little
      pinned space, but still around 120MiB of outstanding bytes_may_used.
      Upon further investigation I saw that we were flushing around 14 pages
      per shrink call for delalloc, despite having around 2GiB of delalloc
      outstanding.
      
      Consider the example of a 8 way machine, all CPUs trying to create a
      file in parallel, which at the time of this commit requires 5 items to
      do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
      size waiting on reservations.  Now assume we have 128MiB of delalloc
      outstanding.  With our current math we would set items to 20, and then
      set to_reclaim to 20 * 256k, or 5MiB.
      
      Assuming that we went through this loop all 3 times, for both
      FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
      twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
      leave a fair bit of delalloc reservations still hanging around by the
      time we go to ENOSPC out all the remaining tickets.
      
      Fix this two ways.  First, change the calculations to be a fraction of
      the total delalloc bytes on the system.  Prior to this change we were
      calculating based on dirty inodes so our math made more sense, now it's
      just completely unrelated to what we're actually doing.
      
      Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
      gone through the flush states at least once.  This will empty the system
      of all delalloc so we're sure to be truly out of space when we start
      failing tickets.
      
      I'm tagging stable 5.10 and forward, because this is where we started
      using the page stuff heavily again.  This affects earlier kernel
      versions as well, but would be a pain to backport to them as the
      flushing mechanisms aren't the same.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      03fe78cc
    • D
      btrfs: make btrfs_next_leaf static inline · 809d6902
      David Sterba 提交于
      btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
      to header to avoid the function call overhead.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      809d6902
    • D
      btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered · 25c1252a
      David Sterba 提交于
      The uptodate parameter should be bool, change the type.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25c1252a
    • Q
      btrfs: remove unused start and end parameters from btrfs_run_delalloc_range() · a129ffb8
      Qu Wenruo 提交于
      Since commit d75855b4 ("btrfs: Remove
      extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
      and adds btrfs_writepage_cow_fixup() function, there is no need to
      follow the old hook parameters.
      
      Remove the @start and @end hook, since currently the fixup check is full
      page check, it doesn't need @start and @end hook.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a129ffb8
    • J
      btrfs: zoned: remove max_zone_append_size logic · 5a80d1c6
      Johannes Thumshirn 提交于
      There used to be a patch in the original series for zoned support which
      limited the extent size to max_zone_append_size, but this patch has been
      dropped somewhere around v9.
      
      We've decided to go the opposite direction, instead of limiting extents
      in the first place we split them before submission to comply with the
      device's limits.
      
      Remove the related code, btrfs_fs_info::max_zone_append_size and
      btrfs_zoned_device_info::max_zone_append_size.
      
      This also removes the workaround for dm-crypt introduced in
      1d68128c ("btrfs: zoned: fail mount if the device does not support
      zone append") because the fix has been merged as f34ee1dc ("dm
      crypt: Fix zoned block device support").
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a80d1c6
  8. 19 8月, 2021 1 次提交
  9. 23 6月, 2021 1 次提交
  10. 22 6月, 2021 5 次提交
    • J
      btrfs: rip out may_commit_transaction · c416a30c
      Josef Bacik 提交于
      may_commit_transaction was introduced before the ticketing
      infrastructure existed.  There was a problem where we'd legitimately be
      out of space, but every reservation would trigger a transaction commit
      and then fail.  Thus if you had 1000 things trying to make a
      reservation, they'd all do the flushing loop and thus commit the
      transaction 1000 times before they'd get their ENOSPC.
      
      This helper was introduced to short circuit this, if there wasn't space
      that could be reclaimed by committing the transaction then simply ENOSPC
      out.  This made true ENOSPC tests much faster as we didn't waste a bunch
      of time.
      
      However many of our bugs over the years have been from cases where we
      didn't account for some space that would be reclaimed by committing a
      transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
      miscalculations, etc.  And in the meantime the original problem has been
      solved with ticketing.  We no longer will commit the transaction 1000
      times.  Instead we'll get 1000 waiters, we will go through the flushing
      mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
      out.  The ticketing infrastructure gives us a deterministic way to see
      if we're making progress or not, thus we avoid a lot of extra work.
      
      So simplify this step by simply unconditionally committing the
      transaction.  This removes what is arguably our most common source of
      early ENOSPC bugs and will allow us to drastically simplify many of the
      things we track because we simply won't need them with this stuff gone.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c416a30c
    • F
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana 提交于
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cea5cf0
    • D
      btrfs: shorten integrity checker extent data mount option · cbeaae4f
      David Sterba 提交于
      Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and
      calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the
      mount option name.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cbeaae4f
    • D
      btrfs: switch mount option bits to enums and use wider type · ccd9395b
      David Sterba 提交于
      Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are
      recorded in the debugging information for convenience).
      
      There are two more things done but separating them would not make much
      sense as it's touching the same lines:
      
      - Renumber shifts 18..31 to 17..30 to get rid of the hole in the
        sequence.
      
      - Use 1UL as the value that gets shifted because we're approaching the
        32bit limit and due to integer promotions the value of (1 << 31)
        becomes 0xffffffff80000000 when cast to unsigned long (eg. the option
        manipulating helpers).
      
        This is not causing any problems yet as the operations are in-memory
        and masking the 31st bit works, we don't have more than 31 bits so the
        ill effects of not masking higher bits don't happen. But once we have
        more, the problems will emerge.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ccd9395b
    • D
      btrfs: fix typos in comments · 1a9fd417
      David Sterba 提交于
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a9fd417
  11. 21 6月, 2021 3 次提交