1. 05 5月, 2021 1 次提交
  2. 21 4月, 2021 1 次提交
    • J
      btrfs: zoned: automatically reclaim zones · 18bb8bbf
      Johannes Thumshirn 提交于
      When a file gets deleted on a zoned file system, the space freed is not
      returned back into the block group's free space, but is migrated to
      zone_unusable.
      
      As this zone_unusable space is behind the current write pointer it is not
      possible to use it for new allocations. In the current implementation a
      zone is reset once all of the block group's space is accounted as zone
      unusable.
      
      This behaviour can lead to premature ENOSPC errors on a busy file system.
      
      Instead of only reclaiming the zone once it is completely unusable,
      kick off a reclaim job once the amount of unusable bytes exceeds a user
      configurable threshold between 51% and 100%. It can be set per mounted
      filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
      by default.
      
      Similar to reclaiming unused block groups, these dirty block groups are
      added to a to_reclaim list and then on a transaction commit, the reclaim
      process is triggered but after we deleted unused block groups, which will
      free space for the relocation process.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18bb8bbf
  3. 04 3月, 2021 1 次提交
  4. 23 2月, 2021 2 次提交
    • J
      btrfs: avoid double put of block group when emptying cluster · 95c85fba
      Josef Bacik 提交于
      It's wrong calling btrfs_put_block_group in
      __btrfs_return_cluster_to_free_space if the block group passed is
      different than the block group the cluster represents. As this means the
      cluster doesn't have a reference to the passed block group. This results
      in double put and a use-after-free bug.
      
      Fix this by simply bailing if the block group we passed in does not
      match the block group on the cluster.
      
      Fixes: fa9c0d79 ("Btrfs: rework allocation clustering")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95c85fba
    • N
      btrfs: fix race between extent freeing/allocation when using bitmaps · 3c179165
      Nikolay Borisov 提交于
      During allocation the allocator will try to allocate an extent using
      cluster policy. Once the current cluster is exhausted it will remove the
      entry under btrfs_free_cluster::lock and subsequently acquire
      btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
      and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
      because there exists a race condition between removing the entry under
      one lock and doing the necessary accounting holding a different lock
      since extent freeing only uses the 2nd lock. This can result in the
      following situation:
      
      T1:                                    T2:
      btrfs_alloc_from_cluster               insert_into_bitmap <holds tree_lock>
       if (entry->bytes == 0)                   if (block_group && !list_empty(&block_group->cluster_list)) {
          rb_erase(entry)
      
       spin_unlock(&cluster->lock);
         (total_bitmaps is still 4)           spin_lock(&cluster->lock);
                                               <doesn't find entry in cluster->root>
       spin_lock(&ctl->tree_lock);             <goes to new_bitmap label, adds
      <blocked since T2 holds tree_lock>       <a new entry and calls add_new_bitmap>
      					    recalculate_thresholds  <crashes,
                                                    due to total_bitmaps
      					      becoming 5 and triggering
      					      an ASSERT>
      
      To fix this ensure that once depleted, the cluster entry is deleted when
      both cluster lock and tree locks are held in the allocator (T1), this
      ensures that even if there is a race with a concurrent
      insert_into_bitmap call it will correctly find the entry in the cluster
      and add the new space to it.
      
      CC: <stable@vger.kernel.org> # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c179165
  5. 09 2月, 2021 7 次提交
    • N
      btrfs: zoned: advance allocation pointer after tree log node · 011b41bf
      Naohiro Aota 提交于
      Since the allocation info of a tree log node is not recorded in the extent
      tree, calculate_alloc_pointer() cannot detect this node, so the pointer
      can be over a tree node.
      
      Replaying the log calls btrfs_remove_free_space() for each node in the
      log tree.
      
      So, advance the pointer after the node to not allocate over it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      011b41bf
    • N
      btrfs: zoned: implement sequential extent allocation · 2eda5708
      Naohiro Aota 提交于
      Implement a sequential extent allocator for zoned filesystems. This
      allocator only needs to check if there is enough space in the block group
      after the allocation pointer to satisfy the extent allocation request.
      Therefore the allocator never manages bitmaps or clusters. Also, add
      assertions to the corresponding functions.
      
      As zone append writing is used, it would be unnecessary to track the
      allocation offset, as the allocator only needs to check available space.
      But by tracking and returning the offset as an allocated region, we can
      skip modification of ordered extents and checksum information when there
      is no IO reordering.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2eda5708
    • N
      btrfs: zoned: track unusable bytes for zones · 169e0da9
      Naohiro Aota 提交于
      In a zoned filesystem a once written then freed region is not usable
      until the underlying zone has been reset. So we need to distinguish such
      unusable space from usable free space.
      
      Therefore we need to introduce the "zone_unusable" field to the block
      group structure, and "bytes_zone_unusable" to the space_info structure
      to track the unusable space.
      
      Pinned bytes are always reclaimed to the unusable space. But, when an
      allocated region is returned before using e.g., the block group becomes
      read-only between allocation time and reservation time, we can safely
      return the region to the block group. For the situation, this commit
      introduces "btrfs_add_free_space_unused". This behaves the same as
      btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
      rewinds the allocation offset.
      
      Because the read-only bytes tracks free but unusable bytes when the block
      group is read-only, we need to migrate the zone_unusable bytes to
      read-only bytes when a block group is marked read-only.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      169e0da9
    • Q
      btrfs: introduce btrfs_subpage for data inodes · 32443de3
      Qu Wenruo 提交于
      To support subpage sector size, data also need extra info to make sure
      which sectors in a page are uptodate/dirty/...
      
      This patch will make pages for data inodes get btrfs_subpage structure
      attached, and detached when the page is freed.
      
      This patch also slightly changes the timing when
      set_page_extent_mapped() is called to make sure:
      
      - We have page->mapping set
        page->mapping->host is used to grab btrfs_fs_info, thus we can only
        call this function after page is mapped to an inode.
      
        One call site attaches pages to inode manually, thus we have to modify
        the timing of set_page_extent_mapped() a bit.
      
      - As soon as possible, before other operations
        Since memory allocation can fail, we have to do extra error handling.
        Calling set_page_extent_mapped() as soon as possible can simply the
        error handling for several call sites.
      
      The idea is pretty much the same as iomap_page, but with more bitmaps
      for btrfs specific cases.
      
      Currently the plan is to switch iomap if iomap can provide sector
      aligned write back (only write back dirty sectors, but not the full
      page, data balance require this feature).
      
      So we will stick to btrfs specific bitmap for now.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32443de3
    • N
      btrfs: improve parameter description for __btrfs_write_out_cache · f092cf3c
      Nikolay Borisov 提交于
      Fixes following W=1 warnings:
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'root' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'inode' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'ctl' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'block_group' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'io_ctl' not described in '__btrfs_write_out_cache'
      fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'trans' not described in '__btrfs_write_out_cache'
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f092cf3c
    • N
      btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid · 543068a2
      Nikolay Borisov 提交于
      This better reflects the semantics of the function i.e no search is
      performed whatsoever.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      543068a2
    • Z
      btrfs: clarify error returns values in __load_free_space_cache · 3cc64e7e
      Zhihao Cheng 提交于
      Return value in __load_free_space_cache is not properly set after
      (unlikely) memory allocation failures and 0 is returned instead.
      This is not a problem for the caller load_free_space_cache because only
      value 1 is considered as 'cache loaded' but for clarity it's better
      to set the errors accordingly.
      
      Fixes: a67509c3 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3cc64e7e
  6. 10 12月, 2020 6 次提交
  7. 08 12月, 2020 5 次提交
  8. 07 10月, 2020 2 次提交
  9. 20 8月, 2020 1 次提交
    • F
      btrfs: fix space cache memory leak after transaction abort · bbc37d6e
      Filipe Manana 提交于
      If a transaction aborts it can cause a memory leak of the pages array of
      a block group's io_ctl structure. The following steps explain how that can
      happen:
      
      1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
         and it's about to start writing out dirty extent buffers;
      
      2) Transaction N + 1 already started and another task, task A, just called
         btrfs_commit_transaction() on it;
      
      3) Block group B was dirtied (extents allocated from it) by transaction
         N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
         very beginning of the transaction commit, it starts writeback for the
         block group's space cache by calling btrfs_write_out_cache(), which
         allocates the pages array for the block group's io_ctl with a call to
         io_ctl_init(). Block group A is added to the io_list of transaction
         N + 1 by btrfs_start_dirty_block_groups();
      
      4) While transaction N's commit is writing out the extent buffers, it gets
         an IO error and aborts transaction N, also setting the file system to
         RO mode;
      
      5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
         btrfs_commit_transaction() and has set transaction N + 1 state to
         TRANS_STATE_COMMIT_START. Immediately after that it checks that the
         filesystem was turned to RO mode, due to transaction N's abort, and
         jumps to the "cleanup_transaction" label. After that we end up at
         btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
         That helper finds block group B in the transaction's io_list but it
         never releases the pages array of the block group's io_ctl, resulting in
         a memory leak.
      
      In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
      array points to pages that were already released by us at
      __btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
      up freeing the pages array only after waiting for the ordered extent to
      complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
      that. But in the transaction abort case we don't wait for the space cache's
      ordered extent to complete through a call to btrfs_wait_cache_io(), so
      that's why we end up with a memory leak - we wait for the ordered extent
      to complete indirectly by shutting down the work queues and waiting for
      any jobs in them to complete before returning from close_ctree().
      
      We can solve the leak simply by freeing the pages array right after
      releasing the pages (with the call to io_ctl_drop_pages()) at
      __btrfs_write_out_cache(), since we will never use it anymore after that
      and the pages array points to already released pages at that point, which
      is currently not a problem since no one will use it after that, but not a
      good practice anyway since it can easily lead to use-after-free issues.
      
      So fix this by freeing the pages array right after releasing the pages at
      __btrfs_write_out_cache().
      
      This issue can often be reproduced with test case generic/475 from fstests
      and kmemleak can detect it and reports it with the following trace:
      
      unreferenced object 0xffff9bbf009fa600 (size 512):
        comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
        hex dump (first 32 bytes):
          00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
          80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
        backtrace:
          [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
          [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
          [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
          [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
          [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
          [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
          [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
          [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
          [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
          [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
          [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
          [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
          [<0000000043ace2c9>] do_syscall_64+0x33/0x80
          [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbc37d6e
  10. 11 8月, 2020 1 次提交
    • J
      btrfs: only search for left_info if there is no right_info in try_merge_free_space · bf53d468
      Josef Bacik 提交于
      In try_to_merge_free_space we attempt to find entries to the left and
      right of the entry we are adding to see if they can be merged.  We
      search for an entry past our current info (saved into right_info), and
      then if right_info exists and it has a rb_prev() we save the rb_prev()
      into left_info.
      
      However there's a slight problem in the case that we have a right_info,
      but no entry previous to that entry.  At that point we will search for
      an entry just before the info we're attempting to insert.  This will
      simply find right_info again, and assign it to left_info, making them
      both the same pointer.
      
      Now if right_info _can_ be merged with the range we're inserting, we'll
      add it to the info and free right_info.  However further down we'll
      access left_info, which was right_info, and thus get a use-after-free.
      
      Fix this by only searching for the left entry if we don't find a right
      entry at all.
      
      The CVE referenced had a specially crafted file system that could
      trigger this use-after-free. However with the tree checker improvements
      we no longer trigger the conditions for the UAF.  But the original
      conditions still apply, hence this fix.
      
      Reference: CVE-2019-19448
      Fixes: 96303081 ("Btrfs: use hybrid extents+bitmap rb tree for free space")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf53d468
  11. 27 7月, 2020 3 次提交
  12. 25 5月, 2020 5 次提交
    • F
      btrfs: turn space cache writeout failure messages into debug messages · bbcd1f4d
      Filipe Manana 提交于
      Since commit 1afb648e ("btrfs: use standard debug config option to
      enable free-space-cache debug prints"), we started to log error messages
      that were never logged before since there was no DEBUG macro defined
      anywhere. This started to make test case btrfs/187 to fail very often,
      as it greps for any btrfs error messages in dmesg/syslog and fails if
      any is found:
      
      (...)
      btrfs/186 1s ...  2s
      btrfs/187       - output mismatch (see .../results//btrfs/187.out.bad)
          \--- tests/btrfs/187.out     2019-05-17 12:48:32.537340749 +0100
          \+++ /home/fdmanana/git/hub/xfstests/results//btrfs/187.out.bad ...
          \@@ -1,3 +1,8 @@
           QA output created by 187
           Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap1'
           Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap2'
          +[268364.139958] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.156503] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.161703] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.253180] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          ...
          (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/187.out ...
      btrfs/188 4s ...  2s
      (...)
      
      The space cache write failures happen due to ENOSPC when attempting to
      update the free space cache items in the root tree. This happens because
      when starting or joining a transaction we don't know how many block
      groups we will end up changing (due to extent allocation or release) and
      therefore never reserve space for updating free space cache items.
      More often than not, the free space cache writeout succeeds since the
      metadata space info is not yet full nor very close to being full, but
      when it is, the space cache writeout fails with ENOSPC.
      
      Occasional failures to write space caches are not considered critical
      since they can be rebuilt when mounting the filesystem or the next
      attempt to write a free space cache in the next transaction commit might
      succeed, so we used to hide those error messages with a preprocessor
      check for the existence of the DEBUG macro that was never enabled
      anywhere.
      
      A few other generic test cases also trigger the error messages due to
      ENOSPC failure when writing free space caches as well, however they don't
      fail since they don't grep dmesg/syslog for any btrfs specific error
      messages.
      
      So change the messages from 'error' level to 'debug' level, as it doesn't
      make much sense to have error messages triggered only if the debug macro
      is enabled plus, more importantly, the error is not serious nor highly
      unexpected.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbcd1f4d
    • F
      btrfs: include error on messages about failure to write space/inode caches · 2e69a7a6
      Filipe Manana 提交于
      Currently the error messages logged when we fail to write a free space
      cache or an inode cache are not very useful as they don't mention what
      was the error. So include the error number in the messages.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2e69a7a6
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • F
      btrfs: move the block group freeze/unfreeze helpers into block-group.c · 684b752b
      Filipe Manana 提交于
      The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
      used to be named btrfs_get_block_group_trimming() and
      btrfs_put_block_group_trimming() respectively.
      
      At the time they were added to free-space-cache.c, by commit e33e17ee
      ("btrfs: add missing discards when unpinning extents with -o discard")
      because all the trimming related functions were in free-space-cache.c.
      
      Now that the helpers were renamed and are used in scrub context as well,
      move them to block-group.c, a much more logical location for them.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      684b752b
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af
  13. 24 3月, 2020 5 次提交