1. 20 8月, 2020 1 次提交
    • F
      btrfs: fix space cache memory leak after transaction abort · bbc37d6e
      Filipe Manana 提交于
      If a transaction aborts it can cause a memory leak of the pages array of
      a block group's io_ctl structure. The following steps explain how that can
      happen:
      
      1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
         and it's about to start writing out dirty extent buffers;
      
      2) Transaction N + 1 already started and another task, task A, just called
         btrfs_commit_transaction() on it;
      
      3) Block group B was dirtied (extents allocated from it) by transaction
         N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
         very beginning of the transaction commit, it starts writeback for the
         block group's space cache by calling btrfs_write_out_cache(), which
         allocates the pages array for the block group's io_ctl with a call to
         io_ctl_init(). Block group A is added to the io_list of transaction
         N + 1 by btrfs_start_dirty_block_groups();
      
      4) While transaction N's commit is writing out the extent buffers, it gets
         an IO error and aborts transaction N, also setting the file system to
         RO mode;
      
      5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
         btrfs_commit_transaction() and has set transaction N + 1 state to
         TRANS_STATE_COMMIT_START. Immediately after that it checks that the
         filesystem was turned to RO mode, due to transaction N's abort, and
         jumps to the "cleanup_transaction" label. After that we end up at
         btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
         That helper finds block group B in the transaction's io_list but it
         never releases the pages array of the block group's io_ctl, resulting in
         a memory leak.
      
      In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
      array points to pages that were already released by us at
      __btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
      up freeing the pages array only after waiting for the ordered extent to
      complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
      that. But in the transaction abort case we don't wait for the space cache's
      ordered extent to complete through a call to btrfs_wait_cache_io(), so
      that's why we end up with a memory leak - we wait for the ordered extent
      to complete indirectly by shutting down the work queues and waiting for
      any jobs in them to complete before returning from close_ctree().
      
      We can solve the leak simply by freeing the pages array right after
      releasing the pages (with the call to io_ctl_drop_pages()) at
      __btrfs_write_out_cache(), since we will never use it anymore after that
      and the pages array points to already released pages at that point, which
      is currently not a problem since no one will use it after that, but not a
      good practice anyway since it can easily lead to use-after-free issues.
      
      So fix this by freeing the pages array right after releasing the pages at
      __btrfs_write_out_cache().
      
      This issue can often be reproduced with test case generic/475 from fstests
      and kmemleak can detect it and reports it with the following trace:
      
      unreferenced object 0xffff9bbf009fa600 (size 512):
        comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
        hex dump (first 32 bytes):
          00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
          80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
        backtrace:
          [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
          [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
          [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
          [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
          [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
          [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
          [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
          [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
          [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
          [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
          [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
          [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
          [<0000000043ace2c9>] do_syscall_64+0x33/0x80
          [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbc37d6e
  2. 11 8月, 2020 1 次提交
    • J
      btrfs: only search for left_info if there is no right_info in try_merge_free_space · bf53d468
      Josef Bacik 提交于
      In try_to_merge_free_space we attempt to find entries to the left and
      right of the entry we are adding to see if they can be merged.  We
      search for an entry past our current info (saved into right_info), and
      then if right_info exists and it has a rb_prev() we save the rb_prev()
      into left_info.
      
      However there's a slight problem in the case that we have a right_info,
      but no entry previous to that entry.  At that point we will search for
      an entry just before the info we're attempting to insert.  This will
      simply find right_info again, and assign it to left_info, making them
      both the same pointer.
      
      Now if right_info _can_ be merged with the range we're inserting, we'll
      add it to the info and free right_info.  However further down we'll
      access left_info, which was right_info, and thus get a use-after-free.
      
      Fix this by only searching for the left entry if we don't find a right
      entry at all.
      
      The CVE referenced had a specially crafted file system that could
      trigger this use-after-free. However with the tree checker improvements
      we no longer trigger the conditions for the UAF.  But the original
      conditions still apply, hence this fix.
      
      Reference: CVE-2019-19448
      Fixes: 96303081 ("Btrfs: use hybrid extents+bitmap rb tree for free space")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf53d468
  3. 27 7月, 2020 3 次提交
  4. 25 5月, 2020 5 次提交
    • F
      btrfs: turn space cache writeout failure messages into debug messages · bbcd1f4d
      Filipe Manana 提交于
      Since commit 1afb648e ("btrfs: use standard debug config option to
      enable free-space-cache debug prints"), we started to log error messages
      that were never logged before since there was no DEBUG macro defined
      anywhere. This started to make test case btrfs/187 to fail very often,
      as it greps for any btrfs error messages in dmesg/syslog and fails if
      any is found:
      
      (...)
      btrfs/186 1s ...  2s
      btrfs/187       - output mismatch (see .../results//btrfs/187.out.bad)
          \--- tests/btrfs/187.out     2019-05-17 12:48:32.537340749 +0100
          \+++ /home/fdmanana/git/hub/xfstests/results//btrfs/187.out.bad ...
          \@@ -1,3 +1,8 @@
           QA output created by 187
           Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap1'
           Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/snap2'
          +[268364.139958] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.156503] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.161703] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          +[268380.253180] BTRFS error (device sdc): failed to write free space cache for block group 30408704
          ...
          (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/187.out ...
      btrfs/188 4s ...  2s
      (...)
      
      The space cache write failures happen due to ENOSPC when attempting to
      update the free space cache items in the root tree. This happens because
      when starting or joining a transaction we don't know how many block
      groups we will end up changing (due to extent allocation or release) and
      therefore never reserve space for updating free space cache items.
      More often than not, the free space cache writeout succeeds since the
      metadata space info is not yet full nor very close to being full, but
      when it is, the space cache writeout fails with ENOSPC.
      
      Occasional failures to write space caches are not considered critical
      since they can be rebuilt when mounting the filesystem or the next
      attempt to write a free space cache in the next transaction commit might
      succeed, so we used to hide those error messages with a preprocessor
      check for the existence of the DEBUG macro that was never enabled
      anywhere.
      
      A few other generic test cases also trigger the error messages due to
      ENOSPC failure when writing free space caches as well, however they don't
      fail since they don't grep dmesg/syslog for any btrfs specific error
      messages.
      
      So change the messages from 'error' level to 'debug' level, as it doesn't
      make much sense to have error messages triggered only if the debug macro
      is enabled plus, more importantly, the error is not serious nor highly
      unexpected.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bbcd1f4d
    • F
      btrfs: include error on messages about failure to write space/inode caches · 2e69a7a6
      Filipe Manana 提交于
      Currently the error messages logged when we fail to write a free space
      cache or an inode cache are not very useful as they don't mention what
      was the error. So include the error number in the messages.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2e69a7a6
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • F
      btrfs: move the block group freeze/unfreeze helpers into block-group.c · 684b752b
      Filipe Manana 提交于
      The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group()
      used to be named btrfs_get_block_group_trimming() and
      btrfs_put_block_group_trimming() respectively.
      
      At the time they were added to free-space-cache.c, by commit e33e17ee
      ("btrfs: add missing discards when unpinning extents with -o discard")
      because all the trimming related functions were in free-space-cache.c.
      
      Now that the helpers were renamed and are used in scrub context as well,
      move them to block-group.c, a much more logical location for them.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      684b752b
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af
  5. 24 3月, 2020 6 次提交
  6. 20 1月, 2020 14 次提交
    • D
      btrfs: ensure removal of discardable_* in free_bitmap() · 27f0afc7
      Dennis Zhou 提交于
      Most callers of free_bitmap() only call it if bitmap_info->bytes is 0.
      However, there are certain cases where we may free the free space cache
      via __btrfs_remove_free_space_cache(). This exposes a path where
      free_bitmap() is called regardless. This may result in a bad accounting
      situation for discardable_bytes and discardable_extents. So, remove the
      stats and call btrfs_discard_update_discardable().
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      27f0afc7
    • D
      btrfs: make smaller extents more likely to go into bitmaps · f9bb615a
      Dennis Zhou 提交于
      It's less than ideal for small extents to eat into our extent budget, so
      force extents <= 32KB into the bitmaps save for the first handful.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9bb615a
    • D
      btrfs: increase the metadata allowance for the free_space_cache · 5d90c5c7
      Dennis Zhou 提交于
      Currently, there is no way for the free space cache to recover from
      being serviced by purely bitmaps because the extent threshold is set to
      0 in recalculate_thresholds() when we surpass the metadata allowance.
      
      This adds a recovery mechanism by keeping large extents out of the
      bitmaps and increases the metadata upper bound to 64KB. The recovery
      mechanism bypasses this upper bound, thus making it a soft upper bound.
      But, with the bypass being 1MB or greater, it shouldn't add unbounded
      overhead.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5d90c5c7
    • D
      btrfs: keep track of discard reuse stats · 9ddf648f
      Dennis Zhou 提交于
      Keep track of how much we are discarding and how often we are reusing
      with async discard. The discard_*_bytes values don't need any special
      protection because the work item provides the single threaded access.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ddf648f
    • D
      btrfs: have multiple discard lists · 7fe6d45e
      Dennis Zhou 提交于
      Non-block group destruction discarding currently only had a single list
      with no minimum discard length. This can lead to caravaning more
      meaningful discards behind a heavily fragmented block group.
      
      This adds support for multiple lists with minimum discard lengths to
      prevent the caravan effect. We promote block groups back up when we
      exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support
      only 2 lists with filters of 1MB and 32KB respectively.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7fe6d45e
    • D
      btrfs: make max async discard size tunable · 19b2a2c7
      Dennis Zhou 提交于
      Expose max_discard_size as a tunable via sysfs and switch the current
      fixed maximum to the default value.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      19b2a2c7
    • D
      btrfs: limit max discard size for async discard · 4aa9ad52
      Dennis Zhou 提交于
      Throttle the maximum size of a discard so that we can provide an upper
      bound for the rate of async discard. While the block layer is able to
      split discards into the appropriate sized discards, we want to be able
      to account more accurately the rate at which we are consuming NCQ slots
      as well as limit the upper bound of work for a discard.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4aa9ad52
    • D
      btrfs: keep track of discardable_bytes for async discard · 5dc7c10b
      Dennis Zhou 提交于
      Keep track of this metric so that we can understand how ahead or behind
      we are in discarding rate. This uses the same accounting method as
      discardable_extents, deltas between previous/current values and
      propagating them up.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5dc7c10b
    • D
      btrfs: track discardable extents for async discard · dfb79ddb
      Dennis Zhou 提交于
      The number of discardable extents will serve as the rate limiting metric
      for how often we should discard. This keeps track of discardable extents
      in the free space caches by maintaining deltas and propagating them to
      the global count.
      
      The deltas are calculated from 2 values stored in PREV and CURR entries,
      then propagated up to the global discard ctl.  The current counter value
      becomes the previous counter value after update.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dfb79ddb
    • D
      btrfs: discard one region at a time in async discard · 2bee7eb8
      Dennis Zhou 提交于
      The prior two patches added discarding via a background workqueue. This
      just piggybacked off of the fstrim code to trim the whole block at once.
      Well inevitably this is worse performance wise and will aggressively
      overtrim. But it was nice to plumb the other infrastructure to keep the
      patches easier to review.
      
      This adds the real goal of this series which is discarding slowly (ie. a
      slow long running fstrim). The discarding is split into two phases,
      extents and then bitmaps. The reason for this is two fold. First, the
      bitmap regions overlap the extent regions. Second, discarding the
      extents first will let the newly trimmed bitmaps have the highest chance
      of coalescing when being readded to the free space cache.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bee7eb8
    • D
      btrfs: handle empty block_group removal for async discard · 6e80d4f8
      Dennis Zhou 提交于
      block_group removal is a little tricky. It can race with the extent
      allocator, the cleaner thread, and balancing. The current path is for a
      block_group to be added to the unused_bgs list. Then, when the cleaner
      thread comes around, it starts a transaction and then proceeds with
      removing the block_group. Extents that are pinned are subsequently
      removed from the pinned trees and then eventually a discard is issued
      for the entire block_group.
      
      Async discard introduces another player into the game, the discard
      workqueue. While it has none of the racing issues, the new problem is
      ensuring we don't leave free space untrimmed prior to forgetting the
      block_group.  This is handled by placing fully free block_groups on a
      separate discard queue. This is necessary to maintain discarding order
      as in the future we will slowly trim even fully free block_groups. The
      ordering helps us make progress on the same block_group rather than say
      the last fully freed block_group or needing to search through the fully
      freed block groups at the beginning of a list and insert after.
      
      The new order of events is a fully freed block group gets placed on the
      unused discard queue first. Once it's processed, it will be placed on
      the unusued_bgs list and then the original sequence of events will
      happen, just without the final whole block_group discard.
      
      The mount flags can change when processing unused_bgs, so when flipping
      from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
      discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
      free block groups on the discard_list to the unused_bg queue which will
      do the final discard for us.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6e80d4f8
    • D
      btrfs: add the beginning of async discard, discard workqueue · b0643e59
      Dennis Zhou 提交于
      When discard is enabled, everytime a pinned extent is released back to
      the block_group's free space cache, a discard is issued for the extent.
      This is an overeager approach when it comes to discarding and helping
      the SSD maintain enough free space to prevent severe garbage collection
      situations.
      
      This adds the beginning of async discard. Instead of issuing a discard
      prior to returning it to the free space, it is just marked as untrimmed.
      The block_group is then added to a LRU which then feeds into a workqueue
      to issue discards at a much slower rate. Full discarding of unused block
      groups is still done and will be addressed in a future patch of the
      series.
      
      For now, we don't persist the discard state of extents and bitmaps.
      Therefore, our failure recovery mode will be to consider extents
      untrimmed. This lets us handle failure and unmounting as one in the
      same.
      
      On a number of Facebook webservers, I collected data every minute
      accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
      and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
      is where we discard extents synchronously before returning them to the
      free space cache.
      
      discard=sync:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          ---------------------------------------------------------------
           Drive A  |           434           |          1170
           Drive B  |           880           |          2330
           Drive C  |          2943           |          3920
           Drive D  |          4763           |          5701
      
      discard=async:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          --------------------------------------------------------------
           Drive A  |           134           |           956
           Drive B  |            64           |          1972
           Drive C  |            59           |          1032
           Drive D  |            62           |          1200
      
      While it's not great that the stats are cumulative over 1m, all of these
      servers are running the same workload and and the delta between the two
      are substantial. We are spending significantly less time in
      btrfs_finish_extent_commit() which is responsible for discarding.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0643e59
    • D
      btrfs: keep track of free space bitmap trim status cleanliness · da080fe1
      Dennis Zhou 提交于
      There is a cap in btrfs in the amount of free extents that a block group
      can have. When it surpasses that threshold, future extents are placed
      into bitmaps. Instead of keeping track of if a certain bit is trimmed or
      not in a second bitmap, keep track of the relative state of the bitmap.
      
      With async discard, trimming bitmaps becomes a more frequent operation.
      As a trade off with simplicity, we keep track of if discarding a bitmap
      is in progress. If we fully scan a bitmap and trim as necessary, the
      bitmap is marked clean. This has some caveats as the min block size may
      skip over regions deemed too small. But this should be a reasonable
      trade off rather than keeping a second bitmap and making allocation
      paths more complex. The downside is we may overtrim, but ideally the min
      block size should prevent us from doing that too often and getting stuck
      trimming pathological cases.
      
      BTRFS_TRIM_STATE_TRIMMING is added to indicate a bitmap is in the
      process of being trimmed. If additional free space is added to that
      bitmap, the bit is cleared. A bitmap will be marked
      BTRFS_TRIM_STATE_TRIMMED if the trimming code was able to reach the end
      of it and the former is still set.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da080fe1
    • D
      btrfs: keep track of which extents have been discarded · a7ccb255
      Dennis Zhou 提交于
      Async discard will use the free space cache as backing knowledge for
      which extents to discard. This patch plumbs knowledge about which
      extents need to be discarded into the free space cache from
      unpin_extent_range().
      
      An untrimmed extent can merge with everything as this is a new region.
      Absorbing trimmed extents is a tradeoff to for greater coalescing which
      makes life better for find_free_extent(). Additionally, it seems the
      size of a trim isn't as problematic as the trim io itself.
      
      When reading in the free space cache from disk, if sync is set, mark all
      extents as trimmed. The current code ensures at transaction commit that
      all free space is trimmed when sync is set, so this reflects that.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a7ccb255
  7. 19 11月, 2019 3 次提交
  8. 18 11月, 2019 2 次提交
    • J
      btrfs: check page->mapping when loading free space cache · 3797136b
      Josef Bacik 提交于
      While testing 5.2 we ran into the following panic
      
      [52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001
      [52238.105608] RIP: 0010:drop_buffers+0x3d/0x150
      [52238.304051] Call Trace:
      [52238.308958]  try_to_free_buffers+0x15b/0x1b0
      [52238.317503]  shrink_page_list+0x1164/0x1780
      [52238.325877]  shrink_inactive_list+0x18f/0x3b0
      [52238.334596]  shrink_node_memcg+0x23e/0x7d0
      [52238.342790]  ? do_shrink_slab+0x4f/0x290
      [52238.350648]  shrink_node+0xce/0x4a0
      [52238.357628]  balance_pgdat+0x2c7/0x510
      [52238.365135]  kswapd+0x216/0x3e0
      [52238.371425]  ? wait_woken+0x80/0x80
      [52238.378412]  ? balance_pgdat+0x510/0x510
      [52238.386265]  kthread+0x111/0x130
      [52238.392727]  ? kthread_create_on_node+0x60/0x60
      [52238.401782]  ret_from_fork+0x1f/0x30
      
      The page we were trying to drop had a page->private, but had no
      page->mapping and so called drop_buffers, assuming that we had a
      buffer_head on the page, and then panic'ed trying to deref 1, which is
      our page->private for data pages.
      
      This is happening because we're truncating the free space cache while
      we're trying to load the free space cache.  This isn't supposed to
      happen, and I'll fix that in a followup patch.  However we still
      shouldn't allow those sort of mistakes to result in messing with pages
      that do not belong to us.  So add the page->mapping check to verify that
      we still own this page after dropping and re-acquiring the page lock.
      
      This page being unlocked as:
      btrfs_readpage
        extent_read_full_page
          __extent_read_full_page
            __do_readpage
              if (!nr)
      	   unlock_page  <-- nr can be 0 only if submit_extent_page
      			    returns an error
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add callchain ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3797136b
    • D
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba 提交于
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683 ("Btrfs:
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c66e0d4
  9. 09 9月, 2019 5 次提交
    • O
      btrfs: stop clearing EXTENT_DIRTY in inode I/O tree · e182163d
      Omar Sandoval 提交于
      Since commit fee187d9 ("Btrfs: do not set EXTENT_DIRTY along with
      EXTENT_DELALLOC"), we never set EXTENT_DIRTY in inode->io_tree, so we
      can simplify and stop trying to clear it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e182163d
    • C
      btrfs: fix allocation of free space cache v1 bitmap pages · 3acd4850
      Christophe Leroy 提交于
      Various notifications of type "BUG kmalloc-4096 () : Redzone
      overwritten" have been observed recently in various parts of the kernel.
      After some time, it has been made a relation with the use of BTRFS
      filesystem and with SLUB_DEBUG turned on.
      
      [   22.809700] BUG kmalloc-4096 (Tainted: G        W        ): Redzone overwritten
      
      [   22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
      [   22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224
      [   22.811193] 	__slab_alloc.constprop.26+0x44/0x70
      [   22.811345] 	kmem_cache_alloc_trace+0xf0/0x2ec
      [   22.811588] 	__load_free_space_cache+0x588/0x780 [btrfs]
      [   22.811848] 	load_free_space_cache+0xf4/0x1b0 [btrfs]
      [   22.812090] 	cache_block_group+0x1d0/0x3d0 [btrfs]
      [   22.812321] 	find_free_extent+0x680/0x12a4 [btrfs]
      [   22.812549] 	btrfs_reserve_extent+0xec/0x220 [btrfs]
      [   22.812785] 	btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
      [   22.813032] 	__btrfs_cow_block+0x150/0x5d4 [btrfs]
      [   22.813262] 	btrfs_cow_block+0x194/0x298 [btrfs]
      [   22.813484] 	commit_cowonly_roots+0x44/0x294 [btrfs]
      [   22.813718] 	btrfs_commit_transaction+0x63c/0xc0c [btrfs]
      [   22.813973] 	close_ctree+0xf8/0x2a4 [btrfs]
      [   22.814107] 	generic_shutdown_super+0x80/0x110
      [   22.814250] 	kill_anon_super+0x18/0x30
      [   22.814437] 	btrfs_kill_super+0x18/0x90 [btrfs]
      [   22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
      [   22.814841] 	proc_cgroup_show+0xc0/0x248
      [   22.814967] 	proc_single_show+0x54/0x98
      [   22.815086] 	seq_read+0x278/0x45c
      [   22.815190] 	__vfs_read+0x28/0x17c
      [   22.815289] 	vfs_read+0xa8/0x14c
      [   22.815381] 	ksys_read+0x50/0x94
      [   22.815475] 	ret_from_syscall+0x0/0x38
      
      Commit 69d24804 ("btrfs: use copy_page for copying pages instead of
      memcpy") changed the way bitmap blocks are copied. But allthough bitmaps
      have the size of a page, they were allocated with kzalloc().
      
      Most of the time, kzalloc() allocates aligned blocks of memory, so
      copy_page() can be used. But when some debug options like SLAB_DEBUG are
      activated, kzalloc() may return unaligned pointer.
      
      On powerpc, memcpy(), copy_page() and other copying functions use
      'dcbz' instruction which provides an entire zeroed cacheline to avoid
      memory read when the intention is to overwrite a full line. Functions
      like memcpy() are writen to care about partial cachelines at the start
      and end of the destination, but copy_page() assumes it gets pages. As
      pages are naturally cache aligned, copy_page() doesn't care about
      partial lines. This means that when copy_page() is called with a
      misaligned pointer, a few leading bytes are zeroed.
      
      To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
      The cache pool is created with PAGE_SIZE alignment constraint.
      Reported-by: NErhard F. <erhard_f@mailbox.org>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371
      Fixes: 69d24804 ("btrfs: use copy_page for copying pages instead of memcpy")
      Cc: stable@vger.kernel.org # 4.19+
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to btrfs_free_space_bitmap ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3acd4850
    • J
      btrfs: rename the btrfs_calc_*_metadata_size helpers · 2bd36e7b
      Josef Bacik 提交于
      btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that
      it doesn't take into account any splitting at the levels, because
      truncate will never split nodes.  However truncate _and_ changing will
      never split nodes, so rename btrfs_calc_trunc_metadata_size to
      btrfs_calc_metadata_size.  Also btrfs_calc_trans_metadata_size is purely
      for inserting items, so rename this to btrfs_calc_insert_metadata_size.
      Making these clearer will help when I start using them differently in
      upcoming patches.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bd36e7b
    • J
      btrfs: move basic block_group definitions to their own header · aac0023c
      Josef Bacik 提交于
      This is prep work for moving all of the block group cache code into its
      own file.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aac0023c
    • J
      btrfs: move btrfs_add_free_space out of a header file · 478b4d9f
      Josef Bacik 提交于
      This is prep work for moving block_group_cache around.  Having this in
      the header file makes the header file include need to be in a certain
      order, which is awkward, so just move it into free-space-cache.c and
      then we can re-arrange later.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      478b4d9f