1. 19 2月, 2020 1 次提交
  2. 20 1月, 2020 7 次提交
    • D
      btrfs: calculate discard delay based on number of extents · a2309300
      Dennis Zhou 提交于
      An earlier patch keeps track of discardable_extents. These are
      undiscarded extents managed by the free space cache. Here, we will use
      this to dynamically calculate the discard delay interval.
      
      There are 3 rate to consider. The first is the target convergence rate,
      the rate to discard all discardable_extents over the
      BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower
      limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper
      limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every
      transaction commit.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2309300
    • D
      btrfs: add the beginning of async discard, discard workqueue · b0643e59
      Dennis Zhou 提交于
      When discard is enabled, everytime a pinned extent is released back to
      the block_group's free space cache, a discard is issued for the extent.
      This is an overeager approach when it comes to discarding and helping
      the SSD maintain enough free space to prevent severe garbage collection
      situations.
      
      This adds the beginning of async discard. Instead of issuing a discard
      prior to returning it to the free space, it is just marked as untrimmed.
      The block_group is then added to a LRU which then feeds into a workqueue
      to issue discards at a much slower rate. Full discarding of unused block
      groups is still done and will be addressed in a future patch of the
      series.
      
      For now, we don't persist the discard state of extents and bitmaps.
      Therefore, our failure recovery mode will be to consider extents
      untrimmed. This lets us handle failure and unmounting as one in the
      same.
      
      On a number of Facebook webservers, I collected data every minute
      accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
      and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
      is where we discard extents synchronously before returning them to the
      free space cache.
      
      discard=sync:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          ---------------------------------------------------------------
           Drive A  |           434           |          1170
           Drive B  |           880           |          2330
           Drive C  |          2943           |          3920
           Drive D  |          4763           |          5701
      
      discard=async:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          --------------------------------------------------------------
           Drive A  |           134           |           956
           Drive B  |            64           |          1972
           Drive C  |            59           |          1032
           Drive D  |            62           |          1200
      
      While it's not great that the stats are cumulative over 1m, all of these
      servers are running the same workload and and the delta between the two
      are substantial. We are spending significantly less time in
      btrfs_finish_extent_commit() which is responsible for discarding.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0643e59
    • D
      btrfs: rename DISCARD mount option to to DISCARD_SYNC · 46b27f50
      Dennis Zhou 提交于
      This series introduces async discard which will use the flag
      DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
      synchronously done in transaction commit.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      46b27f50
    • O
      btrfs: remove struct find_free_extent.ram_bytes · 95690e58
      Omar Sandoval 提交于
      This hasn't been used since it was first introduced in commit
      b4bd745d ("btrfs: Introduce find_free_extent_ctl structure for later
      rework"). Passing that to btrfs_add_reserved_bytes in find_free_extent
      is not strictly necessary and using the local ram_bytes instead seems
      cleaner.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95690e58
    • N
      btrfs: Rename __btrfs_free_reserved_extent to btrfs_pin_reserved_extent · a0fbf736
      Nikolay Borisov 提交于
      __btrfs_free_reserved_extent now performs the actions of
      btrfs_free_and_pin_reserved_extent. But this name is a bit of a
      misnomer, since the extent is not really freed but just pinned. Reflect
      this in the new name. No semantics changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0fbf736
    • N
      btrfs: Open code __btrfs_free_reserved_extent in btrfs_free_reserved_extent · 7ef54d54
      Nikolay Borisov 提交于
      __btrfs_free_reserved_extent performs 2 entirely different operations
      depending on whether its 'pin' argument is true or false. This patch
      lifts the 2nd case (pin is false) into it's sole caller
      btrfs_free_reserved_extent. No semantics changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7ef54d54
    • N
      btrfs: Don't discard unwritten extents · 4eaaec24
      Nikolay Borisov 提交于
      All callers of btrfs_free_reserved_extent (respectively
      __btrfs_free_reserved_extent with in set to 0) pass in extents which
      have only been reserved but not yet written to. Namely,
      
      * in cow_file_range that function is called only if create_io_em fails
        or btrfs_add_ordered_extent fail, both of which happen _before_ any IO
        is submitted to the newly reserved range
      
      * in submit_compressed_extents the code flow is similar -
        out_free_reserve can be called only before
        btrfs_submit_compressed_write which is where any writes to the range
        could occur
      
      * btrfs_new_extent_direct also calls btrfs_free_reserved_extent only
        if extent_map fails, before any IO is issued
      
      * __btrfs_prealloc_file_range also calls btrfs_free_reserved_extent
        in case insertion of the metadata fails
      
      * btrfs_alloc_tree_block again can only be called in case in-memory
        operations fail, before any IO is submitted
      
      * btrfs_finish_ordered_io - this is the only caller where discarding
        the extent could have a material effect, since it can be called for
        an extent which was partially written.
      
      With this change the submission of discards is optimised since discards
      are now not being created for extents which are known to not have been
      touched on disk.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4eaaec24
  3. 13 12月, 2019 2 次提交
    • F
      Btrfs: fix missing data checksums after replaying a log tree · 40e046ac
      Filipe Manana 提交于
      When logging a file that has shared extents (reflinked with other files or
      with itself), we can end up logging multiple checksum items that cover
      overlapping ranges. This confuses the search for checksums at log replay
      time causing some checksums to never be added to the fs/subvolume tree.
      
      Consider the following example of a file that shares the same extent at
      offsets 0 and 256Kb:
      
         [ bytenr 13893632, offset 64Kb, len 64Kb  ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 13893632, offset 0, len 256Kb    ]
         256Kb                                     512Kb
      
      When logging the inode, at tree-log.c:copy_items(), when processing the
      file extent item at offset 0, we log a checksum item covering the range
      13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
      64Kb + 64Kb, respectively.
      
      Later when processing the extent item at offset 256K, we log the checksums
      for the range from 13893632 to 14155776 (which corresponds to 13893632 +
      256Kb). These checksums get merged with the checksum item for the range
      from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
      So after this we get the two following checksum items in the log tree:
      
         (...)
         item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
                 range start 13631488 end 14155776 length 524288
         item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
                 range start 13959168 end 14024704 length 65536
      
      The first one covers the range from the second one, they overlap.
      
      So far this does not cause a problem after replaying the log, because
      when replaying the file extent item for offset 256K, we copy all the
      checksums for the extent 13893632 from the log tree to the fs/subvolume
      tree, since searching for an checksum item for bytenr 13893632 leaves us
      at the first checksum item, which covers the whole range of the extent.
      
      However if we write 64Kb to file offset 256Kb for example, we will
      not be able to find and copy the checksums for the last 128Kb of the
      extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.
      
      After writing 64Kb into file offset 256Kb we get the following extent
      layout for our file:
      
         [ bytenr 13893632, offset 64K, len 64Kb   ]
         0                                         64Kb
      
         [ bytenr 13631488, offset 64Kb, len 192Kb ]
         64Kb                                      256Kb
      
         [ bytenr 14155776, offset 0, len 64Kb     ]
         256Kb                                     320Kb
      
         [ bytenr 13893632, offset 64Kb, len 192Kb ]
         320Kb                                     512Kb
      
      After fsync'ing the file, if we have a power failure and then mount
      the filesystem to replay the log, the following happens:
      
      1) When replaying the file extent item for file offset 320Kb, we
         lookup for the checksums for the extent range from 13959168
         (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
         to btrfs_lookup_csums_range();
      
      2) btrfs_lookup_csums_range() finds the checksum item that starts
         precisely at offset 13959168 (item 7 in the log tree, shown before);
      
      3) However that checksum item only covers 64Kb of data, and not 192Kb
         of data;
      
      4) As a result only the checksums for the first 64Kb of data referenced
         by the file extent item are found and copied to the fs/subvolume tree.
         The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
         the corresponding data checksums found and copied to the fs/subvolume
         tree.
      
      5) After replaying the log userspace will not be able to read the file
         range from 384Kb to 512Kb, because the checksums are missing and
         resulting in an -EIO error.
      
      The following steps reproduce this scenario:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
        $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar
      
        $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
        $ xfs_io -c "fsync" /mnt/sdc/foobar
      
        <power failure>
      
        $ mount /dev/sdc /mnt/sdc
        $ md5sum /mnt/sdc/foobar
        md5sum: /mnt/sdc/foobar: Input/output error
      
        $ dmesg | tail
        [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
        [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
        [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
        [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
        [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
        [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
        [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
        [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
        [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
        [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
      
      Fix this simply by deleting first any checksums, from the log tree, for the
      range of the extent we are logging at copy_items(). This ensures we do not
      get checksum items in the log tree that have overlapping ranges.
      
      This is a long time issue that has been present since we have the clone
      (and deduplication) ioctl, and can happen both when an extent is shared
      between different files and within the same file.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40e046ac
    • J
      btrfs: handle error in btrfs_cache_block_group · db8fe64f
      Josef Bacik 提交于
      We have a BUG_ON(ret < 0) in find_free_extent from
      btrfs_cache_block_group.  If we fail to allocate our ctl we'll just
      panic, which is not good.  Instead just go on to another block group.
      If we fail to find a block group we don't want to return ENOSPC, because
      really we got a ENOMEM and that's the root of the problem.  Save our
      return from btrfs_cache_block_group(), and then if we still fail to make
      our allocation return that ret so we get the right error back.
      
      Tested with inject-error.py from bcc.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      db8fe64f
  4. 19 11月, 2019 4 次提交
    • D
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba 提交于
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32da5386
    • Q
      btrfs: Ensure we trim ranges across block group boundary · 6b7faadd
      Qu Wenruo 提交于
      [BUG]
      When deleting large files (which cross block group boundary) with
      discard mount option, we find some btrfs_discard_extent() calls only
      trimmed part of its space, not the whole range:
      
        btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50%
      
      type:		bbio->map_type, in above case, it's SINGLE DATA.
      start:		Logical address of this trim
      len:		Logical length of this trim
      trimmed:	Physically trimmed bytes
      ratio:		trimmed / len
      
      Thus leaving some unused space not discarded.
      
      [CAUSE]
      When discard mount option is specified, after a transaction is fully
      committed (super block written to disk), we begin to cleanup pinned
      extents in the following call chain:
      
      btrfs_commit_transaction()
      |- btrfs_finish_extent_commit()
         |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY);
         |- btrfs_discard_extent()
      
      However, pinned extents are recorded in an extent_io_tree, which can
      merge adjacent extent states.
      
      When a large file gets deleted and it has adjacent file extents across
      block group boundary, we will get a large merged range like this:
      
            |<---    BG1    --->|<---      BG2     --->|
            |//////|<--   Range to discard   --->|/////|
      
      To discard that range, we have the following calls:
      
        btrfs_discard_extent()
        |- btrfs_map_block()
        |  Returned bbio will end at BG1's end. As btrfs_map_block()
        |  never returns result across block group boundary.
        |- btrfs_issuse_discard()
           Issue discard for each stripe.
      
      So we will only discard the range in BG1, not the remaining part in BG2.
      
      Furthermore, this bug is not that reliably observed, for above case, if
      there is no other extent in BG2, BG2 will be empty and btrfs will trim
      all space of BG2, covering up the bug.
      
      [FIX]
      - Allow __btrfs_map_block_for_discard() to modify @length parameter
        btrfs_map_block() uses its @length paramter to notify the caller how
        many bytes are mapped in current call.
        With __btrfs_map_block_for_discard() also modifing the @length,
        btrfs_discard_extent() now understands when to do extra trim.
      
      - Call btrfs_map_block() in a loop until we hit the range end Since we
        now know how many bytes are mapped each time, we can iterate through
        each block group boundary and issue correct trim for each range.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7faadd
    • D
      btrfs: add dedicated members for start and length of a block group · b3470b5d
      David Sterba 提交于
      The on-disk format of block group item makes use of the key that stores
      the offset and length. This is further used in the code, although this
      makes thing harder to understand. The key is also packed so the
      offset/length is not properly aligned as u64.
      
      Add start (key.objectid) and length (key.offset) members to block group
      and remove the embedded key.  When the item is searched or written, a
      local variable for key is used.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3470b5d
    • D
      btrfs: move block_group_item::used to block group · bf38be65
      David Sterba 提交于
      For unknown reasons, the member 'used' in the block group struct is
      stored in the b-tree item and accessed everywhere using the special
      accessor helper. Let's unify it and make it a regular member and only
      update the item before writing it to the tree.
      
      The item is still being used for flags and chunk_objectid, there's some
      duplication until the item is removed in following patches.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf38be65
  5. 18 11月, 2019 2 次提交
  6. 09 9月, 2019 24 次提交