1. 09 2月, 2021 40 次提交
    • N
      btrfs: zoned: advance allocation pointer after tree log node · 011b41bf
      Naohiro Aota 提交于
      Since the allocation info of a tree log node is not recorded in the extent
      tree, calculate_alloc_pointer() cannot detect this node, so the pointer
      can be over a tree node.
      
      Replaying the log calls btrfs_remove_free_space() for each node in the
      log tree.
      
      So, advance the pointer after the node to not allocate over it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      011b41bf
    • N
      btrfs: zoned: redirty released extent buffers · d3575156
      Naohiro Aota 提交于
      Tree manipulating operations like merging nodes often release
      once-allocated tree nodes. Such nodes are cleaned so that pages in the
      node are not uselessly written out. On zoned volumes, however, such
      optimization blocks the following IOs as the cancellation of the write
      out of the freed blocks breaks the sequential write sequence expected by
      the device.
      
      Introduce a list of clean and unwritten extent buffers that have been
      released in a transaction. Redirty the buffers so that
      btree_write_cache_pages() can send proper bios to the devices.
      
      Besides it clears the entire content of the extent buffer not to confuse
      raw block scanners e.g. 'btrfs check'. By clearing the content,
      csum_dirty_buffer() complains about bytenr mismatch, so avoid the
      checking and checksum using newly introduced buffer flag
      EXTENT_BUFFER_NO_CHECK.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d3575156
    • N
      btrfs: zoned: implement sequential extent allocation · 2eda5708
      Naohiro Aota 提交于
      Implement a sequential extent allocator for zoned filesystems. This
      allocator only needs to check if there is enough space in the block group
      after the allocation pointer to satisfy the extent allocation request.
      Therefore the allocator never manages bitmaps or clusters. Also, add
      assertions to the corresponding functions.
      
      As zone append writing is used, it would be unnecessary to track the
      allocation offset, as the allocator only needs to check available space.
      But by tracking and returning the offset as an allocated region, we can
      skip modification of ordered extents and checksum information when there
      is no IO reordering.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2eda5708
    • N
      btrfs: zoned: track unusable bytes for zones · 169e0da9
      Naohiro Aota 提交于
      In a zoned filesystem a once written then freed region is not usable
      until the underlying zone has been reset. So we need to distinguish such
      unusable space from usable free space.
      
      Therefore we need to introduce the "zone_unusable" field to the block
      group structure, and "bytes_zone_unusable" to the space_info structure
      to track the unusable space.
      
      Pinned bytes are always reclaimed to the unusable space. But, when an
      allocated region is returned before using e.g., the block group becomes
      read-only between allocation time and reservation time, we can safely
      return the region to the block group. For the situation, this commit
      introduces "btrfs_add_free_space_unused". This behaves the same as
      btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
      rewinds the allocation offset.
      
      Because the read-only bytes tracks free but unusable bytes when the block
      group is read-only, we need to migrate the zone_unusable bytes to
      read-only bytes when a block group is marked read-only.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      169e0da9
    • N
      btrfs: zoned: calculate allocation offset for conventional zones · a94794d5
      Naohiro Aota 提交于
      Conventional zones do not have a write pointer, so we cannot use it to
      determine the allocation offset for sequential allocation if a block
      group contains a conventional zone.
      
      But instead, we can consider the end of the highest addressed extent in
      the block group for the allocation offset.
      
      For new block group, we cannot calculate the allocation offset by
      consulting the extent tree, because it can cause deadlock by taking
      extent buffer lock after chunk mutex, which is already taken in
      btrfs_make_block_group(). Since it is a new block group anyways, we can
      simply set the allocation offset to 0.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a94794d5
    • N
      btrfs: zoned: load zone's allocation offset · 08e11a3d
      Naohiro Aota 提交于
      A zoned filesystem must allocate blocks at the zones' write pointer. The
      device's write pointer position can be mapped to a logical address within
      a block group. To facilitate this, add an "alloc_offset" to the
      block-group to track the logical addresses of the write pointer.
      
      This logical address is populated in btrfs_load_block_group_zone_info()
      from the write pointers of corresponding zones.
      
      For now, zoned filesystems the single profile. Supporting non-single
      profile with zone append writing is not trivial. For example, in the DUP
      profile, we send a zone append writing IO to two zones on a device. The
      device reply with written LBAs for the IOs. If the offsets of the
      returned addresses from the beginning of the zone are different, then it
      results in different logical addresses.
      
      We need fine-grained logical to physical mapping to support such separated
      physical address issue. Since it should require additional metadata type,
      disable non-single profiles for now.
      
      This commit supports the case all the zones in a block group are
      sequential. The next patch will handle the case having a conventional
      zone.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08e11a3d
    • N
      btrfs: zoned: verify device extent is aligned to zone · 381a696e
      Naohiro Aota 提交于
      Add a check in verify_one_dev_extent() to ensure that a device extent on
      a zoned block device is aligned to the respective zone boundary.
      
      If it isn't, mark the filesystem as unclean.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      381a696e
    • N
      btrfs: zoned: implement zoned chunk allocator · 1cd6121f
      Naohiro Aota 提交于
      Implement a zoned chunk and device extent allocator. One device zone
      becomes a device extent so that a zone reset affects only this device
      extent and does not change the state of blocks in the neighbor device
      extents.
      
      To implement the allocator, we need to extend the following functions for
      a zoned filesystem.
      
      - init_alloc_chunk_ctl
      - dev_extent_search_start
      - dev_extent_hole_check
      - decide_stripe_size
      
      init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
      set the stripe_size to the zone size and aligns the parameters to the zone
      size.
      
      dev_extent_search_start() only aligns the start offset to zone boundaries.
      We don't care about the first 1MB like in regular filesystem because we
      anyway reserve the first two zones for superblock logging.
      
      dev_extent_hole_check_zoned() checks if zones in given hole are either
      conventional or empty sequential zones. Also, it skips zones reserved for
      superblock logging.
      
      With the change to the hole, the new hole may now contain pending extents.
      So, in this case, loop again to check that.
      
      Finally, decide_stripe_size_zoned() should shrink the number of devices
      instead of stripe size because we need to honor stripe_size == zone_size.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cd6121f
    • J
      btrfs: zoned: allow zoned filesystems on non-zoned block devices · 3c9daa09
      Johannes Thumshirn 提交于
      Run a zoned filesystem on non-zoned devices. This is done by "slicing up"
      the block device into static sized chunks and fake a conventional zone on
      each of them. The emulated zone size is determined from the size of device
      extent.
      
      This is mainly aimed at testing of zoned filesystems, i.e. the zoned
      chunk allocator, on regular block devices.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c9daa09
    • N
      btrfs: zoned: disallow fitrim on zoned filesystems · 1cb3dc3f
      Naohiro Aota 提交于
      The implementation of fitrim depends on space cache, which is not used
      and disabled for zoned extent allocator. So the current code does not
      work with zoned filesystem.
      
      In the future, we can implement fitrim for zoned filesystems by enabling
      space cache (but, only for fitrim) or scanning the extent tree at fitrim
      time.  For now, disallow fitrim on zoned filesystems.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cb3dc3f
    • J
      btrfs: zoned: do not load fs_info::zoned from incompat flag · b53429ba
      Johannes Thumshirn 提交于
      Don't set the zoned flag in fs_info as soon as we're encountering the
      incompat filesystem flag for a zoned filesystem on mount. The zoned flag
      in fs_info is in a union together with the zone_size, so setting it too
      early will result in setting an incorrect zone_size as well.
      
      Once the correct zone_size is read from the device, we can rely on the
      zoned flag in fs_info as well to determine if the filesystem is zoned.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b53429ba
    • J
      btrfs: release path before calling to btrfs_load_block_group_zone_info · 4afd2fe8
      Johannes Thumshirn 提交于
      Since we have no write pointer in conventional zones, we cannot
      determine the allocation offset from it. Instead, we set the allocation
      offset after the highest addressed extent. This is done by reading the
      extent tree in btrfs_load_block_group_zone_info().
      
      However, this function is called from btrfs_read_block_groups(), so the
      read lock for the tree node could be recursively taken.
      
      To avoid this unsafe locking scenario, release the path before reading
      the extent tree to get the allocation offset.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4afd2fe8
    • N
      btrfs: zoned: use regular super block location on zone emulation · d6639b35
      Naohiro Aota 提交于
      A zoned filesystem currently has a superblock at the beginning of the
      superblock logging zones if the zones are conventional. This difference
      in superblock position causes a chicken-and-egg problem for filesystems
      with emulated zones. Since the device is a regular (non-zoned) device,
      we cannot know if the filesystem is regular or zoned while reading the
      superblock. But, to load the superblock, we need to see if it is
      emulated zoned or not.
      
      Place the superblocks at the same location as they are on regular
      filesystem on regular devices to solve the problem. It is possible
      because it's ensured that all the superblock locations are at an
      (emulated) conventional zone on regular devices.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d6639b35
    • N
      btrfs: zoned: defer loading zone info after opening trees · 73651042
      Naohiro Aota 提交于
      This is a preparation patch to implement zone emulation on a regular
      device.
      
      To emulate a zoned filesystem on a regular (non-zoned) device, we need to
      decide an emulated zone size. Instead of making it a compile-time static
      value, we'll make it configurable at mkfs time. Since we have one zone ==
      one device extent restriction, we can determine the emulated zone size
      from the size of a device extent. We can extend btrfs_get_dev_zone_info()
      to show a regular device filled with conventional zones once the zone size
      is decided.
      
      The current call site of btrfs_get_dev_zone_info() during the mount process
      is earlier than loading the file system trees so that we don't know the
      size of a device extent at this point. Thus we can't slice a regular device
      to conventional zones.
      
      This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone
      info for all the devices. And, it places this function in open_ctree()
      after loading the trees.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73651042
    • N
      iomap: support REQ_OP_ZONE_APPEND · c3b0e880
      Naohiro Aota 提交于
      A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
      max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
      such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
      REQ_OP_ZONE_APPEND.
      
      To utilize it, we need to set the bio_op before calling
      bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
      that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
      and restricted bio.
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3b0e880
    • J
      block: add bio_add_zone_append_page · ae29333f
      Johannes Thumshirn 提交于
      Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
      is intended to be used by file systems that directly add pages to a bio
      instead of using bio_iov_iter_get_pages().
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae29333f
    • F
      btrfs: fix extent buffer leak on failure to copy root · 72c9925f
      Filipe Manana 提交于
      At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
      returning without unlocking and releasing our reference on the extent
      buffer named "cow" we previously allocated with btrfs_alloc_tree_block().
      
      So fix that by unlocking the extent buffer and dropping our reference on
      it before returning.
      
      Fixes: be20aa9d ("Btrfs: Add mount option to turn off data cow")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72c9925f
    • Q
      btrfs: explain page locking and readahead in read_extent_buffer_pages() · 2c4d8cb7
      Qu Wenruo 提交于
      In read_extent_buffer_pages(), if we failed to lock the page atomically,
      we just exit with return value 0.
      
      This is counter-intuitive, as normally if we can't lock what we need, we
      would return something like EAGAIN.
      
      But that return hides under (wait == WAIT_NONE) branch, which only gets
      triggered for readahead.
      
      And for readahead, if we failed to lock the page, it means the extent
      buffer is either being read by other thread, or has been read and is
      under modification.  Either way the eb will or has been cached, thus
      readahead has no need to wait for it.
      
      Add comment on this counter-intuitive behavior.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2c4d8cb7
    • Q
      btrfs: allow read-only mount of 4K sector size fs on 64K page system · 0bb3eb3e
      Qu Wenruo 提交于
      This adds the basic RO mount ability for 4K sector size on 64K page
      system.
      
      Currently we only plan to support 4K and 64K page system.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bb3eb3e
    • Q
      btrfs: integrate page status update for data read path into begin/end_page_read · 92082d40
      Qu Wenruo 提交于
      In btrfs data page read path, the page status update are handled in two
      different locations:
      
        btrfs_do_read_page()
        {
      	while (cur <= end) {
      		/* No need to read from disk */
      		if (HOLE/PREALLOC/INLINE){
      			memset();
      			set_extent_uptodate();
      			continue;
      		}
      		/* Read from disk */
      		ret = submit_extent_page(end_bio_extent_readpage);
        }
      
        end_bio_extent_readpage()
        {
      	endio_readpage_uptodate_page_status();
        }
      
      This is fine for sectorsize == PAGE_SIZE case, as for above loop we
      should only hit one branch and then exit.
      
      But for subpage, there is more work to be done in page status update:
      
      - Page Unlock condition
        Unlike regular page size == sectorsize case, we can no longer just
        unlock a page.
        Only the last reader of the page can unlock the page.
        This means, we can unlock the page either in the while() loop, or in
        the endio function.
      
      - Page uptodate condition
        Since we have multiple sectors to read for a page, we can only mark
        the full page uptodate if all sectors are uptodate.
      
      To handle both subpage and regular cases, introduce a pair of functions
      to help handling page status update:
      
      - begin_page_read()
        For regular case, it does nothing.
        For subpage case, it updates the reader counters so that later
        end_page_read() can know who is the last one to unlock the page.
      
      - end_page_read()
        This is just endio_readpage_uptodate_page_status() renamed.
        The original name is a little too long and too specific for endio.
      
        The new thing added is the condition for page unlock.
        Now for subpage data, we unlock the page if we're the last reader.
      
      This does not only provide the basis for subpage data read, but also
      hide the special handling of page read from the main read loop.
      
      Also, since we're changing how the page lock is handled, there are two
      existing error paths where we need to manually unlock the page before
      calling begin_page_read().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92082d40
    • Q
      btrfs: introduce btrfs_subpage for data inodes · 32443de3
      Qu Wenruo 提交于
      To support subpage sector size, data also need extra info to make sure
      which sectors in a page are uptodate/dirty/...
      
      This patch will make pages for data inodes get btrfs_subpage structure
      attached, and detached when the page is freed.
      
      This patch also slightly changes the timing when
      set_page_extent_mapped() is called to make sure:
      
      - We have page->mapping set
        page->mapping->host is used to grab btrfs_fs_info, thus we can only
        call this function after page is mapped to an inode.
      
        One call site attaches pages to inode manually, thus we have to modify
        the timing of set_page_extent_mapped() a bit.
      
      - As soon as possible, before other operations
        Since memory allocation can fail, we have to do extra error handling.
        Calling set_page_extent_mapped() as soon as possible can simply the
        error handling for several call sites.
      
      The idea is pretty much the same as iomap_page, but with more bitmaps
      for btrfs specific cases.
      
      Currently the plan is to switch iomap if iomap can provide sector
      aligned write back (only write back dirty sectors, but not the full
      page, data balance require this feature).
      
      So we will stick to btrfs specific bitmap for now.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32443de3
    • Q
      btrfs: introduce subpage metadata validation check · 371cdc07
      Qu Wenruo 提交于
      For subpage metadata validation check, there are some differences:
      
      - Read must finish in one bvec
        Since we're just reading one subpage range in one page, it should
        never be split into two bios nor two bvecs.
      
      - How to grab the existing eb
        Instead of grabbing eb using page->private, we have to go search radix
        tree as we don't have any direct pointer at hand.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      371cdc07
    • Q
      btrfs: support subpage in endio_readpage_update_page_status() · 4325cb22
      Qu Wenruo 提交于
      To handle subpage status update, add the following:
      
      - Use btrfs_page_*() subpage-aware helpers to update page status
        Now we can handle both cases well.
      
      - No page unlock for subpage metadata
        Since subpage metadata doesn't utilize page locking at all, skip it.
        For subpage data locking, it's handled in later commits.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4325cb22
    • Q
      btrfs: introduce read_extent_buffer_subpage() · 4012daf7
      Qu Wenruo 提交于
      Introduce a helper, read_extent_buffer_subpage(), to do the subpage
      extent buffer read.
      
      The difference between regular and subpage routines are:
      
      - No page locking
        Here we completely rely on extent locking.
        Page locking can reduce the concurrency greatly, as if we lock one
        page to read one extent buffer, all the other extent buffers in the
        same page will have to wait.
      
      - Extent uptodate condition
        Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check,
        We also need to check btrfs_subpage::uptodate_bitmap.
      
      - No page iteration
        Just one page, no need to loop, this greatly simplified the subpage
        routine.
      
      This patch only implements the bio submit part, no endio support yet.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4012daf7
    • Q
      btrfs: support subpage in try_release_extent_buffer() · d1e86e3f
      Qu Wenruo 提交于
      Unlike the original try_release_extent_buffer(),
      try_release_subpage_extent_buffer() will iterate through all the ebs in
      the page, and try to release each.
      
      We can release the full page only after there's no private attached,
      which means all ebs of that page have been released as well.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1e86e3f
    • Q
      btrfs: support subpage in btrfs_clone_extent_buffer · 92d83e94
      Qu Wenruo 提交于
      For btrfs_clone_extent_buffer(), it's mostly the same code of
      __alloc_dummy_extent_buffer(), except it has extra page copy.
      
      So to make it subpage compatible, we only need to:
      
      - Call set_extent_buffer_uptodate() instead of SetPageUptodate()
        This will set correct uptodate bit for subpage and regular sector size
        cases.
      
      Since we're calling set_extent_buffer_uptodate() which will also set
      EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit
      either.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92d83e94
    • Q
      btrfs: support subpage in set/clear_extent_buffer_uptodate() · 251f2acc
      Qu Wenruo 提交于
      To support subpage in set_extent_buffer_uptodate and
      clear_extent_buffer_uptodate we only need to use the subpage-aware
      helpers to update the page bits.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      251f2acc
    • Q
      btrfs: introduce helpers for subpage error status · 03a816b3
      Qu Wenruo 提交于
      Introduce the following functions to handle subpage error status:
      
      - btrfs_subpage_set_error()
      - btrfs_subpage_clear_error()
      - btrfs_subpage_test_error()
        These helpers can only be called when the page has subpage attached
        and the range is ensured to be inside the page.
      
      - btrfs_page_set_error()
      - btrfs_page_clear_error()
      - btrfs_page_test_error()
        These helpers can handle both regular sector size and subpage without
        problem.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      03a816b3
    • Q
      btrfs: introduce helpers for subpage uptodate status · a1d767c1
      Qu Wenruo 提交于
      Introduce the following functions to handle subpage uptodate status:
      
      - btrfs_subpage_set_uptodate()
      - btrfs_subpage_clear_uptodate()
      - btrfs_subpage_test_uptodate()
        These helpers can only be called when the page has subpage attached
        and the range is ensured to be inside the page.
      
      - btrfs_page_set_uptodate()
      - btrfs_page_clear_uptodate()
      - btrfs_page_test_uptodate()
        These helpers can handle both regular sector size and subpage.
        Although caller should still ensure that the range is inside the page.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a1d767c1
    • Q
      btrfs: attach private to dummy extent buffer pages · 09bc1f0f
      Qu Wenruo 提交于
      There are locations where we allocate dummy extent buffers for temporary
      usage, like in tree_mod_log_rewind() or get_old_root().
      
      These dummy extent buffers will be handled by the same eb accessors, and
      if they don't have page::private subpage eb accessors could fail.
      
      To address such problems, make __alloc_dummy_extent_buffer() attach
      page private for dummy extent buffers too.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      09bc1f0f
    • Q
      btrfs: support subpage for extent buffer page release · 8ff8466d
      Qu Wenruo 提交于
      In btrfs_release_extent_buffer_pages(), we need to add extra handling
      for subpage.
      
      Introduce a helper, detach_extent_buffer_page(), to do different
      handling for regular and subpage cases.
      
      For subpage case, handle detaching page private.
      
      For unmapped (dummy or cloned) ebs, we can detach the page private
      immediately as the page can only be attached to one unmapped eb.
      
      For mapped ebs, we have to ensure there are no eb in the page range
      before we delete it, as page->private is shared between all ebs in the
      same page.
      
      But there is a subpage specific race, where we can race with extent
      buffer allocation, and clear the page private while new eb is still
      being utilized, like this:
      
        Extent buffer A is the new extent buffer which will be allocated,
        while extent buffer B is the last existing extent buffer of the page.
      
        		T1 (eb A) 	 |		T2 (eb B)
        -------------------------------+------------------------------
        alloc_extent_buffer()		 | btrfs_release_extent_buffer_pages()
        |- p = find_or_create_page()   | |
        |- attach_extent_buffer_page() | |
        |				 | |- detach_extent_buffer_page()
        |				 |    |- if (!page_range_has_eb())
        |				 |    |  No new eb in the page range yet
        |				 |    |  As new eb A hasn't yet been
        |				 |    |  inserted into radix tree.
        |				 |    |- btrfs_detach_subpage()
        |				 |       |- detach_page_private();
        |- radix_tree_insert()	 |
      
        Then we have a metadata eb whose page has no private bit.
      
      To avoid such race, we introduce a subpage metadata-specific member,
      btrfs_subpage::eb_refs.
      
      In alloc_extent_buffer() we increase eb_refs in the critical section of
      private_lock.  Then page_range_has_eb() will return true for
      detach_extent_buffer_page(), and will not detach page private.
      
      The section is marked by:
      
      - btrfs_page_inc_eb_refs()
      - btrfs_page_dec_eb_refs()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ff8466d
    • Q
      btrfs: make grab_extent_buffer_from_page() handle subpage case · 81982210
      Qu Wenruo 提交于
      For subpage case, grab_extent_buffer() can't really get an extent buffer
      just from btrfs_subpage.
      
      We have radix tree lock protecting us from inserting the same eb into
      the tree.  Thus we don't really need to do the extra hassle, just let
      alloc_extent_buffer() handle the existing eb in radix tree.
      
      Now if two ebs are being allocated as the same time, one will fail with
      -EEIXST when inserting into the radix tree.
      
      So for grab_extent_buffer(), just always return NULL for subpage case.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      81982210
    • Q
      btrfs: make attach_extent_buffer_page() handle subpage case · 760f991f
      Qu Wenruo 提交于
      For subpage case, we need to allocate additional memory for each
      metadata page.
      
      So we need to:
      
      - Allow attach_extent_buffer_page() to return int to indicate allocation
        failure
      
      - Allow manually pre-allocate subpage memory for alloc_extent_buffer()
        As we don't want to use GFP_ATOMIC under spinlock, we introduce
        btrfs_alloc_subpage() and btrfs_free_subpage() functions for this
        purpose.
        (The simple wrap for btrfs_free_subpage() is for later convert to
         kmem_cache. Already internally tested without problem)
      
      - Preallocate btrfs_subpage structure for alloc_extent_buffer()
        We don't want to call memory allocation with spinlock held, so
        do preallocation before we acquire mapping->private_lock.
      
      - Handle subpage and regular case differently in
        attach_extent_buffer_page()
        For regular case, no change, just do the usual thing.
        For subpage case, allocate new memory or use the preallocated memory.
      
      For future subpage metadata, we will make use of radix tree to grab
      extent buffer.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      760f991f
    • Q
      btrfs: introduce the skeleton of btrfs_subpage structure · cac06d84
      Qu Wenruo 提交于
      For sectorsize < page size support, we need a structure to record extra
      status info for each sector of a page.
      
      Introduce the skeleton structure, all subpage related code would go to
      subpage.[ch].
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cac06d84
    • Q
      btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support · 62c053fb
      Qu Wenruo 提交于
      For the incoming subpage support, UNMAPPED extent buffer will have
      different behavior in btrfs_release_extent_buffer().
      
      This means we need to set UNMAPPED bit early before calling
      btrfs_release_extent_buffer().
      
      Currently there is only one caller which relies on
      btrfs_release_extent_buffer() in its error path while set UNMAPPED bit
      late:
      - btrfs_clone_extent_buffer()
      
      Make it subpage compatible by setting the UNMAPPED bit early, since
      we're here, also move the UPTODATE bit early.
      
      There is another caller, __alloc_dummy_extent_buffer(), setting
      UNMAPPED bit late, but that function clean up the allocated page
      manually, thus no need for any modification.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62c053fb
    • Q
      btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK · 6869b0a8
      Qu Wenruo 提交于
      PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
      __process_pages_contig(), to let the function know to clear page dirty
      bit and then set page writeback.
      
      However page writeback and dirty bits are conflicting (at least for
      sector size == PAGE_SIZE case), this means these two have to be always
      updated together.
      
      This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
      PAGE_START_WRITEBACK.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6869b0a8
    • F
      btrfs: make concurrent fsyncs wait less when waiting for a transaction commit · d0c2f4fa
      Filipe Manana 提交于
      Often an fsync needs to fallback to a transaction commit for several
      reasons (to ensure consistency after a power failure, a new block group
      was allocated or a temporary error such as ENOMEM or ENOSPC happened).
      
      In that case the log is marked as needing a full commit and any concurrent
      tasks attempting to log inodes or commit the log will also fallback to the
      transaction commit. When this happens they all wait for the task that first
      started the transaction commit to finish the transaction commit - however
      they wait until the full transaction commit happens, which is not needed,
      as they only need to wait for the superblocks to be persisted and not for
      unpinning all the extents pinned during the transaction's lifetime, which
      even for short lived transactions can be a few thousand and take some
      significant amount of time to complete - for dbench workloads I have
      observed up to 4~5 milliseconds of time spent unpinning extents in the
      worst cases, and the number of pinned extents was between 2 to 3 thousand.
      
      So allow fsync tasks to skip waiting for the unpinning of extents when
      they call btrfs_commit_transaction() and they were not the task that
      started the transaction commit (that one has to do it, the alternative
      would be to offload the transaction commit to another task so that it
      could avoid waiting for the extent unpinning or offload the extent
      unpinning to another task).
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      After applying the entire patchset, dbench shows improvements in respect
      to throughput and latency. The script used to measure it is the following:
      
        $ cat dbench-test.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 300 64
      
        umount $MNT
      
      The test was run on a physical machine with 12 cores (Intel corei7), 64G
      of ram, using a NVMe device and a non-debug kernel configuration (Debian's
      default configuration).
      
      Before applying patchset, 32 clients:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9627107     0.153    61.938
       Close        7072076     0.001     3.175
       Rename        407633     1.222    44.439
       Unlink       1943895     0.658    44.440
       Deltree          256    17.339   110.891
       Mkdir            128     0.003     0.009
       Qpathinfo    8725406     0.064    17.850
       Qfileinfo    1529516     0.001     2.188
       Qfsinfo      1599884     0.002     1.457
       Sfileinfo     784200     0.005     3.562
       Find         3373513     0.411    30.312
       WriteX       4802132     0.053    29.054
       ReadX       15089959     0.002     5.801
       LockX          31344     0.002     0.425
       UnlockX        31344     0.001     0.173
       Flush         674724     5.952   341.830
      
      Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms
      
      After applying patchset, 32 clients:
      
      After patchset, with 32 clients:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9931568     0.111    25.597
       Close        7295730     0.001     2.171
       Rename        420549     0.982    49.714
       Unlink       2005366     0.497    39.015
       Deltree          256    11.149    89.242
       Mkdir            128     0.002     0.014
       Qpathinfo    9001863     0.049    20.761
       Qfileinfo    1577730     0.001     2.546
       Qfsinfo      1650508     0.002     3.531
       Sfileinfo     809031     0.005     5.846
       Find         3480259     0.309    23.977
       WriteX       4952505     0.043    41.283
       ReadX       15568127     0.002     5.476
       LockX          32338     0.002     0.978
       UnlockX        32338     0.001     2.032
       Flush         696017     7.485   228.835
      
      Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms
      
       --> +4.1% throughput, -39.6% max latency
      
      Before applying patchset, 64 clients:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    8956748     0.342   108.312
       Close        6579660     0.001     3.823
       Rename        379209     2.396    81.897
       Unlink       1808625     1.108   131.148
       Deltree          256    25.632   172.176
       Mkdir            128     0.003     0.018
       Qpathinfo    8117615     0.131    55.916
       Qfileinfo    1423495     0.001     2.635
       Qfsinfo      1488496     0.002     5.412
       Sfileinfo     729472     0.007     8.643
       Find         3138598     0.855    78.321
       WriteX       4470783     0.102    79.442
       ReadX       14038139     0.002     7.578
       LockX          29158     0.002     0.844
       UnlockX        29158     0.001     0.567
       Flush         627746    14.168   506.151
      
      Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms
      
      After applying patchset, 64 clients:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9069003     0.303    43.193
       Close        6662328     0.001     3.888
       Rename        383976     2.194    46.418
       Unlink       1831080     1.022    43.873
       Deltree          256    24.037   155.763
       Mkdir            128     0.002     0.005
       Qpathinfo    8219173     0.137    30.233
       Qfileinfo    1441203     0.001     3.204
       Qfsinfo      1507092     0.002     4.055
       Sfileinfo     738775     0.006     5.431
       Find         3177874     0.936    38.170
       WriteX       4526152     0.084    39.518
       ReadX       14213562     0.002    24.760
       LockX          29522     0.002     1.221
       UnlockX        29522     0.001     0.694
       Flush         635652    14.358   422.039
      
      Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms
      
       --> +6.8% throughput, -18.1% max latency
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d0c2f4fa
    • F
      btrfs: remove unnecessary check_parent_dirs_for_sync() · 64d6b281
      Filipe Manana 提交于
      Whenever we fsync an inode, if it is a directory, a regular file that was
      created in the current transaction or has last_unlink_trans set to the
      generation of the current transaction, we check if any of its ancestor
      inodes (and the inode itself if it is a directory) can not be logged and
      need a fallback to a full transaction commit - if so, we return with a
      value of 1 in order to fallback to a transaction commit.
      
      However we often do not need to fallback to a transaction commit because:
      
      1) The ancestor inode is not an immediate parent, and therefore there is
         not an explicit request to log it and it is not needed neither to
         guarantee the consistency of the inode originally asked to be logged
         (fsynced) nor its immediate parent;
      
      2) The ancestor inode was already logged before, in which case any link,
         unlink or rename operation updates the log as needed.
      
      So for these two cases we can avoid an unnecessary transaction commit.
      Therefore remove check_parent_dirs_for_sync() and add a check at the top
      of btrfs_log_inode() to make us fallback immediately to a transaction
      commit when we are logging a directory inode that can not be logged and
      needs a full transaction commit. All we need to protect is the case where
      after renaming a file someone fsyncs only the old directory, which would
      result is losing the renamed file after a log replay.
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64d6b281
    • F
      btrfs: skip logging inodes already logged when logging new entries · 0e44cb3f
      Filipe Manana 提交于
      When logging new directory entries of a directory, we log the inodes of
      new dentries and the inodes of dentries pointing to directories that
      may have been created in past transactions. For the case of directories
      we log in full mode, which can be particularly expensive for large
      directories.
      
      We do use btrfs_inode_in_log() to skip already logged inodes, however for
      that helper to return true, it requires that the log transaction used to
      log the inode to be already committed. This means that when we have more
      than one task using the same log transaction we can end up logging an
      inode multiple times, which is a waste of time and not necessary since
      the log will be committed by one of the tasks and the others will wait for
      the log transaction to be committed before returning to user space.
      
      So simply replace the use of btrfs_inode_in_log() with the new helper
      function need_log_inode(), introduced in a previous commit.
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e44cb3f
    • F
      btrfs: skip logging directories already logged when logging all parents · 3e6a86a1
      Filipe Manana 提交于
      Some times when we fsync an inode we need to do a full log of all its
      ancestors (due to unlink, link or rename operations), which can be an
      expensive operation, specially if the directories are large.
      
      However if we find an ancestor directory inode that is already logged in
      the current transaction, and has no inserted/updated/deleted xattrs since
      it was last logged, we can skip logging the directory again. We are safe
      to skip that since we know that for logged directories, any link, unlink
      or rename operations that implicate the directory will update the log as
      necessary.
      
      So use the helper need_log_dir(), introduced in a previous commit, to
      detect already logged directories that can be skipped.
      
      This patch is part of a patchset comprised of the following patches:
      
        btrfs: remove unnecessary directory inode item update when deleting dir entry
        btrfs: stop setting nbytes when filling inode item for logging
        btrfs: avoid logging new ancestor inodes when logging new inode
        btrfs: skip logging directories already logged when logging all parents
        btrfs: skip logging inodes already logged when logging new entries
        btrfs: remove unnecessary check_parent_dirs_for_sync()
        btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
      
      Performance results, after applying all patches, are mentioned in the
      change log of the last patch.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e6a86a1