1. 27 10月, 2021 13 次提交
    • Q
      btrfs: cleanup for extent_write_locked_range() · 2bd0fc93
      Qu Wenruo 提交于
      There are several cleanups for extent_write_locked_range(), most of them
      are pure cleanups, but with some preparation for future subpage support.
      
      - Add a proper comment for which call sites are suitable
        Unlike regular synchronized extent write back, if async COW or zoned
        COW happens, we have all pages in the range still locked.
      
        Thus for those (only) two call sites, we need this function to submit
        page content into bios and submit them.
      
      - Remove @mode parameter
        All the existing two call sites pass WB_SYNC_ALL. No need for @mode
        parameter.
      
      - Better error handling
        Currently if we hit an error during the page iteration loop, we
        overwrite @ret, causing only the last error can be recorded.
      
        Here we add @found_error and @first_error variable to record if we hit
        any error, and the first error we hit.
        So the first error won't get lost.
      
      - Don't reuse @start as the cursor
        We reuse the parameter @start as the cursor to iterate the range, not
        a big problem, but since we're here, introduce a proper @cur as the
        cursor.
      
      - Remove impossible branch
        Since all pages are still locked after the ordered extent is inserted,
        there is no way that pages can get its dirty bit cleared.
        Remove the branch where page is not dirty and replace it with an
        ASSERT().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bd0fc93
    • Q
      btrfs: refactor submit_compressed_extents() · b4ccace8
      Qu Wenruo 提交于
      We have a big chunk of code inside a while() loop, with tons of strange
      jumps for error handling.  It's definitely not to the code standard of
      today.  Move the code into a new function, submit_one_async_extent().
      
      Since we're here, also do the following changes:
      
      - Comment style change
        To follow the current scheme
      
      - Don't fallback to non-compressed write then hitting ENOSPC
        If we hit ENOSPC for compressed write, how could we reserve more space
        for non-compressed write?
        Thus we go error path directly.
        This removes the retry: label.
      
      - Add more comment for super long parameter list
        Explain which parameter is for, so we don't need to check the
        prototype.
      
      - Move the error handling to submit_one_async_extent()
        Thus no strange code like:
      
        out_free:
      	...
      	goto again;
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4ccace8
    • Q
      btrfs: remove unused function btrfs_bio_fits_in_stripe() · 6aabd858
      Qu Wenruo 提交于
      As the last caller in compression.c has been removed, we don't need that
      function anymore.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6aabd858
    • Q
      btrfs: subpage: add bitmap for PageChecked flag · e4f94347
      Qu Wenruo 提交于
      Although in btrfs we have very limited usage of PageChecked flag, it's
      still some page flag not yet subpage compatible.
      
      Fix it by introducing btrfs_subpage::checked_offset to do the convert.
      
      For most call sites, especially for free-space cache, COW fixup and
      btrfs_invalidatepage(), they all work in full page mode anyway.
      
      For other call sites, they work as subpage compatible mode.
      
      Some call sites need extra modification:
      
      - btrfs_drop_pages()
        Needs extra parameter to get the real range we need to clear checked
        flag.
      
        Also since btrfs_drop_pages() will accept pages beyond the dirtied
        range, update btrfs_subpage_clamp_range() to handle such case
        by setting @len to 0 if the page is beyond target range.
      
      - btrfs_invalidatepage()
        We need to call subpage helper before calling __btrfs_releasepage(),
        or it will trigger ASSERT() as page->private will be cleared.
      
      - btrfs_verify_data_csum()
        In theory we don't need the io_bio->csum check anymore, but it's
        won't hurt.  Just change the comment.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4f94347
    • Q
      btrfs: don't pass compressed pages to btrfs_writepage_endio_finish_ordered() · 58469174
      Qu Wenruo 提交于
      Since async_extent holds the compressed page, it would trigger the new
      ASSERT() in btrfs_mark_ordered_io_finished() which checks that the range
      is inside the page.
      
      Now btrfs_writepage_endio_finish_ordered() can accept @page == NULL,
      just pass NULL to btrfs_writepage_endio_finish_ordered().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58469174
    • Q
      btrfs: use async_chunk::async_cow to replace the confusing pending pointer · 9e895a8f
      Qu Wenruo 提交于
      For structure async_chunk, we use a very strange member layout to grab
      structure async_cow who owns this async_chunk.
      
      At initialization, it goes like this:
      
      		async_chunk[i].pending = &ctx->num_chunks;
      
      Then at async_cow_free() we do a super weird freeing:
      
      	/*
      	 * Since the pointer to 'pending' is at the beginning of the array of
      	 * async_chunk's, freeing it ensures the whole array has been freed.
      	 */
      	if (atomic_dec_and_test(async_chunk->pending))
      		kvfree(async_chunk->pending);
      
      This is absolutely an abuse of kvfree().
      
      Replace async_chunk::pending with async_chunk::async_cow, so that we can
      grab the async_cow structure directly, without this strange dancing.
      
      And with this change, there is no requirement for any specific member
      location.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e895a8f
    • F
      btrfs: loop only once over data sizes array when inserting an item batch · b7ef5f3a
      Filipe Manana 提交于
      When inserting a batch of items into a btree, we end up looping over the
      data sizes array 3 times:
      
      1) Once in the caller of btrfs_insert_empty_items(), when it populates the
         array with the data sizes for each item;
      
      2) Once at btrfs_insert_empty_items() to sum the elements of the data
         sizes array and compute the total data size;
      
      3) And then once again at setup_items_for_insert(), where we do exactly
         the same as what we do at btrfs_insert_empty_items(), to compute the
         total data size.
      
      That is not bad for small arrays, but when the arrays have hundreds of
      elements, the time spent on looping is not negligible. For example when
      doing batch inserts of delayed items for dir index items or when logging
      a directory, it's common to have 200 to 260 dir index items in a single
      batch when using a leaf size of 16K and using file names between 8 and 12
      characters. For a 64K leaf size, multiply that by 4. Taking into account
      that during directory logging or when flushing delayed dir index items we
      can have many of those large batches, the time spent on the looping adds
      up quickly.
      
      It's also more important to avoid it at setup_items_for_insert(), since
      we are holding a write lock on a leaf and, in some cases, on upper nodes
      of the btree, which causes us to block other tasks that want to access
      the leaf and nodes for longer than necessary.
      
      So change the code so that setup_items_for_insert() and
      btrfs_insert_empty_items() no longer compute the total data size, and
      instead rely on the caller to supply it. This makes us loop over the
      array only once, where we can both populate the data size array and
      compute the total data size, taking advantage of spatial and temporal
      locality. To make this more manageable, use a structure to contain
      all the relevant details for a batch of items (keys array, data sizes
      array, total data size, number of items), and use it as an argument
      for btrfs_insert_empty_items() and setup_items_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 1/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7ef5f3a
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • F
      btrfs: keep track of the last logged keys when logging a directory · dc287224
      Filipe Manana 提交于
      After the first time we log a directory in the current transaction, for
      each directory item in a changed leaf of the subvolume tree, we have to
      check if we previously logged the item, in order to overwrite it in case
      its data changed or skip it in case its data hasn't changed.
      
      Checking if we have logged each item before not only wastes times, but it
      also adds lock contention on the log tree. So in order to minimize the
      number of times we do such checks, keep track of the offset of the last
      key we logged for a directory and, on the next time we log the directory,
      skip the checks for any new keys that have an offset greater than the
      offset we have previously saved. This is specially effective for index
      keys, because the offset for these keys comes from a monotonically
      increasing counter.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 5/5.
      
      The following test was used on a non-debug kernel to measure the impact
      it has on a directory fsync:
      
        $ cat test-dir-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_NEW_FILES=100000
        NUM_FILE_DELETES=1000
      
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # fsync the directory, this will log the new dir items and the inodes
        # they point to, because these are new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
        # sync to force transaction commit and wipeout the log.
        sync
      
        del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
        for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # fsync the directory, this will only log dir items, there are no
        # dentries pointing to new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
      
        umount $MNT
      
      Test results with NUM_NEW_FILES set to 100 000 and 1 000 000:
      
      **** before patchset, 100 000 files, 1000 deletes ****
      
      dir fsync took 848 ms after adding 100000 files
      dir fsync took 175 ms after deleting 1000 files
      
      **** after patchset, 100 000 files, 1000 deletes ****
      
      dir fsync took 758 ms after adding 100000 files  (-11.2%)
      dir fsync took 63 ms after deleting 1000 files   (-94.1%)
      
      **** before patchset, 1 000 000 files, 1000 deletes ****
      
      dir fsync took 9945 ms after adding 1000000 files
      dir fsync took 473 ms after deleting 1000 files
      
      **** after patchset, 1 000 000 files, 1000 deletes ****
      
      dir fsync took 8677 ms after adding 1000000 files (-13.6%)
      dir fsync took 146 ms after deleting 1000 files   (-105.6%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc287224
    • J
      btrfs: check for relocation inodes on zoned btrfs in should_nocow · 2adada88
      Johannes Thumshirn 提交于
      Prepare for allowing preallocation for relocation inodes.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2adada88
    • J
      btrfs: introduce btrfs_is_data_reloc_root · 37f00a6d
      Johannes Thumshirn 提交于
      There are several places in our codebase where we check if a root is the
      root of the data reloc tree and subsequent patches will introduce more.
      
      Factor out the check into a small helper function instead of open coding
      it multiple times.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37f00a6d
    • A
      btrfs: convert latest_bdev type to btrfs_device and rename · d24fa5c1
      Anand Jain 提交于
      In preparation to fix a bug in btrfs_show_devname().
      
      Convert fs_devices::latest_bdev type from struct block_device to struct
      btrfs_device and, rename the member to fs_devices::latest_dev.
      So that btrfs_show_devname() can use fs_devices::latest_dev::name.
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d24fa5c1
    • N
      btrfs: zoned: finish fully written block group · be1a1d7a
      Naohiro Aota 提交于
      If we have written to the zone capacity, the device automatically
      deactivates the zone. Sync up block group side (the active BG list and
      zone_is_active flag) with it.
      
      We need to do it both on data BGs and metadata BGs. On data side, we add a
      hook to btrfs_finish_ordered_io(). On metadata side, we use
      end_extent_buffer_writeback().
      
      To reduce excess lookup of a block group, we mark the last extent buffer in
      a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
      data (ordered_extent), because the address may change due to
      REQ_OP_ZONE_APPEND.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be1a1d7a
  2. 25 8月, 2021 1 次提交
  3. 23 8月, 2021 26 次提交
    • C
      btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls · 4d4340c9
      Christian Brauner 提交于
      Creating subvolumes and snapshots is one of the core features of btrfs
      and is even available to unprivileged users. Make it possible to use
      subvolume and snapshot creation on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d4340c9
    • C
      btrfs: allow idmapped permission inode op · 3bc71ba0
      Christian Brauner 提交于
      Enable btrfs_permission() to handle idmapped mounts. This is just a
      matter of passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3bc71ba0
    • C
      btrfs: allow idmapped setattr inode op · d4d09464
      Christian Brauner 提交于
      Enable btrfs_setattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4d09464
    • C
      btrfs: allow idmapped tmpfile inode op · 98b6ab5f
      Christian Brauner 提交于
      Enable btrfs_tmpfile() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98b6ab5f
    • C
      btrfs: allow idmapped symlink inode op · 5a052108
      Christian Brauner 提交于
      Enable btrfs_symlink() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a052108
    • C
      btrfs: allow idmapped mkdir inode op · b0b3e44d
      Christian Brauner 提交于
      Enable btrfs_mkdir() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0b3e44d
    • C
      btrfs: allow idmapped create inode op · e93ca491
      Christian Brauner 提交于
      Enable btrfs_create() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e93ca491
    • C
      btrfs: allow idmapped mknod inode op · 72105277
      Christian Brauner 提交于
      Enable btrfs_mknod() to handle idmapped mounts. This is just a matter of
      passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72105277
    • C
      btrfs: allow idmapped getattr inode op · c020d2ea
      Christian Brauner 提交于
      Enable btrfs_getattr() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c020d2ea
    • C
      btrfs: allow idmapped rename inode op · ca07274c
      Christian Brauner 提交于
      Enable btrfs_rename() to handle idmapped mounts. This is just a matter
      of passing down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca07274c
    • C
      btrfs: handle idmaps in btrfs_new_inode() · b3b6f5b9
      Christian Brauner 提交于
      Extend btrfs_new_inode() to take the idmapped mount into account when
      initializing a new inode. This is just a matter of passing down the
      mount's userns. The rest is taken care of in inode_init_owner(). This is
      a preliminary patch to make the individual btrfs inode operations
      idmapped mount aware.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3b6f5b9
    • N
      btrfs: zoned: add asserts on splitting extent_map · 63fb5879
      Naohiro Aota 提交于
      We call split_zoned_em() on an extent_map on submitting a bio for it. Thus,
      we can assume the extent_map is PINNED, not LOGGING, and in the modified
      list. Add ASSERT()s to ensure the extent_maps after the split also has the
      proper flags set and are in the modified list.
      Suggested-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63fb5879
    • F
      btrfs: remove unnecessary NULL check for the new inode during rename exchange · 1c167b87
      Filipe Manana 提交于
      At the very end of btrfs_rename_exchange(), in case an error happened, we
      are checking if 'new_inode' is NULL, but that is not needed since during a
      rename exchange, unlike regular renames, 'new_inode' can never be NULL,
      and if it were, we would have a crashed much earlier when we dereference it
      multiple times.
      
      So remove the check because it is not necessary and because it is causing
      static checkers to emit a warning. I probably introduced the check by
      copy-pasting similar code from btrfs_rename(), where 'new_inode' can be
      NULL, in commit 86e8aa0e ("Btrfs: unpin logs if rename exchange
      operation fails").
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c167b87
    • B
      btrfs: verity metadata orphan items · 70524253
      Boris Burkov 提交于
      Writing out the verity data is too large of an operation to do in a
      single transaction. If we are interrupted before we finish creating
      fsverity metadata for a file, or fail to clean up already created
      metadata after a failure, we could leak the verity items that we already
      committed.
      
      To address this issue, we use the orphan mechanism. When we start
      enabling verity on a file, we also add an orphan item for that inode.
      When we are finished, we delete the orphan. However, if we are
      interrupted midway, the orphan will be present at mount and we can
      cleanup the half-formed verity state.
      
      There is a possible race with a normal unlink operation: if unlink and
      verity run on the same file in parallel, it is possible for verity to
      succeed and delete the still legitimate orphan added by unlink. Then, if
      we are interrupted and mount in that state, we will never clean up the
      inode properly. This is also possible for a file created with O_TMPFILE.
      Check nlink==0 before deleting to avoid this race.
      
      A final thing to note is that this is a resurrection of using orphans to
      signal an operation besides "delete this inode". The old case was to
      signal the need to do a truncate. That case still technically applies
      for mounting very old file systems, so we need to take some care to not
      clobber it. To that end, we just have to be careful that verity orphan
      cleanup is a no-op for non-verity files.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      70524253
    • B
      btrfs: initial fsverity support · 14605409
      Boris Burkov 提交于
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: NEric Biggers <ebiggers@google.com>
      Co-developed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14605409
    • B
      btrfs: add ro compat flags to inodes · 77eea05e
      Boris Burkov 提交于
      Currently, inode flags are fully backwards incompatible in btrfs. If we
      introduce a new inode flag, then tree-checker will detect it and fail.
      This can even cause us to fail to mount entirely. To make it possible to
      introduce new flags which can be read-only compatible, like VERITY, we
      add new ro flags to btrfs without treating them quite so harshly in
      tree-checker. A read-only file system can survive an unexpected flag,
      and can be mounted.
      
      As for the implementation, it unfortunately gets a little complicated.
      
      The on-disk representation of the inode, btrfs_inode_item, has an __le64
      for flags but the in-memory representation, btrfs_inode, uses a u32.
      David Sterba had the nice idea that we could reclaim those wasted 32 bits
      on disk and use them for the new ro_compat flags.
      
      It turns out that the tree-checker code which checks for unknown flags
      is broken, and ignores the upper 32 bits we are hoping to use. The issue
      is that the flags use the literal 1 rather than 1ULL, so the flags are
      signed ints, and one of them is specifically (1 << 31). As a result, the
      mask which ORs the flags is a negative integer on machines where int is
      32 bit twos complement. When tree-checker evaluates the expression:
      
        btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
      
      The mask is something like 0x80000abc, which gets promoted to u64 with
      sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
      all the upper bits zeroed, and we can't detect unexpected flags.
      
      This suggests that we can't use those bits after all. Luckily, we have
      good reason to believe that they are zero anyway. Inode flags are
      metadata, which is always checksummed, so any bit flips that would
      introduce 1s would cause a checksum failure anyway (excluding the
      improbable case of the checksum getting corrupted exactly badly).
      
      Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
      inode flag should preserve its value and not add leading zeroes
      (at least for twos complement). The only place that flag
      (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
      the root item, and indeed for that inode we see 0xffffffff80000000 as
      the flags on disk. However, that inode is never seen by tree checker,
      nor is it used in a context where verity might be meaningful.
      Theoretically, a future ro flag might cause trouble on that inode, so we
      should proactively clean up that mess before it does.
      
      With the introduction of the new ro flags, keep two separate unsigned
      masks and check them against the appropriate u32. Since we no longer run
      afoul of sign extension, this also stops writing out 0xffffffff80000000
      in root_item inodes going forward.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77eea05e
    • Q
      btrfs: remove the dead comment in writepage_delalloc() · 7361b4ae
      Qu Wenruo 提交于
      When btrfs_run_delalloc_range() failed, we will error out.
      
      But there is a strange comment mentioning that
      btrfs_run_delalloc_range() could have returned value >0 to indicate the
      IO has already started.
      
      Commit 40f76580 ("Btrfs: split up __extent_writepage to lower stack
      usage") introduced the comment, but unfortunately at that time, we were
      already using @page_started to indicate that case, and still return 0.
      
      Furthermore, even if that comment was right (which is not), we would
      return -EIO if the IO had already started.
      
      By all means the comment is incorrect, just remove the comment along
      with the dead check.
      
      Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
      make sure we either return 0 or error, no positive return value.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7361b4ae
    • F
      btrfs: do not pin logs too early during renames · bd54f381
      Filipe Manana 提交于
      During renames we pin the logs of the roots a bit too early, before the
      calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
      since those will not change anything in a log tree.
      
      In a scenario where we have multiple and diverse filesystem operations
      running in parallel, those calls can take a significant amount of time,
      due to lock contention on extent buffers, and delay log commits from other
      tasks for longer than necessary.
      
      So just pin logs after calls to btrfs_insert_inode_ref() and right before
      the first operation that can update a log tree.
      
      The following script that uses dbench was used for testing:
      
        $ cat dbench-test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
      
        echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 16
      
        umount $MNT
      
      The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
      and using a non-debug kernel config (Debian's default config).
      
      The results compare a branch without this patch and without the previous
      patch in the series, that has the subject:
      
       "btrfs: eliminate some false positives when checking if inode was logged"
      
      Versus the same branch with these two patches applied.
      
      dbench with 8 clients, results before:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4391359     0.009   249.745
       Close        3225882     0.001     3.243
       Rename        185953     0.065   240.643
       Unlink        886669     0.049   249.906
       Deltree          112     2.455   217.433
       Mkdir             56     0.002     0.004
       Qpathinfo    3980281     0.004     3.109
       Qfileinfo     697579     0.001     0.187
       Qfsinfo       729780     0.002     2.424
       Sfileinfo     357764     0.004     1.415
       Find         1538861     0.016     4.863
       WriteX       2189666     0.010     3.327
       ReadX        6883443     0.002     0.729
       LockX          14298     0.002     0.073
       UnlockX        14298     0.001     0.042
       Flush         307777     2.447   303.663
      
      Throughput 1149.6 MB/sec  8 clients  8 procs  max_latency=303.666 ms
      
      dbench with 8 clients, results after:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4269920     0.009   213.532
       Close        3136653     0.001     0.690
       Rename        180805     0.082   213.858
       Unlink        862189     0.050   172.893
       Deltree          112     2.998   218.328
       Mkdir             56     0.002     0.003
       Qpathinfo    3870158     0.004     5.072
       Qfileinfo     678375     0.001     0.194
       Qfsinfo       709604     0.002     0.485
       Sfileinfo     347850     0.004     1.304
       Find         1496310     0.017     5.504
       WriteX       2129613     0.010     2.882
       ReadX        6693066     0.002     1.517
       LockX          13902     0.002     0.075
       UnlockX        13902     0.001     0.055
       Flush         299276     2.511   220.189
      
      Throughput 1187.33 MB/sec  8 clients  8 procs  max_latency=220.194 ms
      
      +3.2% throughput, -31.8% max latency
      
      dbench with 16 clients, results before:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5978334     0.028   156.507
       Close        4391598     0.001     1.345
       Rename        253136     0.241   155.057
       Unlink       1207220     0.182   257.344
       Deltree          160     6.123    36.277
       Mkdir             80     0.003     0.005
       Qpathinfo    5418817     0.012     6.867
       Qfileinfo     949929     0.001     0.941
       Qfsinfo       993560     0.002     1.386
       Sfileinfo     486904     0.004     2.829
       Find         2095088     0.059     8.164
       WriteX       2982319     0.017     9.029
       ReadX        9371484     0.002     4.052
       LockX          19470     0.002     0.461
       UnlockX        19470     0.001     0.990
       Flush         418936     2.740   347.902
      
      Throughput 1495.31 MB/sec  16 clients  16 procs  max_latency=347.909 ms
      
      dbench with 16 clients, results after:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5711833     0.029   131.240
       Close        4195897     0.001     1.732
       Rename        241849     0.204   147.831
       Unlink       1153341     0.184   231.322
       Deltree          160     6.086    30.198
       Mkdir             80     0.003     0.021
       Qpathinfo    5177011     0.012     7.150
       Qfileinfo     907768     0.001     0.793
       Qfsinfo       949205     0.002     1.431
       Sfileinfo     465317     0.004     2.454
       Find         2001541     0.058     7.819
       WriteX       2850661     0.017     9.110
       ReadX        8952289     0.002     3.991
       LockX          18596     0.002     0.655
       UnlockX        18596     0.001     0.179
       Flush         400342     2.879   293.607
      
      Throughput 1565.73 MB/sec  16 clients  16 procs  max_latency=293.611 ms
      
      +4.6% throughput, -16.9% max latency
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd54f381
    • N
      btrfs: drop unnecessary ASSERT from btrfs_submit_direct() · 42b5d73b
      Naohiro Aota 提交于
      When on SINGLE block group, btrfs_get_io_geometry() will return "the
      size of the block group - the offset of the logical address within the
      block group" as geom.len. Since we allow up to 8 GiB zone size on zoned
      filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB
      geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <=
      INT_MAX);".
      
      The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim()
      which both take "int" (now u64 due to the previous patch). So to be
      precise the ASSERT should check if clone_len <= UINT_MAX. But actually,
      clone_len is already capped by bio.bi_iter.bi_size which is unsigned
      int. So the ASSERT is not necessary.
      
      Drop the ASSERT and properly compare submit_len and geom.len in u64.
      Then, let the implicit casting to convert it to u64.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42b5d73b
    • J
      btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking · b3776305
      Josef Bacik 提交于
      sync_inode() has some holes that can cause problems if we're under heavy
      ENOSPC pressure.  If there's writeback running on a separate thread
      sync_inode() will skip writing the inode altogether.  What we really
      want is to make sure writeback has been started on all the pages to make
      sure we can see the ordered extents and wait on them if appropriate.
      Switch to this new helper which will allow us to accomplish this and
      avoid ENOSPC'ing early.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3776305
    • J
      btrfs: wait on async extents when flushing delalloc · e1646070
      Josef Bacik 提交于
      I've been debugging an early ENOSPC problem in production and finally
      root caused it to this problem.  When we switched to the per-inode in
      38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") I pulled out the async extent handling, because we
      were doing the correct thing by calling filemap_flush() if we had async
      extents set.  This would properly wait on any async extents by locking
      the page in the second flush, thus making sure our ordered extents were
      properly set up.
      
      However when I switched us back to page based flushing, I used
      sync_inode(), which allows us to pass in our own wbc.  The problem here
      is that sync_inode() is smarter than the filemap_* helpers, it tries to
      avoid calling writepages at all.  This means that our second call could
      skip calling do_writepages altogether, and thus not wait on the pagelock
      for the async helpers.  This means we could come back before any ordered
      extents were created and then simply continue on in our flushing
      mechanisms and ENOSPC out when we have plenty of space to use.
      
      Fix this by putting back the async pages logic in shrink_delalloc.  This
      allows us to bulk write out everything that we need to, and then we can
      wait in one place for the async helpers to catch up, and then wait on
      any ordered extents that are created.
      
      Fixes: e076ab2a ("btrfs: shrink delalloc pages instead of full inodes")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1646070
    • J
      btrfs: wake up async_delalloc_pages waiters after submit · ac98141d
      Josef Bacik 提交于
      We use the async_delalloc_pages mechanism to make sure that we've
      completed our async work before trying to continue our delalloc
      flushing.  The reason for this is we need to see any ordered extents
      that were created by our delalloc flushing.  However we're waking up
      before we do the submit work, which is before we create the ordered
      extents.  This is a pretty wide race window where we could potentially
      think there are no ordered extents and thus exit shrink_delalloc
      prematurely.  Fix this by waking us up after we've done the work to
      create ordered extents.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac98141d
    • Q
      btrfs: allow read-write for 4K sectorsize on 64K page size systems · 95ea0486
      Qu Wenruo 提交于
      Since now we support data and metadata read-write for subpage, remove
      the RO requirement for subpage mount.
      
      There are some extra limitations though:
      
      - For now, subpage RW mount is still considered experimental
        Thus that mount warning will still be there.
      
      - No compression support
        There are still quite some PAGE_SIZE hard coded and quite some call
        sites use extent_clear_unlock_delalloc() to unlock locked_page.
        This will screw up subpage helpers.
      
        Now for subpage RW mount, no matter what mount option or inode attr is
        set, all writes will not be compressed.  Although reading compressed
        data has no problem.
      
      - No defrag for subpage case
        The defrag support for subpage case will come in later patches, which
        will also rework the defrag workflow.
      
      - No inline extent will be created
        This is mostly due to the fact that filemap_fdatawrite_range() will
        trigger more write than the range specified.
        In fallocate calls, this behavior can make us to writeback which can
        be inlined, before we enlarge the i_size.
      
        This is a very special corner case, and even current btrfs check won't
        report error on such inline extent + regular extent.
        But considering how much effort has been put to prevent such inline +
        regular, I'd prefer to cut off inline extent completely until we have
        a good solution.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95ea0486
    • Q
      btrfs: subpage: fix false alert when relocating partial preallocated data extents · e3c62324
      Qu Wenruo 提交于
      [BUG]
      When relocating partial preallocated data extents (part of the
      preallocated extent is written) for subpage, it can cause the following
      false alert and make the relocation to fail:
      
        BTRFS info (device dm-3): balance: start -d
        BTRFS info (device dm-3): relocating block group 13631488 flags data
        BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
        BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
        BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
        BTRFS info (device dm-3): balance: ended with status: -5
      
      The minimal script to reproduce looks like this:
      
        mkfs.btrfs -f -s 4k $dev
        mount $dev -o nospace_cache $mnt
        xfs_io -f -c "falloc 0 8k" $mnt/file
        xfs_io -f -c "pwrite 0 4k" $mnt/file
        btrfs balance start -d $mnt
      
      [CAUSE]
      Function btrfs_verify_data_csum() checks if the full range has
      EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
      have EXTENT_NODATASUM bit, then it skip the range.
      
      This works pretty well for regular sectorsize, as in that case
      btrfs_verify_data_csum() is called for each sector, thus no problem at
      all.
      
      But for subpage case, btrfs_verify_data_csum() is called on each bvec,
      which can contain several sectors, and since it checks *all* bytes for
      EXTENT_NODATASUM bit, if we have some range with csum, then we will
      continue checking all the sectors.
      
      For the preallocated sectors, it doesn't have any csum, thus obviously
      the csum won't match and cause the false alert.
      
      [FIX]
      Move the EXTENT_NODATASUM check into the main loop, so that we can check
      each sector for EXTENT_NODATASUM bit for subpage case.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3c62324
    • Q
      btrfs: subpage: fix a potential use-after-free in writeback helper · 7c11d0ae
      Qu Wenruo 提交于
      [BUG]
      There is a possible use-after-free bug when running generic/095.
      
       BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
       Faulting instruction address: 0xc000000000283654
       c000000000283078 do_raw_spin_unlock+0x88/0x230
       c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
       c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
       c0000000009e0458 end_bio_extent_writepage+0x158/0x270
       c000000000b6fd14 bio_endio+0x254/0x270
       c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
       c000000000b6fd14 bio_endio+0x254/0x270
       c000000000b781fc blk_update_request+0x46c/0x670
       c000000000b8b394 blk_mq_end_request+0x34/0x1d0
       c000000000d82d1c lo_complete_rq+0x11c/0x140
       c000000000b880a4 blk_complete_reqs+0x84/0xb0
       c0000000012b2ca4 __do_softirq+0x334/0x680
       c0000000001dd878 irq_exit+0x148/0x1d0
       c000000000016f4c do_IRQ+0x20c/0x240
       c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0
      
      [CAUSE]
      There is very small race window like the following in generic/095.
      
      	Thread 1		|		Thread 2
      --------------------------------+------------------------------------
        end_bio_extent_writepage()	| btrfs_releasepage()
        |- spin_lock_irqsave()	| |
        |- end_page_writeback()	| |
        |				| |- if (PageWriteback() ||...)
        |				| |- clear_page_extent_mapped()
        |				|    |- kfree(subpage);
        |- spin_unlock_irqrestore().
      
      The race can also happen between writeback and btrfs_invalidatepage(),
      although that would be much harder as btrfs_invalidatepage() has much
      more work to do before the clear_page_extent_mapped() call.
      
      [FIX]
      Here we "wait" for the subapge spinlock to be released before we detach
      subpage structure.
      So this patch will introduce a new function, wait_subpage_spinlock(), to
      do the "wait" by acquiring the spinlock and release it.
      
      Since the caller has ensured the page is not dirty nor writeback, and
      page is already locked, the only way to hold the subpage spinlock is
      from endio function.
      Thus we only need to acquire the spinlock to wait for any existing
      holder.
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Tested-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c11d0ae
    • Q
      btrfs: subpage: disable inline extent creation · 7367253a
      Qu Wenruo 提交于
      [BUG]
      When running the following fsx command (extracted from generic/127) on
      subpage filesystem, it can create inline extent with regular extents:
      
        fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx
      
      The offending extent would look like:
      
        item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
          index 2 namelen 4 name: file
        item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
          generation 7 type 0 (inline)
          inline extent data size 707 ram_bytes 707 compression 0 (none)
        item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
          generation 7 type 2 (prealloc)
          prealloc data disk byte 102346752 nr 4096
          prealloc data offset 0 nr 4096
      
      [CAUSE]
      For subpage filesystem, the writeback is triggered in page units, which
      means, even if we just want to writeback range [16K, 20K) for 64K page
      system, we will still try to writeback any dirty sector of range [0, 64K).
      
      This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
      this can cause unexpected problems.
      
      For above test case, the last several operations from fsx are:
      
       9055 trunc      from 0x40000 to 0x2c3
       9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)
      
      In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
      btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
      writeback any dirty data in [4096, 8192), but nothing else.
      
      Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
      trigger writeback of the range [0, 64K), which includes the data at
      [0, 4096).
      
      And since at the call site, we haven't yet increased i_size, which is
      still 707, this means cow_file_range() can insert an inline extent.
      
      Resulting above inline + regular extent.
      
      [WORKAROUND]
      I don't really have any good short-term solution yet, as this means all
      operations that would trigger writeback need to be reviewed for any
      i_size change.
      
      So here I choose to disable inline extent creation for subpage case as a
      workaround.  We have done tons of work just to avoid such extent, so I
      don't to create an exception just for subpage.
      
      This only affects inline extent creation, subpage has no problem reading
      existing inline extents at all.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7367253a