1. 25 7月, 2022 40 次提交
    • C
      btrfs: raid56: use fixed stripe length everywhere · ff18a4af
      Christoph Hellwig 提交于
      The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
      are functions passing it as arguments, this is not necessary. The fixed
      value has been used for a long time and though the stripe length should
      be configurable by super block member stripesize, this hasn't been
      implemented and would require more changes so we don't need to keep this
      code around until then.
      
      Partially based on a patch from Qu Wenruo.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [ update changelog ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff18a4af
    • F
      btrfs: remove the inode cache check at btrfs_is_free_space_inode() · 0201fceb
      Filipe Manana 提交于
      The inode cache feature was removed in kernel 5.11, and we no longer have
      any code that reads from or writes to inode caches. We may still mount a
      filesystem that has inode caches, but they are ignored.
      
      Remove the check for an inode cache from btrfs_is_free_space_inode(),
      since we no longer have code to trigger reads from an inode cache or
      writes to an inode cache. The check at send.c is still needed, because
      in case we find a filesystem with an inode cache, we must ignore it.
      Also leave the checks at tree-checker.c, as they are sanity checks.
      
      This eliminates a dead branch and reduces the amount of code since it's
      in an inline function.
      
      Before:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1620662	 189240	  29032	1838934	 1c0f56	fs/btrfs/btrfs.ko
      
      After:
      
      $ size fs/btrfs/btrfs.ko
         text	   data	    bss	    dec	    hex	filename
      1620502	 189240	  29032	1838774	 1c0eb6	fs/btrfs/btrfs.ko
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0201fceb
    • N
      btrfs: sysfs: remove BIG_METADATA feature files · 74860816
      Nikolay Borisov 提交于
      This flag has been merged in 3.10 and is effectively always-on. Its
      status depends on the host page size so there's another way to guarantee
      compatibility with old kernels.
      
      Due to a bug introduced in 6f93e834 ("btrfs: fix upper limit for
      max_inline for page size 64K") the flag is not persisted among features
      in the superblock so it's not reliable.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74860816
    • N
      btrfs: sysfs: remove MIXED_BACKREF feature file · 0766837b
      Nikolay Borisov 提交于
      This feature has been the default for about 13 year. At this point it's
      safe to consider it an indispensable feature of BTRFS as such there's
      no need to advertise it in sysfs. Remove the global sysfs feature file,
      the per-filesystem feature file has never been there.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0766837b
    • N
      btrfs: don't print 'has skinny extents' anymore on mount · 49f468c9
      Nikolay Borisov 提交于
      Skinny extents have been a default mkfs feature since version 3.18 i
      (introduced in btrfs-progs commit 6715de04d9a7 ("btrfs-progs: mkfs:
      make skinny-metadata default") ). It really doesn't bring any value to
      users to simply remove it.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49f468c9
    • N
      btrfs: don't print 'flagging with big metadata' anymore on mount · 6b769dac
      Nikolay Borisov 提交于
      Added in commit 727011e0 ("Btrfs: allow metadata blocks larger than
      the page size") in 2010 and it's been default for mkfs since 3.12
      (2013).  The message doesn't really convey any useful information to
      users. Remove it.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b769dac
    • D
      btrfs: clean up chained assignments · c1867eb3
      David Sterba 提交于
      The chained assignments may be convenient to write, but make readability
      a bit worse as it's too easy to overlook that there are several values
      set on the same line while this is rather an exception.  Making it
      consistent everywhere avoids surprises.
      
      The pattern where inode times are initialized reuses the first value and
      the order is mtime, ctime. In other blocks the assignments are expanded
      so the order of variables is similar to the neighboring code.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1867eb3
    • D
      btrfs: merge calculations for simple striped profiles in btrfs_rmap_block · ac067734
      David Sterba 提交于
      Use the same expression for stripe_nr for RAID0 (map->sub_stripes is 1)
      and RAID10 (map->sub_stripes is 2), with equivalent results.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac067734
    • D
      btrfs: use mask for all RAID1* profiles in btrfs_calc_avail_data_space · d09cb9e1
      David Sterba 提交于
      There's a sequence of hard coded values for RAID1 profiles that are
      already stored in the raid_attr table that should be used instead.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d09cb9e1
    • N
      btrfs: properly flag filesystem with BTRFS_FEATURE_INCOMPAT_BIG_METADATA · e26b04c4
      Nikolay Borisov 提交于
      Commit 6f93e834 seemingly inadvertently moved the code responsible
      for flagging the filesystem as having BIG_METADATA to a place where
      setting the flag was essentially lost. This means that
      filesystems created with kernels containing this bug (starting with 5.15)
      can potentially be mounted by older (pre-3.4) kernels. In reality
      chances for this happening are low because there are other incompat
      flags introduced in the mean time. Still the correct behavior is to set
      INCOMPAT_BIG_METADATA flag and persist this in the superblock.
      
      Fixes: 6f93e834 ("btrfs: fix upper limit for max_inline for page size 64K")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e26b04c4
    • D
      btrfs: print checksum type and implementation at mount time · c8a5f8ca
      David Sterba 提交于
      Per user request, print the checksum type and implementation at mount
      time among the messages. The checksum is user configurable and the
      actual crypto implementation is useful to see for performance reasons.
      The same information is also available after mount in
      /sys/fs/FSID/checksum file.
      
      Example:
      
        [25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm
      
      Link: https://github.com/kdave/btrfs-progs/issues/483Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c8a5f8ca
    • J
      btrfs: reset block group chunk force if we have to wait · 1314ca78
      Josef Bacik 提交于
      If you try to force a chunk allocation, but you race with another chunk
      allocation, you will end up waiting on the chunk allocation that just
      occurred and then allocate another chunk.  If you have many threads all
      doing this at once you can way over-allocate chunks.
      
      Fix this by resetting force to NO_FORCE, that way if we think we need to
      allocate we can, otherwise we don't force another chunk allocation if
      one is already happening.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1314ca78
    • D
      btrfs: send: add new command FILEATTR for file attributes · 48247359
      David Sterba 提交于
      There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS
      and later from XFLAGS interfaces, now commonly found under the
      'fileattr' API. This corresponds to the individual inode bits and that's
      part of the on-disk format, so this is suitable for the protocol. The
      other interfaces contain a lot of cruft or bits that btrfs does not
      support yet.
      
      Currently the value is u64 and matches btrfs_inode_item. Not all the
      bits can be set by ioctls (like NODATASUM or READONLY), but we can send
      them over the protocol and leave it up to the receiving side what and
      how to apply.
      
      As some of the flags, eg. IMMUTABLE, can prevent any further changes,
      the receiving side needs to understand that and apply the changes in the
      right order, or possibly with some intermediate steps. This should be
      easier, future proof and simpler on the protocol layer than implementing
      in kernel.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48247359
    • D
      btrfs: send: add OTIME as utimes attribute for proto 2+ by default · 22a5b2ab
      David Sterba 提交于
      When send v1 was introduced the otime (inode creation time) was not
      available, however the attribute in btrfs send protocol exists. Though
      it would be possible to add it for v1 too as the attribute would be
      ignored by v1 receive, let's not change the layout of v1 and only add
      that to v2+.  The otime cannot be changed and is only informative.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      22a5b2ab
    • Q
      btrfs: output mirror number for bad metadata · 8f0ed7d4
      Qu Wenruo 提交于
      When handling a real world transid mismatch image, it's hard to know
      which copy is corrupted, as the error messages just look like this:
      
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
      
      We don't even know if the retry is caused by btrfs or the VFS retry.
      
      To make things a little easier to read, add mirror number for all
      related tree block read errors.
      
      So the above messages would look like this:
      
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
        BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ update messages, add "logical" ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8f0ed7d4
    • N
      btrfs: replace unnecessary goto with direct return at cow_file_range() · aaafa1eb
      Naohiro Aota 提交于
      The 'goto out' in cow_file_range() in the exit block are not necessary
      and jump back. Replace them with return, while still keeping 'goto out'
      in the main code.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ keep goto in the main code, update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aaafa1eb
    • N
      btrfs: fix error handling of fallback uncompress write · 71aa147b
      Naohiro Aota 提交于
      When cow_file_range() fails in the middle of the allocation loop, it
      unlocks the pages but leaves the ordered extents intact. Thus, we need
      to call btrfs_cleanup_ordered_extents() to finish the created ordered
      extents.
      
      Also, we need to call end_extent_writepage() if locked_page is available
      because btrfs_cleanup_ordered_extents() never processes the region on
      the locked_page.
      
      Furthermore, we need to set the mapping as error if locked_page is
      unavailable before unlocking the pages, so that the errno is properly
      propagated to the user space.
      
      CC: stable@vger.kernel.org # 5.18+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      71aa147b
    • N
      btrfs: extend btrfs_cleanup_ordered_extents for NULL locked_page · 99826e4c
      Naohiro Aota 提交于
      btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
      is not usable for submit_uncompressed_range() which can have NULL
      locked_page.
      
      Add support supports locked_page == NULL case. Also, it rewrites
      redundant "page_offset(locked_page)".
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99826e4c
    • N
      btrfs: ensure pages are unlocked on cow_file_range() failure · 9ce7466f
      Naohiro Aota 提交于
      There is a hung_task report on zoned btrfs like below.
      
      https://github.com/naota/linux/issues/59
      
        [726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
        [726.329839]       Not tainted 5.16.0-rc1+ #1
        [726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
        [726.331608] Call Trace:
        [726.331611]  <TASK>
        [726.331614]  __schedule+0x2e5/0x9d0
        [726.331622]  schedule+0x58/0xd0
        [726.331626]  io_schedule+0x3f/0x70
        [726.331629]  __folio_lock+0x125/0x200
        [726.331634]  ? find_get_entries+0x1bc/0x240
        [726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
        [726.331642]  truncate_inode_pages_range+0x5b2/0x770
        [726.331649]  truncate_inode_pages_final+0x44/0x50
        [726.331653]  btrfs_evict_inode+0x67/0x480
        [726.331658]  evict+0xd0/0x180
        [726.331661]  iput+0x13f/0x200
        [726.331664]  do_unlinkat+0x1c0/0x2b0
        [726.331668]  __x64_sys_unlink+0x23/0x30
        [726.331670]  do_syscall_64+0x3b/0xc0
        [726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [726.331677] RIP: 0033:0x7fb9490a171b
        [726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
        [726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
        [726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
        [726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
        [726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
        [726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
        [726.331693]  </TASK>
      
      While we debug the issue, we found running fstests generic/551 on 5GB
      non-zoned null_blk device in the emulated zoned mode also had a
      similar hung issue.
      
      Also, we can reproduce the same symptom with an error injected
      cow_file_range() setup.
      
      The hang occurs when cow_file_range() fails in the middle of
      allocation. cow_file_range() called from do_allocation_zoned() can
      split the give region ([start, end]) for allocation depending on
      current block group usages. When btrfs can allocate bytes for one part
      of the split regions but fails for the other region (e.g. because of
      -ENOSPC), we return the error leaving the pages in the succeeded regions
      locked. Technically, this occurs only when @unlock == 0. Otherwise, we
      unlock the pages in an allocated region after creating an ordered
      extent.
      
      Considering the callers of cow_file_range(unlock=0) won't write out
      the pages, we can unlock the pages on error exit from
      cow_file_range(). So, we can ensure all the pages except @locked_page
      are unlocked on error case.
      
      In summary, cow_file_range now behaves like this:
      
      - page_started == 1 (return value)
        - All the pages are unlocked. IO is started.
      - unlock == 1
        - All the pages except @locked_page are unlocked in any case
      - unlock == 0
        - On success, all the pages are locked for writing out them
        - On failure, all the pages except @locked_page are unlocked
      
      Fixes: 42c01100 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ce7466f
    • I
      btrfs: sysfs: export commit stats · 140a8ff7
      Ioannis Angelakopoulos 提交于
      Export commit stats in file
      
        /sys/fs/btrfs/UUID/commit_stats
      
      with example output like:
      
        commits 123
        last_commit_ms 11
        max_commit_ms 150
        total_commit_ms 2000
      
      The values are in one file so reading them at a single time will give a
      more consistent view. The stats are internally tracked in nanoseconds so
      the cumulative values should not suffer from rounding errors.
      
      Writing 0 to the file 'commit_stats' will reset max_commit_ms.
      Initial values are set at first mount of the filesystem.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      [ update changelog ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      140a8ff7
    • I
      btrfs: collect commit stats, count, duration · e55958c8
      Ioannis Angelakopoulos 提交于
      Track several stats about transaction commit, to be later exported via
      sysfs:
      
      - number of commits so far
      - duration of the last commit in ns
      - maximum commit duration seen so far in ns
      - total duration for all commits so far in ns
      
      The update of the commit stats occurs after the commit thread has gone
      through all the logic that checks if there is another thread committing
      at the same time. This means that we only account for actual commit work
      in the commit stats we report and not the time the thread spends waiting
      until it is ready to do the commit work.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e55958c8
    • C
      btrfs: remove extent writepage address space operation · f3e90c1c
      Christoph Hellwig 提交于
      Same as in commit 21b4ee70 ("xfs: drop ->writepage completely"): we
      can remove the callback as it's only used in one place - single page
      writeback from memory reclaim and is not called for cgroup writeback at
      all.
      
      We only allow such writeback from kswapd, not from direct memory
      reclaim, and so it is rarely used. When it comes from kswapd, it is
      effectively random dirty page shoot-down, which is horrible for IO
      patterns. We can rely on background writeback to clean all dirty pages
      in an efficient way and not let it be interrupted by kswapd.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3e90c1c
    • D
      btrfs: send: use boolean types for current inode status · 9555e1f1
      David Sterba 提交于
      The new, new_gen and deleted indicate a status, use boolean type instead
      of int.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9555e1f1
    • D
      btrfs: send: remove old TODO regarding ERESTARTSYS · cec3dad9
      David Sterba 提交于
      The whole send operation is restartable and handling properly a buffer
      write may not be easy. We can't know what caused that and if a short
      delay and retry will fix it or how many retries should be performed in
      case it's a temporary condition.
      
      The error value is returned to the ioctl caller so in case it's
      transient problem, the user would be notified about the reason. Remove
      the TODO note as there's no plan to handle ERESTARTSYS.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cec3dad9
    • D
      btrfs: send: simplify includes · 8234d3f6
      David Sterba 提交于
      We don't need the whole ctree.h in send.h, none of the data types
      defined there are used.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8234d3f6
    • D
      btrfs: send: drop __KERNEL__ ifdef from send.h · e3b4b904
      David Sterba 提交于
      We don't need this ifdef as the header file is not shared, the protocol
      definition used by userspace should be from libbtrfs or libbtrfsutil.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3b4b904
    • C
      btrfs: increase direct io read size limit to 256 sectors · ee5b46a3
      Christoph Hellwig 提交于
      Btrfs currently limits direct I/O reads to a single sector, which goes
      back to commit c329861d ("Btrfs: don't allocate a separate csums
      array for direct reads") from Josef.  That commit changes the direct I/O
      code to ".. use the private part of the io_tree for our csums.", but ten
      years later that isn't how checksums for direct reads work, instead they
      use a csums allocation on a per-btrfs_dio_private basis (which have their
      own performance problem for small I/O, but that will be addressed later).
      
      There is no fundamental limit in btrfs itself to limit the I/O size
      except for the size of the checksum array that scales linearly with
      the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
      256 limits, which matches what the buffered reads typically see as
      the upper limit as the limit for direct I/O as well.
      
      This significantly improves direct read performance.  For example a fio
      run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
      throughput:
      
      Baseline:
      
      READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec
      
      With this patch:
      
      READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee5b46a3
    • Q
      btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() · f6065f8e
      Qu Wenruo 提交于
      [BUG]
      There is a small workload which will always fail with recent kernel:
      (A simplified version from btrfs/125 test case)
      
        mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
        mount $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
        sync
        umount $mnt
        btrfs dev scan -u $dev3
        mount -o degraded $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
        umount $mnt
        btrfs dev scan
        mount $dev1 $mnt
        btrfs balance start --full-balance $mnt
        umount $mnt
      
      The failure is always failed to read some tree blocks:
      
        BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
        BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
        BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
        ...
      
      [CAUSE]
      With the recently added debug output, we can see all RAID56 operations
      related to full stripe 38928384:
      
        56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
        56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
        56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
        56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
        56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
        56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
        56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
        56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
        56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
        56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
        56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
        56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
      
      Before we enter raid56_parity_recover(), we have triggered some metadata
      write for the full stripe 38928384, this leads to us to read all the
      sectors from disk.
      
      Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
      avoid unnecessary read.
      
      This means, for that full stripe, after any partial write, we will have
      stale data, along with P/Q calculated using that stale data.
      
      Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
      which has data stripes" we haven't submitted all the corrupted P/Q to disk.
      
      When we really need to recover certain range, aka in
      raid56_parity_recover(), we will use the cached rbio, along with its
      cached sectors (the full stripe is all cached).
      
      This explains why we have no event raid56_scrub_read_recover()
      triggered.
      
      Since we have the cached P/Q which is calculated using the stale data,
      the recovered one will just be stale.
      
      In our particular test case, it will always return the same incorrect
      metadata, thus causing the same error message "parent transid verify
      failed on 39010304 wanted 9 found 7" again and again.
      
      [BTRFS DESTRUCTIVE RMW PROBLEM]
      
      Test case btrfs/125 (and above workload) always has its trouble with
      the destructive read-modify-write (RMW) cycle:
      
              0       32K     64K
      Data1:  | Good  | Good  |
      Data2:  | Bad   | Bad   |
      Parity: | Good  | Good  |
      
      In above case, if we trigger any write into Data1, we will use the bad
      data in Data2 to re-generate parity, killing the only chance to recovery
      Data2, thus Data2 is lost forever.
      
      This destructive RMW cycle is not specific to btrfs RAID56, but there
      are some btrfs specific behaviors making the case even worse:
      
      - Btrfs will cache sectors for unrelated vertical stripes.
      
        In above example, if we're only writing into 0~32K range, btrfs will
        still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
        This behavior is to cache sectors for later update.
      
        Incidentally commit d4e28d9b ("btrfs: raid56: make steal_rbio()
        subpage compatible") has a bug which makes RAID56 to never trust the
        cached sectors, thus slightly improve the situation for recovery.
      
        Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
        steal_rbio" will revert the behavior back to the old one.
      
      - Btrfs raid56 partial write will update all P/Q sectors and cache them
      
        This means, even if data at (64K ~ 96K) of Data2 is free space, and
        only (96K ~ 128K) of Data2 is really stale data.
        And we write into that (96K ~ 128K), we will update all the parity
        sectors for the full stripe.
      
        This unnecessary behavior will completely kill the chance of recovery.
      
        Thankfully, an unrelated optimization "btrfs: only write the sectors
        in the vertical stripe which has data stripes" will prevent
        submitting the write bio for untouched vertical sectors.
      
        That optimization will keep the on-disk P/Q untouched for a chance for
        later recovery.
      
      [FIX]
      Although we have no good way to completely fix the destructive RMW
      (unless we go full scrub for each partial write), we can still limit the
      damage.
      
      With patch "btrfs: only write the sectors in the vertical stripe which
      has data stripes" now we won't really submit the P/Q of unrelated
      vertical stripes, so the on-disk P/Q should still be fine.
      
      Now we really need to do is just drop all the cached sectors when doing
      recovery.
      
      By this, we have a chance to read the original P/Q from disk, and have a
      chance to recover the stale data, while still keep the cache to speed up
      regular write path.
      
      In fact, just dropping all the cache for recovery path is good enough to
      allow the test case btrfs/125 along with the small script to pass
      reliably.
      
      The lack of metadata write after the degraded mount, and forced metadata
      COW is saving us this time.
      
      So this patch will fix the behavior by not trust any cache in
      __raid56_parity_recover(), to solve the problem while still keep the
      cache useful.
      
      But please note that this test pass DOES NOT mean we have solved the
      destructive RMW problem, we just do better damage control a little
      better.
      
      Related patches:
      
      - btrfs: only write the sectors in the vertical stripe
      - d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      - btrfs: update stripe_sectors::uptodate in steal_rbio
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f6065f8e
    • C
      btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished · 711f447b
      Christoph Hellwig 提交于
      finish_func is always set to finish_ordered_fn, so remove it and also
      the now pointless and somewhat confusingly named
      __endio_write_update_ordered wrapper.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      711f447b
    • N
      btrfs: batch up release of reserved metadata for delayed items used for deletion · 1f4f639f
      Nikolay Borisov 提交于
      With Filipe's recent rework of the delayed inode code one aspect which
      isn't batched is the release of the reserved metadata of delayed inode's
      delete items. With this patch on top of Filipe's rework and running the
      same test as provided in the description of a patch titled
      "btrfs: improve batch deletion of delayed dir index items" I observe
      the following change of the number of calls to btrfs_block_rsv_release:
      
      Before this change:
      - block_rsv_release:                      1004
      - btrfs_delete_delayed_items_total_time: 14602
      - delete_batches:                          505
      
      After:
      - block_rsv_release:                       510
      - btrfs_delete_delayed_items_total_time: 13643
      - delete_batches:                          507
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1f4f639f
    • Q
      btrfs: warn about dev extents that are inside the reserved range · 3613249a
      Qu Wenruo 提交于
      Btrfs on-disk format has reserved the first 1MiB for the primary super
      block (at 64KiB offset) and bootloaders may also use this space.
      
      This behavior is only introduced since v4.1 btrfs-progs release,
      although kernel can ensure we never touch the reserved range of super
      blocks, it's better to inform the end users, and a balance will resolve
      the problem.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ update changelog and message ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3613249a
    • Q
      btrfs: use named constant for reserved device space · 37f85ec3
      Qu Wenruo 提交于
      There's a reserved space on each device of size 1MiB that can be used by
      bootloaders or to avoid accidental overwrite. Use a symbolic constant
      with the explaining comment instead of hard coding the value and
      multiple comments.
      
      Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for
      the primary super block (at offset 64KiB), until then the range could
      have been used by mistake. Kernel has been always respecting the 1MiB
      range for writes.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37f85ec3
    • D
      bfceac7f
    • D
      btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino · e3059ec0
      David Sterba 提交于
      There's only one function we pass to iterate_inodes_from_logical as
      iterator, so we can drop the indirection and call it directly, after
      moving the function to backref.c
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3059ec0
    • D
      btrfs: simplify parameters of backref iterators · 875d1daa
      David Sterba 提交于
      The inode reference iterator interface takes parameters that are derived
      from the context parameter, but as it's a void* type the values are
      passed individually.
      
      Change the ctx type to inode_fs_path as it's the only thing we pass and
      drop any parameters that are derived from that.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      875d1daa
    • D
      btrfs: call inode_to_path directly and drop indirection · ad6240f6
      David Sterba 提交于
      The functions for iterating inode reference take a function parameter
      but there's only one value, inode_to_path(). Remove the indirection and
      call the function. As paths_from_inode would become just an alias for
      iterate_irefs(), merge the two into one function.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad6240f6
    • Q
      btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies() · 6d322b48
      Qu Wenruo 提交于
      For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies
      directly, only for RAID5 and RAID6 we need some extra handling as
      there's no table value for that.
      
      For RAID10 there's a change from sub_stripes to ncopies. The values are
      the same but semantically we want to use number of copies, as this is
      what btrfs_num_copies does.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d322b48
    • Q
      btrfs: use btrfs_raid_array to calculate number of parity stripes · 0b30f719
      Qu Wenruo 提交于
      Use the raid table instead of hard coded values and rename the helper as
      it is exported.  This could make later extension on RAID56 based
      profiles easier.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0b30f719
    • Q
      btrfs: use btrfs_chunk_max_errors() to replace tolerance calculation · 6dead96c
      Qu Wenruo 提交于
      In __btrfs_map_block() we have an assignment to @max_errors using
      nr_parity_stripes().
      
      Although it works for RAID56 it's confusing.  Replace it with
      btrfs_chunk_max_errors().
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6dead96c
    • Q
      btrfs: remove parameter dev_extent_len from scrub_stripe() · bc88b486
      Qu Wenruo 提交于
      For scrub_stripe() we can easily calculate the dev extent length as we
      have the full info of the chunk.
      
      Thus there is no need to pass @dev_extent_len from the caller, and we
      introduce a helper, btrfs_calc_stripe_length(), to do the calculation
      from extent_map structure.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc88b486