1. 20 6月, 2017 1 次提交
  2. 10 6月, 2017 1 次提交
  3. 09 6月, 2017 2 次提交
  4. 01 6月, 2017 1 次提交
    • D
      btrfs: use correct types for page indices in btrfs_page_exists_in_range · cc2b702c
      David Sterba 提交于
      Variables start_idx and end_idx are supposed to hold a page index
      derived from the file offsets. The int type is not the right one though,
      offsets larger than 1 << 44 will get silently trimmed off the high bits.
      (1 << 44 is 16TiB)
      
      What can go wrong, if start is below the boundary and end gets trimmed:
      - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot)
      - the final check "if (page->index <= end_idx)" will unexpectedly fail
      
      The function will return false, ie. "there's no page in the range",
      although there is at least one.
      
      btrfs_page_exists_in_range is used to prevent races in:
      
      * in hole punching, where we make sure there are not pages in the
        truncated range, otherwise we'll wait for them to finish and redo
        truncation, but we're going to replace the pages with holes anyway so
        the only problem is the intermediate state
      
      * lock_extent_direct: we want to make sure there are no pages before we
        lock and start DIO, to prevent stale data reads
      
      For practical occurence of the bug, there are several constaints.  The
      file must be quite large, the affected range must cross the 16TiB
      boundary and the internal state of the file pages and pending operations
      must match.  Also, we must not have started any ordered data in the
      range, otherwise we don't even reach the buggy function check.
      
      DIO locking tries hard in several places to avoid deadlocks with
      buffered IO and avoids waiting for ranges. The worst consequence seems
      to be stale data read.
      
      CC: Liu Bo <bo.li.liu@oracle.com>
      CC: stable@vger.kernel.org	# 3.16+
      Fixes: fc4adbff ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking")
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc2b702c
  5. 26 4月, 2017 5 次提交
    • F
      Btrfs: fix reported number of inode blocks · a7e3b975
      Filipe Manana 提交于
      Currently when there are buffered writes that were not yet flushed and
      they fall within allocated ranges of the file (that is, not in holes or
      beyond eof assuming there are no prealloc extents beyond eof), btrfs
      simply reports an incorrect number of used blocks through the stat(2)
      system call (or any of its variants), regardless of mount options or
      inode flags (compress, compress-force, nodatacow). This is because the
      number of blocks used that is reported is based on the current number
      of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
      inode. The later covers bytes that both fall within allocated regions
      of the file and holes.
      
      Example scenarios where the number of reported blocks is wrong while the
      buffered writes are not flushed:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
      
        # The following should have reported 64K...
        $ du -h /mnt/sdc/foo1
        128K	/mnt/sdc/foo1
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo1
        64K	/mnt/sdc/foo1
      
        $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 65536
        64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
      
        # The following should have reported 128K...
        $ du -h /mnt/sdc/foo2
        192K	/mnt/sdc/foo2
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo2
        128K	/mnt/sdc/foo2
      
      So the number of used file blocks is simply incorrect, unlike in other
      filesystems such as ext4 and xfs for example, but only while the buffered
      writes are not flushed.
      
      Fix this by tracking the number of delalloc bytes that fall within holes
      and beyond eof of a file, and use instead this new counter when reporting
      the number of used blocks for an inode.
      
      Another different problem that exists is that the delalloc bytes counter
      is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
      the respective range in the inode's iotree) and the vfs inode's bytes
      counter is only incremented when writeback finishes (through
      insert_reserved_file_extent()). Therefore while writeback is ongoing we
      simply report a wrong number of blocks used by an inode if the write
      operation covers a range previously unallocated. While this change does
      not fix this problem, it does minimizes it a lot by shortening that time
      window, as the new dealloc bytes counter (new_delalloc_bytes) is only
      decremented when writeback finishes right before updating the vfs inode's
      bytes counter. Fully fixing this second problem is not trivial and will
      be addressed later by a different patch.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      a7e3b975
    • F
      Btrfs: fix incorrect space accounting after failure to insert inline extent · 1c81ba23
      Filipe Manana 提交于
      When using compression, if we fail to insert an inline extent we
      incorrectly end up attempting to free the reserved data space twice,
      once through extent_clear_unlock_delalloc(), because we pass it the
      flag EXTENT_DO_ACCOUNTING, and once through a direct call to
      btrfs_free_reserved_data_space_noquota(). This results in a trace
      like the following:
      
      [  834.576240] ------------[ cut here ]------------
      [  834.576825] WARNING: CPU: 2 PID: 486 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.579501] Modules linked in: btrfs crc32c_generic xor raid6_pq ppdev i2c_piix4 acpi_cpufreq psmouse tpm_tis parport_pc pcspkr serio_raw tpm_tis_core sg parport evdev i2c_core tpm button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
      [  834.592116] CPU: 2 PID: 486 Comm: kworker/u32:4 Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  834.593316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  834.595273] Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
      [  834.596103] Call Trace:
      [  834.596103]  dump_stack+0x67/0x90
      [  834.596103]  __warn+0xc2/0xdd
      [  834.596103]  warn_slowpath_null+0x1d/0x1f
      [  834.596103]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  834.596103]  compress_file_range.constprop.42+0x2fa/0x3fc [btrfs]
      [  834.596103]  ? submit_compressed_extents+0x3a7/0x3a7 [btrfs]
      [  834.596103]  async_cow_start+0x32/0x4d [btrfs]
      [  834.596103]  btrfs_scrubparity_helper+0x187/0x3e7 [btrfs]
      [  834.596103]  btrfs_delalloc_helper+0xe/0x10 [btrfs]
      [  834.596103]  process_one_work+0x273/0x4e4
      [  834.596103]  worker_thread+0x1eb/0x2ca
      [  834.596103]  ? rescuer_thread+0x2b6/0x2b6
      [  834.596103]  kthread+0x100/0x108
      [  834.596103]  ? __list_del_entry+0x22/0x22
      [  834.596103]  ret_from_fork+0x2e/0x40
      [  834.611656] ---[ end trace 719902fe6bdef08f ]---
      
      So fix this by not calling directly btrfs_free_reserved_data_space_noquota()
      if an error happened.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      1c81ba23
    • F
      Btrfs: fix invalid attempt to free reserved space on failure to cow range · a315e68f
      Filipe Manana 提交于
      When attempting to COW a file range (we are starting writeback and doing
      COW), if we manage to reserve an extent for the range we will write into
      but fail after reserving it and before creating the respective ordered
      extent, we end up in an error path where we attempt to decrement the
      data space's bytes_may_use counter after we already did it while
      reserving the extent, leading to a warning/trace like the following:
      
      [  847.621524] ------------[ cut here ]------------
      [  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
      [  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  847.648601] Call Trace:
      [  847.648601]  dump_stack+0x67/0x90
      [  847.648601]  __warn+0xc2/0xdd
      [  847.648601]  warn_slowpath_null+0x1d/0x1f
      [  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
      [  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
      [  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
      [  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
      [  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
      [  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
      [  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
      [  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
      [  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
      [  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  ? mark_lock+0x24/0x201
      [  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
      [  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
      [  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
      [  847.648601]  do_writepages+0x23/0x2c
      [  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
      [  847.648601]  filemap_fdatawrite_range+0x13/0x15
      [  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
      [  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
      [  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
      [  847.648601]  vfs_fsync_range+0x8c/0x9e
      [  847.648601]  vfs_fsync+0x1c/0x1e
      [  847.648601]  do_fsync+0x31/0x4a
      [  847.648601]  SyS_fsync+0x10/0x14
      [  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  847.648601] RIP: 0033:0x7f5b05200800
      [  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
      [  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
      [  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
      [  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
      [  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
      [  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
      [  847.685787] ---[ end trace 2a4a3e15382508e8 ]---
      
      So fix this by not attempting to decrement the data space info's
      bytes_may_use counter if we already reserved the extent and an error
      happened before creating the ordered extent. We are already correctly
      freeing the reserved extent if an error happens, so there's no additional
      measure needed.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      a315e68f
    • Q
      btrfs: Handle delalloc error correctly to avoid ordered extent hang · 52427260
      Qu Wenruo 提交于
      [BUG]
      If run_delalloc_range() returns error and there is already some ordered
      extents created, btrfs will be hanged with the following backtrace:
      
      Call Trace:
       __schedule+0x2d4/0xae0
       schedule+0x3d/0x90
       btrfs_start_ordered_extent+0x160/0x200 [btrfs]
       ? wake_atomic_t_function+0x60/0x60
       btrfs_run_ordered_extent_work+0x25/0x40 [btrfs]
       btrfs_scrubparity_helper+0x1c1/0x620 [btrfs]
       btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
       process_one_work+0x2af/0x720
       ? process_one_work+0x22b/0x720
       worker_thread+0x4b/0x4f0
       kthread+0x10f/0x150
       ? process_one_work+0x720/0x720
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x2e/0x40
      
      [CAUSE]
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<>|                       |<---------- cleanup range --------->|
       ||
       \_=> First page handled by end_extent_writepage() in __extent_writepage()
      
      The problem is caused by error handler of run_delalloc_range(), which
      doesn't handle any created ordered extents, leaving them waiting on
      btrfs_finish_ordered_io() to finish.
      
      However after run_delalloc_range() returns error, __extent_writepage()
      won't submit bio, so btrfs_writepage_end_io_hook() won't be triggered
      except the first page, and btrfs_finish_ordered_io() won't be triggered
      for created ordered extents either.
      
      So OE 2~n will hang forever, and if OE 1 is larger than one page, it
      will also hang.
      
      [FIX]
      Introduce btrfs_cleanup_ordered_extents() function to cleanup created
      ordered extents and finish them manually.
      
      The function is based on existing
      btrfs_endio_direct_write_update_ordered() function, and modify it to
      act just like btrfs_writepage_endio_hook() but handles specified range
      other than one page.
      
      After fix, delalloc error will be handled like:
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<>|<--------  ----------->|<------ old error handler --------->|
       ||          ||
       ||          \_=> Cleaned up by cleanup_ordered_extents()
       \_=> First page handled by end_extent_writepage() in __extent_writepage()
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      52427260
    • Q
      btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error · 4dbd80fb
      Qu Wenruo 提交于
      [BUG]
      When btrfs_reloc_clone_csum() reports error, it can underflow metadata
      and leads to kernel assertion on outstanding extents in
      run_delalloc_nocow() and cow_file_range().
      
       BTRFS info (device vdb5): relocating block group 12582912 flags data
       BTRFS info (device vdb5): found 1 extents
       assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858
      
      Currently, due to another bug blocking ordered extents, the bug is only
      reproducible under certain block group layout and using error injection.
      
      a) Create one data block group with one 4K extent in it.
         To avoid the bug that hangs btrfs due to ordered extent which never
         finishes
      b) Make btrfs_reloc_clone_csum() always fail
      c) Relocate that block group
      
      [CAUSE]
      run_delalloc_nocow() and cow_file_range() handles error from
      btrfs_reloc_clone_csum() wrongly:
      
      (The ascii chart shows a more generic case of this bug other than the
      bug mentioned above)
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
                          |<----------- cleanup range --------------->|
      |<-----------  ----------->|
                   \/
       btrfs_finish_ordered_io() range
      
      So error handler, which calls extent_clear_unlock_delalloc() with
      EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
      will both cover OE n, and free its metadata, causing metadata under flow.
      
      [Fix]
      The fix is to ensure after calling btrfs_add_ordered_extent(), we only
      call error handler after increasing the iteration offset, so that
      cleanup range won't cover any created ordered extent.
      
      |<------------------ delalloc range --------------------------->|
      | OE 1 | OE 2 | ... | OE n |
      |<-----------  ----------->|<---------- cleanup range --------->|
                   \/
       btrfs_finish_ordered_io() range
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      4dbd80fb
  6. 18 4月, 2017 3 次提交
  7. 12 4月, 2017 2 次提交
    • L
      Btrfs: fix segmentation fault when doing dio read · 97bf5a55
      Liu Bo 提交于
      Commit 2dabb324 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
      introduced this bug during iterating bio pages in dio read's endio hook,
      and it could end up with segment fault of the dio reading task.
      
      So the reason is 'if (nr_sectors--)', and it makes the code assume that
      there is one more block in the same page, so page offset is increased and
      the bio which is created to repair the bad block then has an incorrect
      bvec.bv_offset, and a later access of the page content would throw a
      segmentation fault.
      
      This also adds ASSERT to check page offset against page size.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97bf5a55
    • L
      Btrfs: fix invalid dereference in btrfs_retry_endio · 2e949b0a
      Liu Bo 提交于
      When doing directIO repair, we have this oops:
      
      [ 1458.532816] general protection fault: 0000 [#1] SMP
      ...
      [ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
      [ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
      [ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
      ...
      [ 1458.543261] Call Trace:
      [ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
      [ 1458.544374]  bio_endio+0xed/0x100
      [ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
      [ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
      [ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
      [ 1458.546224]  process_one_work+0x34d/0xb70
      [ 1458.546570]  ? process_one_work+0x29e/0xb70
      [ 1458.546938]  worker_thread+0x1cf/0x960
      [ 1458.547263]  ? process_one_work+0xb70/0xb70
      [ 1458.547624]  kthread+0x17d/0x180
      [ 1458.547909]  ? kthread_create_on_node+0x70/0x70
      [ 1458.548300]  ret_from_fork+0x31/0x40
      
      It turns out that btrfs_retry_endio is trying to get inode from a directIO
      page.
      
      This fixes the problem by using the saved inode pointer, done->inode.
      btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.
      
      Also cleanup unused @start (which is too trivial for a separate patch).
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2e949b0a
  8. 29 3月, 2017 1 次提交
    • L
      Btrfs: bring back repair during read · 9d0d1c8b
      Liu Bo 提交于
      Commit 20a7db8a ("btrfs: add dummy callback for readpage_io_failed
      and drop checks") made a cleanup around readpage_io_failed_hook, and
      it was supposed to keep the original sematics, but it also
      unexpectedly disabled repair during read for dup, raid1 and raid10.
      
      This fixes the problem by letting data's inode call the generic
      readpage_io_failed callback by returning -EAGAIN from its
      readpage_io_failed_hook in order to notify end_bio_extent_readpage to
      do the rest.  We don't call it directly because the generic one takes
      an offset from end_bio_extent_readpage() to calculate the index in the
      checksum array and inode's readpage_io_failed_hook doesn't offer that
      offset.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ keep the const function attribute ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d0d1c8b
  9. 18 3月, 2017 1 次提交
    • Z
      btrfs: add missing memset while reading compressed inline extents · e1699d2d
      Zygo Blaxell 提交于
      This is a story about 4 distinct (and very old) btrfs bugs.
      
      Commit c8b97818 ("Btrfs: Add zlib compression support") added
      three data corruption bugs for inline extents (bugs #1-3).
      
      Commit 93c82d57 ("Btrfs: zero page past end of inline file items")
      fixed bug #1:  uncompressed inline extents followed by a hole and more
      extents could get non-zero data in the hole as they were read.  The fix
      was to add a memset in btrfs_get_extent to zero out the hole.
      
      Commit 166ae5a4 ("btrfs: fix inline compressed read err corruption")
      fixed bug #2:  compressed inline extents which contained non-zero bytes
      might be replaced with zero bytes in some cases.  This patch removed an
      unhelpful memset from uncompress_inline, but the case where memset is
      required was missed.
      
      There is also a memset in the decompression code, but this only covers
      decompressed data that is shorter than the ram_bytes from the extent
      ref record.  This memset doesn't cover the region between the end of the
      decompressed data and the end of the page.  It has also moved around a
      few times over the years, so there's no single patch to refer to.
      
      This patch fixes bug #3:  compressed inline extents followed by a hole
      and more extents could get non-zero data in the hole as they were read
      (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
      The fix is the same:  zero out the hole in the compressed case too,
      by putting a memset back in uncompress_inline, but this time with
      correct parameters.
      
      The last and oldest bug, bug #0, is the cause of the offending inline
      extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
      of behavior somewhere in the btrfs write code.  In a few special cases,
      an inline extent and hole are allowed to persist where they normally
      would be combined with later extents in the file.
      
      A fast reproducer for bug #0 is presented below.  A few offending extents
      are also created in the wild during large rsync transfers with the -S
      flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
      will produce a handful of offending files as well.  Once an offending
      file is created, it can present different content to userspace each
      time it is read.
      
      Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
      kernel back to v3.5 has this behavior.  There are fossil records of this
      bug's effects in commits all the way back to v2.6.32.  I have no reason
      to believe bug #0 wasn't present at the beginning of btrfs compression
      support in v2.6.29, but I can't easily test kernels that old to be sure.
      
      It is not clear whether bug #0 is worth fixing.  A fix would likely
      require injecting extra reads into currently write-only paths, and most
      of the exceptional cases caused by bug #0 are already handled now.
      
      Whether we like them or not, bug #0's inline extents followed by holes
      are part of the btrfs de-facto disk format now, and we need to be able
      to read them without data corruption or an infoleak.  So enough about
      bug #0, let's get back to bug #3 (this patch).
      
      An example of on-disk structure leading to data corruption found in
      the wild:
      
              item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
                      inode generation 50 transid 50 size 47424 nbytes 49141
                      block group 0 mode 100644 links 1 uid 0 gid 0
                      rdev 0 flags 0x0(none)
              item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
                      inode ref index 3 namelen 10 name: DB_File.so
              item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
                      inline extent data size 1341 ram 4085 compress(zlib)
              item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
                      extent data disk byte 5367308288 nr 20480
                      extent data offset 0 nr 45056 ram 45056
                      extent compression(zlib)
      
      Different data appears in userspace during each read of the 11 bytes
      between 4085 and 4096.  The extent in item 63 is not long enough to
      fill the first page of the file, so a memset is required to fill the
      space between item 63 (ending at 4085) and item 64 (beginning at 4096)
      with zero.
      
      Here is a reproducer from Liu Bo, which demonstrates another method
      of creating the same inline extent and hole pattern:
      
      Using 'page_poison=on' kernel command line (or enable
      CONFIG_PAGE_POISONING) run the following:
      
      	# touch foo
      	# chattr +c foo
      	# xfs_io -f -c "pwrite -W 0 1000" foo
      	# xfs_io -f -c "falloc 4 8188" foo
      	# od -x foo
      	# echo 3 >/proc/sys/vm/drop_caches
      	# od -x foo
      
      This produce the following on my box:
      
      Correct output:  file contains 1000 data bytes followed
      by zeros:
      
      	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      	*
      	0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
      	0001760 0000 0000 0000 0000 0000 0000 0000 0000
      	*
      	0020000
      
      Actual output:  the data after the first 1000 bytes
      will be different each run:
      
      	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
      	*
      	0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
      	0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
      	0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
      	(...)
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e1699d2d
  10. 03 3月, 2017 1 次提交
    • D
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells 提交于
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      function.
      
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      
      ========
      OVERVIEW
      ========
      
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      
      A number of requests were gathered for features to be included.  The
      following have been included:
      
       (1) Make the fields a consistent size on all arches and make them large.
      
       (2) Spare space, request flags and information flags are provided for
           future expansion.
      
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
           __s64).
      
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
      
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
      
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
      
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
           (AT_STATX_FORCE_SYNC).
      
      And the following have been left out for future extension:
      
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Kumar].
      
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
      
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
      
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
      
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
      
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
      
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
      
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
      
           (Deferred to fsinfo).
      
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
      
           (Deferred to fsinfo).
      
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
      
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
      
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      
      (16) Femtosecond-resolution timestamps [Dave Chinner].
      
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      
      (17) A set multiple attributes syscall to go with this.
      
      ===============
      NEW SYSTEM CALL
      ===============
      
      The new system call is:
      
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
      
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
           respect.
      
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
      
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
           approximate.
      
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      
      buffer points to the destination for the data.  This must be 256 bytes in
      size.
      
      ======================
      MAIN ATTRIBUTES RECORD
      ======================
      
      The following structures are defined in which to return the main attribute
      set:
      
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	};
      
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      	};
      
      The defined bits in request_mask and stx_mask are:
      
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      placed.
      
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      
      Within the kernel, the supported flags are listed by:
      
      	KSTAT_ATTR_FS_IOC_FLAGS
      
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      
      New flags include:
      
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      
      Fields in struct statx come in a number of classes:
      
       (0) stx_dev_*, stx_blksize.
      
           These are local system information and are always available.
      
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
      
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
      
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
      
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
      
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
      
       (2) stx_rdev_*.
      
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
      
       (3) stx_btime.
      
           Similar to (1), except this will be set to 0 if it doesn't exist.
      
      =======
      TESTING
      =======
      
      The following test program can be used to test the statx system call:
      
      	samples/statx/test-statx.c
      
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      
      Secondly, the result of automounting on that directory.
      
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a528d35e
  11. 28 2月, 2017 22 次提交