1. 16 11月, 2017 1 次提交
    • F
      Btrfs: fix reported number of inode blocks after buffered append writes · e3b8a485
      Filipe Manana 提交于
      The patch from commit a7e3b975 ("Btrfs: fix reported number of inode
      blocks") introduced a regression where if we do a buffered write starting
      at position equal to or greater than the file's size and then stat(2) the
      file before writeback is triggered, the number of used blocks does not
      change (unless there's a prealloc/unwritten extent). Example:
      
        $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
        $ du -h foobar
        0	foobar
        $ sync
        $ du -h foobar
        64K	foobar
      
      The first version of that patch didn't had this regression and the second
      version, which was the one committed, was made only to address some
      performance regression detected by the intel test robots using fs_mark.
      
      This fixes the regression by setting the new delaloc bit in the range, and
      doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
      well, so that this way we set both bits at once avoiding navigation of the
      inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
      meaninful place, as we should set the new dellaloc bit when if we set the
      delalloc bit, which happens only if we copied bytes into the pages at
      __btrfs_buffered_write().
      
      This was making some of LTP's du tests fail, which can be quickly run
      using a command line like the following:
      
        $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt
      
      Fixes: a7e3b975 ("Btrfs: fix reported number of inode blocks")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3b8a485
  2. 15 11月, 2017 1 次提交
    • L
      Btrfs: add write_flags for compression bio · f82b7359
      Liu Bo 提交于
      Compression code path has only flaged bios with REQ_OP_WRITE no matter
      where the bios come from, but it could be a sync write if fsync starts
      this writeback or a normal writeback write if wb kthread starts a
      periodic writeback.
      
      It breaks the rule that sync writes and writeback writes need to be
      differentiated from each other, because from the POV of block layer,
      all bios need to be recognized by these flags in order to do some
      management, e.g. throttlling.
      
      This passes writeback_control to compression write path so that it can
      send bios with proper flags to block layer.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f82b7359
  3. 30 10月, 2017 1 次提交
    • L
      Btrfs: remove bio_flags which indicates a meta block of log-tree · 18fdc679
      Liu Bo 提交于
      Since both committing transaction and writing log-tree are doing
      plugging on metadata IO, we can unify to use %sync_writers to benefit
      both cases, instead of checking bio_flags while writing meta blocks of
      log-tree.
      
      We can remove this bio_flags because in order to write dirty blocks,
      log tree also uses btrfs_write_marked_extents(), inside which we
      have enabled %sync_writers, therefore, every write goes in a
      synchronous way, so does checksuming.
      
      Please also note that, bio_flags is applied per-context while
      %sync_writers is applied per-inode, so this might incur some overhead, ie.
      
      1) while log tree is flushing its dirty blocks via
         btrfs_write_marked_extents(), in which %sync_writers is increased
         by one.
      
      2) in the meantime, some writeback operations may happen upon btrfs's
         metadata inode, so these writes go synchronously, too.
      
      However, AFAICS, the overhead is not a big one while the win is that
      we unify the two places that needs synchronous way and remove a
      special hack/flag.
      
      This removes the bio_flags related stuff for writing log-tree.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18fdc679
  4. 16 8月, 2017 1 次提交
    • J
      btrfs: struct-funcs, constify readers · 1cbb1f45
      Jeff Mahoney 提交于
      We have reader helpers for most of the on-disk structures that use
      an extent_buffer and pointer as offset into the buffer that are
      read-only.  We should mark them as const and, in turn, allow consumers
      of these interfaces to mark the buffers const as well.
      
      No impact on code, but serves as documentation that a buffer is intended
      not to be modified.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cbb1f45
  5. 15 7月, 2017 1 次提交
  6. 30 6月, 2017 2 次提交
  7. 20 6月, 2017 9 次提交
  8. 09 6月, 2017 1 次提交
  9. 26 4月, 2017 2 次提交
    • F
      Btrfs: fix reported number of inode blocks · a7e3b975
      Filipe Manana 提交于
      Currently when there are buffered writes that were not yet flushed and
      they fall within allocated ranges of the file (that is, not in holes or
      beyond eof assuming there are no prealloc extents beyond eof), btrfs
      simply reports an incorrect number of used blocks through the stat(2)
      system call (or any of its variants), regardless of mount options or
      inode flags (compress, compress-force, nodatacow). This is because the
      number of blocks used that is reported is based on the current number
      of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
      inode. The later covers bytes that both fall within allocated regions
      of the file and holes.
      
      Example scenarios where the number of reported blocks is wrong while the
      buffered writes are not flushed:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
      
        # The following should have reported 64K...
        $ du -h /mnt/sdc/foo1
        128K	/mnt/sdc/foo1
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo1
        64K	/mnt/sdc/foo1
      
        $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 65536
        64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
      
        # The following should have reported 128K...
        $ du -h /mnt/sdc/foo2
        192K	/mnt/sdc/foo2
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo2
        128K	/mnt/sdc/foo2
      
      So the number of used file blocks is simply incorrect, unlike in other
      filesystems such as ext4 and xfs for example, but only while the buffered
      writes are not flushed.
      
      Fix this by tracking the number of delalloc bytes that fall within holes
      and beyond eof of a file, and use instead this new counter when reporting
      the number of used blocks for an inode.
      
      Another different problem that exists is that the delalloc bytes counter
      is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
      the respective range in the inode's iotree) and the vfs inode's bytes
      counter is only incremented when writeback finishes (through
      insert_reserved_file_extent()). Therefore while writeback is ongoing we
      simply report a wrong number of blocks used by an inode if the write
      operation covers a range previously unallocated. While this change does
      not fix this problem, it does minimizes it a lot by shortening that time
      window, as the new dealloc bytes counter (new_delalloc_bytes) is only
      decremented when writeback finishes right before updating the vfs inode's
      bytes counter. Fully fixing this second problem is not trivial and will
      be addressed later by a different patch.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      a7e3b975
    • F
      Btrfs: fix invalid attempt to free reserved space on failure to cow range · a315e68f
      Filipe Manana 提交于
      When attempting to COW a file range (we are starting writeback and doing
      COW), if we manage to reserve an extent for the range we will write into
      but fail after reserving it and before creating the respective ordered
      extent, we end up in an error path where we attempt to decrement the
      data space's bytes_may_use counter after we already did it while
      reserving the extent, leading to a warning/trace like the following:
      
      [  847.621524] ------------[ cut here ]------------
      [  847.625441] WARNING: CPU: 5 PID: 4905 at fs/btrfs/extent-tree.c:4316 btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.633704] Modules linked in: btrfs crc32c_generic xor raid6_pq acpi_cpufreq i2c_piix4 ppdev psmouse tpm_tis serio_raw pcspkr parport_pc tpm_tis_core i2c_core sg
      [  847.644616] CPU: 5 PID: 4905 Comm: xfs_io Not tainted 4.10.0-rc8-btrfs-next-37+ #2
      [  847.648601] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  847.648601] Call Trace:
      [  847.648601]  dump_stack+0x67/0x90
      [  847.648601]  __warn+0xc2/0xdd
      [  847.648601]  warn_slowpath_null+0x1d/0x1f
      [  847.648601]  btrfs_free_reserved_data_space_noquota+0x60/0x9f [btrfs]
      [  847.648601]  btrfs_clear_bit_hook+0x140/0x258 [btrfs]
      [  847.648601]  clear_state_bit+0x87/0x128 [btrfs]
      [  847.648601]  __clear_extent_bit+0x222/0x2b7 [btrfs]
      [  847.648601]  clear_extent_bit+0x17/0x19 [btrfs]
      [  847.648601]  extent_clear_unlock_delalloc+0x3b/0x6b [btrfs]
      [  847.648601]  cow_file_range.isra.39+0x387/0x39a [btrfs]
      [  847.648601]  run_delalloc_nocow+0x4d7/0x70e [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  run_delalloc_range+0xa7/0x2b5 [btrfs]
      [  847.648601]  writepage_delalloc.isra.31+0xb9/0x15c [btrfs]
      [  847.648601]  __extent_writepage+0x249/0x2e8 [btrfs]
      [  847.648601]  extent_write_cache_pages.constprop.33+0x28b/0x36c [btrfs]
      [  847.648601]  ? arch_local_irq_save+0x9/0xc
      [  847.648601]  ? mark_lock+0x24/0x201
      [  847.648601]  extent_writepages+0x4b/0x5c [btrfs]
      [  847.648601]  ? btrfs_writepage_start_hook+0xed/0xed [btrfs]
      [  847.648601]  btrfs_writepages+0x28/0x2a [btrfs]
      [  847.648601]  do_writepages+0x23/0x2c
      [  847.648601]  __filemap_fdatawrite_range+0x5a/0x61
      [  847.648601]  filemap_fdatawrite_range+0x13/0x15
      [  847.648601]  btrfs_fdatawrite_range+0x20/0x46 [btrfs]
      [  847.648601]  start_ordered_ops+0x19/0x23 [btrfs]
      [  847.648601]  btrfs_sync_file+0x136/0x42c [btrfs]
      [  847.648601]  vfs_fsync_range+0x8c/0x9e
      [  847.648601]  vfs_fsync+0x1c/0x1e
      [  847.648601]  do_fsync+0x31/0x4a
      [  847.648601]  SyS_fsync+0x10/0x14
      [  847.648601]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  847.648601] RIP: 0033:0x7f5b05200800
      [  847.648601] RSP: 002b:00007ffe204f71c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
      [  847.648601] RAX: ffffffffffffffda RBX: ffffffff8109637b RCX: 00007f5b05200800
      [  847.648601] RDX: 00000000008bd0a0 RSI: 00000000008bd2e0 RDI: 0000000000000003
      [  847.648601] RBP: ffffc90001d67f98 R08: 000000000000ffff R09: 000000000000001f
      [  847.648601] R10: 00000000000001f6 R11: 0000000000000246 R12: 0000000000000046
      [  847.648601] R13: ffffc90001d67f78 R14: 00007f5b054be740 R15: 00007f5b054be740
      [  847.648601]  ? trace_hardirqs_off_caller+0x3f/0xaa
      [  847.685787] ---[ end trace 2a4a3e15382508e8 ]---
      
      So fix this by not attempting to decrement the data space info's
      bytes_may_use counter if we already reserved the extent and an error
      happened before creating the ordered extent. We are already correctly
      freeing the reserved extent if an error happens, so there's no additional
      measure needed.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      a315e68f
  10. 18 4月, 2017 1 次提交
  11. 28 2月, 2017 9 次提交
  12. 17 2月, 2017 2 次提交
  13. 06 12月, 2016 2 次提交
  14. 30 11月, 2016 3 次提交
  15. 04 10月, 2016 1 次提交
  16. 26 9月, 2016 2 次提交
  17. 25 8月, 2016 1 次提交
    • W
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang 提交于
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18513091