1. 06 7月, 2017 1 次提交
  2. 26 4月, 2017 2 次提交
    • F
      Btrfs: fix reported number of inode blocks · a7e3b975
      Filipe Manana 提交于
      Currently when there are buffered writes that were not yet flushed and
      they fall within allocated ranges of the file (that is, not in holes or
      beyond eof assuming there are no prealloc extents beyond eof), btrfs
      simply reports an incorrect number of used blocks through the stat(2)
      system call (or any of its variants), regardless of mount options or
      inode flags (compress, compress-force, nodatacow). This is because the
      number of blocks used that is reported is based on the current number
      of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
      inode. The later covers bytes that both fall within allocated regions
      of the file and holes.
      
      Example scenarios where the number of reported blocks is wrong while the
      buffered writes are not flushed:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
      
        # The following should have reported 64K...
        $ du -h /mnt/sdc/foo1
        128K	/mnt/sdc/foo1
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo1
        64K	/mnt/sdc/foo1
      
        $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 65536
        64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
      
        # The following should have reported 128K...
        $ du -h /mnt/sdc/foo2
        192K	/mnt/sdc/foo2
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo2
        128K	/mnt/sdc/foo2
      
      So the number of used file blocks is simply incorrect, unlike in other
      filesystems such as ext4 and xfs for example, but only while the buffered
      writes are not flushed.
      
      Fix this by tracking the number of delalloc bytes that fall within holes
      and beyond eof of a file, and use instead this new counter when reporting
      the number of used blocks for an inode.
      
      Another different problem that exists is that the delalloc bytes counter
      is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
      the respective range in the inode's iotree) and the vfs inode's bytes
      counter is only incremented when writeback finishes (through
      insert_reserved_file_extent()). Therefore while writeback is ongoing we
      simply report a wrong number of blocks used by an inode if the write
      operation covers a range previously unallocated. While this change does
      not fix this problem, it does minimizes it a lot by shortening that time
      window, as the new dealloc bytes counter (new_delalloc_bytes) is only
      decremented when writeback finishes right before updating the vfs inode's
      bytes counter. Fully fixing this second problem is not trivial and will
      be addressed later by a different patch.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      a7e3b975
    • F
      Btrfs: fix extent map leak during fallocate error path · be2d253c
      Filipe Manana 提交于
      If the call to btrfs_qgroup_reserve_data() failed, we were leaking an
      extent map structure. The failure can happen either due to an -ENOMEM
      condition or, when quotas are enabled, due to -EDQUOT for example.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      be2d253c
  3. 18 4月, 2017 1 次提交
  4. 28 2月, 2017 14 次提交
  5. 14 2月, 2017 3 次提交
  6. 10 12月, 2016 1 次提交
    • C
      fs: try to clone files first in vfs_copy_file_range · a76b5b04
      Christoph Hellwig 提交于
      A clone is a perfectly fine implementation of a file copy, so most
      file systems just implement the copy that way.  Instead of duplicating
      this logic move it to the VFS.  Currently btrfs and XFS implement copies
      the same way as clones and there is no behavior change for them, cifs
      only implements clones and grow support for copy_file_range with this
      patch.  NFS implements both, so this will allow copy_file_range to work
      on servers that only implement CLONE and be lot more efficient on servers
      that implements CLONE and COPY.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a76b5b04
  7. 06 12月, 2016 5 次提交
  8. 30 11月, 2016 4 次提交
    • R
      Btrfs: fix enospc in hole punching · 2cdaf447
      Robbie Ko 提交于
      The hole punching can result in adding new leafs (and as a consequence
      new nodes) to the tree because when we find file extent items that span
      beyond the hole range we may end up not deleting them (just adjusting
      them, reducing their range by reducing their length or increasing their
      offset field) and add new file extent items representing holes.
      
      So after splitting a leaf (therefore creating a new one) to insert a new
      file extent item representing a hole, a new node might be added to each
      level of the tree in the worst case scenario (since there's a new key
      and every parent node was full).
      
      For example if a file has an extent item representing the range 0 to 64Mb
      and we punch a hole in the range 1Mb to 20Mb, the existing extent item is
      duplicated and one of the copies is adjusted to represent the range 0 to
      1Mb, the other copy adjusted to represent the range 20Mb to 64Mb, and a
      new file extent item representing a hole in the range 1Mb to 20Mb is
      inserted.
      
      Fix this by using btrfs_calc_trans_metadata_size() instead of
      btrfs_calc_trunc_metadata_size(), so that enough metadata space is
      reserved for the worst possible case.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [Modified changelog for clarity and correctness]
      2cdaf447
    • J
      Btrfs: abort transaction if fill_holes() fails · f94480bd
      Josef Bacik 提交于
      At this point we will have dropped extent entries from the file, so if we fail
      to insert the new hole entries then we are leaving the fs in a corrupt state
      (albeit an easily fixed one).  Abort the transaciton if this happens so we can
      avoid corrupting the fs.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f94480bd
    • J
      Btrfs: fix file extent corruption · 62fe51c1
      Josef Bacik 提交于
      In order to do hole punching we have a block reserve to hold the reservation we
      need to drop the extents in our range.  Since we could end up dropping a lot of
      extents we set rsv->failfast so we can just loop around again and drop the
      remaining of the range.  Unfortunately we unconditionally fill the hole extents
      in and start from the last extent we encountered, which we may or may not have
      dropped.  So this can result in overlapping file extent entries, which can be
      tripped over in a variety of ways, either by hitting BUG_ON(!ret) in
      fill_holes() after the search, or in btrfs_set_item_key_safe() in
      btrfs_drop_extent() at a later time by an unrelated task.  Fix this by only
      setting drop_end to the last extent we did actually drop.  This way our holes
      are filled in properly for the range that we did drop, and the rest of the range
      that remains to be dropped is actually dropped.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62fe51c1
    • D
      btrfs: remove unused headers, statfs.h · 926b9233
      David Sterba 提交于
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      926b9233
  9. 28 9月, 2016 1 次提交
  10. 26 9月, 2016 2 次提交
  11. 16 9月, 2016 1 次提交
  12. 25 8月, 2016 2 次提交
    • F
      Btrfs: fix lockdep warning on deadlock against an inode's log mutex · 28a23593
      Filipe Manana 提交于
      Commit 44f714da ("Btrfs: improve performance on fsync against new
      inode after rename/unlink"), which landed in 4.8-rc2, introduced a
      possibility for a deadlock due to double locking of an inode's log mutex
      by the same task, which lockdep reports with:
      
      [23045.433975] =============================================
      [23045.434748] [ INFO: possible recursive locking detected ]
      [23045.435426] 4.7.0-rc6-btrfs-next-34+ #1 Not tainted
      [23045.436044] ---------------------------------------------
      [23045.436044] xfs_io/3688 is trying to acquire lock:
      [23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     but task is already holding lock:
      [23045.436044]  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     other info that might help us debug this:
      [23045.436044]  Possible unsafe locking scenario:
      
      [23045.436044]        CPU0
      [23045.436044]        ----
      [23045.436044]   lock(&ei->log_mutex);
      [23045.436044]   lock(&ei->log_mutex);
      [23045.436044]
                      *** DEADLOCK ***
      
      [23045.436044]  May be due to missing lock nesting notation
      
      [23045.436044] 3 locks held by xfs_io/3688:
      [23045.436044]  #0:  (&sb->s_type->i_mutex_key#15){+.+...}, at: [<ffffffffa035f2ae>] btrfs_sync_file+0x14e/0x425 [btrfs]
      [23045.436044]  #1:  (sb_internal#2){.+.+.+}, at: [<ffffffff8118446b>] __sb_start_write+0x5f/0xb0
      [23045.436044]  #2:  (&ei->log_mutex){+.+...}, at: [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]
                     stack backtrace:
      [23045.436044] CPU: 4 PID: 3688 Comm: xfs_io Not tainted 4.7.0-rc6-btrfs-next-34+ #1
      [23045.436044] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [23045.436044]  0000000000000000 ffff88022f5f7860 ffffffff8127074d ffffffff82a54b70
      [23045.436044]  ffffffff82a54b70 ffff88022f5f7920 ffffffff81092897 ffff880228015d68
      [23045.436044]  0000000000000000 ffffffff82a54b70 ffffffff829c3f00 ffff880228015d68
      [23045.436044] Call Trace:
      [23045.436044]  [<ffffffff8127074d>] dump_stack+0x67/0x90
      [23045.436044]  [<ffffffff81092897>] __lock_acquire+0xcbb/0xe4e
      [23045.436044]  [<ffffffff8109155f>] ? mark_lock+0x24/0x201
      [23045.436044]  [<ffffffff8109179a>] ? mark_held_locks+0x5e/0x74
      [23045.436044]  [<ffffffff81092de0>] lock_acquire+0x12f/0x1c3
      [23045.436044]  [<ffffffff81092de0>] ? lock_acquire+0x12f/0x1c3
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffff814a51a4>] mutex_lock_nested+0x77/0x3a7
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa039705e>] ? btrfs_release_delayed_node+0xb/0xd [btrfs]
      [23045.436044]  [<ffffffffa038552d>] btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa038552d>] ? btrfs_log_inode+0x13a/0xc95 [btrfs]
      [23045.436044]  [<ffffffff810a0ed1>] ? vprintk_emit+0x453/0x465
      [23045.436044]  [<ffffffffa0385a61>] btrfs_log_inode+0x66e/0xc95 [btrfs]
      [23045.436044]  [<ffffffffa03c084d>] log_new_dir_dentries+0x26c/0x359 [btrfs]
      [23045.436044]  [<ffffffffa03865aa>] btrfs_log_inode_parent+0x4a6/0x628 [btrfs]
      [23045.436044]  [<ffffffffa0387552>] btrfs_log_dentry_safe+0x5a/0x75 [btrfs]
      [23045.436044]  [<ffffffffa035f464>] btrfs_sync_file+0x304/0x425 [btrfs]
      [23045.436044]  [<ffffffff811acaf4>] vfs_fsync_range+0x8c/0x9e
      [23045.436044]  [<ffffffff811acb22>] vfs_fsync+0x1c/0x1e
      [23045.436044]  [<ffffffff811acc79>] do_fsync+0x31/0x4a
      [23045.436044]  [<ffffffff811ace99>] SyS_fsync+0x10/0x14
      [23045.436044]  [<ffffffff814a88e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [23045.436044]  [<ffffffff8108f039>] ? trace_hardirqs_off_caller+0x3f/0xaa
      
      An example reproducer for this is:
      
         $ mkfs.btrfs -f /dev/sdb
         $ mount /dev/sdb /mnt
         $ mkdir /mnt/dir
         $ touch /mnt/dir/foo
         $ sync
         $ mv /mnt/dir/foo /mnt/dir/bar
         $ touch /mnt/dir/foo
         $ xfs_io -c "fsync" /mnt/dir/bar
      
      This is because while logging the inode of file bar we end up logging its
      parent directory (since its inode has an unlink_trans field matching the
      current transaction id due to the rename operation), which in turn logs
      the inodes for all its new dentries, so that the new inode for the new
      file named foo gets logged which in turn triggered another logging attempt
      for the inode we are fsync'ing, since that inode had an old name that
      corresponds to the name of the new inode.
      
      So fix this by ensuring that when logging the inode for a new dentry that
      has a name matching an old name of some other inode, we don't log again
      the original inode that we are fsync'ing.
      
      Fixes: 44f714da ("Btrfs: improve performance on fsync against new inode after rename/unlink")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      28a23593
    • W
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang 提交于
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18513091
  13. 01 8月, 2016 1 次提交
    • F
      Btrfs: add missing check for writeback errors on fsync · 0596a904
      Filipe Manana 提交于
      When we start an fsync we start ordered extents for all delalloc ranges.
      However before attempting to log the inode, we only wait for those ordered
      extents if we are not doing a full sync (bit BTRFS_INODE_NEEDS_FULL_SYNC
      is set in the inode's flags). This means that if an ordered extent
      completes with an IO error before we check if we can skip logging the
      inode, we will not catch and report the IO error to user space. This is
      because on an IO error, when the ordered extent completes we do not
      update the inode, so if the inode was not previously updated by the
      current transaction we end up not logging it through calls to fsync and
      therefore not check its mapping flags for the presence of IO errors.
      
      Fix this by checking for errors in the flags of the inode's mapping when
      we notice we can skip logging the inode.
      
      This caused sporadic failures in the test generic/331 (which explicitly
      tests for IO errors during an fsync call).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      0596a904
  14. 26 7月, 2016 2 次提交