1. 08 10月, 2015 1 次提交
  2. 22 9月, 2015 1 次提交
    • C
      Btrfs: Direct I/O: Fix space accounting · 50745b0a
      chandan 提交于
      The following call trace is seen when generic/095 test is executed,
      
      WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
      Modules linked in:
      CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
       ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
       0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
       ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
      Call Trace:
       [<ffffffff81984058>] dump_stack+0x45/0x57
       [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
       [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
       [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
       [<ffffffff8117ce07>] destroy_inode+0x37/0x60
       [<ffffffff8117cf39>] evict+0x109/0x170
       [<ffffffff8117cfd5>] dispose_list+0x35/0x50
       [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
       [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
       [<ffffffff81165951>] kill_anon_super+0x11/0x20
       [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
       [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
       [<ffffffff811660cf>] deactivate_super+0x5f/0x70
       [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
       [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
       [<ffffffff81069c06>] task_work_run+0x96/0xb0
       [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
       [<ffffffff8198cbc2>] int_signal+0x12/0x17
      
      This means that the inode had non-zero "outstanding extents" during
      eviction. This occurs because, during direct I/O a task which successfully
      used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
      not clear the bit after finishing the DIO write. A future DIO write could
      actually fail and the unused reserve space won't be freed because of the
      previously set BTRFS_INODE_DIO_READY bit.
      
      Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
      following issue,
      |-----------------------------------+-------------------------------------|
      | Task A                            | Task B                              |
      |-----------------------------------+-------------------------------------|
      | Start direct i/o write on inode X.|                                     |
      | reserve space                     |                                     |
      | Allocate ordered extent           |                                     |
      | release reserved space            |                                     |
      | Set BTRFS_INODE_DIO_READY bit.    |                                     |
      |                                   | splice()                            |
      |                                   | Transfer data from pipe buffer to   |
      |                                   | destination file.                   |
      |                                   | - kmap(pipe buffer page)            |
      |                                   | - Start direct i/o write on         |
      |                                   |   inode X.                          |
      |                                   |   - reserve space                   |
      |                                   |   - dio_refill_pages()              |
      |                                   |     - sdio->blocks_available == 0   |
      |                                   |     - Since a kernel address is     |
      |                                   |       being passed instead of a     |
      |                                   |       user space address,           |
      |                                   |       iov_iter_get_pages() returns  |
      |                                   |       -EFAULT.                      |
      |                                   |   - Since BTRFS_INODE_DIO_READY is  |
      |                                   |     set, we don't release reserved  |
      |                                   |     space.                          |
      |                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
      | -EIOCBQUEUED is returned.         |                                     |
      |-----------------------------------+-------------------------------------|
      
      Hence this commit introduces "struct btrfs_dio_data" to track the usage of
      reserved data space. The remaining unused "reserve space" can now be freed
      reliably.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50745b0a
  3. 15 9月, 2015 1 次提交
    • J
      btrfs: skip waiting on ordered range for special files · a30e577c
      Jeff Mahoney 提交于
      In btrfs_evict_inode, we properly truncate the page cache for evicted
      inodes but then we call btrfs_wait_ordered_range for every inode as well.
      It's the right thing to do for regular files but results in incorrect
      behavior for device inodes for block devices.
      
      filemap_fdatawrite_range gets called with inode->i_mapping which gets
      resolved to the block device inode before getting passed to
      wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
      next depends on whether there's an open file handle associated with the
      inode.  If there is, we write to the block device, which is unexpected
      behavior.  If there isn't, we through normally and inode->i_data is used.
      We can also end up racing against open/close which can result in crashes
      when i_mapping points to a block device inode that has been closed.
      
      Since there can't be any page cache associated with special file inodes,
      it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
      the problem.
      
      Cc: <stable@vger.kernel.org>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911Tested-by: NChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      a30e577c
  4. 01 9月, 2015 1 次提交
  5. 14 8月, 2015 1 次提交
  6. 09 8月, 2015 2 次提交
    • C
      Btrfs: add support for blkio controllers · da2f0f74
      Chris Mason 提交于
      This attaches accounting information to bios as we submit them so the
      new blkio controllers can throttle on btrfs filesystems.
      
      Not much is required, we're just associating bios with blkcgs during clone,
      calling wbc_init_bio()/wbc_account_io() during writepages submission,
      and attaching the bios to the current context during direct IO.
      
      Finally if we are splitting bios during btrfs_map_bio, this attaches
      accounting information to the split.
      
      The end result is able to throttle nicely on single disk filesystems.  A
      little more work is required for multi-device filesystems.
      Signed-off-by: NChris Mason <clm@fb.com>
      da2f0f74
    • F
      Btrfs: fix stale dir entries after unlink, inode eviction and fsync · bde6c242
      Filipe Manana 提交于
      If we remove a hard link from an inode, the inode gets evicted, then
      we fsync the inode and then power fail/crash, when the log tree is
      replayed, the parent directory inode still has entries pointing to
      the name that no longer exists, while our inode no longer has the
      BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
      leaving the filesystem in an inconsistent state. The stale directory
      entries can not be deleted (an attempt to delete them causes -ESTALE
      errors), which makes it impossible to delete the parent directory.
      
      This happens because we track the id of the transaction where the last
      unlink operation for the inode happened (last_unlink_trans) in an
      in-memory only field of the inode, that is, a value that is never
      persisted in the inode item stored on the fs/subvol btree. So if an
      inode is evicted and loaded again, the value for last_unlink_trans is
      set to 0, which prevents the fsync from logging the parent directory
      at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
      to the id of the transaction that last modified the inode when we
      load the inode. This is a pessimistic approach but it always ensures
      correctness with the trade off of ocassional full transaction commits
      when an fsync is done against the inode in the same transaction where
      it was evicted and reloaded when our inode is a directory and often
      logging its parent unnecessarily when our inode is not a directory.
      
      The following test case for fstests triggers the problem:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with 2 hard links.
        mkdir $SCRATCH_MNT/testdir
        touch $SCRATCH_MNT/testdir/foo
        ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
      
        # Make sure everything done so far is durably persisted.
        sync
      
        # Now remove one of the links, trigger inode eviction and then fsync
        # our inode.
        unlink $SCRATCH_MNT/testdir/bar
        echo 2 > /proc/sys/vm/drop_caches
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo
      
        # Silently drop all writes on our scratch device to simulate a power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again and mount the fs to trigger log/journal replay.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now verify our directory entries.
        echo "Entries in testdir:"
        ls -1 $SCRATCH_MNT/testdir
      
        # If we remove our inode, its parent should become empty and therefore we should
        # be able to remove the parent.
        rm -f $SCRATCH_MNT/testdir/*
        rmdir $SCRATCH_MNT/testdir
      
        _unmount_flakey
      
        # The fstests framework will call fsck against our filesystem which will verify
        # that all metadata is in a consistent state.
      
        status=0
        exit
      
      The test failed on btrfs with:
      
        generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
          --- tests/generic/098.out	2015-07-23 18:01:12.616175932 +0100
          +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad	2015-07-23 18:04:58.924138308 +0100
          @@ -1,3 +1,6 @@
           QA output created by 098
           Entries in testdir:
          +bar
           foo
          +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
          +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
          ...
          (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad'  to see the entire diff)
        _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)
      
        $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
        (...)
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
           unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
           unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
        Checking filesystem on /dev/sdc
        (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bde6c242
  7. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  8. 12 7月, 2015 1 次提交
    • F
      Btrfs: fix shrinking truncate when the no_holes feature is enabled · c1aa4575
      Filipe Manana 提交于
      If the no_holes feature is enabled, we attempt to shrink a file to a size
      that ends up in the middle of a hole and we don't have any file extent
      items in the fs/subvol tree that go beyond the new file size (or any
      ordered extents that will insert such file extent items), we end up not
      updating the inode's disk_i_size, we only update the inode's i_size.
      
      This means that after unmounting and mounting the filesystem, or after
      the inode is evicted and reloaded, its i_size ends up being incorrect
      (an inode's i_size is set to the disk_i_size field when an inode is
      loaded). This happens when btrfs_truncate_inode_items() doesn't find
      any file extent items to drop - in this case it never makes a call to
      btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.
      
      Example reproducer:
      
        $ mkfs.btrfs -O no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create our test file with some data and durably persist it.
        $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
        $ sync
      
        # Append some data to the file, increasing its size, and leave a hole
        # between the old size and the start offset if the following write. So
        # our file gets a hole in the range [128Kb, 256Kb[.
        $ xfs_io -c "truncate 160K" /mnt/foo
      
        # We expect to see our file with a size of 160Kb, with the first 128Kb
        # of data all having the value 0xaa and the remaining 32Kb of data all
        # having the value 0x00.
        $ od -t x1 /mnt/foo
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0500000
      
        # Now cleanly unmount and mount again the filesystem.
        $ umount /mnt
        $ mount /dev/sdd /mnt
      
        # We expect to get the same result as before, a file with a size of
        # 160Kb, with the first 128Kb of data all having the value 0xaa and the
        # remaining 32Kb of data all having the value 0x00.
        $ od -t x1 /mnt/foo
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0400000
      
      In the example above the file size/data do not match what they were before
      the remount.
      
      Fix this by always calling btrfs_ordered_update_i_size() with a size
      matching the size the file was truncated to if btrfs_truncate_inode_items()
      is not called for a log tree and no file extent items were dropped. This
      ensures the same behaviour as when the no_holes feature is not enabled.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      c1aa4575
  9. 02 7月, 2015 4 次提交
    • L
      Btrfs: fix warning of bytes_may_use · ddba1bfc
      Liu Bo 提交于
      While running generic/019, dmesg got several warnings from
      btrfs_free_reserved_data_space().
      
      Test generic/019 produces some disk failures so sumbit dio will get errors,
      in which case, btrfs_direct_IO() goes to the error handling and free
      bytes_may_use, but the problem is that bytes_may_use has been free'd
      during get_block().
      
      This adds a runtime flag to show if we've gone through get_block(), if so,
      don't do the cleanup work.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ddba1bfc
    • L
      Btrfs: fix hang when failing to submit bio of directIO · ad9ee205
      Liu Bo 提交于
      The hang is uncoverd by generic/019.
      
      btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
      an error, thus those added ordered extents will never get processed, which
      block processes that waiting for them via btrfs_start_ordered_extent().
      
      This fixes the above, and meanwhile finish_ordered_fn will do the space
      accounting work.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ad9ee205
    • F
      Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() · 9c6429d9
      Filipe Manana 提交于
      The comment was not correct about the part where it says the endio
      callback of the bio might have not yet been called - update it
      to mention that by that time the endio callback execution might
      still be in progress only.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9c6429d9
    • F
      Btrfs: fix memory corruption on failure to submit bio for direct IO · 61de718f
      Filipe Manana 提交于
      If we fail to submit a bio for a direct IO request, we were grabbing the
      corresponding ordered extent and decrementing its reference count twice,
      once for our lookup reference and once for the ordered tree reference.
      This was a problem because it caused the ordered extent to be freed
      without removing it from the ordered tree and any lists it might be
      attached to, leaving dangling pointers to the ordered extent around.
      Example trace with CONFIG_DEBUG_PAGEALLOC=y:
      
      [161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
      [161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
      [161779.860636] PGD 34d818067 PUD 0
      [161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      (...)
      [161779.860636] Call Trace:
      [161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
      [161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
      [161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
      [161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
      [161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
      [161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
      [161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
      (...)
      
      We were also not freeing the btrfs_dio_private we allocated previously,
      which kmemleak reported with the following trace in its sysfs file:
      
      unreferenced object 0xffff8803f553bf80 (size 96):
        comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
        hex dump (first 32 bytes):
          88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
          00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81161ffe>] create_object+0x172/0x29a
          [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
          [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
          [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
          [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
          [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
          [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
          [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
          [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
          [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
          [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
          [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
          [<ffffffff81165da9>] vfs_write+0xa0/0xe4
          [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
          [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      For read requests we weren't doing any cleanup either (none of the work
      done by btrfs_endio_direct_read()), so a failure submitting a bio for a
      read request would leave a range in the inode's io_tree locked forever,
      blocking any future operations (both reads and writes) against that range.
      
      So fix this by making sure we do the same cleanup that we do for the case
      where the bio submission succeeds.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      61de718f
  10. 03 6月, 2015 1 次提交
    • F
      Btrfs: fix hang during inode eviction due to concurrent readahead · 6ca07097
      Filipe Manana 提交于
      Zygo Blaxell and other users have reported occasional hangs while an
      inode is being evicted, leading to traces like the following:
      
      [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
      [ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
      [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
      [ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
      [ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
      [ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
      [ 5281.981396] Call Trace:
      [ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
      [ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
      [ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
      [ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
      [ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
      [ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
      [ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
      [ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
      [ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
      [ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
      [ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [ 5282.001299] 1 lock held by rm/20488:
      [ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
      
      This happens when we have readahead, which calls readpages(), happening
      right before the inode eviction handler is invoked. So the reason is
      essentially:
      
      1) readpages() is called while a reference on the inode is held, so
         eviction can not be triggered before readpages() returns. It also
         locks one or more ranges in the inode's io_tree (which is done at
         extent_io.c:__do_contiguous_readpages());
      
      2) readpages() submits several read bios, all with an end io callback
         that runs extent_io.c:end_bio_extent_readpage() and that is executed
         by other task when a bio finishes, corresponding to a work queue
         (fs_info->end_io_workers) worker kthread. This callback unlocks
         the ranges in the inode's io_tree that were previously locked in
         step 1;
      
      3) readpages() returns, the reference on the inode is dropped;
      
      4) One or more of the read bios previously submitted are still not
         complete (their end io callback was not yet invoked or has not
         yet finished execution);
      
      5) Inode eviction is triggered (through an unlink call for example).
         The inode reference count was not incremented before submitting
         the read bios, therefore this is possible;
      
      6) The eviction handler starts executing and enters the loop that
         iterates over all extent states in the inode's io_tree;
      
      7) The loop picks one extent state record and uses its ->start and
         ->end fields, after releasing the inode's io_tree spinlock, to
         call lock_extent_bits() and clear_extent_bit(). The call to lock
         the range [state->start, state->end] blocks because the whole
         range or a part of it was locked by the previous call to
         readpages() and the corresponding end io callback, which unlocks
         the range was not yet executed;
      
      8) The end io callback for the read bio is executed and unlocks the
         range [state->start, state->end] (or a superset of that range).
         And at clear_extent_bit() the extent_state record state is used
         as a second argument to split_state(), which sets state->start to
         a larger value;
      
      9) The task executing the eviction handler is woken up by the task
         executing the bio's end io callback (through clear_state_bit) and
         the eviction handler locks the range
         [old value for state->start, state->end]. Shortly after, when
         calling clear_extent_bit(), it unlocks the range
         [new value for state->start, state->end], so it ends up unlocking
         only part of the range that it locked, leaving an extent state
         record in the io_tree that represents the unlocked subrange;
      
      10) The eviction handler loop, in its next iteration, gets the
          extent_state record for the subrange that it did not unlock in the
          previous step and then tries to lock it, resulting in an hang.
      
      So fix this by not using the ->start and ->end fields of an existing
      extent_state record. This is a simple solution, and an alternative
      could be to bump the inode's reference count before submitting each
      read bio and having it dropped in the bio's end io callback. But that
      would be a more invasive/complex change and would not protect against
      other possible places that are not holding a reference on the inode
      as well. Something to consider in the future.
      
      Many thanks to Zygo Blaxell for reporting, in the mailing list, the
      issue, a set of scripts to trigger it and testing this fix.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6ca07097
  11. 26 4月, 2015 1 次提交
    • Y
      Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b
      Yang Dongsheng 提交于
      We need to fill inode when we found a node for it in delayed_nodes_tree.
      But we did not fill the ->last_trans currently, it will cause the test
      of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
      
      Problem:
      	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(2). pwrite(test_fd, buf, 4096, 0)
      	(3). close(test_fd)
      	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
      	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(6). fsync(test_fd);
      				<-------- we did not get the correct log entry for the file
      Reason:
      	When we re-open this file in (5), we would find a node
      in delayed_nodes_tree and fill the inode we are lookup with the
      information. But the ->last_trans is not filled, then the fsync()
      will check the ->last_trans and found it's 0 then say this inode
      is already in our tree which is commited, not recording the extents
      for it.
      
      Fix:
      	This patch fill the ->last_trans properly and set the
      runtime_flags if needed in this situation. Then we can get the
      log entries we expected after (6) and generic/311 passed.
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e17d30b
  12. 25 4月, 2015 1 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  13. 16 4月, 2015 1 次提交
  14. 13 4月, 2015 3 次提交
    • D
      btrfs: qgroup: do a reservation in a higher level. · e2d1f923
      Dongsheng Yang 提交于
      There are two problems in qgroup:
      
      a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
      qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
      is not available to user.
      
      b). When user is writing a inline data, qgroup will not reserve it,
      it means this is a window we can exceed the limit of a qgroup.
      
      The main idea of this patch is reserving the data size of write_bytes
      rather than the reserve_bytes. It means qgroup will not care about
      the data size btrfs will reserve for user, but only care about the
      data size user is going to write. Then reserve it when user want to
      write and release it in transaction committed.
      
      In this way, qgroup can be released from the complex procedure in
      btrfs and only do the reserve when user want to write and account
      when the data is written in commit_transaction().
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e2d1f923
    • D
      Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use. · 31193213
      Dongsheng Yang 提交于
      Currently, for pre_alloc or delay_alloc, the bytes will be accounted
      in space_info by the three guys.
      space_info->bytes_may_use --- space_info->reserved --- space_info->used.
      But on the other hand, in qgroup, there are only two counters to account the
      bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
      bytes in space_info->bytes_may_use and qg->excl accounts bytes in
      space_info->used. So the bytes in space_info->reserved is not accounted
      in qgroup. If so, there is a window we can exceed the quota limit when
      bytes is in space_info->reserved.
      
      Example:
      	# btrfs quota enable /mnt
      	# btrfs qgroup limit -e 10M /mnt
      	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
      	# sync
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer     excl     max_rfer max_excl parent  child
      -------- ----     ----     -------- -------- ------  -----
      0/5      20987904 20987904 0        10485760 ---     ---
      
      qg->excl is 20987904 larger than max_excl 10485760.
      
      This patch introduce a new counter named may_use to qgroup, then
      there are three counters in qgroup to account bytes in space_info
      as below.
      space_info->bytes_may_use --- space_info->reserved --- space_info->used.
      qgroup->may_use           --- qgroup->reserved     --- qgroup->excl
      
      With this patch applied:
      	# btrfs quota enable /mnt
      	# btrfs qgroup limit -e 10M /mnt
      	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
      fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
      	# sync
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer    excl    max_rfer max_excl parent  child
      -------- ----    ----    -------- -------- ------  -----
      0/5      9453568 9453568 0        10485760 ---     ---
      Reported-by: NCyril SCETBON <cyril.scetbon@free.fr>
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31193213
    • Z
      btrfs: Fix NO_SPACE bug caused by delayed-iput · d7c15171
      Zhao Lei 提交于
      Steps to reproduce:
        while true; do
          dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
          rm /btrfs_dir/file
          sync
        done
      
        And we'll see dd failed because btrfs return NO_SPACE.
      
      Reason:
        Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
        in end to free fs space for next write, but sometimes it hadn't
        done work on time, because btrfs-cleaner thread get delayed-iputs
        from list before, but do iput() after next write.
      
        This is log:
        [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
      
        [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
        [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
        [ 2569.087554] comm=sync func=btrfs_commit_transaction() end
      
        [ 2569.191081] comm=dd begin
        [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
      
        [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
        [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
        ...
        [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
        [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
      
      Fix:
        Make btrfs_commit_transaction() wait current running btrfs-cleaner's
        delayed-iputs() done in end.
      
      Test:
        Use script similar to above(more complex),
        before patch:
          7 failed in 100 * 20 loop.
        after patch:
          0 failed in 100 * 20 loop.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7c15171
  15. 12 4月, 2015 3 次提交
  16. 11 4月, 2015 4 次提交
    • J
      Btrfs: don't steal from the global reserve if we don't have the space · 3bce876f
      Josef Bacik 提交于
      btrfs_evict_inode() needs to be more careful about stealing from the
      global_rsv.  We dont' want to end up aborting commit with ENOSPC just
      because the evict_inode code was too greedy.
      Signed-off-by: NChris Mason <clm@fb.com>
      3bce876f
    • C
      Btrfs: refill block reserves during truncate · 28f75a0e
      Chris Mason 提交于
      When truncate starts, it allocates some space in the block reserves so
      that we'll have enough to update metadata along the way.
      
      For very large files, we can easily go through all of that space as we
      loop through the extents.  This changes truncate to refill the space
      reservation as it progresses through the file.
      Signed-off-by: NChris Mason <clm@fb.com>
      28f75a0e
    • J
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik 提交于
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: NChris Mason <clm@fb.com>
      1262133b
    • C
      btrfs: actively run the delayed refs while deleting large files · 28ed1345
      Chris Mason 提交于
      When we are deleting large files with large extents, we are building up
      a huge set of delayed refs for processing.  Truncate isn't checking
      often enough to see if we need to back off and process those, or let
      a commit proceed.
      
      The end result is long stalls after the rm, and very long commit times.
      During the commits, other processes back up waiting to start new
      transactions and we get into trouble.
      Signed-off-by: NChris Mason <clm@fb.com>
      28ed1345
  17. 02 4月, 2015 1 次提交
  18. 26 3月, 2015 1 次提交
  19. 18 3月, 2015 3 次提交
    • J
      Btrfs: fix outstanding_extents accounting in DIO · e1cbbfa5
      Josef Bacik 提交于
      We are keeping track of how many extents we need to reserve properly based on
      the amount we want to write, but we were still incrementing outstanding_extents
      if we wrote less than what we requested.  This isn't quite right since we will
      be limited to our max extent size.  So instead lets do something horrible!  Keep
      track of how many outstanding_extents we reserved, and decrement each time we
      allocate an extent.  If we use our entire reserve make sure to jack up
      outstanding_extents on the inode so the accounting works out properly.  Thanks,
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e1cbbfa5
    • J
      Btrfs: add sanity test for outstanding_extents accounting · 6a3891c5
      Josef Bacik 提交于
      I introduced a regression wrt outstanding_extents accounting.  These are tricky
      areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
      at any time.  So add sanity tests to cover the various conditions that are
      tricky in order to make sure we don't introduce regressions in the future.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6a3891c5
    • J
      Btrfs: account merges/splits properly · ba117213
      Josef Bacik 提交于
      My fix
      
      Btrfs: fix merge delalloc logic
      
      only fixed half of the problems, it didn't fix the case where we have two large
      extents on either side and then join them together with a new small extent.  We
      need to instead keep track of how many extents we have accounted for with each
      side of the new extent, and then see how many extents we need for the new large
      extent.  If they match then we know we need to keep our reservation, otherwise
      we need to drop our reservation.  This shows up with a case like this
      
      [BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]
      
      Previously the logic would have said that the number extents required for the
      new size (3) is larger than the number of extents required for the largest side
      (2) therefore we need to keep our reservation.  But this isn't the case, since
      both sides require a reservation of 2 which leads to 4 for the whole range
      currently reserved, but we only need 3, so we need to drop one of the
      reservations.  The same problem existed for splits, we'd think we only need 3
      extents when creating the hole but in reality we need 4.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      ba117213
  20. 14 3月, 2015 1 次提交
    • J
      Btrfs: fix merge delalloc logic · 8461a3de
      Josef Bacik 提交于
      My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
      regression when re-dirtying already dirty areas.  We have logic in split to make
      sure we are taking the largest space into account but didn't have it for merge,
      so it was sometimes making us think we were turning a tiny extent into a huge
      extent, when in reality we already had a huge extent and needed to use the other
      side in our logic.  This fixes the regression that was reported by a user on
      list.  Thanks,
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8461a3de
  21. 04 3月, 2015 1 次提交
  22. 03 3月, 2015 1 次提交
  23. 15 2月, 2015 3 次提交
    • J
      Btrfs: account for large extents with enospc · dcab6a3b
      Josef Bacik 提交于
      On our gluster boxes we stream large tar balls of backups onto our fses.  With
      160gb of ram this means we get really large contiguous ranges of dirty data, but
      the way our ENOSPC stuff works is that as long as it's contiguous we only hold
      metadata reservation for one extent.  The problem is we limit our extents to
      128mb, so we'll end up with at least 800 extents so our enospc accounting is
      quite a bit lower than what we need.  To keep track of this make sure we
      increase outstanding_extents for every multiple of the max extent size so we can
      be sure to have enough reserved metadata space.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcab6a3b
    • J
      Btrfs: don't set and clear delalloc for O_DIRECT writes · 3266789f
      Josef Bacik 提交于
      We do this to get the space accounting, but this is just needless churn on the
      io_tree, so just drop setting/clearing delalloc and just drop the reserved data
      space when we have a successfull allocation.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3266789f
    • J
      Btrfs: only adjust outstanding_extents when we do a short write · 3e05bde8
      Josef Bacik 提交于
      We have this weird dance where we always inc outstanding_extents when we do a
      O_DIRECT write, even if we allocate the entire range.  To get around this we
      also drop the metadata space if we successfully write.  This is an unnecessary
      dance, we only need to jack up outstanding_extents if we don't satisfy the
      entire range request in get_blocks_direct, otherwise we are good using our
      original reservation.  So drop the unconditional inc and the drop of the
      metadata space that we have for the unconditional inc.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3e05bde8
  24. 03 2月, 2015 2 次提交