1. 07 1月, 2016 6 次提交
    • D
      btrfs: cleanup, use enum values for btrfs_path reada · e4058b54
      David Sterba 提交于
      Replace the integers by enums for better readability. The value 2 does
      not have any meaning since a7175319
      "Btrfs: do less aggressive btree readahead" (2009-01-22).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4058b54
    • D
      btrfs: constify static arrays · 4d4ab6d6
      David Sterba 提交于
      There are a few statically initialized arrays that can be made const.
      The remaining (like file_system_type, sysfs attributes or prop handlers)
      do not allow that due to type mismatch when passed to the APIs or
      because the structures are modified through other members.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d4ab6d6
    • D
      btrfs: constify remaining structs with function pointers · 20e5506b
      David Sterba 提交于
      * struct extent_io_ops
      * struct btrfs_free_space_op
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      20e5506b
    • D
      btrfs: don't use slab cache for struct btrfs_delalloc_work · 100d5702
      David Sterba 提交于
      Although we prefer to use separate caches for various structs, it seems
      better not to do that for struct btrfs_delalloc_work. Objects of this
      type are allocated rarely, when transaction commit calls
      btrfs_start_delalloc_roots, requesting delayed iputs.
      
      The objects are temporary (with some IO involved) but still allocated
      and freed within __start_delalloc_inodes. Memory allocation failure is
      handled.
      
      The slab cache is empty most of the time (observed on several systems),
      so if we need to allocate a new slab object, the first one has to
      allocate a full page. In a potential case of low memory conditions this
      might fail with higher probability compared to using the generic slab
      caches.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      100d5702
    • D
      btrfs: put delayed item hook into inode · 8089fe62
      David Sterba 提交于
      Inodes for delayed iput allocate a trivial helper structure, let's place
      the list hook directly into the inode and save a kmalloc (killing a
      __GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
      
      The inode can be put into the delayed_iputs list more than once and we
      have to keep the count. This means we can't use the list_splice to
      process a bunch of inodes because we'd lost track of the count if the
      inode is put into the delayed iputs again while it's processed.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8089fe62
    • J
      Btrfs: igrab inode in writepage · be7bd730
      Josef Bacik 提交于
      We hit this panic on a few of our boxes this week where we have an
      ordered_extent with an NULL inode.  We do an igrab() of the inode in writepages,
      but weren't doing it in writepage which can be called directly from the VM on
      dirty pages.  If the inode has been unlinked then we could have I_FREEING set
      which means igrab() would return NULL and we get this panic.  Fix this by trying
      to igrab in btrfs_writepage, and if it returns NULL then just redirty the page
      and return AOP_WRITEPAGE_ACTIVATE; so the VM knows it wasn't successful.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be7bd730
  2. 25 11月, 2015 1 次提交
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  3. 10 11月, 2015 1 次提交
  4. 09 11月, 2015 1 次提交
    • F
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · 1d512cb7
      Filipe Manana 提交于
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      1d512cb7
  5. 05 11月, 2015 1 次提交
    • F
      Btrfs: fix extent accounting for partial direct IO writes · 9c9464cc
      Filipe Manana 提交于
      When doing a write using direct IO we can end up not doing the whole write
      operation using the direct IO path, in that case we fallback to a buffered
      write to do the remaining IO. This happens for example if the range we are
      writing to contains a compressed extent.
      When we do a partial write and fallback to buffered IO, due to the
      existence of a compressed extent for example, we end up not adjusting the
      outstanding extents counter of our inode which ends up getting decremented
      twice, once by the DIO ordered extent for the partial write and once again
      by btrfs_direct_IO(), resulting in an arithmetic underflow at
      extent-tree.c:drop_outstanding_extent(). For example if we have:
      
        extents        [ prealloc extent ] [ compressed extent ]
        offsets        A        B          C       D           E
      
      and at the moment our inode's outstanding extents counter is 0, if we do a
      direct IO write against the range [B, D[ (which has a length smaller than
      128Mb), we end up bumping our inode's outstanding extents counter to 1, we
      create a DIO ordered extent for the range [B, C[ and then fallback to a
      buffered write for the range [C, D[. The direct IO handler
      (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
      1, leaving it with a value of 0, through a call to
      btrfs_delalloc_release_space() and then shortly after the DIO ordered
      extent finishes and calls btrfs_delalloc_release_metadata() which ends
      up to attempt to decrement the inode's outstanding extents counter by 1,
      resulting in an assertion failure at drop_outstanding_extent() because
      the operation would result in an arithmetic underflow (0 - 1). This
      produces the following trace:
      
        [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
        [125471.338844] ------------[ cut here ]------------
        [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
        [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
        [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
        [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
        [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000
        [125471.340745] RIP: 0010:[<ffffffffa0550da1>]  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745] RSP: 0018:ffff88040a11bc78  EFLAGS: 00010296
        [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000
        [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff
        [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000
        [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000
        [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88
        [125471.340745] FS:  0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000
        [125471.340745] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0
        [125471.340745] Stack:
        [125471.340745]  ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1
        [125471.340745]  ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000
        [125471.340745]  0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa
        [125471.340745] Call Trace:
        [125471.340745]  [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs]
        [125471.340745]  [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
        [125471.340745]  [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs]
        [125471.340745]  [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs]
        [125471.340745]  [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs]
        [125471.340745]  [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [125471.340745]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
        [125471.340745]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106904d>] kthread+0xef/0xf7
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41
        [125471.340745] RIP  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745]  RSP <ffff88040a11bc78>
        [125471.539620] ---[ end trace 144259f7838b4aa4 ]---
      
      So fix this by ensuring we adjust the outstanding extents counter when we
      do the fallback just like we do for the case where the whole write can be
      done through the direct IO path.
      
      We were also adjusting the outstanding extents counter by a constant value
      of 1, which is incorrect because we were ignorning that we account extents
      in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.
      
      The following test case for fstests reproduces this issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "falloc"
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount "-o compress"
      
        # Create a compressed extent covering the range [700K, 800K[.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Create prealloc extent covering the range [600K, 700K[.
        $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo
      
        # Write 80K of data to the range [640K, 720K[ using direct IO. This
        # range covers both the prealloc extent and the compressed extent.
        # Because there's a compressed extent in the range we are writing to,
        # the DIO write code path ends up only writing the first 60k of data,
        # which goes to the prealloc extent, and then falls back to buffered IO
        # for writing the remaining 20K of data - because that remaining data
        # maps to a file range containing a compressed extent.
        # When falling back to buffered IO, we used to trigger an assertion when
        # releasing reserved space due to bad accounting of the inode's
        # outstanding extents counter, which was set to 1 but we ended up
        # decrementing it by 1 twice, once through the ordered extent for the
        # 60K of data we wrote using direct IO, and once through the main direct
        # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
        # wrote less than 80K of data (60K).
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now similar test as above but for very large write operations. This
        # triggers special cases for an inode's outstanding extents accounting,
        # as internally btrfs logically splits extents into 128Mb units.
        $XFS_IO_PROG -f -s \
            -c "pwrite -S 0xaa -b 128M 258M 128M" \
            -c "falloc 0 258M" \
            $SCRATCH_MNT/bar | _filter_xfs_io
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
            | _filter_xfs_io
      
        # Now verify the file contents are correct and that they are the same
        # even after unmounting and mounting the fs again (or evicting the page
        # cache).
        #
        # For file foo, all bytes in the range [0, 640K[ must have a value of
        # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
        # and all bytes in the range [720K, 800K[ must have a value of 0xaa.
        #
        # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
        # all bytes in the range [3M, 259M[ must have a value of 0xbb and all
        # bytes in the range [259M, 386M[ must have a value of 0xaa.
        #
        echo "File digests before remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
        _scratch_remount
        echo "File digests after remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
      
        status=0
        exit
      
      Fixes: e1cbbfa5 ("Btrfs: fix outstanding_extents accounting in DIO")
      Fixes: 3e05bde8 ("Btrfs: only adjust outstanding_extents when we do a short write")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      9c9464cc
  6. 27 10月, 2015 1 次提交
  7. 26 10月, 2015 1 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
  8. 22 10月, 2015 10 次提交
  9. 17 10月, 2015 1 次提交
    • F
      Btrfs: fix truncation of compressed and inlined extents · 0305cd5f
      Filipe Manana 提交于
      When truncating a file to a smaller size which consists of an inline
      extent that is compressed, we did not discard (or made unusable) the
      data between the new file size and the old file size, wasting metadata
      space and allowing for the truncated data to be leaked and the data
      corruption/loss mentioned below.
      We were also not correctly decrementing the number of bytes used by the
      inode, we were setting it to zero, giving a wrong report for callers of
      the stat(2) syscall. The fsck tool also reported an error about a mismatch
      between the nbytes of the file versus the real space used by the file.
      
      Now because we weren't discarding the truncated region of the file, it
      was possible for a caller of the clone ioctl to actually read the data
      that was truncated, allowing for a security breach without requiring root
      access to the system, using only standard filesystem operations. The
      scenario is the following:
      
         1) User A creates a file which consists of an inline and compressed
            extent with a size of 2000 bytes - the file is not accessible to
            any other users (no read, write or execution permission for anyone
            else);
      
         2) The user truncates the file to a size of 1000 bytes;
      
         3) User A makes the file world readable;
      
         4) User B creates a file consisting of an inline extent of 2000 bytes;
      
         5) User B issues a clone operation from user A's file into its own
            file (using a length argument of 0, clone the whole range);
      
         6) User B now gets to see the 1000 bytes that user A truncated from
            its file before it made its file world readbale. User B also lost
            the bytes in the range [1000, 2000[ bytes from its own file, but
            that might be ok if his/her intention was reading stale data from
            user A that was never supposed to be public.
      
      Note that this contrasts with the case where we truncate a file from 2000
      bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
      this case reading any byte from the range [1000, 2000[ will return a value
      of 0x00, instead of the original data.
      
      This problem exists since the clone ioctl was added and happens both with
      and without my recent data loss and file corruption fixes for the clone
      ioctl (patch "Btrfs: fix file corruption and data loss after cloning
      inline extents").
      
      So fix this by truncating the compressed inline extents as we do for the
      non-compressed case, which involves decompressing, if the data isn't already
      in the page cache, compressing the truncated version of the extent, writing
      the compressed content into the inline extent and then truncate it.
      
      The following test case for fstests reproduces the problem. In order for
      the test to pass both this fix and my previous fix for the clone ioctl
      that forbids cloning a smaller inline extent into a larger one,
      which is titled "Btrfs: fix file corruption and data loss after cloning
      inline extents", are needed. Without that other fix the test fails in a
      different way that does not leak the truncated data, instead part of
      destination file gets replaced with zeroes (because the destination file
      has a larger inline extent than the source).
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount "-o compress"
      
        # Create our test files. File foo is going to be the source of a clone operation
        # and consists of a single inline extent with an uncompressed size of 512 bytes,
        # while file bar consists of a single inline extent with an uncompressed size of
        # 256 bytes. For our test's purpose, it's important that file bar has an inline
        # extent with a size smaller than foo's inline extent.
        $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128"   \
                -c "pwrite -S 0x2a 128 384" \
                $SCRATCH_MNT/foo | _filter_xfs_io
        $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io
      
        # Now durably persist all metadata and data. We do this to make sure that we get
        # on disk an inline extent with a size of 512 bytes for file foo.
        sync
      
        # Now truncate our file foo to a smaller size. Because it consists of a
        # compressed and inline extent, btrfs did not shrink the inline extent to the
        # new size (if the extent was not compressed, btrfs would shrink it to 128
        # bytes), it only updates the inode's i_size to 128 bytes.
        $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo
      
        # Now clone foo's inline extent into bar.
        # This clone operation should fail with errno EOPNOTSUPP because the source
        # file consists only of an inline extent and the file's size is smaller than
        # the inline extent of the destination (128 bytes < 256 bytes). However the
        # clone ioctl was not prepared to deal with a file that has a size smaller
        # than the size of its inline extent (something that happens only for compressed
        # inline extents), resulting in copying the full inline extent from the source
        # file into the destination file.
        #
        # Note that btrfs' clone operation for inline extents consists of removing the
        # inline extent from the destination inode and copy the inline extent from the
        # source inode into the destination inode, meaning that if the destination
        # inode's inline extent is larger (N bytes) than the source inode's inline
        # extent (M bytes), some bytes (N - M bytes) will be lost from the destination
        # file. Btrfs could copy the source inline extent's data into the destination's
        # inline extent so that we would not lose any data, but that's currently not
        # done due to the complexity that would be needed to deal with such cases
        # (specially when one or both extents are compressed), returning EOPNOTSUPP, as
        # it's normally not a very common case to clone very small files (only case
        # where we get inline extents) and copying inline extents does not save any
        # space (unlike for normal, non-inlined extents).
        $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
      
        # Now because the above clone operation used to succeed, and due to foo's inline
        # extent not being shinked by the truncate operation, our file bar got the whole
        # inline extent copied from foo, making us lose the last 128 bytes from bar
        # which got replaced by the bytes in range [128, 256[ from foo before foo was
        # truncated - in other words, data loss from bar and being able to read old and
        # stale data from foo that should not be possible to read anymore through normal
        # filesystem operations. Contrast with the case where we truncate a file from a
        # size N to a smaller size M, truncate it back to size N and then read the range
        # [M, N[, we should always get the value 0x00 for all the bytes in that range.
      
        # We expected the clone operation to fail with errno EOPNOTSUPP and therefore
        # not modify our file's bar data/metadata. So its content should be 256 bytes
        # long with all bytes having the value 0xbb.
        #
        # Without the btrfs bug fix, the clone operation succeeded and resulted in
        # leaking truncated data from foo, the bytes that belonged to its range
        # [128, 256[, and losing data from bar in that same range. So reading the
        # file gave us the following content:
        #
        # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
        # *
        # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
        # *
        # 0000400
        echo "File bar's content after the clone operation:"
        od -t x1 $SCRATCH_MNT/bar
      
        # Also because the foo's inline extent was not shrunk by the truncate
        # operation, btrfs' fsck, which is run by the fstests framework everytime a
        # test completes, failed reporting the following error:
        #
        #  root 5 inode 257 errors 400, nbytes wrong
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      0305cd5f
  10. 11 10月, 2015 1 次提交
  11. 08 10月, 2015 1 次提交
  12. 22 9月, 2015 1 次提交
    • C
      Btrfs: Direct I/O: Fix space accounting · 50745b0a
      chandan 提交于
      The following call trace is seen when generic/095 test is executed,
      
      WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
      Modules linked in:
      CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
       ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
       0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
       ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
      Call Trace:
       [<ffffffff81984058>] dump_stack+0x45/0x57
       [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
       [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
       [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
       [<ffffffff8117ce07>] destroy_inode+0x37/0x60
       [<ffffffff8117cf39>] evict+0x109/0x170
       [<ffffffff8117cfd5>] dispose_list+0x35/0x50
       [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
       [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
       [<ffffffff81165951>] kill_anon_super+0x11/0x20
       [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
       [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
       [<ffffffff811660cf>] deactivate_super+0x5f/0x70
       [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
       [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
       [<ffffffff81069c06>] task_work_run+0x96/0xb0
       [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
       [<ffffffff8198cbc2>] int_signal+0x12/0x17
      
      This means that the inode had non-zero "outstanding extents" during
      eviction. This occurs because, during direct I/O a task which successfully
      used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
      not clear the bit after finishing the DIO write. A future DIO write could
      actually fail and the unused reserve space won't be freed because of the
      previously set BTRFS_INODE_DIO_READY bit.
      
      Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
      following issue,
      |-----------------------------------+-------------------------------------|
      | Task A                            | Task B                              |
      |-----------------------------------+-------------------------------------|
      | Start direct i/o write on inode X.|                                     |
      | reserve space                     |                                     |
      | Allocate ordered extent           |                                     |
      | release reserved space            |                                     |
      | Set BTRFS_INODE_DIO_READY bit.    |                                     |
      |                                   | splice()                            |
      |                                   | Transfer data from pipe buffer to   |
      |                                   | destination file.                   |
      |                                   | - kmap(pipe buffer page)            |
      |                                   | - Start direct i/o write on         |
      |                                   |   inode X.                          |
      |                                   |   - reserve space                   |
      |                                   |   - dio_refill_pages()              |
      |                                   |     - sdio->blocks_available == 0   |
      |                                   |     - Since a kernel address is     |
      |                                   |       being passed instead of a     |
      |                                   |       user space address,           |
      |                                   |       iov_iter_get_pages() returns  |
      |                                   |       -EFAULT.                      |
      |                                   |   - Since BTRFS_INODE_DIO_READY is  |
      |                                   |     set, we don't release reserved  |
      |                                   |     space.                          |
      |                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
      | -EIOCBQUEUED is returned.         |                                     |
      |-----------------------------------+-------------------------------------|
      
      Hence this commit introduces "struct btrfs_dio_data" to track the usage of
      reserved data space. The remaining unused "reserve space" can now be freed
      reliably.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50745b0a
  13. 15 9月, 2015 1 次提交
    • J
      btrfs: skip waiting on ordered range for special files · a30e577c
      Jeff Mahoney 提交于
      In btrfs_evict_inode, we properly truncate the page cache for evicted
      inodes but then we call btrfs_wait_ordered_range for every inode as well.
      It's the right thing to do for regular files but results in incorrect
      behavior for device inodes for block devices.
      
      filemap_fdatawrite_range gets called with inode->i_mapping which gets
      resolved to the block device inode before getting passed to
      wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi.  What happens
      next depends on whether there's an open file handle associated with the
      inode.  If there is, we write to the block device, which is unexpected
      behavior.  If there isn't, we through normally and inode->i_data is used.
      We can also end up racing against open/close which can result in crashes
      when i_mapping points to a block device inode that has been closed.
      
      Since there can't be any page cache associated with special file inodes,
      it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
      the problem.
      
      Cc: <stable@vger.kernel.org>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911Tested-by: NChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      a30e577c
  14. 01 9月, 2015 1 次提交
  15. 14 8月, 2015 1 次提交
  16. 09 8月, 2015 2 次提交
    • C
      Btrfs: add support for blkio controllers · da2f0f74
      Chris Mason 提交于
      This attaches accounting information to bios as we submit them so the
      new blkio controllers can throttle on btrfs filesystems.
      
      Not much is required, we're just associating bios with blkcgs during clone,
      calling wbc_init_bio()/wbc_account_io() during writepages submission,
      and attaching the bios to the current context during direct IO.
      
      Finally if we are splitting bios during btrfs_map_bio, this attaches
      accounting information to the split.
      
      The end result is able to throttle nicely on single disk filesystems.  A
      little more work is required for multi-device filesystems.
      Signed-off-by: NChris Mason <clm@fb.com>
      da2f0f74
    • F
      Btrfs: fix stale dir entries after unlink, inode eviction and fsync · bde6c242
      Filipe Manana 提交于
      If we remove a hard link from an inode, the inode gets evicted, then
      we fsync the inode and then power fail/crash, when the log tree is
      replayed, the parent directory inode still has entries pointing to
      the name that no longer exists, while our inode no longer has the
      BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
      leaving the filesystem in an inconsistent state. The stale directory
      entries can not be deleted (an attempt to delete them causes -ESTALE
      errors), which makes it impossible to delete the parent directory.
      
      This happens because we track the id of the transaction where the last
      unlink operation for the inode happened (last_unlink_trans) in an
      in-memory only field of the inode, that is, a value that is never
      persisted in the inode item stored on the fs/subvol btree. So if an
      inode is evicted and loaded again, the value for last_unlink_trans is
      set to 0, which prevents the fsync from logging the parent directory
      at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
      to the id of the transaction that last modified the inode when we
      load the inode. This is a pessimistic approach but it always ensures
      correctness with the trade off of ocassional full transaction commits
      when an fsync is done against the inode in the same transaction where
      it was evicted and reloaded when our inode is a directory and often
      logging its parent unnecessarily when our inode is not a directory.
      
      The following test case for fstests triggers the problem:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with 2 hard links.
        mkdir $SCRATCH_MNT/testdir
        touch $SCRATCH_MNT/testdir/foo
        ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
      
        # Make sure everything done so far is durably persisted.
        sync
      
        # Now remove one of the links, trigger inode eviction and then fsync
        # our inode.
        unlink $SCRATCH_MNT/testdir/bar
        echo 2 > /proc/sys/vm/drop_caches
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo
      
        # Silently drop all writes on our scratch device to simulate a power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again and mount the fs to trigger log/journal replay.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now verify our directory entries.
        echo "Entries in testdir:"
        ls -1 $SCRATCH_MNT/testdir
      
        # If we remove our inode, its parent should become empty and therefore we should
        # be able to remove the parent.
        rm -f $SCRATCH_MNT/testdir/*
        rmdir $SCRATCH_MNT/testdir
      
        _unmount_flakey
      
        # The fstests framework will call fsck against our filesystem which will verify
        # that all metadata is in a consistent state.
      
        status=0
        exit
      
      The test failed on btrfs with:
      
        generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
          --- tests/generic/098.out	2015-07-23 18:01:12.616175932 +0100
          +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad	2015-07-23 18:04:58.924138308 +0100
          @@ -1,3 +1,6 @@
           QA output created by 098
           Entries in testdir:
          +bar
           foo
          +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
          +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
          ...
          (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad'  to see the entire diff)
        _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)
      
        $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
        (...)
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
           unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
           unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
        Checking filesystem on /dev/sdc
        (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bde6c242
  17. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  18. 12 7月, 2015 1 次提交
    • F
      Btrfs: fix shrinking truncate when the no_holes feature is enabled · c1aa4575
      Filipe Manana 提交于
      If the no_holes feature is enabled, we attempt to shrink a file to a size
      that ends up in the middle of a hole and we don't have any file extent
      items in the fs/subvol tree that go beyond the new file size (or any
      ordered extents that will insert such file extent items), we end up not
      updating the inode's disk_i_size, we only update the inode's i_size.
      
      This means that after unmounting and mounting the filesystem, or after
      the inode is evicted and reloaded, its i_size ends up being incorrect
      (an inode's i_size is set to the disk_i_size field when an inode is
      loaded). This happens when btrfs_truncate_inode_items() doesn't find
      any file extent items to drop - in this case it never makes a call to
      btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.
      
      Example reproducer:
      
        $ mkfs.btrfs -O no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create our test file with some data and durably persist it.
        $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
        $ sync
      
        # Append some data to the file, increasing its size, and leave a hole
        # between the old size and the start offset if the following write. So
        # our file gets a hole in the range [128Kb, 256Kb[.
        $ xfs_io -c "truncate 160K" /mnt/foo
      
        # We expect to see our file with a size of 160Kb, with the first 128Kb
        # of data all having the value 0xaa and the remaining 32Kb of data all
        # having the value 0x00.
        $ od -t x1 /mnt/foo
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0500000
      
        # Now cleanly unmount and mount again the filesystem.
        $ umount /mnt
        $ mount /dev/sdd /mnt
      
        # We expect to get the same result as before, a file with a size of
        # 160Kb, with the first 128Kb of data all having the value 0xaa and the
        # remaining 32Kb of data all having the value 0x00.
        $ od -t x1 /mnt/foo
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0400000
      
      In the example above the file size/data do not match what they were before
      the remount.
      
      Fix this by always calling btrfs_ordered_update_i_size() with a size
      matching the size the file was truncated to if btrfs_truncate_inode_items()
      is not called for a log tree and no file extent items were dropped. This
      ensures the same behaviour as when the no_holes feature is not enabled.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      c1aa4575
  19. 02 7月, 2015 4 次提交
    • L
      Btrfs: fix warning of bytes_may_use · ddba1bfc
      Liu Bo 提交于
      While running generic/019, dmesg got several warnings from
      btrfs_free_reserved_data_space().
      
      Test generic/019 produces some disk failures so sumbit dio will get errors,
      in which case, btrfs_direct_IO() goes to the error handling and free
      bytes_may_use, but the problem is that bytes_may_use has been free'd
      during get_block().
      
      This adds a runtime flag to show if we've gone through get_block(), if so,
      don't do the cleanup work.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ddba1bfc
    • L
      Btrfs: fix hang when failing to submit bio of directIO · ad9ee205
      Liu Bo 提交于
      The hang is uncoverd by generic/019.
      
      btrfs_endio_direct_write() skips the "finish_ordered_fn" part when it hits
      an error, thus those added ordered extents will never get processed, which
      block processes that waiting for them via btrfs_start_ordered_extent().
      
      This fixes the above, and meanwhile finish_ordered_fn will do the space
      accounting work.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ad9ee205
    • F
      Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() · 9c6429d9
      Filipe Manana 提交于
      The comment was not correct about the part where it says the endio
      callback of the bio might have not yet been called - update it
      to mention that by that time the endio callback execution might
      still be in progress only.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9c6429d9
    • F
      Btrfs: fix memory corruption on failure to submit bio for direct IO · 61de718f
      Filipe Manana 提交于
      If we fail to submit a bio for a direct IO request, we were grabbing the
      corresponding ordered extent and decrementing its reference count twice,
      once for our lookup reference and once for the ordered tree reference.
      This was a problem because it caused the ordered extent to be freed
      without removing it from the ordered tree and any lists it might be
      attached to, leaving dangling pointers to the ordered extent around.
      Example trace with CONFIG_DEBUG_PAGEALLOC=y:
      
      [161779.858707] BUG: unable to handle kernel paging request at 0000000087654330
      [161779.859983] IP: [<ffffffff8124ca68>] rb_prev+0x22/0x3b
      [161779.860636] PGD 34d818067 PUD 0
      [161779.860636] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      (...)
      [161779.860636] Call Trace:
      [161779.860636]  [<ffffffffa06b36a6>] __tree_search+0xd9/0xf9 [btrfs]
      [161779.860636]  [<ffffffffa06b3708>] tree_search+0x42/0x63 [btrfs]
      [161779.860636]  [<ffffffffa06b4868>] ? btrfs_lookup_ordered_range+0x2d/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06b4873>] btrfs_lookup_ordered_range+0x38/0xa5 [btrfs]
      [161779.860636]  [<ffffffffa06aab8e>] btrfs_get_blocks_direct+0x11b/0x615 [btrfs]
      [161779.860636]  [<ffffffff8119727f>] do_blockdev_direct_IO+0x5ff/0xb43
      [161779.860636]  [<ffffffffa06aaa73>] ? btrfs_page_exists_in_range+0x1ad/0x1ad [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffffa06a10ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
      [161779.860636]  [<ffffffffa06a2c9a>] ? btrfs_get_extent_fiemap+0x1bc/0x1bc [btrfs]
      [161779.860636]  [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
      [161779.860636]  [<ffffffffa06affaa>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs]
      [161779.860636]  [<ffffffffa06b004c>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
      (...)
      
      We were also not freeing the btrfs_dio_private we allocated previously,
      which kmemleak reported with the following trace in its sysfs file:
      
      unreferenced object 0xffff8803f553bf80 (size 96):
        comm "xfs_io", pid 4501, jiffies 4295039588 (age 173.936s)
        hex dump (first 32 bytes):
          88 6c 9b f5 02 88 ff ff 00 00 00 00 00 00 00 00  .l..............
          00 00 00 00 00 00 00 00 00 00 c4 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81161ffe>] create_object+0x172/0x29a
          [<ffffffff8145870f>] kmemleak_alloc+0x25/0x41
          [<ffffffff81154e64>] kmemleak_alloc_recursive.constprop.40+0x16/0x18
          [<ffffffff811579ed>] kmem_cache_alloc_trace+0xfb/0x148
          [<ffffffffa03d8cff>] btrfs_submit_direct+0x65/0x16a [btrfs]
          [<ffffffff811968dc>] dio_bio_submit+0x62/0x8f
          [<ffffffff811975fe>] do_blockdev_direct_IO+0x97e/0xb43
          [<ffffffff811977f5>] __blockdev_direct_IO+0x32/0x34
          [<ffffffffa03d70ae>] btrfs_direct_IO+0x198/0x21f [btrfs]
          [<ffffffff81112ca1>] generic_file_direct_write+0xb3/0x128
          [<ffffffffa03e604d>] btrfs_file_write_iter+0x201/0x3e0 [btrfs]
          [<ffffffff8116586a>] __vfs_write+0x7c/0xa5
          [<ffffffff81165da9>] vfs_write+0xa0/0xe4
          [<ffffffff81166675>] SyS_pwrite64+0x64/0x82
          [<ffffffff81464fd7>] system_call_fastpath+0x12/0x6f
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      For read requests we weren't doing any cleanup either (none of the work
      done by btrfs_endio_direct_read()), so a failure submitting a bio for a
      read request would leave a range in the inode's io_tree locked forever,
      blocking any future operations (both reads and writes) against that range.
      
      So fix this by making sure we do the same cleanup that we do for the case
      where the bio submission succeeds.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      61de718f
  20. 03 6月, 2015 1 次提交
    • F
      Btrfs: fix hang during inode eviction due to concurrent readahead · 6ca07097
      Filipe Manana 提交于
      Zygo Blaxell and other users have reported occasional hangs while an
      inode is being evicted, leading to traces like the following:
      
      [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
      [ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
      [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
      [ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
      [ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
      [ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
      [ 5281.981396] Call Trace:
      [ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
      [ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
      [ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
      [ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
      [ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
      [ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
      [ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
      [ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
      [ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
      [ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
      [ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
      [ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
      [ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [ 5282.001299] 1 lock held by rm/20488:
      [ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b
      
      This happens when we have readahead, which calls readpages(), happening
      right before the inode eviction handler is invoked. So the reason is
      essentially:
      
      1) readpages() is called while a reference on the inode is held, so
         eviction can not be triggered before readpages() returns. It also
         locks one or more ranges in the inode's io_tree (which is done at
         extent_io.c:__do_contiguous_readpages());
      
      2) readpages() submits several read bios, all with an end io callback
         that runs extent_io.c:end_bio_extent_readpage() and that is executed
         by other task when a bio finishes, corresponding to a work queue
         (fs_info->end_io_workers) worker kthread. This callback unlocks
         the ranges in the inode's io_tree that were previously locked in
         step 1;
      
      3) readpages() returns, the reference on the inode is dropped;
      
      4) One or more of the read bios previously submitted are still not
         complete (their end io callback was not yet invoked or has not
         yet finished execution);
      
      5) Inode eviction is triggered (through an unlink call for example).
         The inode reference count was not incremented before submitting
         the read bios, therefore this is possible;
      
      6) The eviction handler starts executing and enters the loop that
         iterates over all extent states in the inode's io_tree;
      
      7) The loop picks one extent state record and uses its ->start and
         ->end fields, after releasing the inode's io_tree spinlock, to
         call lock_extent_bits() and clear_extent_bit(). The call to lock
         the range [state->start, state->end] blocks because the whole
         range or a part of it was locked by the previous call to
         readpages() and the corresponding end io callback, which unlocks
         the range was not yet executed;
      
      8) The end io callback for the read bio is executed and unlocks the
         range [state->start, state->end] (or a superset of that range).
         And at clear_extent_bit() the extent_state record state is used
         as a second argument to split_state(), which sets state->start to
         a larger value;
      
      9) The task executing the eviction handler is woken up by the task
         executing the bio's end io callback (through clear_state_bit) and
         the eviction handler locks the range
         [old value for state->start, state->end]. Shortly after, when
         calling clear_extent_bit(), it unlocks the range
         [new value for state->start, state->end], so it ends up unlocking
         only part of the range that it locked, leaving an extent state
         record in the io_tree that represents the unlocked subrange;
      
      10) The eviction handler loop, in its next iteration, gets the
          extent_state record for the subrange that it did not unlock in the
          previous step and then tries to lock it, resulting in an hang.
      
      So fix this by not using the ->start and ->end fields of an existing
      extent_state record. This is a simple solution, and an alternative
      could be to bump the inode's reference count before submitting each
      read bio and having it dropped in the bio's end io callback. But that
      would be a more invasive/complex change and would not protect against
      other possible places that are not holding a reference on the inode
      as well. Something to consider in the future.
      
      Many thanks to Zygo Blaxell for reporting, in the mailing list, the
      issue, a set of scripts to trigger it and testing this fix.
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6ca07097
  21. 26 4月, 2015 1 次提交
    • Y
      Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b
      Yang Dongsheng 提交于
      We need to fill inode when we found a node for it in delayed_nodes_tree.
      But we did not fill the ->last_trans currently, it will cause the test
      of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
      
      Problem:
      	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(2). pwrite(test_fd, buf, 4096, 0)
      	(3). close(test_fd)
      	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
      	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(6). fsync(test_fd);
      				<-------- we did not get the correct log entry for the file
      Reason:
      	When we re-open this file in (5), we would find a node
      in delayed_nodes_tree and fill the inode we are lookup with the
      information. But the ->last_trans is not filled, then the fsync()
      will check the ->last_trans and found it's 0 then say this inode
      is already in our tree which is commited, not recording the extents
      for it.
      
      Fix:
      	This patch fill the ->last_trans properly and set the
      runtime_flags if needed in this situation. Then we can get the
      log entries we expected after (6) and generic/311 passed.
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e17d30b
  22. 25 4月, 2015 1 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0