1. 10 11月, 2015 1 次提交
  2. 07 11月, 2015 2 次提交
    • M
      mm, fs: introduce mapping_gfp_constraint() · c62d2555
      Michal Hocko 提交于
      There are many places which use mapping_gfp_mask to restrict a more
      generic gfp mask which would be used for allocations which are not
      directly related to the page cache but they are performed in the same
      context.
      
      Let's introduce a helper function which makes the restriction explicit and
      easier to track.  This patch doesn't introduce any functional changes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c62d2555
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  3. 03 11月, 2015 3 次提交
    • F
      Btrfs: fix hole punching when using the no-holes feature · 2959a32a
      Filipe Manana 提交于
      When we are using the no-holes feature, if we punch a hole into a file
      range that already contains a hole which overlaps the range we are passing
      to fallocate(), we end up removing the extent map that represents the
      existing hole without adding a new one. This happens because with the
      no-holes feature we do not have explicit extent items to represent holes
      and therefore the call to __btrfs_drop_extents(), made from
      btrfs_punch_hole(), returns an end offset to the variable drop_end that
      is smaller than the end of the range passed to fallocate(), while it
      drops all existing extent maps in that range.
      Normally having a missing extent map is not a problem, for example for
      a readpages() operation we just end up building the extent map by
      looking at the fs/subvol tree for a matching extent item (or a lack of
      one for implicit holes). However for an fsync that uses the fast path,
      which needs to look at the list of modified extent maps, this means
      the fsync will not record information about the complete hole we had
      before the fallocate() call into the log tree, resulting in a file with
      content/layout that does not match what we had neither before nor after
      the hole punch operation.
      
      The following test case for fstests reproduces the issue. It fails without
      this change because we get a file with a different digest after the fsync
      log replay and also with a different extent/hole layout.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
           _cleanup_flakey
           rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/punch
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "fpunch"
        _require_xfs_io_command "fiemap"
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        # This test was motivated by an issue found in btrfs when the btrfs
        # no-holes feature is enabled (introduced in kernel 3.14). So enable
        # the feature if the fs being tested is btrfs.
        if [ $FSTYP == "btrfs" ]; then
            _require_btrfs_fs_feature "no_holes"
            _require_btrfs_mkfs_feature "no-holes"
            MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
        fi
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create out test file with some data and then fsync it.
        # We do the fsync only to make sure the last fsync we do in this test
        # triggers the fast code path of btrfs' fsync implementation, a
        # condition necessary to trigger the bug btrfs had.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 128K" \
                        -c "fsync"                  \
                        $SCRATCH_MNT/foobar | _filter_xfs_io
      
        # Now punch a hole against the range [96K, 128K[.
        $XFS_IO_PROG -c "fpunch 96K 32K" $SCRATCH_MNT/foobar
      
        # Punch another hole against a range that overlaps the previous range
        # and ends beyond eof.
        $XFS_IO_PROG -c "fpunch 64K 128K" $SCRATCH_MNT/foobar
      
        # Punch another hole against a range that overlaps the first range
        # ([96K, 128K[) and ends at eof.
        $XFS_IO_PROG -c "fpunch 32K 96K" $SCRATCH_MNT/foobar
      
        # Fsync our file. We want to verify that, after a power failure and
        # mounting the filesystem again, the file content reflects all the hole
        # punch operations.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        echo "Fiemap before power failure:"
        $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
      
        # Silently drop all writes and umount to simulate a crash/power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger log replay and validate file
        # contents.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        echo "File digest after log replay:"
        # Must match the same digest we got before the power failure.
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        echo "Fiemap after log replay:"
        # Must match the same extent listing we got before the power failure.
        $XFS_IO_PROG -c "fiemap -v" $SCRATCH_MNT/foobar | _filter_fiemap
      
        _unmount_flakey
      
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2959a32a
    • C
      Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state · 13a0db5a
      chandan 提交于
      When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
      and nodesize set to 64k), the following call trace is observed,
      
      WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
      Modules linked in:
      CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d9 #54
      task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
      NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
      REGS: c0000000f600a820 TRAP: 0700   Not tainted  (4.3.0-rc5-13676-ga5e681d9)
      MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 24444884  XER: 00000000
      CFAR: c0000000004a3b30 SOFTE: 1
      GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
      GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
      GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
      GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
      GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
      GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
      GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
      GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
      NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
      LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      Call Trace:
      [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
      [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
      [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
      [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
      [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
      [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
      [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
      [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
      [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
      [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
      [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
      [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
      [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
      [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
      [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
      [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
      [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
      [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
      [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
      [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
      [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
      [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
      [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
      [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
      [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
      Instruction dump:
      fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
      812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c
      
      The above call trace is seen even on x86_64; albeit very rarely and that too
      with nodesize set to 64k and with nospace_cache mount option being used.
      
      The reason for the above call trace is,
      btrfs_remove_chunk
        check_system_chunk
          Allocate chunk if required
        For each physical stripe on underlying device,
          btrfs_free_dev_extent
            ...
            Take lock on Device tree's root node
            btrfs_cow_block("dev tree's root node");
              btrfs_reserve_extent
                find_free_extent
      	    index = BTRFS_RAID_DUP;
      	    have_caching_bg = false;
      
                  When in LOOP_CACHING_NOWAIT state, Assume we find a block group
      	    which is being cached; Hence have_caching_bg is set to true
      
                  When repeating the search for the next RAID index, we set
      	    have_caching_bg to false.
      
      Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
      skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
      allocate a chunk and try to add entries corresponding to the chunk's physical
      stripe into the device tree. When doing so the task deadlocks itself waiting
      for the blocking lock on the root node of the device tree.
      
      This commit fixes the issue by introducing a new local variable to help
      indicate as to whether a block group of any RAID type is being cached.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      13a0db5a
    • Q
      btrfs: Fix a data space underflow warning · 485290a7
      Qu Wenruo 提交于
      Even with quota disabled, generic/127 will trigger a kernel warning by
      underflow data space info.
      
      The bug is caused by buffered write, which in case of short copy, the
      start parameter for btrfs_delalloc_release_space() is wrong, and
      round_up/down() in btrfs_delalloc_release() extents the range to page
      aligned, decreasing one more page than expected.
      
      This patch will fix it by passing correct start.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      485290a7
  4. 27 10月, 2015 10 次提交
  5. 26 10月, 2015 2 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
    • F
      Btrfs: fix regression when running delayed references · 2c3cf7d5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a refactoring/rework of the delayed
      references implementation in order to fix certain problems with qgroups.
      However that rework introduced one more regression that leads to the
      following trace when running delayed references for metadata:
      
      [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
      [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2
      [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
      [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: ffff88010c4c8000
      [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
      [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000
      [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000
      [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8
      [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000
      [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000
      [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0
      [35908.065201] Stack:
      [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000
      [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8
      [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000
      [35908.065201] Call Trace:
      [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
      [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs]
      [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs]
      [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs]
      [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
      [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b [btrfs]
      [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
      [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 [btrfs]
      [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
      [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
      [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
      [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
      [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
      [35908.065201] RIP  [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs]
      [35908.065201]  RSP <ffff88010c4cbb08>
      [35908.310885] ---[ end trace fe4299baf0666457 ]---
      
      This happens because the new delayed references code no longer merges
      delayed references that have different sequence values. The following
      steps are an example sequence leading to this issue:
      
      1) Transaction N starts, fs_info->tree_mod_seq has value 0;
      
      2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
         bytenr A is created, with a value of 1 and a seq value of 0;
      
      3) fs_info->tree_mod_seq is incremented to 1;
      
      4) Extent buffer A is deleted through btrfs_del_items(), which calls
         btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
         later returns the metadata extent associated to extent buffer A to
         the free space cache (the range is not pinned), because the extent
         buffer was created in the current transaction (N) and writeback never
         happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
         in the extent buffer).
         This creates the delayed reference Ref2 for bytenr A, with a value
         of -1 and a seq value of 1;
      
      5) Delayed reference Ref2 is not merged with Ref1 when we create it,
         because they have different sequence numbers (decided at
         add_delayed_ref_tail_merge());
      
      6) fs_info->tree_mod_seq is incremented to 2;
      
      7) Some task attempts to allocate a new extent buffer (done at
         extent-tree.c:find_free_extent()), but due to heavy fragmentation
         and running low on metadata space the clustered allocation fails
         and we fall back to unclustered allocation, which finds the
         extent at offset A, so a new extent buffer at offset A is allocated.
         This creates delayed reference Ref3 for bytenr A, with a value of 1
         and a seq value of 2;
      
      8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
         all have different seq values;
      
      9) We start running the delayed references (__btrfs_run_delayed_refs());
      
      10) The delayed Ref1 is the first one being applied, which ends up
          creating an inline extent backref in the extent tree;
      
      10) Next the delayed reference Ref3 is selected for execution, and not
          Ref2, because select_delayed_ref() always gives a preference for
          positive references (that have an action of BTRFS_ADD_DELAYED_REF);
      
      11) When running Ref3 we encounter alreay the inline extent backref
          in the extent tree at insert_inline_extent_backref(), which makes
          us hit the following BUG_ON:
      
              BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
      
          This is always true because owner corresponds to the level of the
          extent buffer/btree node in the btree.
      
      For the scenario described above we hit the BUG_ON because we never merge
      references that have different seq values.
      
      We used to do the merging before the 4.2 kernel, more specifically, before
      the commmits:
      
        c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
        c43d160f ("btrfs: delayed-ref: Cleanup the unneeded functions.")
      
      This issue became more exposed after the following change that was added
      to 4.2 as well:
      
        cffc3374 ("Btrfs: fix order by which delayed references are run")
      
      Which in turn fixed another regression by the two commits previously
      mentioned.
      
      So fix this by bringing back the delayed reference merge code, with the
      proper adaptations so that it operates against the new data structure
      (linked list vs old red black tree implementation).
      
      This issue was hit running fstest btrfs/063 in a loop. Several people have
      reported this issue in the mailing list when running on kernels 4.2+.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Fixes: c6fc2454 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.")
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      2c3cf7d5
  6. 22 10月, 2015 22 次提交