1. 23 2月, 2016 1 次提交
    • L
      Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9
      Liu Bo 提交于
      Xfstests btrfs/011 complains about a deadlock warning,
      
      [ 1226.649039] =========================================================
      [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
      [ 1226.649039] 4.1.0+ #270 Not tainted
      [ 1226.649039] ---------------------------------------------------------
      [ 1226.652955] kswapd0/46 just changed the state of lock:
      [ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1226.652955]
      other info that might help us debug this:
      [ 1226.652955] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1226.652955]  Possible interrupt unsafe locking scenario:
      
      [ 1226.652955]        CPU0                    CPU1
      [ 1226.652955]        ----                    ----
      [ 1226.652955]   lock(&fs_info->dev_replace.lock);
      [ 1226.652955]                                local_irq_disable();
      [ 1226.652955]                                lock(&delayed_node->mutex);
      [ 1226.652955]                                lock(&found->groups_sem);
      [ 1226.652955]   <Interrupt>
      [ 1226.652955]     lock(&delayed_node->mutex);
      [ 1226.652955]
       *** DEADLOCK ***
      
      Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
      to fix a similar one that has the exactly same warning, but with that, we still
      run to this.
      
      The above lock chain comes from
      btrfs_commit_transaction
        ->btrfs_run_delayed_items
          ...
          ->__btrfs_update_delayed_inode
            ...
            ->__btrfs_cow_block
               ...
               ->find_free_extent
                  ->cache_block_group
                    ->load_free_space_cache
                      ->btrfs_readpages
                        ->submit_one_bio
                          ...
                          ->__btrfs_map_block
                            ->btrfs_dev_replace_lock
      
      However, with high memory pressure, tasks which hold dev_replace.lock can
      be interrupted by kswapd and then kswapd is intended to release memory occupied
      by superblock, inodes and dentries, where we may call evict_inode, and it comes
      to
      
      [ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
      [ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
      
      delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
      to a ABBA deadlock.
      
      To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
      things are simpler here since we only needs read's spinlock to blocking lock.
      
      With this, btrfs/011 no more produces warnings in dmesg.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73beece9
  2. 11 2月, 2016 1 次提交
    • D
      btrfs: scrub: use GFP_KERNEL on the submission path · 58c4e173
      David Sterba 提交于
      Scrub is not on the critical writeback path we don't need to use
      GFP_NOFS for all allocations. The failures are handled and stats passed
      back to userspace.
      
      Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the
      global structures and the IO submission paths.
      
      Functions that do the repair and fixups still use GFP_NOFS as we might
      want to skip any other filesystem activity if we encounter an error.
      This could turn out to be unnecessary, but requires more review compared
      to the easy cases in this patch.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58c4e173
  3. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  4. 20 1月, 2016 1 次提交
  5. 16 1月, 2016 1 次提交
  6. 07 1月, 2016 3 次提交
  7. 03 12月, 2015 1 次提交
  8. 25 11月, 2015 3 次提交
    • F
      Btrfs: fix scrub preventing unused block groups from being deleted · 758f2dfc
      Filipe Manana 提交于
      Currently scrub can race with the cleaner kthread when the later attempts
      to delete an unused block group, and the result is preventing the cleaner
      kthread from ever deleting later the block group - unless the block group
      becomes used and unused again. The following diagram illustrates that
      race:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs and
           removes it from that list
      
                                                   scrub_enumerate_chunks()
      
                                                     searches device tree using
                                                     its commit root
      
                                                     finds device extent for
                                                     block group X
      
                                                     gets block group X from the tree
                                                     fs_info->block_group_cache_tree
                                                     (via btrfs_lookup_block_group())
      
                                                     sets bg X to RO
      
           sees the block group is
           already RO and therefore
           doesn't delete it nor adds
           it back to unused list
      
      So fix this by making scrub add the block group again to the list of
      unused block groups if the block group is still unused when it finished
      scrubbing it and it hasn't been removed already.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758f2dfc
    • F
      Btrfs: fix race between scrub and block group deletion · 020d5b73
      Filipe Manana 提交于
      Scrub can race with the cleaner kthread deleting block groups that are
      unused (and with relocation too) leading to a failure with error -EINVAL
      that gets returned to user space.
      
      The following diagram illustrates how it happens:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs
      
           sets block group to RO
      
             btrfs_remove_chunk(bg X)
      
               deletes device extents
      
                                               scrub_enumerate_chunks()
      
                                                 searches device tree using
                                                 its commit root
      
                                                 finds device extent for
                                                 block group X
      
                                                 gets block group X from the tree
                                                 fs_info->block_group_cache_tree
                                                 (via btrfs_lookup_block_group())
      
                                                 sets bg X to RO (again)
      
                btrfs_remove_block_group(bg X)
      
                  deletes block group from
                  fs_info->block_group_cache_tree
      
                  removes extent map from
                  fs_info->mapping_tree
      
                                                     scrub_chunk(offset X)
      
                                                       searches fs_info->mapping_tree
                                                       for extent map starting at
                                                       offset X
      
                                                          --> doesn't find any such
                                                              extent map
                                                          --> returns -EINVAL and scrub
                                                              errors out to userspace
                                                              with -EINVAL
      
      Fix this by dealing with an extent map lookup failure as an indicator of
      block group deletion.
      Issue reproduced with fstest btrfs/071.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      020d5b73
    • Z
      btrfs: Continue replace when set_block_ro failed · 76a8efa1
      Zhaolei 提交于
      xfstests/011 failed in node with small_size filesystem.
      Can be reproduced by following script:
        DEV_LIST="/dev/vdd /dev/vde"
        DEV_REPLACE="/dev/vdf"
      
        do_test()
        {
            local mkfs_opt="$1"
            local size="$2"
      
            dmesg -c >/dev/null
            umount $SCRATCH_MNT &>/dev/null
      
            echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
            mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
            mount "${DEV_LIST[0]}" $SCRATCH_MNT
      
            echo -n "Writing big files"
            dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
            for ((i = 1; i <= size; i++)); do
                echo -n .
                /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
            done
            echo
      
            echo Start replace
            btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
                dmesg
                return 1
            }
            return 0
        }
      
        # Set size to value near fs size
        # for example, 1897 can trigger this bug in 2.6G device.
        #
        ./do_test "-d raid1 -m raid1" 1897
      
      System will report replace fail with following warning in dmesg:
       [  134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
       [  135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
       [  135.543505] ------------[ cut here ]------------
       [  135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
       [  135.545276] Modules linked in:
       [  135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
       [  135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
       [  135.547798]  ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
       [  135.548774]  ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
       [  135.549774]  ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
       [  135.550758] Call Trace:
       [  135.551086]  [<ffffffff817fe7b5>] dump_stack+0x44/0x55
       [  135.551737]  [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
       [  135.552487]  [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
       [  135.553211]  [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
       [  135.554051]  [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
       [  135.554722]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
       [  135.555506]  [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
       [  135.556304]  [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
       [  135.557009]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
       [  135.557855]  [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
       [  135.558669]  [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
       [  135.559374]  [<ffffffff81202124>] SyS_ioctl+0x74/0x80
       [  135.559987]  [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
       [  135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
      
      Reason:
       When big data writen to fs, the whole free space will be allocated
       for data chunk.
       And operation as scrub need to set_block_ro(), and when there is
       only one metadata chunk in system(or other metadata chunks
       are all full), the function will try to allocate a new chunk,
       and failed because no space in device.
      
      Fix:
       When set_block_ro failed for metadata chunk, it is not a problem
       because scrub_lock paused commit_trancaction in same time, and
       metadata are always cowed, so the on-the-fly writepages will not
       write data into same place with scrub/replace.
       Let replace continue in this case is no problem.
      
      Tested by above script, and xfstests/011, plus 100 times xfstests/070.
      
      Changelog v1->v2:
      1: Add detail comments in source and commit-message.
      2: Add dmesg detail into commit-message.
      3: Limit return value of -ENOSPC to be passed.
      All suggested by: Filipe Manana <fdmanana@gmail.com>
      Suggested-by: NFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      76a8efa1
  9. 11 11月, 2015 6 次提交
  10. 08 10月, 2015 3 次提交
  11. 01 9月, 2015 2 次提交
  12. 14 8月, 2015 1 次提交
  13. 09 8月, 2015 13 次提交
    • O
      Btrfs: fix parity scrub of RAID 5/6 with missing device · 4a770891
      Omar Sandoval 提交于
      When testing the previous patch, Zhao Lei reported a similar bug when
      attempting to scrub a degraded RAID 5/6 filesystem with a missing
      device, leading to NULL pointer dereferences from the RAID 5/6 parity
      scrubbing code.
      
      The first cause was the same as in the previous patch: attempting to
      call bio_add_page() on a missing block device. To fix this,
      scrub_extent_for_parity() can just mark the sectors on the missing
      device as errors instead of attempting to read from it.
      
      Additionally, the code uses scrub_remap_extent() to map the extent of
      the corresponding data stripe, but the extent wasn't already mapped. If
      scrub_remap_extent() finds a missing block device, it doesn't initialize
      extent_dev, so we're left with a NULL struct btrfs_device. The solution
      is to use btrfs_map_block() directly.
      Reported-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4a770891
    • O
      Btrfs: fix device replace of a missing RAID 5/6 device · 73ff61db
      Omar Sandoval 提交于
      The original implementation of device replace on RAID 5/6 seems to have
      missed support for replacing a missing device. When this is attempted,
      we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
      crashes when we try to dereference it. This happens because
      btrfs_map_block() has no choice but to return us the missing device
      because RAID 5/6 don't have any alternate mirrors to read from, and a
      missing device has a NULL bdev.
      
      The idea implemented here is to handle the missing device case
      separately, which better only happen when we're replacing a missing RAID
      5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
      reconstruct the data from parity, check it with
      scrub_recheck_block_checksum(), and write it out with
      scrub_write_block_to_dev_replace().
      Reported-by: NPhilip <bugzilla@philip-seeger.de>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      73ff61db
    • O
      Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation · b4ee1782
      Omar Sandoval 提交于
      The current RAID 5/6 recovery code isn't quite prepared to handle
      missing devices. In particular, it expects a bio that we previously
      attempted to use in the read path, meaning that it has valid pages
      allocated. However, missing devices have a NULL blkdev, and we can't
      call bio_add_page() on a bio with a NULL blkdev. We could do manual
      manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
      a separate path that allows us to manually add pages to the rbio.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b4ee1782
    • O
      Btrfs: remove misleading handling of missing device scrub · 03679ade
      Omar Sandoval 提交于
      scrub_submit() claims that it can handle a bio with a NULL block device,
      but this is misleading, as calling bio_add_page() on a bio with a NULL
      ->bi_bdev would've already crashed. Delete this, as we're about to
      properly handle a missing block device.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      03679ade
    • Z
      btrfs: Fix data checksum error cause by replace with io-load. · 55e3a601
      Zhaolei 提交于
      xfstests btrfs/070 sometimes failed.
      In my test machine, its fail rate is about 30%.
      In another vm(vmware), its fail rate is about 50%.
      
      Reason:
        btrfs/070 do replace and defrag with fsstress simultaneously,
        after above operation, checksum error is found by scrub.
      
        Actually, it have no relationship with defrag operation, only
        replace with fsstress can trigger this bug.
      
        New data writen to target device have possibility rewrited by
        old data from source device by replace code in debug, to avoid
        above problem, we can set target block group to readonly in
        replace period, so new data requested by other operation will
        not write to same place with replace code.
      
        Before patch(4.1-rc3):
          30% failed in 100 xfstests.
        After patch:
          0% failed in 300 xfstests.
      
      It also happened in btrfs/071 as it's another scrub with IO load tests.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      55e3a601
    • Z
      btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks() · b708ce96
      Zhaolei 提交于
      Use new intruduced scrub_pause_on/off() can make this code block
      clean and more readable.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b708ce96
    • Z
      btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off() · 0e22be89
      Zhaolei 提交于
      It can reduce current duplicated code which is similar to
      scrub_blocked_if_needed() but can not call it because little
      different.
      It also used by my next patch which is in same case.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0e22be89
    • Z
      btrfs: Bypass unrelated items before accessing its contents in scrub · d7cad238
      Zhao Lei 提交于
      When we access extent_root in scrub_stripe() and
      scrub_raid56_parity(), we need bypass unrelated tree item firstly
      before using its contents to do other condition.
      
      It is not a bug fix, only making code sequence in logic.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7cad238
    • Z
      btrfs: Load only necessary csums into list in scrub · fe8cf654
      Zhao Lei 提交于
      We need not load csum of whole strip in scrub because strip is trimed
      before use, it is to say, what we really need to calculate csum is
      data between [extent_logical, extent_len).
      
      This patch changed to use above segment for btrfs_lookup_csums_range()
      in scrub_stripe()
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fe8cf654
    • Z
      btrfs: Fix calculate typo caused by ambiguous meaning of logic_end · a0dd59de
      Zhao Lei 提交于
      For example, in scrub_raid56_parity(), following lines are used
      to judge is all data processed:
       place1: if (key.objectid > logic_end) ...
       place2: if (logic_start >= logic_end) ...
       ...
       (place2 is typo, is should be ">", it is copied from other
        place, where logic_end's meaning is different, long story...)
      
      We can fix above typo directly, but the root reason is ambiguous
      meaning of logic_end in scrub raid56 parity.
      
      In other place, XXX_end is pointed to data which is not included,
      and we need to process segment of [XXX_start, XXX_end).
      
      But for scrub raid56 parity, logic_end is pointed to lattest data
      need to process, and introduced many "+ 1" and "- 1" in code as
      below:
       length = sparity->logic_end - sparity->logic_start + 1
       logic_end - logic_start + 1
       stripe_logical + increment - 1
      
      This patch changed logic_end's meaning to make it in normal understanding
      in raid56 parity functions and data struct alone with above bugfix.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a0dd59de
    • Z
      btrfs: Free checksum list on scrub_extent() fail · 6fa96d72
      Zhao Lei 提交于
      When scrub_extent() failed, we need to free previois created
      checksum list.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6fa96d72
    • Z
      btrfs: Check cancel and pause in interval of scrub operation · f2f66a2f
      Zhao Lei 提交于
      Old code checking cancel and pause request inside scrub stripe
      operation, like:
        loop() {
          if (parity) {
            scrub_parity_stripe();
            continue;
          }
      
          check_cancel_and_pause()
      
          scrub_normal_stripe();
        }
      
      Reason is when introduce raid56 stripe scrub, new code is inserted
      simplely to front of loop.
      
      Better to:
        loop() {
          check_cancel_and_pause()
      
          if (parity)
            scrub_parity_stripe();
          else
            scrub_normal_stripe();
        }
      
      This patch adjusted code place to realize above sequence.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f2f66a2f
    • Z
      btrfs: Fix scrub panic when leaf crosses stripes · a323e813
      Zhao Lei 提交于
      Scrub panic in following operation:
        mkfs.ext4 /dev/vdh
        btrfs-convert /dev/vdh
        mount /dev/vdh /mnt/tmp1
        btrfs scrub start -B /dev/vdh
        (panic)
      
      Reason:
        1: In some case, leaf created by btrfs-convert was splited into 2
           strips.
        2: Scrub bypassed part of above wrong leaf data, but remain data
           caused panic in scrub_checksum_tree_block().
      
      For reason 1:
        we can get following information after some simple operation.
        a. mkfs.ext4 /dev/vdh
           btrfs-convert /dev/vdh
        b. btrfs-debug-tree /dev/vdh
           we can see following item in extent tree:
           item 25 key (27054080 METADATA_ITEM 0) itemoff 15083 itemsize 33
           Its logical address is [27054080, 27070464)
           and acrossed 2 strips:
           [27000832, 27066368)
           [27066368, 27131904)
        Will be fixed in btrfs-progs(btrfs-convert, btrfsck, ...)
      
      For reason 2:
        Scrub is trying to do a "bypass" in this case, but the result is
        "panic", because current code lacks of some condition in bypass,
        and let some wrong leaf data escaped.
      
      This patch fixed above scrub code.
      
      Before patch:
        # btrfs scrub start -B /dev/vdh
        (panic)
      
      After patch:
        # btrfs scrub start -B /dev/vdh
        scrub done for 353cec8f-da31-4a94-aa35-be72d997b06e
        ...
        # dmesg
        ...
        [   59.088697] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
        [   59.089929] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
        #
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a323e813
  14. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  15. 01 7月, 2015 1 次提交
  16. 10 6月, 2015 1 次提交
    • Z
      btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() · 20b2e302
      Zhao Lei 提交于
      lockdep report following warning in test:
       [25176.843958] =================================
       [25176.844519] [ INFO: inconsistent lock state ]
       [25176.845047] 4.1.0-rc3 #22 Tainted: G        W
       [25176.845591] ---------------------------------
       [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
       [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
       [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
       [25176.847838] {SOFTIRQ-ON-W} state was registered at:
       [25176.848396]   [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
       [25176.848955]   [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
       [25176.849491]   [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
       [25176.850029]   [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
       [25176.850575]   [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
       [25176.851110]   [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
       [25176.851660]   [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
       [25176.852189]   [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
       [25176.852771]   [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
       [25176.853315]   [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
       [25176.853868]   [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
       [25176.854406]   [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
       [25176.854935] irq event stamp: 51506
       [25176.855511] hardirqs last  enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
       [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
       [25176.856642] softirqs last  enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
       [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
       [25176.857746]
       other info that might help us debug this:
       [25176.858845]  Possible unsafe locking scenario:
       [25176.859981]        CPU0
       [25176.860537]        ----
       [25176.861059]   lock(&wr_ctx->wr_lock);
       [25176.861705]   <Interrupt>
       [25176.862272]     lock(&wr_ctx->wr_lock);
       [25176.862881]
        *** DEADLOCK ***
      
      Reason:
       Above warning is caused by:
       Interrupt
       -> bio_endio()
       -> ...
       -> scrub_put_ctx()
       -> scrub_free_ctx() *1
       -> ...
       -> mutex_lock(&wr_ctx->wr_lock);
      
       scrub_put_ctx() is allowed to be called in end_bio interrupt, but
       in code design, it will never call scrub_free_ctx(sctx) in interrupe
       context(above *1), because btrfs_scrub_dev() get one additional
       reference of sctx->refs, which makes scrub_free_ctx() only called
       withine btrfs_scrub_dev().
      
       Now the code runs out of our wish, because free sequence in
       scrub_pending_bio_dec() have a gap.
      
       Current code:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
       -> scrub_free_ctx()                |
       -----------------------------------+-----------------------------------
      
       We expected:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
                                          | -> scrub_free_ctx()
       -----------------------------------+-----------------------------------
      
      Fix:
       Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
       in interrupt context.
       Tested by check tracelog in debug.
      
      Changelog v1->v2:
       Use workqueue instead of adjust function call sequence in v1,
       because v1 will introduce a bug pointed out by:
       Filipe David Manana <fdmanana@gmail.com>
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      20b2e302