1. 16 5月, 2016 1 次提交
  2. 06 5月, 2016 3 次提交
  3. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  4. 12 3月, 2016 1 次提交
  5. 23 2月, 2016 1 次提交
    • L
      Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9
      Liu Bo 提交于
      Xfstests btrfs/011 complains about a deadlock warning,
      
      [ 1226.649039] =========================================================
      [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
      [ 1226.649039] 4.1.0+ #270 Not tainted
      [ 1226.649039] ---------------------------------------------------------
      [ 1226.652955] kswapd0/46 just changed the state of lock:
      [ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1226.652955]
      other info that might help us debug this:
      [ 1226.652955] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1226.652955]  Possible interrupt unsafe locking scenario:
      
      [ 1226.652955]        CPU0                    CPU1
      [ 1226.652955]        ----                    ----
      [ 1226.652955]   lock(&fs_info->dev_replace.lock);
      [ 1226.652955]                                local_irq_disable();
      [ 1226.652955]                                lock(&delayed_node->mutex);
      [ 1226.652955]                                lock(&found->groups_sem);
      [ 1226.652955]   <Interrupt>
      [ 1226.652955]     lock(&delayed_node->mutex);
      [ 1226.652955]
       *** DEADLOCK ***
      
      Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
      to fix a similar one that has the exactly same warning, but with that, we still
      run to this.
      
      The above lock chain comes from
      btrfs_commit_transaction
        ->btrfs_run_delayed_items
          ...
          ->__btrfs_update_delayed_inode
            ...
            ->__btrfs_cow_block
               ...
               ->find_free_extent
                  ->cache_block_group
                    ->load_free_space_cache
                      ->btrfs_readpages
                        ->submit_one_bio
                          ...
                          ->__btrfs_map_block
                            ->btrfs_dev_replace_lock
      
      However, with high memory pressure, tasks which hold dev_replace.lock can
      be interrupted by kswapd and then kswapd is intended to release memory occupied
      by superblock, inodes and dentries, where we may call evict_inode, and it comes
      to
      
      [ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
      [ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
      
      delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
      to a ABBA deadlock.
      
      To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
      things are simpler here since we only needs read's spinlock to blocking lock.
      
      With this, btrfs/011 no more produces warnings in dmesg.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73beece9
  6. 11 2月, 2016 1 次提交
    • D
      btrfs: scrub: use GFP_KERNEL on the submission path · 58c4e173
      David Sterba 提交于
      Scrub is not on the critical writeback path we don't need to use
      GFP_NOFS for all allocations. The failures are handled and stats passed
      back to userspace.
      
      Let's use GFP_KERNEL on the paths where everything is ok, ie. setup the
      global structures and the IO submission paths.
      
      Functions that do the repair and fixups still use GFP_NOFS as we might
      want to skip any other filesystem activity if we encounter an error.
      This could turn out to be unnecessary, but requires more review compared
      to the easy cases in this patch.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58c4e173
  7. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  8. 20 1月, 2016 1 次提交
  9. 16 1月, 2016 1 次提交
  10. 07 1月, 2016 3 次提交
  11. 03 12月, 2015 1 次提交
  12. 25 11月, 2015 3 次提交
    • F
      Btrfs: fix scrub preventing unused block groups from being deleted · 758f2dfc
      Filipe Manana 提交于
      Currently scrub can race with the cleaner kthread when the later attempts
      to delete an unused block group, and the result is preventing the cleaner
      kthread from ever deleting later the block group - unless the block group
      becomes used and unused again. The following diagram illustrates that
      race:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs and
           removes it from that list
      
                                                   scrub_enumerate_chunks()
      
                                                     searches device tree using
                                                     its commit root
      
                                                     finds device extent for
                                                     block group X
      
                                                     gets block group X from the tree
                                                     fs_info->block_group_cache_tree
                                                     (via btrfs_lookup_block_group())
      
                                                     sets bg X to RO
      
           sees the block group is
           already RO and therefore
           doesn't delete it nor adds
           it back to unused list
      
      So fix this by making scrub add the block group again to the list of
      unused block groups if the block group is still unused when it finished
      scrubbing it and it hasn't been removed already.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758f2dfc
    • F
      Btrfs: fix race between scrub and block group deletion · 020d5b73
      Filipe Manana 提交于
      Scrub can race with the cleaner kthread deleting block groups that are
      unused (and with relocation too) leading to a failure with error -EINVAL
      that gets returned to user space.
      
      The following diagram illustrates how it happens:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs
      
           sets block group to RO
      
             btrfs_remove_chunk(bg X)
      
               deletes device extents
      
                                               scrub_enumerate_chunks()
      
                                                 searches device tree using
                                                 its commit root
      
                                                 finds device extent for
                                                 block group X
      
                                                 gets block group X from the tree
                                                 fs_info->block_group_cache_tree
                                                 (via btrfs_lookup_block_group())
      
                                                 sets bg X to RO (again)
      
                btrfs_remove_block_group(bg X)
      
                  deletes block group from
                  fs_info->block_group_cache_tree
      
                  removes extent map from
                  fs_info->mapping_tree
      
                                                     scrub_chunk(offset X)
      
                                                       searches fs_info->mapping_tree
                                                       for extent map starting at
                                                       offset X
      
                                                          --> doesn't find any such
                                                              extent map
                                                          --> returns -EINVAL and scrub
                                                              errors out to userspace
                                                              with -EINVAL
      
      Fix this by dealing with an extent map lookup failure as an indicator of
      block group deletion.
      Issue reproduced with fstest btrfs/071.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      020d5b73
    • Z
      btrfs: Continue replace when set_block_ro failed · 76a8efa1
      Zhaolei 提交于
      xfstests/011 failed in node with small_size filesystem.
      Can be reproduced by following script:
        DEV_LIST="/dev/vdd /dev/vde"
        DEV_REPLACE="/dev/vdf"
      
        do_test()
        {
            local mkfs_opt="$1"
            local size="$2"
      
            dmesg -c >/dev/null
            umount $SCRATCH_MNT &>/dev/null
      
            echo  mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}"
            mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1
            mount "${DEV_LIST[0]}" $SCRATCH_MNT
      
            echo -n "Writing big files"
            dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1
            for ((i = 1; i <= size; i++)); do
                echo -n .
                /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1
            done
            echo
      
            echo Start replace
            btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || {
                dmesg
                return 1
            }
            return 0
        }
      
        # Set size to value near fs size
        # for example, 1897 can trigger this bug in 2.6G device.
        #
        ./do_test "-d raid1 -m raid1" 1897
      
      System will report replace fail with following warning in dmesg:
       [  134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started
       [  135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28
       [  135.543505] ------------[ cut here ]------------
       [  135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440()
       [  135.545276] Modules linked in:
       [  135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256
       [  135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
       [  135.547798]  ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000
       [  135.548774]  ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4
       [  135.549774]  ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70
       [  135.550758] Call Trace:
       [  135.551086]  [<ffffffff817fe7b5>] dump_stack+0x44/0x55
       [  135.551737]  [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0
       [  135.552487]  [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20
       [  135.553211]  [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440
       [  135.554051]  [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0
       [  135.554722]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
       [  135.555506]  [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0
       [  135.556304]  [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580
       [  135.557009]  [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0
       [  135.557855]  [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70
       [  135.558669]  [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90
       [  135.559374]  [<ffffffff81202124>] SyS_ioctl+0x74/0x80
       [  135.559987]  [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f
       [  135.560842] ---[ end trace 2a5c1fc3205abbdd ]---
      
      Reason:
       When big data writen to fs, the whole free space will be allocated
       for data chunk.
       And operation as scrub need to set_block_ro(), and when there is
       only one metadata chunk in system(or other metadata chunks
       are all full), the function will try to allocate a new chunk,
       and failed because no space in device.
      
      Fix:
       When set_block_ro failed for metadata chunk, it is not a problem
       because scrub_lock paused commit_trancaction in same time, and
       metadata are always cowed, so the on-the-fly writepages will not
       write data into same place with scrub/replace.
       Let replace continue in this case is no problem.
      
      Tested by above script, and xfstests/011, plus 100 times xfstests/070.
      
      Changelog v1->v2:
      1: Add detail comments in source and commit-message.
      2: Add dmesg detail into commit-message.
      3: Limit return value of -ENOSPC to be passed.
      All suggested by: Filipe Manana <fdmanana@gmail.com>
      Suggested-by: NFilipe Manana <fdmanana@gmail.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      76a8efa1
  13. 11 11月, 2015 6 次提交
  14. 08 10月, 2015 3 次提交
  15. 01 9月, 2015 2 次提交
  16. 14 8月, 2015 1 次提交
  17. 09 8月, 2015 10 次提交
    • O
      Btrfs: fix parity scrub of RAID 5/6 with missing device · 4a770891
      Omar Sandoval 提交于
      When testing the previous patch, Zhao Lei reported a similar bug when
      attempting to scrub a degraded RAID 5/6 filesystem with a missing
      device, leading to NULL pointer dereferences from the RAID 5/6 parity
      scrubbing code.
      
      The first cause was the same as in the previous patch: attempting to
      call bio_add_page() on a missing block device. To fix this,
      scrub_extent_for_parity() can just mark the sectors on the missing
      device as errors instead of attempting to read from it.
      
      Additionally, the code uses scrub_remap_extent() to map the extent of
      the corresponding data stripe, but the extent wasn't already mapped. If
      scrub_remap_extent() finds a missing block device, it doesn't initialize
      extent_dev, so we're left with a NULL struct btrfs_device. The solution
      is to use btrfs_map_block() directly.
      Reported-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4a770891
    • O
      Btrfs: fix device replace of a missing RAID 5/6 device · 73ff61db
      Omar Sandoval 提交于
      The original implementation of device replace on RAID 5/6 seems to have
      missed support for replacing a missing device. When this is attempted,
      we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
      crashes when we try to dereference it. This happens because
      btrfs_map_block() has no choice but to return us the missing device
      because RAID 5/6 don't have any alternate mirrors to read from, and a
      missing device has a NULL bdev.
      
      The idea implemented here is to handle the missing device case
      separately, which better only happen when we're replacing a missing RAID
      5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
      reconstruct the data from parity, check it with
      scrub_recheck_block_checksum(), and write it out with
      scrub_write_block_to_dev_replace().
      Reported-by: NPhilip <bugzilla@philip-seeger.de>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      73ff61db
    • O
      Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation · b4ee1782
      Omar Sandoval 提交于
      The current RAID 5/6 recovery code isn't quite prepared to handle
      missing devices. In particular, it expects a bio that we previously
      attempted to use in the read path, meaning that it has valid pages
      allocated. However, missing devices have a NULL blkdev, and we can't
      call bio_add_page() on a bio with a NULL blkdev. We could do manual
      manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
      a separate path that allows us to manually add pages to the rbio.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b4ee1782
    • O
      Btrfs: remove misleading handling of missing device scrub · 03679ade
      Omar Sandoval 提交于
      scrub_submit() claims that it can handle a bio with a NULL block device,
      but this is misleading, as calling bio_add_page() on a bio with a NULL
      ->bi_bdev would've already crashed. Delete this, as we're about to
      properly handle a missing block device.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      03679ade
    • Z
      btrfs: Fix data checksum error cause by replace with io-load. · 55e3a601
      Zhaolei 提交于
      xfstests btrfs/070 sometimes failed.
      In my test machine, its fail rate is about 30%.
      In another vm(vmware), its fail rate is about 50%.
      
      Reason:
        btrfs/070 do replace and defrag with fsstress simultaneously,
        after above operation, checksum error is found by scrub.
      
        Actually, it have no relationship with defrag operation, only
        replace with fsstress can trigger this bug.
      
        New data writen to target device have possibility rewrited by
        old data from source device by replace code in debug, to avoid
        above problem, we can set target block group to readonly in
        replace period, so new data requested by other operation will
        not write to same place with replace code.
      
        Before patch(4.1-rc3):
          30% failed in 100 xfstests.
        After patch:
          0% failed in 300 xfstests.
      
      It also happened in btrfs/071 as it's another scrub with IO load tests.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      55e3a601
    • Z
      btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks() · b708ce96
      Zhaolei 提交于
      Use new intruduced scrub_pause_on/off() can make this code block
      clean and more readable.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b708ce96
    • Z
      btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off() · 0e22be89
      Zhaolei 提交于
      It can reduce current duplicated code which is similar to
      scrub_blocked_if_needed() but can not call it because little
      different.
      It also used by my next patch which is in same case.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0e22be89
    • Z
      btrfs: Bypass unrelated items before accessing its contents in scrub · d7cad238
      Zhao Lei 提交于
      When we access extent_root in scrub_stripe() and
      scrub_raid56_parity(), we need bypass unrelated tree item firstly
      before using its contents to do other condition.
      
      It is not a bug fix, only making code sequence in logic.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7cad238
    • Z
      btrfs: Load only necessary csums into list in scrub · fe8cf654
      Zhao Lei 提交于
      We need not load csum of whole strip in scrub because strip is trimed
      before use, it is to say, what we really need to calculate csum is
      data between [extent_logical, extent_len).
      
      This patch changed to use above segment for btrfs_lookup_csums_range()
      in scrub_stripe()
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fe8cf654
    • Z
      btrfs: Fix calculate typo caused by ambiguous meaning of logic_end · a0dd59de
      Zhao Lei 提交于
      For example, in scrub_raid56_parity(), following lines are used
      to judge is all data processed:
       place1: if (key.objectid > logic_end) ...
       place2: if (logic_start >= logic_end) ...
       ...
       (place2 is typo, is should be ">", it is copied from other
        place, where logic_end's meaning is different, long story...)
      
      We can fix above typo directly, but the root reason is ambiguous
      meaning of logic_end in scrub raid56 parity.
      
      In other place, XXX_end is pointed to data which is not included,
      and we need to process segment of [XXX_start, XXX_end).
      
      But for scrub raid56 parity, logic_end is pointed to lattest data
      need to process, and introduced many "+ 1" and "- 1" in code as
      below:
       length = sparity->logic_end - sparity->logic_start + 1
       logic_end - logic_start + 1
       stripe_logical + increment - 1
      
      This patch changed logic_end's meaning to make it in normal understanding
      in raid56 parity functions and data struct alone with above bugfix.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a0dd59de