1. 28 4月, 2016 4 次提交
  2. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  3. 14 3月, 2016 1 次提交
  4. 23 2月, 2016 1 次提交
    • L
      Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9
      Liu Bo 提交于
      Xfstests btrfs/011 complains about a deadlock warning,
      
      [ 1226.649039] =========================================================
      [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
      [ 1226.649039] 4.1.0+ #270 Not tainted
      [ 1226.649039] ---------------------------------------------------------
      [ 1226.652955] kswapd0/46 just changed the state of lock:
      [ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1226.652955]
      other info that might help us debug this:
      [ 1226.652955] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1226.652955]  Possible interrupt unsafe locking scenario:
      
      [ 1226.652955]        CPU0                    CPU1
      [ 1226.652955]        ----                    ----
      [ 1226.652955]   lock(&fs_info->dev_replace.lock);
      [ 1226.652955]                                local_irq_disable();
      [ 1226.652955]                                lock(&delayed_node->mutex);
      [ 1226.652955]                                lock(&found->groups_sem);
      [ 1226.652955]   <Interrupt>
      [ 1226.652955]     lock(&delayed_node->mutex);
      [ 1226.652955]
       *** DEADLOCK ***
      
      Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
      to fix a similar one that has the exactly same warning, but with that, we still
      run to this.
      
      The above lock chain comes from
      btrfs_commit_transaction
        ->btrfs_run_delayed_items
          ...
          ->__btrfs_update_delayed_inode
            ...
            ->__btrfs_cow_block
               ...
               ->find_free_extent
                  ->cache_block_group
                    ->load_free_space_cache
                      ->btrfs_readpages
                        ->submit_one_bio
                          ...
                          ->__btrfs_map_block
                            ->btrfs_dev_replace_lock
      
      However, with high memory pressure, tasks which hold dev_replace.lock can
      be interrupted by kswapd and then kswapd is intended to release memory occupied
      by superblock, inodes and dentries, where we may call evict_inode, and it comes
      to
      
      [ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
      [ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
      
      delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
      to a ABBA deadlock.
      
      To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
      things are simpler here since we only needs read's spinlock to blocking lock.
      
      With this, btrfs/011 no more produces warnings in dmesg.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73beece9
  5. 11 2月, 2016 3 次提交
  6. 30 1月, 2016 1 次提交
  7. 22 1月, 2016 1 次提交
  8. 20 1月, 2016 4 次提交
  9. 19 1月, 2016 1 次提交
  10. 16 1月, 2016 2 次提交
  11. 08 1月, 2016 1 次提交
    • F
      Btrfs: fix fitrim discarding device area reserved for boot loader's use · 8cdc7c5b
      Filipe Manana 提交于
      As of the 4.3 kernel release, the fitrim ioctl can now discard any region
      of a disk that is not allocated to any chunk/block group, including the
      first megabyte which is used for our primary superblock and by the boot
      loader (grub for example).
      
      Fix this by not allowing to trim/discard any region in the device starting
      with an offset not greater than min(alloc_start_mount_option, 1Mb), just
      as it was not possible before 4.3.
      
      A reproducer test case for xfstests follows.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
      
        # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
        # reserved for a boot loader to use (GRUB for example) and btrfs should never
        # use them - neither for allocating metadata/data nor should trim/discard them.
        # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
        $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
        $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
      
        # Now mount the filesystem and perform a fitrim against it.
        _scratch_mount
        _require_batched_discard $SCRATCH_MNT
        $FSTRIM_PROG $SCRATCH_MNT
      
        # Now unmount the filesystem and verify the content of the ranges was not
        # modified (no trim/discard happened on them).
        _scratch_unmount
        echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
        od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
        od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
      
        status=0
        exit
      Reported-by: NVincent Petry  <PVince81@yahoo.fr>
      Reported-by: NAndrei Borzenkov <arvidjaar@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
      Fixes: 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      Cc: stable@vger.kernel.org # 4.3+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      8cdc7c5b
  12. 07 1月, 2016 7 次提交
    • S
      Btrfs: Check metadata redundancy on balance · ee592d07
      Sam Tygier 提交于
      When converting a filesystem via balance check that metadata mode
      is at least as redundant as the data mode. For example give warning
      when:
      -dconvert=raid1 -mconvert=single
      Signed-off-by: NSam Tygier <samtygier@yahoo.co.uk>
      [ minor message reformatting ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee592d07
    • D
      btrfs: cleanup, use enum values for btrfs_path reada · e4058b54
      David Sterba 提交于
      Replace the integers by enums for better readability. The value 2 does
      not have any meaning since a7175319
      "Btrfs: do less aggressive btree readahead" (2009-01-22).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4058b54
    • B
      Btrfs: use linux/sizes.h to represent constants · ee22184b
      Byongho Lee 提交于
      We use many constants to represent size and offset value.  And to make
      code readable we use '256 * 1024 * 1024' instead of '268435456' to
      represent '256MB'.  However we can make far more readable with 'SZ_256MB'
      which is defined in the 'linux/sizes.h'.
      
      So this patch replaces 'xxx * 1024 * 1024' kind of expression with
      single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
      not a power of 2. And I haven't touched to '4096' & '8192' because it's
      more intuitive than 'SZ_4KB' & 'SZ_8KB'.
      Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee22184b
    • D
      btrfs: cleanup, remove stray return statements · 7928d672
      David Sterba 提交于
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7928d672
    • D
      93a3d467
    • D
      btrfs: handle invalid num_stripes in sys_array · f5cdedd7
      David Sterba 提交于
      We can handle the special case of num_stripes == 0 directly inside
      btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
      catch other unhandled cases where we fail to validate external data.
      
      A crafted or corrupted image crashes at mount time:
      
      BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
      BTRFS info (device loop0): disk space caching is enabled
      BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
      Kernel panic - not syncing: BUG!
      CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
      Stack:
       637af890 60062489 602aeb2e 604192ba
       60387961 00000011 637af8a0 6038a835
       637af9c0 6038776b 634ef32b 00000000
      Call Trace:
       [<6001c86d>] show_stack+0xfe/0x15b
       [<6038a835>] dump_stack+0x2a/0x2c
       [<6038776b>] panic+0x13e/0x2b3
       [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
       [<601cfbbe>] open_ctree+0x192d/0x27af
       [<6019c2c1>] btrfs_mount+0x8f5/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<6019bcb0>] btrfs_mount+0x2e4/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<600d710b>] do_mount+0xa35/0xbc9
       [<600d7557>] SyS_mount+0x95/0xc8
       [<6001e884>] handle_syscall+0x6b/0x8e
      Reported-by: NJiri Slaby <jslaby@suse.com>
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      CC: stable@vger.kernel.org	# 3.19+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5cdedd7
    • Z
      btrfs: Support convert to -d dup for btrfs-convert · c5ca8781
      Zhao Lei 提交于
      Since we will add support for -d dup for non-mixed filesystem,
      kernel need to support converting to this raid-type.
      
      This patch remove limitation of above case.
      
      Tested by following script:
      (combination of dup conversion with fsck):
      
      export TEST_DEV='/dev/vdc'
      export TEST_DIR='/var/ltf/tester/mnt'
      
      do_dup_test()
      {
          local m_from="$1"
          local d_from="$2"
          local m_to="$3"
          local d_to="$4"
      
          echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
      
          umount "$TEST_DIR" &>/dev/null
          ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
          mount "$TEST_DEV" "$TEST_DIR" || return 1
      
          cp -a /sbin/* "$TEST_DIR"
      
          [[ "$m_from" != "$m_to" ]] && {
              ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
          }
      
          [[ "$d_from" != "$d_to" ]] && {
      	local opt=()
      	[[ "$d_to" == single ]] && opt+=("-f")
              ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
          }
      
          umount "$TEST_DIR" || return 1
          ./btrfsck "$TEST_DEV" || return 1
          echo
      
          return 0
      }
      
      test_all()
      {
          for m_from in single dup; do
          for d_from in single dup; do
          for m_to in single dup; do
          for d_to in single dup; do
          do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
          done
          done
          done
          done
      }
      
      test_all
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5ca8781
  13. 24 12月, 2015 1 次提交
  14. 17 12月, 2015 1 次提交
    • F
      Btrfs: fix race when finishing dev replace leading to transaction abort · 50460e37
      Filipe Manana 提交于
      During the final phase of a device replace operation, I ran into a
      transaction abort that resulted in the following trace:
      
      [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
      [23919.664742] BTRFS: Transaction aborted (error -2)
      [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
      [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
      [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [23919.689151]  0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
      [23919.692604]  ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
      [23919.694230]  ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
      [23919.696716] Call Trace:
      [23919.698669]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
      [23919.700597]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
      [23919.701958]  [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
      [23919.703612]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
      [23919.705047]  [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
      [23919.706967]  [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
      [23919.708611]  [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [23919.710099]  [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
      [23919.711970]  [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
      [23919.713602]  [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
      [23919.714756]  [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
      [23919.716155]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
      [23919.718918]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
      [23919.724170]  [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
      [23919.725482]  [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
      [23919.726790]  [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
      [23919.728428]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [23919.729642]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [23919.730782]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [23919.731847]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [23919.733330] ---[ end trace 166ef301a335832a ]---
      
      This is due to a race between device replace and chunk allocation, which
      the following diagram illustrates:
      
               CPU 1                                    CPU 2
      
       btrfs_dev_replace_finishing()
      
         at this point
          dev_replace->tgtdev->devid ==
          BTRFS_DEV_REPLACE_DEVID (0ULL)
      
         ...
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
                                                     btrfs_fallocate()
                                                       btrfs_alloc_data_chunk_ondemand()
                                                         btrfs_join_transaction()
                                                           --> starts a new transaction
                                                         do_chunk_alloc()
                                                           lock fs_info->chunk_mutex
                                                             btrfs_alloc_chunk()
                                                               --> creates extent map for
                                                                   the new chunk with
                                                                   em->bdev->map->stripes[i]->dev->devid
                                                                   == X (X > 0)
                                                               --> extent map is added to
                                                                   fs_info->mapping_tree
                                                               --> initial phase of bg A
                                                                   allocation completes
                                                           unlock fs_info->chunk_mutex
      
         lock fs_info->chunk_mutex
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates fs_info->mapping_tree and
               replaces the device in every extent
               map's map->stripes[] with
               dev_replace->tgtdev, which still has
               an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
      
                                                         btrfs_end_transaction()
                                                           btrfs_create_pending_block_groups()
                                                             --> starts final phase of
                                                                 bg A creation (update device,
                                                                 extent, and chunk trees, etc)
                                                             btrfs_finish_chunk_alloc()
      
                                                               btrfs_update_device()
                                                                 --> attempts to update a device
                                                                     item with ID == 0ULL
                                                                     (BTRFS_DEV_REPLACE_DEVID)
                                                                     which is the current ID of
                                                                     bg A's
                                                                     em->bdev->map->stripes[i]->dev->devid
                                                                 --> doesn't find such item
                                                                     returns -ENOENT
                                                                 --> the device id should have been X
                                                                     and not 0ULL
      
                                                             got -ENOENT from
                                                             btrfs_finish_chunk_alloc()
                                                             and aborts current transaction
      
         finishes setting up the target device,
         namely it sets tgtdev->devid to the value
         of srcdev->devid, which is X (and X > 0)
      
         frees the srcdev
      
         unlock fs_info->chunk_mutex
      
      So fix this by taking the device list mutex when processing the chunk's
      extent map stripes to update the device items. This avoids getting the
      wrong device id and use-after-free problems if the task finishing a
      chunk allocation grabs the replaced device, which is freed while the
      dev replace task is holding the device list mutex.
      
      This happened while running fstest btrfs/071.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      50460e37
  15. 10 12月, 2015 1 次提交
  16. 07 12月, 2015 1 次提交
  17. 03 12月, 2015 1 次提交
  18. 25 11月, 2015 4 次提交
    • H
      btrfs: fix balance range usage filters in 4.4-rc · dba72cb3
      Holger Hoffstätte 提交于
      There's a regression in 4.4-rc since commit bc309467
      (btrfs: extend balance filter usage to take minimum and maximum) in that
      existing (non-ranged) balance with -dusage=x no longer works; all chunks
      are skipped.
      
      After staring at the code for a while and wondering why a non-ranged
      balance would even need min and max thresholds (..which then were not
      set correctly, leading to the bug) I realized that the only problem
      was the fact that the filter functions were named wrong, thanks to
      patching copypasta. Simply renaming both functions lets the existing
      btrfs-progs call balance with -dusage=x and now the non-ranged filter
      function is invoked, properly using only a single chunk limit.
      Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dba72cb3
    • D
      btrfs: fix rcu warning during device replace · 31388ab2
      David Sterba 提交于
      The test btrfs/011 triggers a rcu warning
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.4.0-rc1-default+ #286 Tainted: G        W
      -------------------------------
      fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      4 locks held by btrfs/28786:
      
      0:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs]
      1:  (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs]
      2:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs]
      3:  (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 28786 Comm: btrfs Tainted: G        W       4.4.0-rc1-default+ #286
      Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010
      0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001
      0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78
      ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220
      Call Trace:
      [<ffffffff8141d47b>] dump_stack+0x4f/0x74
      [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140
      [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs]
      [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10
      [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80
      [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs]
      [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs]
      [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs]
      [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs]
      [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs]
      [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0
      [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs]
      [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
      [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs]
      [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs]
      [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
      [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0
      [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0
      [<ffffffff811e3336>] ? __fget_light+0x86/0xb0
      [<ffffffff811e3369>] ? __fdget+0x9/0x20
      [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80
      [<ffffffff811d7483>] SyS_ioctl+0x53/0x80
      [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This is because of unprotected use of rcu_dereference in
      btrfs_scratch_superblocks. We can't add rcu locks around the whole
      function because we read the superblock.
      
      The fix will use the rcu string buffer directly without the rcu locking.
      Thi is safe as the device will not go away in the meantime. We're
      holding the device list mutexes.
      
      Restructuring the code to narrow down the rcu section turned out to be
      impossible, we need to call filp_open (through update_dev_time) on the
      buffer and this could call kmalloc/__might_sleep. We could call kstrdup
      with GFP_ATOMIC but it's not absolutely necessary.
      
      Fixes: 12b1c263 (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks)
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31388ab2
    • F
      Btrfs: fix the number of transaction units needed to remove a block group · 7fd01182
      Filipe Manana 提交于
      We were using only 1 transaction unit when attempting to delete an unused
      block group but in reality we need 3 + N units, where N corresponds to the
      number of stripes. We were accounting only for the addition of the orphan
      item (for the block group's free space cache inode) but we were not
      accounting that we need to delete one block group item from the extent
      tree, one free space item from the tree of tree roots and N device extent
      items from the device tree.
      
      While one unit is not enough, it worked most of the time because for each
      single unit we are too pessimistic and assume an entire tree path, with
      the highest possible heigth (8), needs to be COWed with eventual node
      splits at every possible level in the tree, so there was usually enough
      reserved space for removing all the items and adding the orphan item.
      
      However after adding the orphan item, writepages() can by called by the VM
      subsystem against the btree inode when we are under memory pressure, which
      causes writeback to start for the nodes we COWed before, this forces the
      operation to remove the free space item to COW again some (or all of) the
      same nodes (in the tree of tree roots). Even without writepages() being
      called, we could fail with ENOSPC because these items are located in
      multiple trees and one of them might have a higher heigth and require
      node/leaf splits at many levels, exhausting all the reserved space before
      removing all the items and adding the orphan.
      
      In the kernel 4.0 release, commit 3d84be79 ("Btrfs: fix BUG_ON in
      btrfs_orphan_add() when delete unused block group"), we attempted to fix
      a BUG_ON due to ENOSPC when trying to add the orphan item by making the
      cleaner kthread reserve one transaction unit before attempting to remove
      the block group, but this was not enough. We had a couple user reports
      still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
      a 4.2-rc6 kernel for example:
      
          http://www.spinics.net/lists/linux-btrfs/msg46070.html
      
      So fix this by reserving all the necessary units of metadata.
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Fixes: 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7fd01182
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  19. 11 11月, 2015 1 次提交
    • Z
      btrfs: Fix lost-data-profile caused by balance bg · 2c9fe835
      Zhao Lei 提交于
      Reproduce:
       (In integration-4.3 branch)
      
       TEST_DEV=(/dev/vdg /dev/vdh)
       TEST_DIR=/mnt/tmp
      
       umount "$TEST_DEV" >/dev/null
       mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
      
       mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
       btrfs balance start -dusage=0 $TEST_DIR
       btrfs filesystem usage $TEST_DIR
      
       dd if=/dev/zero of="$TEST_DIR"/file count=100
       btrfs filesystem usage $TEST_DIR
      
      Result:
       We can see "no data chunk" in first "btrfs filesystem usage":
       # btrfs filesystem usage $TEST_DIR
       Overall:
          ...
       Metadata,single: Size:8.00MiB, Used:0.00B
          /dev/vdg        8.00MiB
       Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
          /dev/vdg      122.88MiB
          /dev/vdh      122.88MiB
       System,single: Size:4.00MiB, Used:0.00B
          /dev/vdg        4.00MiB
       System,RAID1: Size:8.00MiB, Used:16.00KiB
          /dev/vdg        8.00MiB
          /dev/vdh        8.00MiB
       Unallocated:
          /dev/vdg        1.06GiB
          /dev/vdh        1.07GiB
      
       And "data chunks changed from raid1 to single" in second
       "btrfs filesystem usage":
       # btrfs filesystem usage $TEST_DIR
       Overall:
          ...
       Data,single: Size:256.00MiB, Used:0.00B
          /dev/vdh      256.00MiB
       Metadata,single: Size:8.00MiB, Used:0.00B
          /dev/vdg        8.00MiB
       Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
          /dev/vdg      122.88MiB
          /dev/vdh      122.88MiB
       System,single: Size:4.00MiB, Used:0.00B
          /dev/vdg        4.00MiB
       System,RAID1: Size:8.00MiB, Used:16.00KiB
          /dev/vdg        8.00MiB
          /dev/vdh        8.00MiB
       Unallocated:
          /dev/vdg        1.06GiB
          /dev/vdh      841.92MiB
      
      Reason:
       btrfs balance delete last data chunk in case of no data in
       the filesystem, then we can see "no data chunk" by "fi usage"
       command.
      
       And when we do write operation to fs, the only available data
       profile is 0x0, result is all new chunks are allocated single type.
      
      Fix:
       Allocate a data chunk explicitly to ensure we don't lose the
       raid profile for data.
      
      Test:
       Test by above script, and confirmed the logic by debug output.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c9fe835
  20. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  21. 27 10月, 2015 2 次提交