1. 28 4月, 2016 14 次提交
  2. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  3. 14 3月, 2016 1 次提交
  4. 23 2月, 2016 1 次提交
    • L
      Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9
      Liu Bo 提交于
      Xfstests btrfs/011 complains about a deadlock warning,
      
      [ 1226.649039] =========================================================
      [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
      [ 1226.649039] 4.1.0+ #270 Not tainted
      [ 1226.649039] ---------------------------------------------------------
      [ 1226.652955] kswapd0/46 just changed the state of lock:
      [ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1226.652955]
      other info that might help us debug this:
      [ 1226.652955] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1226.652955]  Possible interrupt unsafe locking scenario:
      
      [ 1226.652955]        CPU0                    CPU1
      [ 1226.652955]        ----                    ----
      [ 1226.652955]   lock(&fs_info->dev_replace.lock);
      [ 1226.652955]                                local_irq_disable();
      [ 1226.652955]                                lock(&delayed_node->mutex);
      [ 1226.652955]                                lock(&found->groups_sem);
      [ 1226.652955]   <Interrupt>
      [ 1226.652955]     lock(&delayed_node->mutex);
      [ 1226.652955]
       *** DEADLOCK ***
      
      Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
      to fix a similar one that has the exactly same warning, but with that, we still
      run to this.
      
      The above lock chain comes from
      btrfs_commit_transaction
        ->btrfs_run_delayed_items
          ...
          ->__btrfs_update_delayed_inode
            ...
            ->__btrfs_cow_block
               ...
               ->find_free_extent
                  ->cache_block_group
                    ->load_free_space_cache
                      ->btrfs_readpages
                        ->submit_one_bio
                          ...
                          ->__btrfs_map_block
                            ->btrfs_dev_replace_lock
      
      However, with high memory pressure, tasks which hold dev_replace.lock can
      be interrupted by kswapd and then kswapd is intended to release memory occupied
      by superblock, inodes and dentries, where we may call evict_inode, and it comes
      to
      
      [ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
      [ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
      
      delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
      to a ABBA deadlock.
      
      To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
      things are simpler here since we only needs read's spinlock to blocking lock.
      
      With this, btrfs/011 no more produces warnings in dmesg.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73beece9
  5. 11 2月, 2016 3 次提交
  6. 30 1月, 2016 1 次提交
  7. 22 1月, 2016 1 次提交
  8. 20 1月, 2016 4 次提交
  9. 19 1月, 2016 1 次提交
  10. 16 1月, 2016 2 次提交
  11. 08 1月, 2016 1 次提交
    • F
      Btrfs: fix fitrim discarding device area reserved for boot loader's use · 8cdc7c5b
      Filipe Manana 提交于
      As of the 4.3 kernel release, the fitrim ioctl can now discard any region
      of a disk that is not allocated to any chunk/block group, including the
      first megabyte which is used for our primary superblock and by the boot
      loader (grub for example).
      
      Fix this by not allowing to trim/discard any region in the device starting
      with an offset not greater than min(alloc_start_mount_option, 1Mb), just
      as it was not possible before 4.3.
      
      A reproducer test case for xfstests follows.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
      
        # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
        # reserved for a boot loader to use (GRUB for example) and btrfs should never
        # use them - neither for allocating metadata/data nor should trim/discard them.
        # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
        $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
        $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
      
        # Now mount the filesystem and perform a fitrim against it.
        _scratch_mount
        _require_batched_discard $SCRATCH_MNT
        $FSTRIM_PROG $SCRATCH_MNT
      
        # Now unmount the filesystem and verify the content of the ranges was not
        # modified (no trim/discard happened on them).
        _scratch_unmount
        echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
        od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
        od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
      
        status=0
        exit
      Reported-by: NVincent Petry  <PVince81@yahoo.fr>
      Reported-by: NAndrei Borzenkov <arvidjaar@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
      Fixes: 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      Cc: stable@vger.kernel.org # 4.3+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      8cdc7c5b
  12. 07 1月, 2016 7 次提交
    • S
      Btrfs: Check metadata redundancy on balance · ee592d07
      Sam Tygier 提交于
      When converting a filesystem via balance check that metadata mode
      is at least as redundant as the data mode. For example give warning
      when:
      -dconvert=raid1 -mconvert=single
      Signed-off-by: NSam Tygier <samtygier@yahoo.co.uk>
      [ minor message reformatting ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee592d07
    • D
      btrfs: cleanup, use enum values for btrfs_path reada · e4058b54
      David Sterba 提交于
      Replace the integers by enums for better readability. The value 2 does
      not have any meaning since a7175319
      "Btrfs: do less aggressive btree readahead" (2009-01-22).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4058b54
    • B
      Btrfs: use linux/sizes.h to represent constants · ee22184b
      Byongho Lee 提交于
      We use many constants to represent size and offset value.  And to make
      code readable we use '256 * 1024 * 1024' instead of '268435456' to
      represent '256MB'.  However we can make far more readable with 'SZ_256MB'
      which is defined in the 'linux/sizes.h'.
      
      So this patch replaces 'xxx * 1024 * 1024' kind of expression with
      single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
      not a power of 2. And I haven't touched to '4096' & '8192' because it's
      more intuitive than 'SZ_4KB' & 'SZ_8KB'.
      Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee22184b
    • D
      btrfs: cleanup, remove stray return statements · 7928d672
      David Sterba 提交于
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7928d672
    • D
      93a3d467
    • D
      btrfs: handle invalid num_stripes in sys_array · f5cdedd7
      David Sterba 提交于
      We can handle the special case of num_stripes == 0 directly inside
      btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
      catch other unhandled cases where we fail to validate external data.
      
      A crafted or corrupted image crashes at mount time:
      
      BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
      BTRFS info (device loop0): disk space caching is enabled
      BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
      Kernel panic - not syncing: BUG!
      CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
      Stack:
       637af890 60062489 602aeb2e 604192ba
       60387961 00000011 637af8a0 6038a835
       637af9c0 6038776b 634ef32b 00000000
      Call Trace:
       [<6001c86d>] show_stack+0xfe/0x15b
       [<6038a835>] dump_stack+0x2a/0x2c
       [<6038776b>] panic+0x13e/0x2b3
       [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
       [<601cfbbe>] open_ctree+0x192d/0x27af
       [<6019c2c1>] btrfs_mount+0x8f5/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<6019bcb0>] btrfs_mount+0x2e4/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<600d710b>] do_mount+0xa35/0xbc9
       [<600d7557>] SyS_mount+0x95/0xc8
       [<6001e884>] handle_syscall+0x6b/0x8e
      Reported-by: NJiri Slaby <jslaby@suse.com>
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      CC: stable@vger.kernel.org	# 3.19+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5cdedd7
    • Z
      btrfs: Support convert to -d dup for btrfs-convert · c5ca8781
      Zhao Lei 提交于
      Since we will add support for -d dup for non-mixed filesystem,
      kernel need to support converting to this raid-type.
      
      This patch remove limitation of above case.
      
      Tested by following script:
      (combination of dup conversion with fsck):
      
      export TEST_DEV='/dev/vdc'
      export TEST_DIR='/var/ltf/tester/mnt'
      
      do_dup_test()
      {
          local m_from="$1"
          local d_from="$2"
          local m_to="$3"
          local d_to="$4"
      
          echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
      
          umount "$TEST_DIR" &>/dev/null
          ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
          mount "$TEST_DEV" "$TEST_DIR" || return 1
      
          cp -a /sbin/* "$TEST_DIR"
      
          [[ "$m_from" != "$m_to" ]] && {
              ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
          }
      
          [[ "$d_from" != "$d_to" ]] && {
      	local opt=()
      	[[ "$d_to" == single ]] && opt+=("-f")
              ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
          }
      
          umount "$TEST_DIR" || return 1
          ./btrfsck "$TEST_DEV" || return 1
          echo
      
          return 0
      }
      
      test_all()
      {
          for m_from in single dup; do
          for d_from in single dup; do
          for m_to in single dup; do
          for d_to in single dup; do
          do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
          done
          done
          done
          done
      }
      
      test_all
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5ca8781
  13. 24 12月, 2015 1 次提交
  14. 17 12月, 2015 1 次提交
    • F
      Btrfs: fix race when finishing dev replace leading to transaction abort · 50460e37
      Filipe Manana 提交于
      During the final phase of a device replace operation, I ran into a
      transaction abort that resulted in the following trace:
      
      [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
      [23919.664742] BTRFS: Transaction aborted (error -2)
      [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
      [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
      [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [23919.689151]  0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
      [23919.692604]  ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
      [23919.694230]  ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
      [23919.696716] Call Trace:
      [23919.698669]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
      [23919.700597]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
      [23919.701958]  [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
      [23919.703612]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
      [23919.705047]  [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
      [23919.706967]  [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
      [23919.708611]  [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [23919.710099]  [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
      [23919.711970]  [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
      [23919.713602]  [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
      [23919.714756]  [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
      [23919.716155]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
      [23919.718918]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
      [23919.724170]  [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
      [23919.725482]  [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
      [23919.726790]  [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
      [23919.728428]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [23919.729642]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [23919.730782]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [23919.731847]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [23919.733330] ---[ end trace 166ef301a335832a ]---
      
      This is due to a race between device replace and chunk allocation, which
      the following diagram illustrates:
      
               CPU 1                                    CPU 2
      
       btrfs_dev_replace_finishing()
      
         at this point
          dev_replace->tgtdev->devid ==
          BTRFS_DEV_REPLACE_DEVID (0ULL)
      
         ...
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
                                                     btrfs_fallocate()
                                                       btrfs_alloc_data_chunk_ondemand()
                                                         btrfs_join_transaction()
                                                           --> starts a new transaction
                                                         do_chunk_alloc()
                                                           lock fs_info->chunk_mutex
                                                             btrfs_alloc_chunk()
                                                               --> creates extent map for
                                                                   the new chunk with
                                                                   em->bdev->map->stripes[i]->dev->devid
                                                                   == X (X > 0)
                                                               --> extent map is added to
                                                                   fs_info->mapping_tree
                                                               --> initial phase of bg A
                                                                   allocation completes
                                                           unlock fs_info->chunk_mutex
      
         lock fs_info->chunk_mutex
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates fs_info->mapping_tree and
               replaces the device in every extent
               map's map->stripes[] with
               dev_replace->tgtdev, which still has
               an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
      
                                                         btrfs_end_transaction()
                                                           btrfs_create_pending_block_groups()
                                                             --> starts final phase of
                                                                 bg A creation (update device,
                                                                 extent, and chunk trees, etc)
                                                             btrfs_finish_chunk_alloc()
      
                                                               btrfs_update_device()
                                                                 --> attempts to update a device
                                                                     item with ID == 0ULL
                                                                     (BTRFS_DEV_REPLACE_DEVID)
                                                                     which is the current ID of
                                                                     bg A's
                                                                     em->bdev->map->stripes[i]->dev->devid
                                                                 --> doesn't find such item
                                                                     returns -ENOENT
                                                                 --> the device id should have been X
                                                                     and not 0ULL
      
                                                             got -ENOENT from
                                                             btrfs_finish_chunk_alloc()
                                                             and aborts current transaction
      
         finishes setting up the target device,
         namely it sets tgtdev->devid to the value
         of srcdev->devid, which is X (and X > 0)
      
         frees the srcdev
      
         unlock fs_info->chunk_mutex
      
      So fix this by taking the device list mutex when processing the chunk's
      extent map stripes to update the device items. This avoids getting the
      wrong device id and use-after-free problems if the task finishing a
      chunk allocation grabs the replaced device, which is freed while the
      dev replace task is holding the device list mutex.
      
      This happened while running fstest btrfs/071.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      50460e37
  15. 10 12月, 2015 1 次提交