1. 28 4月, 2016 2 次提交
    • A
      btrfs: introduce device delete by devid · 6b526ed7
      Anand Jain 提交于
      This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct
      btrfs_ioctl_vol_args_v2 to carry devid as an user argument.
      
      The patch won't delete the old ioctl interface and so kernel remains
      backward compatible with user land progs.
      
      Test case/script:
      echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk
      mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk
      mount /dev/sdd /btrfs
      dmsetup suspend bad_disk
      echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk
      dmsetup resume bad_disk
      echo "bad disk failed. now deleting/replacing"
      btrfs dev del  3  /btrfs
      echo $?
      btrfs fi show /btrfs
      umount /btrfs
      btrfs-show-super /dev/sdd | egrep num_device
      dmsetup remove bad_disk
      wipefs -a /dev/sdf
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reported-by: NMartin <m_btrfs@ml1.co.uk>
      [ adjust messages, s/disk/device/ ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b526ed7
    • A
      btrfs: create helper btrfs_find_device_by_user_input() · 24e0474b
      Anand Jain 提交于
      The patch renames btrfs_dev_replace_find_srcdev() to
      btrfs_find_device_by_user_input() and moves it to volumes.c, so that
      delete device can use it.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24e0474b
  2. 07 1月, 2016 1 次提交
    • B
      Btrfs: use linux/sizes.h to represent constants · ee22184b
      Byongho Lee 提交于
      We use many constants to represent size and offset value.  And to make
      code readable we use '256 * 1024 * 1024' instead of '268435456' to
      represent '256MB'.  However we can make far more readable with 'SZ_256MB'
      which is defined in the 'linux/sizes.h'.
      
      So this patch replaces 'xxx * 1024 * 1024' kind of expression with
      single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
      not a power of 2. And I haven't touched to '4096' & '8192' because it's
      more intuitive than 'SZ_4KB' & 'SZ_8KB'.
      Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee22184b
  3. 03 12月, 2015 1 次提交
  4. 25 11月, 2015 1 次提交
  5. 27 10月, 2015 5 次提交
  6. 22 10月, 2015 2 次提交
  7. 14 10月, 2015 1 次提交
    • D
      btrfs: check unsupported filters in balance arguments · 8eb93459
      David Sterba 提交于
      We don't verify that all the balance filter arguments supplemented by
      the flags are actually known to the kernel. Thus we let it silently pass
      and do nothing.
      
      At the moment this means only the 'limit' filter, but we're going to add
      a few more soon so it's better to have that fixed. Also in older stable
      kernels so that it works with newer userspace tools.
      
      Cc: stable@vger.kernel.org # 3.16+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eb93459
  8. 02 10月, 2015 1 次提交
  9. 01 10月, 2015 1 次提交
    • A
      Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks · 12b1c263
      Anand Jain 提交于
      This patch updates and renames btrfs_scratch_superblocks, (which is used
      by the replace device thread), with those fixes from the scratch
      superblock code section of btrfs_rm_device(). The fixes are:
        Scratch all copies of superblock
        Notify kobject that superblock has been changed
        Update time on the device
      
      So that btrfs_rm_device() can use the function
      btrfs_scratch_superblocks() instead of its own scratch code. And further
      replace deivce code which similarly releases device back to the system,
      will have the fixes from the btrfs device delete.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      [renamed to btrfs_scratch_superblock]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12b1c263
  10. 29 9月, 2015 1 次提交
  11. 29 7月, 2015 1 次提交
    • J
      btrfs: iterate over unused chunk space in FITRIM · 499f377f
      Jeff Mahoney 提交于
      Since we now clean up block groups automatically as they become
      empty, iterating over block groups is no longer sufficient to discard
      unused space.
      
      This patch iterates over the unused chunk space and discards any regions
      that are unallocated, regardless of whether they were ever used.  This is
      a change for btrfs but is consistent with other file systems.
      
      We do this in a transactionless manner since the discard process can take
      a substantial amount of time and a transaction would need to be started
      before the acquisition of the device list lock.  That would mean a
      transaction would be held open across /all/ of the discards collectively.
      In order to prevent other threads from allocating or freeing chunks, we
      hold the chunks lock across the search and discard calls.  We release it
      between searches to allow the file system to perform more-or-less
      normally.  Since the running transaction can commit and disappear while
      we're using the transaction pointer, we take a reference to it and
      release it after the search.  This is safe since it would happen normally
      at the end of the transaction commit after any locks are released anyway.
      We also take the commit_root_sem to protect against a transaction starting
      and committing while we're running.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      499f377f
  12. 27 5月, 2015 3 次提交
  13. 22 5月, 2015 1 次提交
    • M
      block: remove management of bi_remaining when restoring original bi_end_io · 326e1dbb
      Mike Snitzer 提交于
      Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
      non-chains") regressed all existing callers that followed this pattern:
       1) saving a bio's original bi_end_io
       2) wiring up an intermediate bi_end_io
       3) restoring the original bi_end_io from intermediate bi_end_io
       4) calling bio_endio() to execute the restored original bi_end_io
      
      The regression was due to BIO_CHAIN only ever getting set if
      bio_inc_remaining() is called.  For the above pattern it isn't set until
      step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
      the first bio_endio(), in step 2 above, never decremented __bi_remaining
      before calling the intermediate bi_end_io -- leaving __bi_remaining with
      the value 1 instead of 0.  When bio_inc_remaining() occurred during step
      3 it brought it to a value of 2.  When the second bio_endio() was
      called, in step 4 above, it should've called the original bi_end_io but
      it didn't because there was an extra reference that wasn't dropped (due
      to atomic operations being optimized away since BIO_CHAIN wasn't set
      upfront).
      
      Fix this issue by removing the __bi_remaining management complexity for
      all callers that use the above pattern -- bio_chain() is the only
      interface that _needs_ to be concerned with __bi_remaining.  For the
      above pattern callers just expect the bi_end_io they set to get called!
      Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
      that aren't associated with the bio_chain() interface.
      
      Also, the bio_inc_remaining() interface has been moved local to bio.c.
      
      Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      326e1dbb
  14. 17 2月, 2015 1 次提交
  15. 22 1月, 2015 3 次提交
  16. 03 12月, 2014 3 次提交
    • F
      Btrfs: fix race between fs trimming and block group remove/allocation · 04216820
      Filipe Manana 提交于
      Our fs trim operation, which is completely transactionless (doesn't start
      or joins an existing transaction) consists of visiting all block groups
      and then for each one to iterate its free space entries and perform a
      discard operation against the space range represented by the free space
      entries. However before performing a discard, the corresponding free space
      entry is removed from the free space rbtree, and when the discard completes
      it is added back to the free space rbtree.
      
      If a block group remove operation happens while the discard is ongoing (or
      before it starts and after a free space entry is hidden), we end up not
      waiting for the discard to complete, remove the extent map that maps
      logical address to physical addresses and the corresponding chunk metadata
      from the the chunk and device trees. After that and before the discard
      completes, the current running transaction can finish and a new one start,
      allowing for new block groups that map to the same physical addresses to
      be allocated and written to.
      
      So fix this by keeping the extent map in memory until the discard completes
      so that the same physical addresses aren't reused before it completes.
      
      If the physical locations that are under a discard operation end up being
      used for a new metadata block group for example, and dirty metadata extents
      are written before the discard finishes (the VM might call writepages() of
      our btree inode's i_mapping for example, or an fsync log commit happens) we
      end up overwriting metadata with zeroes, which leads to errors from fsck
      like the following:
      
              checking extents
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              owner ref check failed [833912832 16384]
              Errors found in extent allocation tree or chunk allocation
              checking free space cache
              checking fs roots
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              root 5 root dir 256 error
              root 5 inode 260 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 262 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 263 errors 2001, no inode item, link count wrong
              (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04216820
    • M
      Btrfs, replace: write dirty pages into the replace target device · 2c8cdd6e
      Miao Xie 提交于
      The implementation is simple:
      - In order to avoid changing the code logic of btrfs_map_bio and
        RAID56, we add the stripes of the replace target devices at the
        end of the stripe array in btrfs bio, and we sort those target
        device stripes in the array. And we keep the number of the target
        device stripes in the btrfs bio.
      - Except write operation on RAID56, all the other operation don't
        take the target device stripes into account.
      - When we do write operation, we read the data from the common devices
        and calculate the parity. Then write the dirty data and new parity
        out, at this time, we will find the relative replace target stripes
        and wirte the relative data into it.
      
      Note: The function that copying old data on the source device to
      the target device was implemented in the past, it is similar to
      the other RAID type.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      2c8cdd6e
    • M
      Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted · af8e2d1d
      Miao Xie 提交于
      This patch implement the RAID5/6 common data repair function, the
      implementation is similar to the scrub on the other RAID such as
      RAID1, the differentia is that we don't read the data from the
      mirror, we use the data repair function of RAID5/6.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      af8e2d1d
  17. 25 11月, 2014 1 次提交
    • Q
      btrfs: Fix a lockdep warning when running xfstest. · 084b6e7c
      Qu Wenruo 提交于
      The following lockdep warning is triggered during xfstests:
      
      [ 1702.980872] =========================================================
      [ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
      [ 1702.981482] 3.18.0-rc1 #27 Not tainted
      [ 1702.981781] ---------------------------------------------------------
      [ 1702.982095] kswapd0/77 just changed the state of lock:
      [ 1702.982415]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
      [ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1702.983160]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1702.984675]
      other info that might help us debug this:
      [ 1702.985524] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1702.986799]  Possible interrupt unsafe locking scenario:
      
      [ 1702.987681]        CPU0                    CPU1
      [ 1702.988137]        ----                    ----
      [ 1702.988598]   lock(&fs_info->dev_replace.lock);
      [ 1702.989069]                                local_irq_disable();
      [ 1702.989534]                                lock(&delayed_node->mutex);
      [ 1702.990038]                                lock(&found->groups_sem);
      [ 1702.990494]   <Interrupt>
      [ 1702.990938]     lock(&delayed_node->mutex);
      [ 1702.991407]
       *** DEADLOCK ***
      
      It is because the btrfs_kobj_{add/rm}_device() will call memory
      allocation with GFP_KERNEL,
      which may flush fs page cache to free space, waiting for it self to do
      the commit, causing the deadlock.
      
      To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
      dev_replace lock range, also involing split the
      btrfs_rm_dev_replace_srcdev() function into remove and free parts.
      
      Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
      lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
      called out of the lock range.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      084b6e7c
  18. 23 9月, 2014 1 次提交
    • J
      Btrfs: remove empty block groups automatically · 47ab2a6c
      Josef Bacik 提交于
      One problem that has plagued us is that a user will use up all of his space with
      data, remove a bunch of that data, and then try to create a bunch of small files
      and run out of space.  This happens because all the chunks were allocated for
      data since the metadata requirements were so low.  But now there's a bunch of
      empty data block groups and not enough metadata space to do anything.  This
      patch solves this problem by automatically deleting empty block groups.  If we
      notice the used count go down to 0 when deleting or on mount notice that a block
      group has a used count of 0 then we will queue it to be deleted.
      
      When the cleaner thread runs we will double check to make sure the block group
      is still empty and then we will delete it.  This patch has the side effect of no
      longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
      helpful for both this and relocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47ab2a6c
  19. 18 9月, 2014 10 次提交