1. 07 1月, 2016 3 次提交
    • D
      93a3d467
    • D
      btrfs: handle invalid num_stripes in sys_array · f5cdedd7
      David Sterba 提交于
      We can handle the special case of num_stripes == 0 directly inside
      btrfs_read_sys_array. The BUG_ON in btrfs_chunk_item_size is there to
      catch other unhandled cases where we fail to validate external data.
      
      A crafted or corrupted image crashes at mount time:
      
      BTRFS: device fsid 9006933e-2a9a-44f0-917f-514252aeec2c devid 1 transid 7 /dev/loop0
      BTRFS info (device loop0): disk space caching is enabled
      BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!
      Kernel panic - not syncing: BUG!
      CPU: 0 PID: 313 Comm: mount Not tainted 4.2.5-00657-ge047887-dirty #25
      Stack:
       637af890 60062489 602aeb2e 604192ba
       60387961 00000011 637af8a0 6038a835
       637af9c0 6038776b 634ef32b 00000000
      Call Trace:
       [<6001c86d>] show_stack+0xfe/0x15b
       [<6038a835>] dump_stack+0x2a/0x2c
       [<6038776b>] panic+0x13e/0x2b3
       [<6020f099>] btrfs_read_sys_array+0x25d/0x2ff
       [<601cfbbe>] open_ctree+0x192d/0x27af
       [<6019c2c1>] btrfs_mount+0x8f5/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<6019bcb0>] btrfs_mount+0x2e4/0xb9a
       [<600bc9a7>] mount_fs+0x11/0xf3
       [<600d5167>] vfs_kern_mount+0x75/0x11a
       [<600d710b>] do_mount+0xa35/0xbc9
       [<600d7557>] SyS_mount+0x95/0xc8
       [<6001e884>] handle_syscall+0x6b/0x8e
      Reported-by: NJiri Slaby <jslaby@suse.com>
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      CC: stable@vger.kernel.org	# 3.19+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5cdedd7
    • Z
      btrfs: Support convert to -d dup for btrfs-convert · c5ca8781
      Zhao Lei 提交于
      Since we will add support for -d dup for non-mixed filesystem,
      kernel need to support converting to this raid-type.
      
      This patch remove limitation of above case.
      
      Tested by following script:
      (combination of dup conversion with fsck):
      
      export TEST_DEV='/dev/vdc'
      export TEST_DIR='/var/ltf/tester/mnt'
      
      do_dup_test()
      {
          local m_from="$1"
          local d_from="$2"
          local m_to="$3"
          local d_to="$4"
      
          echo "Convert from -m $m_from -d $d_from to -m $m_to -d $d_to"
      
          umount "$TEST_DIR" &>/dev/null
          ./mkfs.btrfs -f -m "$m_from" -d "$d_from" "$TEST_DEV" >/dev/null || return 1
          mount "$TEST_DEV" "$TEST_DIR" || return 1
      
          cp -a /sbin/* "$TEST_DIR"
      
          [[ "$m_from" != "$m_to" ]] && {
              ./btrfs balance start -f -mconvert="$m_to" "$TEST_DIR" || return 1
          }
      
          [[ "$d_from" != "$d_to" ]] && {
      	local opt=()
      	[[ "$d_to" == single ]] && opt+=("-f")
              ./btrfs balance start "${opt[@]}" -dconvert="$d_to" "$TEST_DIR" || return 1
          }
      
          umount "$TEST_DIR" || return 1
          ./btrfsck "$TEST_DEV" || return 1
          echo
      
          return 0
      }
      
      test_all()
      {
          for m_from in single dup; do
          for d_from in single dup; do
          for m_to in single dup; do
          for d_to in single dup; do
          do_dup_test "$m_from" "$d_from" "$m_to" "$d_to" || return 1
          done
          done
          done
          done
      }
      
      test_all
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5ca8781
  2. 10 12月, 2015 1 次提交
  3. 25 11月, 2015 4 次提交
    • H
      btrfs: fix balance range usage filters in 4.4-rc · dba72cb3
      Holger Hoffstätte 提交于
      There's a regression in 4.4-rc since commit bc309467
      (btrfs: extend balance filter usage to take minimum and maximum) in that
      existing (non-ranged) balance with -dusage=x no longer works; all chunks
      are skipped.
      
      After staring at the code for a while and wondering why a non-ranged
      balance would even need min and max thresholds (..which then were not
      set correctly, leading to the bug) I realized that the only problem
      was the fact that the filter functions were named wrong, thanks to
      patching copypasta. Simply renaming both functions lets the existing
      btrfs-progs call balance with -dusage=x and now the non-ranged filter
      function is invoked, properly using only a single chunk limit.
      Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dba72cb3
    • D
      btrfs: fix rcu warning during device replace · 31388ab2
      David Sterba 提交于
      The test btrfs/011 triggers a rcu warning
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.4.0-rc1-default+ #286 Tainted: G        W
      -------------------------------
      fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      4 locks held by btrfs/28786:
      
      0:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs]
      1:  (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs]
      2:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs]
      3:  (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 28786 Comm: btrfs Tainted: G        W       4.4.0-rc1-default+ #286
      Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010
      0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001
      0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78
      ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220
      Call Trace:
      [<ffffffff8141d47b>] dump_stack+0x4f/0x74
      [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140
      [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs]
      [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10
      [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80
      [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs]
      [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs]
      [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs]
      [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs]
      [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs]
      [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0
      [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs]
      [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
      [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs]
      [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs]
      [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
      [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0
      [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0
      [<ffffffff811e3336>] ? __fget_light+0x86/0xb0
      [<ffffffff811e3369>] ? __fdget+0x9/0x20
      [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80
      [<ffffffff811d7483>] SyS_ioctl+0x53/0x80
      [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This is because of unprotected use of rcu_dereference in
      btrfs_scratch_superblocks. We can't add rcu locks around the whole
      function because we read the superblock.
      
      The fix will use the rcu string buffer directly without the rcu locking.
      Thi is safe as the device will not go away in the meantime. We're
      holding the device list mutexes.
      
      Restructuring the code to narrow down the rcu section turned out to be
      impossible, we need to call filp_open (through update_dev_time) on the
      buffer and this could call kmalloc/__might_sleep. We could call kstrdup
      with GFP_ATOMIC but it's not absolutely necessary.
      
      Fixes: 12b1c263 (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks)
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31388ab2
    • F
      Btrfs: fix the number of transaction units needed to remove a block group · 7fd01182
      Filipe Manana 提交于
      We were using only 1 transaction unit when attempting to delete an unused
      block group but in reality we need 3 + N units, where N corresponds to the
      number of stripes. We were accounting only for the addition of the orphan
      item (for the block group's free space cache inode) but we were not
      accounting that we need to delete one block group item from the extent
      tree, one free space item from the tree of tree roots and N device extent
      items from the device tree.
      
      While one unit is not enough, it worked most of the time because for each
      single unit we are too pessimistic and assume an entire tree path, with
      the highest possible heigth (8), needs to be COWed with eventual node
      splits at every possible level in the tree, so there was usually enough
      reserved space for removing all the items and adding the orphan item.
      
      However after adding the orphan item, writepages() can by called by the VM
      subsystem against the btree inode when we are under memory pressure, which
      causes writeback to start for the nodes we COWed before, this forces the
      operation to remove the free space item to COW again some (or all of) the
      same nodes (in the tree of tree roots). Even without writepages() being
      called, we could fail with ENOSPC because these items are located in
      multiple trees and one of them might have a higher heigth and require
      node/leaf splits at many levels, exhausting all the reserved space before
      removing all the items and adding the orphan.
      
      In the kernel 4.0 release, commit 3d84be79 ("Btrfs: fix BUG_ON in
      btrfs_orphan_add() when delete unused block group"), we attempted to fix
      a BUG_ON due to ENOSPC when trying to add the orphan item by making the
      cleaner kthread reserve one transaction unit before attempting to remove
      the block group, but this was not enough. We had a couple user reports
      still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
      a 4.2-rc6 kernel for example:
      
          http://www.spinics.net/lists/linux-btrfs/msg46070.html
      
      So fix this by reserving all the necessary units of metadata.
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Fixes: 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7fd01182
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  4. 11 11月, 2015 1 次提交
    • Z
      btrfs: Fix lost-data-profile caused by balance bg · 2c9fe835
      Zhao Lei 提交于
      Reproduce:
       (In integration-4.3 branch)
      
       TEST_DEV=(/dev/vdg /dev/vdh)
       TEST_DIR=/mnt/tmp
      
       umount "$TEST_DEV" >/dev/null
       mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
      
       mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
       btrfs balance start -dusage=0 $TEST_DIR
       btrfs filesystem usage $TEST_DIR
      
       dd if=/dev/zero of="$TEST_DIR"/file count=100
       btrfs filesystem usage $TEST_DIR
      
      Result:
       We can see "no data chunk" in first "btrfs filesystem usage":
       # btrfs filesystem usage $TEST_DIR
       Overall:
          ...
       Metadata,single: Size:8.00MiB, Used:0.00B
          /dev/vdg        8.00MiB
       Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
          /dev/vdg      122.88MiB
          /dev/vdh      122.88MiB
       System,single: Size:4.00MiB, Used:0.00B
          /dev/vdg        4.00MiB
       System,RAID1: Size:8.00MiB, Used:16.00KiB
          /dev/vdg        8.00MiB
          /dev/vdh        8.00MiB
       Unallocated:
          /dev/vdg        1.06GiB
          /dev/vdh        1.07GiB
      
       And "data chunks changed from raid1 to single" in second
       "btrfs filesystem usage":
       # btrfs filesystem usage $TEST_DIR
       Overall:
          ...
       Data,single: Size:256.00MiB, Used:0.00B
          /dev/vdh      256.00MiB
       Metadata,single: Size:8.00MiB, Used:0.00B
          /dev/vdg        8.00MiB
       Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
          /dev/vdg      122.88MiB
          /dev/vdh      122.88MiB
       System,single: Size:4.00MiB, Used:0.00B
          /dev/vdg        4.00MiB
       System,RAID1: Size:8.00MiB, Used:16.00KiB
          /dev/vdg        8.00MiB
          /dev/vdh        8.00MiB
       Unallocated:
          /dev/vdg        1.06GiB
          /dev/vdh      841.92MiB
      
      Reason:
       btrfs balance delete last data chunk in case of no data in
       the filesystem, then we can see "no data chunk" by "fi usage"
       command.
      
       And when we do write operation to fs, the only available data
       profile is 0x0, result is all new chunks are allocated single type.
      
      Fix:
       Allocate a data chunk explicitly to ensure we don't lose the
       raid profile for data.
      
      Test:
       Test by above script, and confirmed the logic by debug output.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c9fe835
  5. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  6. 27 10月, 2015 3 次提交
  7. 22 10月, 2015 4 次提交
  8. 11 10月, 2015 1 次提交
  9. 08 10月, 2015 3 次提交
  10. 02 10月, 2015 1 次提交
  11. 01 10月, 2015 3 次提交
  12. 29 9月, 2015 5 次提交
  13. 01 9月, 2015 1 次提交
    • Z
      btrfs: Add raid56 support for updating · 943c6e99
      Zhao Lei 提交于
       num_tolerated_disk_barrier_failures in btrfs_balance
      
      Code for updating fs_info->num_tolerated_disk_barrier_failures in
      btrfs_balance() lacks raid56 support.
      
      Reason:
       Above code was wroten in 2012-08-01, together with
       btrfs_calc_num_tolerated_disk_barrier_failures()'s first version.
      
       Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated
       later to support raid56, but code in btrfs_balance() was not
       updated together.
      
      Fix:
       Merge above similar code to a common function:
       btrfs_get_num_tolerated_disk_barrier_failures()
       and make it support both case.
      
       It can fix this bug with a bonus of cleanup, and make these code
       never in above no-sync state from now on.
      Suggested-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      943c6e99
  14. 22 8月, 2015 1 次提交
    • C
      btrfs: fix compile when block cgroups are not enabled · 3a9508b0
      Chris Mason 提交于
      bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
      This adds an ifdef around them.  It's not perfect, but our
      use of bi_ioc is being removed in the 4.3 merge window.
      
      The bi_css usage really should go into bio_clone, but I want to make
      sure that doesn't introduce problems for other bio_clone use cases.
      Signed-off-by: NChris Mason <clm@fb.com>
      3a9508b0
  15. 20 8月, 2015 1 次提交
    • M
      btrfs: use __GFP_NOFAIL in alloc_btrfs_bio · 277fb5fc
      Michal Hocko 提交于
      alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
      transaction but this allocation context is rather weak wrt. reclaim
      capabilities. The page allocator currently tries hard to not fail these
      allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
      still fail if the _current_ process is the OOM killer victim. Moreover
      there is an attempt to move away from the default no-fail behavior and
      allow these allocation to fail more eagerly. This would lead to:
      
      [   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
      
      which is clearly undesirable and the nofail behavior should be explicit
      if the allocation failure cannot be tolerated.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      277fb5fc
  16. 14 8月, 2015 1 次提交
  17. 09 8月, 2015 3 次提交
    • C
      Btrfs: add support for blkio controllers · da2f0f74
      Chris Mason 提交于
      This attaches accounting information to bios as we submit them so the
      new blkio controllers can throttle on btrfs filesystems.
      
      Not much is required, we're just associating bios with blkcgs during clone,
      calling wbc_init_bio()/wbc_account_io() during writepages submission,
      and attaching the bios to the current context during direct IO.
      
      Finally if we are splitting bios during btrfs_map_bio, this attaches
      accounting information to the split.
      
      The end result is able to throttle nicely on single disk filesystems.  A
      little more work is required for multi-device filesystems.
      Signed-off-by: NChris Mason <clm@fb.com>
      da2f0f74
    • Z
      btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk() · dc2ee4e2
      Zhaolei 提交于
      Remove chunk_objectid argument from btrfs_relocate_chunk() because
      it is not necessary, it can also cleanup some code in caller for
      prepare its value.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dc2ee4e2
    • Z
      btrfs: Fix data checksum error cause by replace with io-load. · 55e3a601
      Zhaolei 提交于
      xfstests btrfs/070 sometimes failed.
      In my test machine, its fail rate is about 30%.
      In another vm(vmware), its fail rate is about 50%.
      
      Reason:
        btrfs/070 do replace and defrag with fsstress simultaneously,
        after above operation, checksum error is found by scrub.
      
        Actually, it have no relationship with defrag operation, only
        replace with fsstress can trigger this bug.
      
        New data writen to target device have possibility rewrited by
        old data from source device by replace code in debug, to avoid
        above problem, we can set target block group to readonly in
        replace period, so new data requested by other operation will
        not write to same place with replace code.
      
        Before patch(4.1-rc3):
          30% failed in 100 xfstests.
        After patch:
          0% failed in 300 xfstests.
      
      It also happened in btrfs/071 as it's another scrub with IO load tests.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      55e3a601
  18. 29 7月, 2015 2 次提交
    • J
      btrfs: iterate over unused chunk space in FITRIM · 499f377f
      Jeff Mahoney 提交于
      Since we now clean up block groups automatically as they become
      empty, iterating over block groups is no longer sufficient to discard
      unused space.
      
      This patch iterates over the unused chunk space and discards any regions
      that are unallocated, regardless of whether they were ever used.  This is
      a change for btrfs but is consistent with other file systems.
      
      We do this in a transactionless manner since the discard process can take
      a substantial amount of time and a transaction would need to be started
      before the acquisition of the device list lock.  That would mean a
      transaction would be held open across /all/ of the discards collectively.
      In order to prevent other threads from allocating or freeing chunks, we
      hold the chunks lock across the search and discard calls.  We release it
      between searches to allow the file system to perform more-or-less
      normally.  Since the running transaction can commit and disappear while
      we're using the transaction pointer, we take a reference to it and
      release it after the search.  This is safe since it would happen normally
      at the end of the transaction commit after any locks are released anyway.
      We also take the commit_root_sem to protect against a transaction starting
      and committing while we're running.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      499f377f
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  19. 01 7月, 2015 1 次提交
    • F
      Btrfs: fix race between balance and unused block group deletion · 67c5e7d4
      Filipe Manana 提交于
      We have a race between deleting an unused block group and balancing the
      same block group that leads to an assertion failure/BUG(), producing the
      following trace:
      
      [181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
      [181631.220591] ------------[ cut here ]------------
      [181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
      [181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
      [181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
      [181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
      [181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
      [181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
      [181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
      [181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
      [181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
      [181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
      [181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
      [181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
      [181631.224566] Stack:
      [181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
      [181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
      [181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
      [181631.224566] Call Trace:
      [181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
      [181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
      [181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
      [181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
      [181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      [181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
      [181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
      [181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
      [181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
      [181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
      (...)
      
      The sequence of steps leading to this are:
      
                 CPU 0                                         CPU 1
      
        btrfs_balance()
          btrfs_relocate_chunk()
      
            btrfs_relocate_block_group(bg X)
              btrfs_lookup_block_group(bg X)
      
                                                     cleaner_kthread
                                                        locks fs_info->cleaner_mutex
      
                                                        btrfs_delete_unused_bgs()
                                                          finds bg X, which became
                                                          unused in the previous
                                                          transaction
      
                                                          checks bg X ->ro == 0,
                                                          so it proceeds
              sets bg X ->ro to 1
              (btrfs_set_block_group_ro(bg X))
      
              blocks on fs_info->cleaner_mutex
                                                          btrfs_remove_chunk(bg X)
                                                        unlocks fs_info->cleaner_mutex
      
              acquires fs_info->cleaner_mutex
              relocate_block_group()
                --> does nothing, no extents found in
                    the extent tree from bg X
              unlocks fs_info->cleaner_mutex
      
            btrfs_relocate_block_group(bg X) returns
      
          btrfs_remove_chunk(bg X)
             extent map not found
                --> ASSERT(0)
      
      Fix this by using a new mutex to make sure these 2 operations, block
      group relocation and removal, are serialized.
      
      This issue is reproducible by running fstests generic/038 (which stresses
      chunk allocation and automatic removal of unused block groups) together
      with the following balance loop:
      
          while true; do btrfs balance start -dusage=0 <mountpoint> ; done
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      67c5e7d4