1. 24 12月, 2015 1 次提交
  2. 17 12月, 2015 1 次提交
    • F
      Btrfs: fix race when finishing dev replace leading to transaction abort · 50460e37
      Filipe Manana 提交于
      During the final phase of a device replace operation, I ran into a
      transaction abort that resulted in the following trace:
      
      [23919.655368] WARNING: CPU: 10 PID: 30175 at fs/btrfs/extent-tree.c:9843 btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]()
      [23919.664742] BTRFS: Transaction aborted (error -2)
      [23919.665749] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix4 parport psmouse acpi_cpufreq processor i2c_core evdev microcode pcspkr button serio_raw ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic ata_piix virtio_pci floppy virtio_ring libata e1000 virtio scsi_mod [last unloaded: btrfs]
      [23919.679442] CPU: 10 PID: 30175 Comm: fsstress Not tainted 4.3.0-rc5-btrfs-next-17+ #1
      [23919.682392] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [23919.689151]  0000000000000000 ffff8804020cbb50 ffffffff812566f4 ffff8804020cbb98
      [23919.692604]  ffff8804020cbb88 ffffffff8104d0a6 ffffffffa03eea69 ffff88041b678a48
      [23919.694230]  ffff88042ac38000 ffff88041b678930 00000000fffffffe ffff8804020cbbf0
      [23919.696716] Call Trace:
      [23919.698669]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
      [23919.700597]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
      [23919.701958]  [<ffffffffa03eea69>] ? btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
      [23919.703612]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
      [23919.705047]  [<ffffffffa03eea69>] btrfs_create_pending_block_groups+0x15e/0x1ab [btrfs]
      [23919.706967]  [<ffffffffa0402097>] __btrfs_end_transaction+0x84/0x2dd [btrfs]
      [23919.708611]  [<ffffffffa0402300>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [23919.710099]  [<ffffffffa03ef0b8>] btrfs_alloc_data_chunk_ondemand+0x121/0x28b [btrfs]
      [23919.711970]  [<ffffffffa0413025>] btrfs_fallocate+0x7d3/0xc6d [btrfs]
      [23919.713602]  [<ffffffff8108b78f>] ? lock_acquire+0x10d/0x194
      [23919.714756]  [<ffffffff81086dbc>] ? percpu_down_read+0x51/0x78
      [23919.716155]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
      [23919.718918]  [<ffffffff8116ef1d>] ? __sb_start_write+0x5f/0xb0
      [23919.724170]  [<ffffffff8116b579>] vfs_fallocate+0x170/0x1ff
      [23919.725482]  [<ffffffff8117c1d7>] ioctl_preallocate+0x89/0x9b
      [23919.726790]  [<ffffffff8117c5ef>] do_vfs_ioctl+0x406/0x4e6
      [23919.728428]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [23919.729642]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [23919.730782]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [23919.731847]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [23919.733330] ---[ end trace 166ef301a335832a ]---
      
      This is due to a race between device replace and chunk allocation, which
      the following diagram illustrates:
      
               CPU 1                                    CPU 2
      
       btrfs_dev_replace_finishing()
      
         at this point
          dev_replace->tgtdev->devid ==
          BTRFS_DEV_REPLACE_DEVID (0ULL)
      
         ...
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
                                                     btrfs_fallocate()
                                                       btrfs_alloc_data_chunk_ondemand()
                                                         btrfs_join_transaction()
                                                           --> starts a new transaction
                                                         do_chunk_alloc()
                                                           lock fs_info->chunk_mutex
                                                             btrfs_alloc_chunk()
                                                               --> creates extent map for
                                                                   the new chunk with
                                                                   em->bdev->map->stripes[i]->dev->devid
                                                                   == X (X > 0)
                                                               --> extent map is added to
                                                                   fs_info->mapping_tree
                                                               --> initial phase of bg A
                                                                   allocation completes
                                                           unlock fs_info->chunk_mutex
      
         lock fs_info->chunk_mutex
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates fs_info->mapping_tree and
               replaces the device in every extent
               map's map->stripes[] with
               dev_replace->tgtdev, which still has
               an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
      
                                                         btrfs_end_transaction()
                                                           btrfs_create_pending_block_groups()
                                                             --> starts final phase of
                                                                 bg A creation (update device,
                                                                 extent, and chunk trees, etc)
                                                             btrfs_finish_chunk_alloc()
      
                                                               btrfs_update_device()
                                                                 --> attempts to update a device
                                                                     item with ID == 0ULL
                                                                     (BTRFS_DEV_REPLACE_DEVID)
                                                                     which is the current ID of
                                                                     bg A's
                                                                     em->bdev->map->stripes[i]->dev->devid
                                                                 --> doesn't find such item
                                                                     returns -ENOENT
                                                                 --> the device id should have been X
                                                                     and not 0ULL
      
                                                             got -ENOENT from
                                                             btrfs_finish_chunk_alloc()
                                                             and aborts current transaction
      
         finishes setting up the target device,
         namely it sets tgtdev->devid to the value
         of srcdev->devid, which is X (and X > 0)
      
         frees the srcdev
      
         unlock fs_info->chunk_mutex
      
      So fix this by taking the device list mutex when processing the chunk's
      extent map stripes to update the device items. This avoids getting the
      wrong device id and use-after-free problems if the task finishing a
      chunk allocation grabs the replaced device, which is freed while the
      dev replace task is holding the device list mutex.
      
      This happened while running fstest btrfs/071.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      50460e37
  3. 10 12月, 2015 1 次提交
  4. 07 12月, 2015 1 次提交
  5. 03 12月, 2015 1 次提交
  6. 25 11月, 2015 4 次提交
    • H
      btrfs: fix balance range usage filters in 4.4-rc · dba72cb3
      Holger Hoffstätte 提交于
      There's a regression in 4.4-rc since commit bc309467
      (btrfs: extend balance filter usage to take minimum and maximum) in that
      existing (non-ranged) balance with -dusage=x no longer works; all chunks
      are skipped.
      
      After staring at the code for a while and wondering why a non-ranged
      balance would even need min and max thresholds (..which then were not
      set correctly, leading to the bug) I realized that the only problem
      was the fact that the filter functions were named wrong, thanks to
      patching copypasta. Simply renaming both functions lets the existing
      btrfs-progs call balance with -dusage=x and now the non-ranged filter
      function is invoked, properly using only a single chunk limit.
      Signed-off-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Fixes: bc309467 ("btrfs: extend balance filter usage to take minimum and maximum")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dba72cb3
    • D
      btrfs: fix rcu warning during device replace · 31388ab2
      David Sterba 提交于
      The test btrfs/011 triggers a rcu warning
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.4.0-rc1-default+ #286 Tainted: G        W
      -------------------------------
      fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      4 locks held by btrfs/28786:
      
      0:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs]
      1:  (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs]
      2:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs]
      3:  (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs]
      
      stack backtrace:
      CPU: 0 PID: 28786 Comm: btrfs Tainted: G        W       4.4.0-rc1-default+ #286
      Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010
      0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001
      0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78
      ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220
      Call Trace:
      [<ffffffff8141d47b>] dump_stack+0x4f/0x74
      [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140
      [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs]
      [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10
      [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80
      [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs]
      [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs]
      [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs]
      [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs]
      [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs]
      [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0
      [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs]
      [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
      [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs]
      [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs]
      [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70
      [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0
      [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0
      [<ffffffff811e3336>] ? __fget_light+0x86/0xb0
      [<ffffffff811e3369>] ? __fdget+0x9/0x20
      [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80
      [<ffffffff811d7483>] SyS_ioctl+0x53/0x80
      [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This is because of unprotected use of rcu_dereference in
      btrfs_scratch_superblocks. We can't add rcu locks around the whole
      function because we read the superblock.
      
      The fix will use the rcu string buffer directly without the rcu locking.
      Thi is safe as the device will not go away in the meantime. We're
      holding the device list mutexes.
      
      Restructuring the code to narrow down the rcu section turned out to be
      impossible, we need to call filp_open (through update_dev_time) on the
      buffer and this could call kmalloc/__might_sleep. We could call kstrdup
      with GFP_ATOMIC but it's not absolutely necessary.
      
      Fixes: 12b1c263 (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks)
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31388ab2
    • F
      Btrfs: fix the number of transaction units needed to remove a block group · 7fd01182
      Filipe Manana 提交于
      We were using only 1 transaction unit when attempting to delete an unused
      block group but in reality we need 3 + N units, where N corresponds to the
      number of stripes. We were accounting only for the addition of the orphan
      item (for the block group's free space cache inode) but we were not
      accounting that we need to delete one block group item from the extent
      tree, one free space item from the tree of tree roots and N device extent
      items from the device tree.
      
      While one unit is not enough, it worked most of the time because for each
      single unit we are too pessimistic and assume an entire tree path, with
      the highest possible heigth (8), needs to be COWed with eventual node
      splits at every possible level in the tree, so there was usually enough
      reserved space for removing all the items and adding the orphan item.
      
      However after adding the orphan item, writepages() can by called by the VM
      subsystem against the btree inode when we are under memory pressure, which
      causes writeback to start for the nodes we COWed before, this forces the
      operation to remove the free space item to COW again some (or all of) the
      same nodes (in the tree of tree roots). Even without writepages() being
      called, we could fail with ENOSPC because these items are located in
      multiple trees and one of them might have a higher heigth and require
      node/leaf splits at many levels, exhausting all the reserved space before
      removing all the items and adding the orphan.
      
      In the kernel 4.0 release, commit 3d84be79 ("Btrfs: fix BUG_ON in
      btrfs_orphan_add() when delete unused block group"), we attempted to fix
      a BUG_ON due to ENOSPC when trying to add the orphan item by making the
      cleaner kthread reserve one transaction unit before attempting to remove
      the block group, but this was not enough. We had a couple user reports
      still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
      a 4.2-rc6 kernel for example:
      
          http://www.spinics.net/lists/linux-btrfs/msg46070.html
      
      So fix this by reserving all the necessary units of metadata.
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Fixes: 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7fd01182
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  7. 11 11月, 2015 1 次提交
    • Z
      btrfs: Fix lost-data-profile caused by balance bg · 2c9fe835
      Zhao Lei 提交于
      Reproduce:
       (In integration-4.3 branch)
      
       TEST_DEV=(/dev/vdg /dev/vdh)
       TEST_DIR=/mnt/tmp
      
       umount "$TEST_DEV" >/dev/null
       mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"
      
       mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
       btrfs balance start -dusage=0 $TEST_DIR
       btrfs filesystem usage $TEST_DIR
      
       dd if=/dev/zero of="$TEST_DIR"/file count=100
       btrfs filesystem usage $TEST_DIR
      
      Result:
       We can see "no data chunk" in first "btrfs filesystem usage":
       # btrfs filesystem usage $TEST_DIR
       Overall:
          ...
       Metadata,single: Size:8.00MiB, Used:0.00B
          /dev/vdg        8.00MiB
       Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
          /dev/vdg      122.88MiB
          /dev/vdh      122.88MiB
       System,single: Size:4.00MiB, Used:0.00B
          /dev/vdg        4.00MiB
       System,RAID1: Size:8.00MiB, Used:16.00KiB
          /dev/vdg        8.00MiB
          /dev/vdh        8.00MiB
       Unallocated:
          /dev/vdg        1.06GiB
          /dev/vdh        1.07GiB
      
       And "data chunks changed from raid1 to single" in second
       "btrfs filesystem usage":
       # btrfs filesystem usage $TEST_DIR
       Overall:
          ...
       Data,single: Size:256.00MiB, Used:0.00B
          /dev/vdh      256.00MiB
       Metadata,single: Size:8.00MiB, Used:0.00B
          /dev/vdg        8.00MiB
       Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
          /dev/vdg      122.88MiB
          /dev/vdh      122.88MiB
       System,single: Size:4.00MiB, Used:0.00B
          /dev/vdg        4.00MiB
       System,RAID1: Size:8.00MiB, Used:16.00KiB
          /dev/vdg        8.00MiB
          /dev/vdh        8.00MiB
       Unallocated:
          /dev/vdg        1.06GiB
          /dev/vdh      841.92MiB
      
      Reason:
       btrfs balance delete last data chunk in case of no data in
       the filesystem, then we can see "no data chunk" by "fi usage"
       command.
      
       And when we do write operation to fs, the only available data
       profile is 0x0, result is all new chunks are allocated single type.
      
      Fix:
       Allocate a data chunk explicitly to ensure we don't lose the
       raid profile for data.
      
      Test:
       Test by above script, and confirmed the logic by debug output.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2c9fe835
  8. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  9. 27 10月, 2015 3 次提交
  10. 22 10月, 2015 4 次提交
  11. 11 10月, 2015 1 次提交
  12. 08 10月, 2015 3 次提交
  13. 02 10月, 2015 1 次提交
  14. 01 10月, 2015 3 次提交
  15. 29 9月, 2015 5 次提交
  16. 01 9月, 2015 1 次提交
    • Z
      btrfs: Add raid56 support for updating · 943c6e99
      Zhao Lei 提交于
       num_tolerated_disk_barrier_failures in btrfs_balance
      
      Code for updating fs_info->num_tolerated_disk_barrier_failures in
      btrfs_balance() lacks raid56 support.
      
      Reason:
       Above code was wroten in 2012-08-01, together with
       btrfs_calc_num_tolerated_disk_barrier_failures()'s first version.
      
       Then, btrfs_calc_num_tolerated_disk_barrier_failures() got updated
       later to support raid56, but code in btrfs_balance() was not
       updated together.
      
      Fix:
       Merge above similar code to a common function:
       btrfs_get_num_tolerated_disk_barrier_failures()
       and make it support both case.
      
       It can fix this bug with a bonus of cleanup, and make these code
       never in above no-sync state from now on.
      Suggested-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      943c6e99
  17. 22 8月, 2015 1 次提交
    • C
      btrfs: fix compile when block cgroups are not enabled · 3a9508b0
      Chris Mason 提交于
      bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on.
      This adds an ifdef around them.  It's not perfect, but our
      use of bi_ioc is being removed in the 4.3 merge window.
      
      The bi_css usage really should go into bio_clone, but I want to make
      sure that doesn't introduce problems for other bio_clone use cases.
      Signed-off-by: NChris Mason <clm@fb.com>
      3a9508b0
  18. 20 8月, 2015 1 次提交
    • M
      btrfs: use __GFP_NOFAIL in alloc_btrfs_bio · 277fb5fc
      Michal Hocko 提交于
      alloc_btrfs_bio relies on GFP_NOFS allocation when committing the
      transaction but this allocation context is rather weak wrt. reclaim
      capabilities. The page allocator currently tries hard to not fail these
      allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can
      still fail if the _current_ process is the OOM killer victim. Moreover
      there is an attempt to move away from the default no-fail behavior and
      allow these allocation to fail more eagerly. This would lead to:
      
      [   37.928625] kernel BUG at fs/btrfs/extent_io.c:4045
      
      which is clearly undesirable and the nofail behavior should be explicit
      if the allocation failure cannot be tolerated.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      277fb5fc
  19. 14 8月, 2015 1 次提交
  20. 09 8月, 2015 3 次提交
    • C
      Btrfs: add support for blkio controllers · da2f0f74
      Chris Mason 提交于
      This attaches accounting information to bios as we submit them so the
      new blkio controllers can throttle on btrfs filesystems.
      
      Not much is required, we're just associating bios with blkcgs during clone,
      calling wbc_init_bio()/wbc_account_io() during writepages submission,
      and attaching the bios to the current context during direct IO.
      
      Finally if we are splitting bios during btrfs_map_bio, this attaches
      accounting information to the split.
      
      The end result is able to throttle nicely on single disk filesystems.  A
      little more work is required for multi-device filesystems.
      Signed-off-by: NChris Mason <clm@fb.com>
      da2f0f74
    • Z
      btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk() · dc2ee4e2
      Zhaolei 提交于
      Remove chunk_objectid argument from btrfs_relocate_chunk() because
      it is not necessary, it can also cleanup some code in caller for
      prepare its value.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dc2ee4e2
    • Z
      btrfs: Fix data checksum error cause by replace with io-load. · 55e3a601
      Zhaolei 提交于
      xfstests btrfs/070 sometimes failed.
      In my test machine, its fail rate is about 30%.
      In another vm(vmware), its fail rate is about 50%.
      
      Reason:
        btrfs/070 do replace and defrag with fsstress simultaneously,
        after above operation, checksum error is found by scrub.
      
        Actually, it have no relationship with defrag operation, only
        replace with fsstress can trigger this bug.
      
        New data writen to target device have possibility rewrited by
        old data from source device by replace code in debug, to avoid
        above problem, we can set target block group to readonly in
        replace period, so new data requested by other operation will
        not write to same place with replace code.
      
        Before patch(4.1-rc3):
          30% failed in 100 xfstests.
        After patch:
          0% failed in 300 xfstests.
      
      It also happened in btrfs/071 as it's another scrub with IO load tests.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      55e3a601
  21. 29 7月, 2015 2 次提交
    • J
      btrfs: iterate over unused chunk space in FITRIM · 499f377f
      Jeff Mahoney 提交于
      Since we now clean up block groups automatically as they become
      empty, iterating over block groups is no longer sufficient to discard
      unused space.
      
      This patch iterates over the unused chunk space and discards any regions
      that are unallocated, regardless of whether they were ever used.  This is
      a change for btrfs but is consistent with other file systems.
      
      We do this in a transactionless manner since the discard process can take
      a substantial amount of time and a transaction would need to be started
      before the acquisition of the device list lock.  That would mean a
      transaction would be held open across /all/ of the discards collectively.
      In order to prevent other threads from allocating or freeing chunks, we
      hold the chunks lock across the search and discard calls.  We release it
      between searches to allow the file system to perform more-or-less
      normally.  Since the running transaction can commit and disappear while
      we're using the transaction pointer, we take a reference to it and
      release it after the search.  This is safe since it would happen normally
      at the end of the transaction commit after any locks are released anyway.
      We also take the commit_root_sem to protect against a transaction starting
      and committing while we're running.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      499f377f
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6