1. 22 10月, 2015 2 次提交
  2. 08 10月, 2015 3 次提交
  3. 29 9月, 2015 3 次提交
  4. 09 8月, 2015 3 次提交
  5. 29 7月, 2015 1 次提交
  6. 01 7月, 2015 1 次提交
    • F
      Btrfs: fix race between balance and unused block group deletion · 67c5e7d4
      Filipe Manana 提交于
      We have a race between deleting an unused block group and balancing the
      same block group that leads to an assertion failure/BUG(), producing the
      following trace:
      
      [181631.208236] BTRFS: assertion failed: 0, file: fs/btrfs/volumes.c, line: 2622
      [181631.220591] ------------[ cut here ]------------
      [181631.222959] kernel BUG at fs/btrfs/ctree.h:4062!
      [181631.223932] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [181631.224566] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse acpi_cpufreq parpor$
      [181631.224566] CPU: 8 PID: 17451 Comm: btrfs Tainted: G        W       4.1.0-rc5-btrfs-next-10+ #1
      [181631.224566] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [181631.224566] task: ffff880127e09590 ti: ffff8800b5824000 task.ti: ffff8800b5824000
      [181631.224566] RIP: 0010:[<ffffffffa03f19f6>]  [<ffffffffa03f19f6>] assfail.constprop.50+0x1e/0x20 [btrfs]
      [181631.224566] RSP: 0018:ffff8800b5827ae8  EFLAGS: 00010246
      [181631.224566] RAX: 0000000000000040 RBX: ffff8800109fc218 RCX: ffffffff81095dce
      [181631.224566] RDX: 0000000000005124 RSI: ffffffff81464819 RDI: 00000000ffffffff
      [181631.224566] RBP: ffff8800b5827ae8 R08: 0000000000000001 R09: 0000000000000000
      [181631.224566] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800109fc200
      [181631.224566] R13: ffff880020095000 R14: ffff8800b1a13f38 R15: ffff880020095000
      [181631.224566] FS:  00007f70ca0b0c80(0000) GS:ffff88013ec00000(0000) knlGS:0000000000000000
      [181631.224566] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [181631.224566] CR2: 00007f2872ab6e68 CR3: 00000000a717c000 CR4: 00000000000006e0
      [181631.224566] Stack:
      [181631.224566]  ffff8800b5827ba8 ffffffffa03f3916 ffff8800b5827b38 ffffffffa03d080e
      [181631.224566]  ffffffffa03d1423 ffff880020095000 ffff88001233c000 0000000000000001
      [181631.224566]  ffff880020095000 ffff8800b1a13f38 0000000a69c00000 0000000000000000
      [181631.224566] Call Trace:
      [181631.224566]  [<ffffffffa03f3916>] btrfs_remove_chunk+0xa4/0x6bb [btrfs]
      [181631.224566]  [<ffffffffa03d080e>] ? join_transaction.isra.8+0xb9/0x3ba [btrfs]
      [181631.224566]  [<ffffffffa03d1423>] ? wait_current_trans.isra.13+0x22/0xfc [btrfs]
      [181631.224566]  [<ffffffffa03f3fbc>] btrfs_relocate_chunk.isra.29+0x8f/0xa7 [btrfs]
      [181631.224566]  [<ffffffffa03f54df>] btrfs_balance+0xaa4/0xc52 [btrfs]
      [181631.224566]  [<ffffffffa03fd388>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
      [181631.224566]  [<ffffffff810872f9>] ? trace_hardirqs_on+0xd/0xf
      [181631.224566]  [<ffffffffa04019a3>] btrfs_ioctl+0xfe2/0x2220 [btrfs]
      [181631.224566]  [<ffffffff812603ed>] ? __this_cpu_preempt_check+0x13/0x15
      [181631.224566]  [<ffffffff81084669>] ? arch_local_irq_save+0x9/0xc
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff81138def>] ? handle_mm_fault+0x834/0xcd2
      [181631.224566]  [<ffffffff8103e48c>] ? __do_page_fault+0x211/0x424
      [181631.224566]  [<ffffffff811755e6>] do_vfs_ioctl+0x3c6/0x479
      (...)
      
      The sequence of steps leading to this are:
      
                 CPU 0                                         CPU 1
      
        btrfs_balance()
          btrfs_relocate_chunk()
      
            btrfs_relocate_block_group(bg X)
              btrfs_lookup_block_group(bg X)
      
                                                     cleaner_kthread
                                                        locks fs_info->cleaner_mutex
      
                                                        btrfs_delete_unused_bgs()
                                                          finds bg X, which became
                                                          unused in the previous
                                                          transaction
      
                                                          checks bg X ->ro == 0,
                                                          so it proceeds
              sets bg X ->ro to 1
              (btrfs_set_block_group_ro(bg X))
      
              blocks on fs_info->cleaner_mutex
                                                          btrfs_remove_chunk(bg X)
                                                        unlocks fs_info->cleaner_mutex
      
              acquires fs_info->cleaner_mutex
              relocate_block_group()
                --> does nothing, no extents found in
                    the extent tree from bg X
              unlocks fs_info->cleaner_mutex
      
            btrfs_relocate_block_group(bg X) returns
      
          btrfs_remove_chunk(bg X)
             extent map not found
                --> ASSERT(0)
      
      Fix this by using a new mutex to make sure these 2 operations, block
      group relocation and removal, are serialized.
      
      This issue is reproducible by running fstests generic/038 (which stresses
      chunk allocation and automatic removal of unused block groups) together
      with the following balance loop:
      
          while true; do btrfs balance start -dusage=0 <mountpoint> ; done
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      67c5e7d4
  7. 11 6月, 2015 1 次提交
  8. 10 6月, 2015 2 次提交
    • Z
      btrfs: Fix lockdep warning of wr_ctx->wr_lock in scrub_free_wr_ctx() · 20b2e302
      Zhao Lei 提交于
      lockdep report following warning in test:
       [25176.843958] =================================
       [25176.844519] [ INFO: inconsistent lock state ]
       [25176.845047] 4.1.0-rc3 #22 Tainted: G        W
       [25176.845591] ---------------------------------
       [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
       [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes:
       [25176.847246]  (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs]
       [25176.847838] {SOFTIRQ-ON-W} state was registered at:
       [25176.848396]   [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10
       [25176.848955]   [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0
       [25176.849491]   [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410
       [25176.850029]   [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs]
       [25176.850575]   [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs]
       [25176.851110]   [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs]
       [25176.851660]   [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs]
       [25176.852189]   [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs]
       [25176.852771]   [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs]
       [25176.853315]   [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570
       [25176.853868]   [<ffffffff8121c851>] SyS_ioctl+0x41/0x80
       [25176.854406]   [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f
       [25176.854935] irq event stamp: 51506
       [25176.855511] hardirqs last  enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0
       [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0
       [25176.856642] softirqs last  enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640
       [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120
       [25176.857746]
       other info that might help us debug this:
       [25176.858845]  Possible unsafe locking scenario:
       [25176.859981]        CPU0
       [25176.860537]        ----
       [25176.861059]   lock(&wr_ctx->wr_lock);
       [25176.861705]   <Interrupt>
       [25176.862272]     lock(&wr_ctx->wr_lock);
       [25176.862881]
        *** DEADLOCK ***
      
      Reason:
       Above warning is caused by:
       Interrupt
       -> bio_endio()
       -> ...
       -> scrub_put_ctx()
       -> scrub_free_ctx() *1
       -> ...
       -> mutex_lock(&wr_ctx->wr_lock);
      
       scrub_put_ctx() is allowed to be called in end_bio interrupt, but
       in code design, it will never call scrub_free_ctx(sctx) in interrupe
       context(above *1), because btrfs_scrub_dev() get one additional
       reference of sctx->refs, which makes scrub_free_ctx() only called
       withine btrfs_scrub_dev().
      
       Now the code runs out of our wish, because free sequence in
       scrub_pending_bio_dec() have a gap.
      
       Current code:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
       -> scrub_free_ctx()                |
       -----------------------------------+-----------------------------------
      
       We expected:
       -----------------------------------+-----------------------------------
       scrub_pending_bio_dec()            |  btrfs_scrub_dev
       -----------------------------------+-----------------------------------
       atomic_dec(&sctx->bios_in_flight); |
       wake_up(&sctx->list_wait);         |
       scrub_put_ctx(sctx);               |
       -> atomic_dec_and_test(&sctx->refs)|
                                          | scrub_put_ctx()
                                          | -> atomic_dec_and_test(&sctx->refs)
                                          | -> scrub_free_ctx()
       -----------------------------------+-----------------------------------
      
      Fix:
       Move scrub_pending_bio_dec() to a workqueue, to avoid this function run
       in interrupt context.
       Tested by check tracelog in debug.
      
      Changelog v1->v2:
       Use workqueue instead of adjust function call sequence in v1,
       because v1 will introduce a bug pointed out by:
       Filipe David Manana <fdmanana@gmail.com>
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      20b2e302
    • F
      Btrfs: fix necessary chunk tree space calculation when allocating a chunk · 4617ea3a
      Filipe Manana 提交于
      When allocating a new chunk or removing one we need to update num_devs
      device items and insert or remove a chunk item in the chunk tree, so
      in the worst case the space needed in the chunk space_info is:
      
        btrfs_calc_trunc_metadata_size(chunk_root, num_devs) +
           btrfs_calc_trans_metadata_size(chunk_root, 1)
      
      That is, in the worst case we need to cow num_devs paths and cow 1 other
      path that can result in splitting every node and leaf, and each path
      consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring
      some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes,
      which were unnecessary since updating the existing device items does
      not result in splitting the nodes and leaf since after updating them
      they remain with the same size.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4617ea3a
  9. 03 6月, 2015 5 次提交
    • F
      Btrfs: fix -ENOSPC on block group removal · 39c2d7fa
      Filipe Manana 提交于
      Unlike when attempting to allocate a new block group, where we check
      that we have enough space in the system space_info to update the device
      items and insert a new chunk item in the chunk tree, we were not checking
      if the system space_info had enough space for updating the device items
      and deleting the chunk item in the chunk tree. This often lead to -ENOSPC
      error when attempting to allocate blocks for the chunk tree (during btree
      node/leaf COW operations) while updating the device items or deleting the
      chunk item, which resulted in the current transaction being aborted and
      turning the filesystem into read-only mode.
      
      While running fstests generic/038, which stresses allocation of block
      groups and removal of unused block groups, with a large scratch device
      (750Gb) this happened often, despite more than enough unallocated space,
      and resulted in the following trace:
      
      [68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
      [68663.600407] BTRFS: Transaction aborted (error -28)
      (...)
      [68663.730829] Call Trace:
      [68663.732585]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [68663.734334]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [68663.739980]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [68663.757153]  [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [68663.760925]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [68663.762854]  [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs]
      [68663.764073]  [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
      [68663.765130]  [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs]
      [68663.765998]  [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs]
      [68663.767068]  [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs]
      [68663.768227]  [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c
      [68663.769081]  [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs]
      [68663.799485]  [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs]
      [68663.809208]  [<ffffffff8105f367>] kthread+0xef/0xf7
      [68663.828795]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
      [68663.844942]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
      [68663.846486]  [<ffffffff81435a88>] ret_from_fork+0x58/0x90
      [68663.847760]  [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad
      [68663.849503] ---[ end trace 798477c6d6dbaad6 ]---
      [68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left
      
      So fix this by verifying that enough space exists in system space_info,
      and reserving the space in the chunk block reserve, before attempting to
      delete the block group and allocate a new system chunk if we don't have
      enough space to perform the necessary updates and delete in the chunk
      tree. Like for the block group creation case, we don't error our if we
      fail to allocate a new system chunk, since we might end up not needing
      it (no node/leaf splits happen during the COW operations and/or we end
      up not needing to COW any btree nodes or leafs because they were already
      COWed in the current transaction and their writeback didn't start yet).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      39c2d7fa
    • F
      Btrfs: fix -ENOSPC when finishing block group creation · 4fbcdf66
      Filipe Manana 提交于
      While creating a block group, we often end up getting ENOSPC while updating
      the chunk tree, which leads to a transaction abortion that produces a trace
      like the following:
      
      [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
      [30670.117777] BTRFS: Transaction aborted (error -28)
      (...)
      [30670.163567] Call Trace:
      [30670.163906]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
      [30670.164522]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
      [30670.165171]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
      [30670.166323]  [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.167213]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
      [30670.167862]  [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
      [30670.169116]  [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
      [30670.170593]  [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
      [30670.171960]  [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
      [30670.174649]  [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
      [30670.176092]  [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
      [30670.177218]  [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
      [30670.178622]  [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
      [30670.179642]  [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
      [30670.180692]  [<ffffffff81152863>] SyS_fallocate+0x47/0x62
      [30670.186737]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
      [30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
      
      This is because we don't do proper space reservation for the chunk block
      reserve when we have multiple tasks allocating chunks in parallel.
      
      So block group creation has 2 phases, and the first phase essentially
      checks if there is enough space in the system space_info, allocating a
      new system chunk if there isn't, while the second phase updates the
      device, extent and chunk trees. However, because the updates to the
      chunk tree happen in the second phase, if we have N tasks, each with
      its own transaction handle, allocating new chunks in parallel and if
      there is only enough space in the system space_info to allocate M chunks,
      where M < N, none of the tasks ends up allocating a new system chunk in
      the first phase and N - M tasks will get -ENOSPC when attempting to
      update the chunk tree in phase 2 if they need to COW any nodes/leafs
      from the chunk tree.
      
      Fix this by doing proper reservation in the chunk block reserve.
      
      The issue could be reproduced by running fstests generic/038 in a loop,
      which eventually triggered the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4fbcdf66
    • Q
      btrfs: Fix superblock csum type check. · 1f6e4b3f
      Qu Wenruo 提交于
      Old csum type check is wrong and can't catch csum_type 1(not supported).
      
      Fix it to avoid hostile 0 division.
      Reported-by: NLukas Lueg <lukas.lueg@gmail.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      1f6e4b3f
    • D
      btrfs: add 'cold' compiler annotations to all error handling functions · c0d19e2b
      David Sterba 提交于
      The annotated functios will be placed into .text.unlikely section. The
      annotation also hints compiler to move the code out of the hot paths,
      and may implicitly mark if-statement leading to that block as unlikely.
      
      This is a heuristic, the impact on the generated code is not
      significant.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c0d19e2b
    • D
      btrfs: report exact callsite where transaction abort occurs · 1a9a8a71
      David Sterba 提交于
      WARN is called from a single location and all bugreports say that's in
      super.c __btrfs_abort_transaction. This is slightly confusing as we'd
      rather want to know the exact callsite. Whereas this information is
      printed in the syslog below the stacktrace, this requires further look
      and we usually see only the headline from WARNING.
      
      Moving the WARN into the macro has to inline some code and increases
      code by a few kilobytes:
      
        text    data     bss     dec     hex filename
      835481   20305   14120  869906   d4612 btrfs.ko.before
      842883   20305   14120  877308   d62fc btrfs.ko.after
      
      The delta is +7k (130+ calls), measured on 3.19 x86_64, distro config.
      The increase is not small and could lead to worse icache use. The code
      is on error/exit paths that can be recognized by compiler as cold and
      moved out of the way so the impact is speculated to be low, if
      measurable at all.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a9a8a71
  10. 27 5月, 2015 1 次提交
    • A
      Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices · 2e7910d6
      Anand Jain 提交于
      This patch will provide a framework and help to create attributes
      from the structure btrfs_fs_devices which are available even before
      fs_info is created. So by moving the parent kobject super_kobj from
      fs_info to btrfs_fs_devices, it will help to create attributes
      from the btrfs_fs_devices as well.
      
      Patches on top of this patch now will be able to create the
      sys/fs/btrfs/fsid kobject and attributes from btrfs_fs_devices
      when devices are scanned and registered to the kernel.
      
      Just to note, this does not change any of the existing btrfs sysfs
      external kobject names and its attributes and not even the life
      cycle of them. Changes are internal only. And to ensure the same,
      this path has been tested with various device operations and,
      checking and comparing the sysfs kobjects and attributes with
      sysfs kobject and attributes with out this patch, and they remain
      same.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      2e7910d6
  11. 13 4月, 2015 4 次提交
    • Q
      btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created · e09fe2d2
      Qu Wenruo 提交于
      Btrfs will create qgroup on subvolume creation if quota is enabled, but
      qgroup uses the high bits(currently 16 bits) as level, to build the
      inheritance.
      
      However it is fully possible a subvolume can be created with a
      subvolumeid larger than 1 << BTRFS_QGROUP_LEVEL_SHIFT, so it will be
      considered as level 1 and can't be assigned to other qgroup in level 1.
      
      This patch will prevent such things so qgroup inheritance will not be
      screwed up.
      The downside is very clear, btrfs subvolume number limit will decrease
      from (u64 max - 256(fisrt free objectid) - 256(last free objectid)) to
      (u48 max -256(first free objectid)).
      But we still have near u48(that's 15 digits in dec), so that should not
      be a huge problem.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e09fe2d2
    • Q
      btrfs: Check qgroup level in kernel qgroup assign. · 8465ecec
      Qu Wenruo 提交于
      Although we have qgroup level check in btrfs-progs, it's not enough
      since other programe may still call ioctl directly not using
      btrfs-progs. For example, systemd.
      
      But it's btrfs-progs to be blame since we don't provide a
      full-function(like subvolume create things) btrfs library with enough
      check, and only rely on kernel ioctl.
      
      So Add level checks in kernel too.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8465ecec
    • D
      btrfs: qgroup: do a reservation in a higher level. · e2d1f923
      Dongsheng Yang 提交于
      There are two problems in qgroup:
      
      a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
      qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
      is not available to user.
      
      b). When user is writing a inline data, qgroup will not reserve it,
      it means this is a window we can exceed the limit of a qgroup.
      
      The main idea of this patch is reserving the data size of write_bytes
      rather than the reserve_bytes. It means qgroup will not care about
      the data size btrfs will reserve for user, but only care about the
      data size user is going to write. Then reserve it when user want to
      write and release it in transaction committed.
      
      In this way, qgroup can be released from the complex procedure in
      btrfs and only do the reserve when user want to write and account
      when the data is written in commit_transaction().
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e2d1f923
    • Z
      btrfs: Fix NO_SPACE bug caused by delayed-iput · d7c15171
      Zhao Lei 提交于
      Steps to reproduce:
        while true; do
          dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
          rm /btrfs_dir/file
          sync
        done
      
        And we'll see dd failed because btrfs return NO_SPACE.
      
      Reason:
        Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
        in end to free fs space for next write, but sometimes it hadn't
        done work on time, because btrfs-cleaner thread get delayed-iputs
        from list before, but do iput() after next write.
      
        This is log:
        [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
      
        [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
        [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
        [ 2569.087554] comm=sync func=btrfs_commit_transaction() end
      
        [ 2569.191081] comm=dd begin
        [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
      
        [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
        [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
        ...
        [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
        [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
      
      Fix:
        Make btrfs_commit_transaction() wait current running btrfs-cleaner's
        delayed-iputs() done in end.
      
      Test:
        Use script similar to above(more complex),
        before patch:
          7 failed in 100 * 20 loop.
        after patch:
          0 failed in 100 * 20 loop.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7c15171
  12. 11 4月, 2015 5 次提交
    • C
      Btrfs: fix use after free when close_ctree frees the orphan_rsv · cdfb080e
      Chris Mason 提交于
      Near the end of close_ctree, we're calling btrfs_free_block_rsv
      to free up the orphan rsv.  The problem is this call updates the
      space_info, which has already been freed.
      
      This adds a new __ function that directly calls kfree instead of trying
      to update the space infos.
      Signed-off-by: NChris Mason <clm@fb.com>
      cdfb080e
    • C
      Btrfs: allow block group cache writeout outside critical section in commit · 1bbc621e
      Chris Mason 提交于
      We loop through all of the dirty block groups during commit and write
      the free space cache.  In order to make sure the cache is currect, we do
      this while no other writers are allowed in the commit.
      
      If a large number of block groups are dirty, this can introduce long
      stalls during the final stages of the commit, which can block new procs
      trying to change the filesystem.
      
      This commit changes the block group cache writeout to take appropriate
      locks and allow it to run earlier in the commit.  We'll still have to
      redo some of the block groups, but it means we can get most of the work
      out of the way without blocking the entire FS.
      Signed-off-by: NChris Mason <clm@fb.com>
      1bbc621e
    • C
      Btrfs: two stage dirty block group writeout · c9dc4c65
      Chris Mason 提交于
      Block group cache writeout is currently waiting on the pages for each
      block group cache before moving on to writing the next one.  This commit
      switches things around to send down all the caches and then wait on them
      in batches.
      
      The end result is much faster, since we're keeping the disk pipeline
      full.
      Signed-off-by: NChris Mason <clm@fb.com>
      c9dc4c65
    • C
      btrfs: move struct io_ctl into ctree.h and rename it · 4c6d1d85
      Chris Mason 提交于
      We'll need to put the io_ctl into the block_group cache struct, so
      name it struct btrfs_io_ctl and move it into ctree.h
      Signed-off-by: NChris Mason <clm@fb.com>
      4c6d1d85
    • C
      Btrfs: refill block reserves during truncate · 28f75a0e
      Chris Mason 提交于
      When truncate starts, it allocates some space in the block reserves so
      that we'll have enough to update metadata along the way.
      
      For very large files, we can easily go through all of that space as we
      loop through the extents.  This changes truncate to refill the space
      reservation as it progresses through the file.
      Signed-off-by: NChris Mason <clm@fb.com>
      28f75a0e
  13. 18 3月, 2015 1 次提交
    • J
      Btrfs: add sanity test for outstanding_extents accounting · 6a3891c5
      Josef Bacik 提交于
      I introduced a regression wrt outstanding_extents accounting.  These are tricky
      areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
      at any time.  So add sanity tests to cover the various conditions that are
      tricky in order to make sure we don't introduce regressions in the future.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6a3891c5
  14. 17 3月, 2015 1 次提交
    • J
      Btrfs: prepare block group cache before writing · dcdf7f6d
      Josef Bacik 提交于
      Writing the block group cache will modify the extent tree quite a bit because it
      truncates the old space cache and pre-allocates new stuff.  To try and cut down
      on the churn lets do the setup dance first, then later on hopefully we can avoid
      looping with newly dirtied roots.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dcdf7f6d
  15. 04 3月, 2015 1 次提交
  16. 17 2月, 2015 1 次提交
    • D
      Btrfs: ctree: reduce args where only fs_info used · b7a0365e
      Daniel Dressler 提交于
      This patch is part of a larger project to cleanup btrfs's internal usage
      of struct btrfs_root. Many functions take btrfs_root only to grab a
      pointer to fs_info.
      
      This causes programmers to ponder which root can be passed. Since only
      the fs_info is read affected functions can accept any root, except this
      is only obvious upon inspection.
      
      This patch reduces the specificty of such functions to accept the
      fs_info directly.
      
      This patch does not address the two functions in ctree.c (insert_ptr,
      and split_item) which only use root for BUG_ONs in ctree.c
      
      This patch affects the following functions:
        1) fixup_low_keys
        2) btrfs_set_item_key_safe
      Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      b7a0365e
  17. 15 2月, 2015 1 次提交
    • J
      Btrfs: account for large extents with enospc · dcab6a3b
      Josef Bacik 提交于
      On our gluster boxes we stream large tar balls of backups onto our fses.  With
      160gb of ram this means we get really large contiguous ranges of dirty data, but
      the way our ENOSPC stuff works is that as long as it's contiguous we only hold
      metadata reservation for one extent.  The problem is we limit our extents to
      128mb, so we'll end up with at least 800 extents so our enospc accounting is
      quite a bit lower than what we need.  To keep track of this make sure we
      increase outstanding_extents for every multiple of the max extent size so we can
      be sure to have enough reserved metadata space.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcab6a3b
  18. 03 2月, 2015 2 次提交
    • F
      Btrfs: fix race between transaction commit and empty block group removal · d4b450cd
      Filipe Manana 提交于
      Committing a transaction can race with automatic removal of empty block
      groups (cleaner kthread), leading to a BUG_ON() in the transaction
      commit code while running btrfs_finish_extent_commit(). The following
      sequence diagram shows how it can happen:
      
                 CPU 1                                       CPU 2
      
      btrfs_commit_transaction()
        fs_info->running_transaction = NULL
        btrfs_finish_extent_commit()
          find_first_extent_bit()
            -> found range for block group X
               in fs_info->freed_extents[]
      
                                                     btrfs_delete_unused_bgs()
                                                       -> found block group X
      
                                                       Removed block group X's range
                                                       from fs_info->freed_extents[]
      
                                                       btrfs_remove_chunk()
                                                          btrfs_remove_block_group(bg X)
      
          unpin_extent_range(bg X range)
             btrfs_lookup_block_group(bg X)
                -> returns NULL
                  -> BUG_ON()
      
      The trace that results from the BUG_ON() is:
      
      [48665.187808] ------------[ cut here ]------------
      [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
      [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
      [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
      [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
      [48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
      [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
      [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
      [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
      [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
      [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
      [48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
      [48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
      [48665.197388] Stack:
      [48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
      [48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
      [48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
      [48665.197388] Call Trace:
      [48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
      [48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
      [48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
      [48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
      [48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
      [48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
      [48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
      [48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
      [48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
      [48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
      [48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
      [48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
      [48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388]  RSP <ffff8801b56a7b88>
      [48665.272246] ---[ end trace b9c6ab9957521376 ]---
      
      Fix this by ensuring that unpining the block group's range in
      btrfs_finish_extent_commit() is done in a synchronized fashion
      with removing the block group's range from freed_extents[]
      in btrfs_delete_unused_bgs()
      
      This race got introduced with the change:
      
          Btrfs: remove empty block groups automatically
          commit 47ab2a6cSigned-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d4b450cd
    • D
      btrfs: kill btrfs_inode_*time helpers · a937b979
      David Sterba 提交于
      They just opencode taking address of the timespec member.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      a937b979
  19. 22 1月, 2015 2 次提交