1. 08 6月, 2016 1 次提交
  2. 31 5月, 2016 1 次提交
    • F
      Btrfs: fix race between device replace and discard · 2999241d
      Filipe Manana 提交于
      While we are finishing a device replace operation, we can make a discard
      operation (fs mounted with -o discard) do an invalid memory access like
      the one reported by the following trace:
      
      [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
      [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
      processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
      virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
      [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
      [ 3206.388595] RIP: 0010:[<ffffffff8124d233>]  [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7
      [ 3206.388595] RSP: 0018:ffff880171b9bb80  EFLAGS: 00010246
      [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
      [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
      [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
      [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
      [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
      [ 3206.388595] FS:  00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
      [ 3206.388595] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
      [ 3206.388595] Stack:
      [ 3206.388595]  0000000000004878 0000000000000000 0000000002400040 0000000000000000
      [ 3206.388595]  0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
      [ 3206.388595]  ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
      [ 3206.388595] Call Trace:
      [ 3206.388595]  [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs]
      [ 3206.388595]  [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
      [ 3206.388595]  [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs]
      [ 3206.388595]  [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
      [ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
      [ 3206.388595]  [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
      [ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
      [ 3206.388595]  [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs]
      [ 3206.388595]  [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e
      [ 3206.388595]  [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e
      [ 3206.388595]  [<ffffffff811a8417>] do_fsync+0x31/0x4a
      [ 3206.388595]  [<ffffffff811a8637>] SyS_fsync+0x10/0x14
      [ 3206.388595]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [ 3206.388595]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
      [ 3206.388595]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
      
      This happens because when we call btrfs_map_block() from
      btrfs_discard_extent() to get a btrfs_bio structure, the device replace
      operation has not finished yet, but before we use the device of one of the
      stripes from the returned btrfs_bio structure, the device object is freed.
      
      This is illustrated by the following diagram.
      
                  CPU 1                                                  CPU 2
      
       btrfs_dev_replace_start()
      
       (...)
      
       btrfs_dev_replace_finishing()
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
         (...)
      
                                                                  btrfs_sync_file()
                                                                    btrfs_start_transaction()
      
                                                                    (...)
      
                                                                    btrfs_commit_transaction()
                                                                      btrfs_finish_extent_commit()
                                                                        btrfs_discard_extent()
                                                                          btrfs_map_block()
                                                                            --> returns a struct btrfs_bio
                                                                                with a stripe that has a
                                                                                device field pointing to
                                                                                source device of the replace
                                                                                operation (the device that
                                                                                is being replaced)
      
         mutex_lock(&uuid_mutex)
         mutex_lock(&fs_info->fs_devices->device_list_mutex)
         mutex_lock(&fs_info->chunk_mutex)
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates the mapping tree and for each
               extent map that has a stripe pointing to
               the source device, it updates the stripe
               to point to the target device instead
      
         btrfs_rm_dev_replace_blocked()
           --> waits for fs_info->bio_counter to go down to 0
      
         btrfs_rm_dev_replace_remove_srcdev()
           --> removes source device from the list of devices
      
         mutex_unlock(&fs_info->chunk_mutex)
         mutex_unlock(&fs_info->fs_devices->device_list_mutex)
         mutex_unlock(&uuid_mutex)
      
         btrfs_rm_dev_replace_free_srcdev()
           --> frees the source device
      
                                                                          --> iterates over all stripes
                                                                              of the returned struct
                                                                              btrfs_bio
                                                                          --> for each stripe it
                                                                              dereferences its device
                                                                              pointer
                                                                              --> it ends up finding a
                                                                                  pointer to the device
                                                                                  used as the source
                                                                                  device for the replace
                                                                                  operation and that was
                                                                                  already freed
      
      So fix this by surrounding the call to btrfs_map_block(), and the code
      that uses the returned struct btrfs_bio, with calls to
      btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
      the finishing phase of the device replace operation blocks until the
      the bio counter decreases to zero before it frees the source device.
      This is the same approach we do at btrfs_map_bio() for example.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      2999241d
  3. 26 5月, 2016 1 次提交
  4. 13 5月, 2016 3 次提交
    • F
      Btrfs: fix race between block group relocation and nocow writes · f78c436c
      Filipe Manana 提交于
      Relocation of a block group waits for all existing tasks flushing
      dellaloc, starting direct IO writes and any ordered extents before
      starting the relocation process. However for direct IO writes that end
      up doing nocow (inode either has the flag nodatacow set or the write is
      against a prealloc extent) we have a short time window that allows for a
      race that makes relocation proceed without waiting for the direct IO
      write to complete first, resulting in data loss after the relocation
      finishes. This is illustrated by the following diagram:
      
                 CPU 1                                     CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                     direct IO write starts against
                                                     an extent in block group X
                                                     using nocow mode (inode has the
                                                     nodatacow flag or the write is
                                                     for a prealloc extent)
      
                                                     btrfs_direct_IO()
                                                       btrfs_get_blocks_direct()
                                                         --> can_nocow_extent() returns 1
      
         btrfs_inc_block_group_ro(bg X)
           --> turns block group into RO mode
      
         btrfs_wait_ordered_roots()
           --> returns and does not know about
               the DIO write happening at CPU 2
               (the task there has not created
                yet an ordered extent)
      
         relocate_block_group(bg X)
           --> rc->stage == MOVE_DATA_EXTENTS
      
           find_next_extent()
             --> returns extent that the DIO
                 write is going to write to
      
           relocate_data_extent()
      
             relocate_file_extent_cluster()
      
               --> reads the extent from disk into
                   pages belonging to the relocation
                   inode and dirties them
      
                                                         --> creates DIO ordered extent
      
                                                       btrfs_submit_direct()
                                                         --> submits bio against a location
                                                             on disk obtained from an extent
                                                             map before the relocation started
      
         btrfs_wait_ordered_range()
           --> writes all the pages read before
               to disk (belonging to the
               relocation inode)
      
         relocation finishes
      
                                                       bio completes and wrote new data
                                                       to the old location of the block
                                                       group
      
      So fix this by tracking the number of nocow writers for a block group and
      make sure relocation waits for that number to go down to 0 before starting
      to move the extents.
      
      The same race can also happen with buffered writes in nocow mode since the
      patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
      when relocating", because we are no longer flushing all delalloc which
      served as a synchonization mechanism (due to page locking) and ensured
      the ordered extents for nocow buffered writes were created before we
      called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
      mode existed before that patch (no pages are locked or used during direct
      IO) and that fixed only races with direct IO writes that do cow.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      f78c436c
    • F
      Btrfs: don't do unnecessary delalloc flushes when relocating · 9cfa3e34
      Filipe Manana 提交于
      Before we start the actual relocation process of a block group, we do
      calls to flush delalloc of all inodes and then wait for ordered extents
      to complete. However we do these flush calls just to make sure we don't
      race with concurrent tasks that have actually already started to run
      delalloc and have allocated an extent from the block group we want to
      relocate, right before we set it to readonly mode, but have not yet
      created the respective ordered extents. The flush calls make us wait
      for such concurrent tasks because they end up calling
      filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
      __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
      btrfs_run_delalloc_work()) which ends up serializing us with those tasks
      due to attempts to lock the same pages (and the delalloc flush procedure
      calls the allocator and creates the ordered extents before unlocking the
      pages).
      
      These flushing calls not only make us waste time (cpu, IO) but also reduce
      the chances of writing larger extents (applications might be writing to
      contiguous ranges and we flush before they finish dirtying the whole
      ranges).
      
      So make sure we don't flush delalloc and just wait for concurrent tasks
      that have already started flushing delalloc and have allocated an extent
      from the block group we are about to relocate.
      
      This change also ends up fixing a race with direct IO writes that makes
      relocation not wait for direct IO ordered extents. This race is
      illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                 starts direct IO write,
                                                 target inode currently has no
                                                 ordered extents ongoing nor
                                                 dirty pages (delalloc regions),
                                                 therefore the root for our inode
                                                 is not in the list
                                                 fs_info->ordered_roots
      
                                                 btrfs_direct_IO()
                                                   __blockdev_direct_IO()
                                                     btrfs_get_blocks_direct()
                                                       btrfs_lock_extent_direct()
                                                         locks range in the io tree
                                                       btrfs_new_extent_direct()
                                                         btrfs_reserve_extent()
                                                           --> extent allocated
                                                               from bg X
      
         btrfs_inc_block_group_ro(bg X)
      
         btrfs_start_delalloc_roots()
           __start_delalloc_inodes()
             --> does nothing, no dealloc ranges
                 in the inode's io tree so the
                 inode's root is not in the list
                 fs_info->delalloc_roots
      
         btrfs_wait_ordered_roots()
           --> does not find the inode's root in the
               list fs_info->ordered_roots
      
           --> ends up not waiting for the direct IO
               write started by the task at CPU 2
      
         relocate_block_group(rc->stage ==
           MOVE_DATA_EXTENTS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
      
           iterates the extent tree, using its
           commit root and moves extents into new
           locations
      
                                                         btrfs_add_ordered_extent_dio()
                                                           --> now a ordered extent is
                                                               created and added to the
                                                               list root->ordered_extents
                                                               and the root added to the
                                                               list fs_info->ordered_roots
                                                           --> this is too late and the
                                                               task at CPU 1 already
                                                               started the relocation
      
           btrfs_commit_transaction()
      
                                                         btrfs_finish_ordered_io()
                                                           btrfs_alloc_reserved_file_extent()
                                                             --> adds delayed data reference
                                                                 for the extent allocated
                                                                 from bg X
      
         relocate_block_group(rc->stage ==
           UPDATE_DATA_PTRS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
               --> delayed refs are run, so an extent
                   item for the allocated extent from
                   bg X is added to extent tree
               --> commit roots are switched, so the
                   next scan in the extent tree will
                   see the extent item
      
           sees the extent in the extent tree
      
      When this happens the relocation produces the following warning when it
      finishes:
      
      [ 7260.832836] ------------[ cut here ]------------
      [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
      [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
      [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
      [ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
      [ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
      [ 7260.852998] Call Trace:
      [ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
      [ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
      [ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
      [ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
      [ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
      [ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
      [ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
      [ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
      [ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
      [ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
      [ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
      [ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
      [ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
      [ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
      [ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
      [ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
      
      This is because at the end of the first stage, in relocate_block_group(),
      we commit the current transaction, which makes delayed refs run, the
      commit roots are switched and so the second stage will find the extent
      item that the ordered extent added to the delayed refs. But this extent
      was not moved (ordered extent completed after first stage finished), so
      at the end of the relocation our block group item still has a positive
      used bytes counter, triggering a warning at the end of
      btrfs_relocate_block_group(). Later on when trying to read the extent
      contents from disk we hit a BUG_ON() due to the inability to map a block
      with a logical address that belongs to the block group we relocated and
      is no longer valid, resulting in the following trace:
      
      [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
      [ 7344.887518] ------------[ cut here ]------------
      [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
      [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
      [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
      [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
      [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
      [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
      [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
      [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
      [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
      [ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
      [ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
      [ 7344.888431] Stack:
      [ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
      [ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
      [ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
      [ 7344.888431] Call Trace:
      [ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
      [ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
      [ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
      [ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
      [ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
      [ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
      [ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
      [ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
      [ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
      [ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
      [ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
      [ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
      [ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
      [ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
      [ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
      [ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431]  RSP <ffff8802046878f0>
      [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      9cfa3e34
    • F
      Btrfs: don't wait for unrelated IO to finish before relocation · 578def7c
      Filipe Manana 提交于
      Before the relocation process of a block group starts, it sets the block
      group to readonly mode, then flushes all delalloc writes and then finally
      it waits for all ordered extents to complete. This last step includes
      waiting for ordered extents destinated at extents allocated in other block
      groups, making us waste unecessary time.
      
      So improve this by waiting only for ordered extents that fall into the
      block group's range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      578def7c
  5. 09 5月, 2016 1 次提交
  6. 29 4月, 2016 4 次提交
  7. 28 4月, 2016 2 次提交
  8. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  9. 04 4月, 2016 1 次提交
  10. 14 3月, 2016 1 次提交
  11. 18 2月, 2016 3 次提交
    • S
      btrfs: fix build warning · 89771cc9
      Sudip Mukherjee 提交于
      We were getting build warning about:
      fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used
      	uninitialized in this function
      
      It is not a valid warning as used_bg is never used uninitilized since
      locked is initially false so we can never be in the section where
      'used_bg' is used. But gcc is not able to understand that and we can
      initialize it while declaring to silence the warning.
      Signed-off-by: NSudip Mukherjee <sudip@vectorindia.org>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      89771cc9
    • J
      Btrfs: check reserved when deciding to background flush · baee8790
      Josef Bacik 提交于
      We will sometimes start background flushing the various enospc related things
      (delayed nodes, delalloc, etc) if we are getting close to reserving all of our
      available space.  We don't want to do this however when we are actually using
      this space as it causes unneeded thrashing.  We currently try to do this by
      checking bytes_used >= thresh, but bytes_used is only part of the equation, we
      need to use bytes_reserved as well as this represents space that is very likely
      to become bytes_used in the future.
      
      My tracing tool will keep count of the number of times we kick off the async
      flusher, the following are counts for the entire run of generic/027
      
      		No Patch	Patch
      avg: 		5385		5009
      median:		5500		4916
      
      We skewed lower than the average with my patch and higher than the average with
      the patch, overall it cuts the flushing from anywhere from 5-10%, which in the
      case of actual ENOSPC is quite helpful.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      baee8790
    • J
      Btrfs: change how we update the global block rsv · fb4b10e5
      Josef Bacik 提交于
      I'm writing a tool to visualize the enospc system in order to help debug enospc
      bugs and I found weird data and ran it down to when we update the global block
      rsv.  We add all of the remaining free space to the block rsv, do a trace event,
      then remove the extra and do another trace event.  This makes my visualization
      look silly and is unintuitive code as well.  Fix this stuff to only add the
      amount we are missing, or free the amount we are missing.  This is less clean to
      read but more explicit in what it is doing, as well as only emitting events for
      values that make sense.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb4b10e5
  12. 20 1月, 2016 3 次提交
    • Z
      btrfs: Fix no_space in write and rm loop · e1746e83
      Zhao Lei 提交于
      I see no_space in v4.4-rc1 again in xfstests generic/102.
      It happened randomly in some node only.
      (one of 4 phy-node, and a kvm with non-virtio block driver)
      
      By bisect, we can found the first-bad is:
       commit bdced438 ("block: setup bi_phys_segments after splitting")'
      But above patch only triggered the bug by making bio operation
      faster(or slower).
      
      Main reason is in our space_allocating code, we need to commit
      page writeback before wait it complish, this patch fixed above
      bug.
      
      BTW, there is another reason for generic/102 fail, caused by
      disable default mixed-blockgroup, I'll fix it in xfstests.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e1746e83
    • Z
      btrfs: merge functions for wait snapshot creation · 0bc19f90
      Zhao Lei 提交于
      wait_for_snapshot_creation() is in same group with oher two:
       btrfs_start_write_no_snapshoting()
       btrfs_end_write_no_snapshoting()
      
      Rename wait_for_snapshot_creation() and move it into same place
      with other two.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0bc19f90
    • F
      Btrfs: fix deadlock running delayed iputs at transaction commit time · c2d6cb16
      Filipe Manana 提交于
      While running a stress test I ran into a deadlock when running the delayed
      iputs at transaction time, which produced the following report and trace:
      
      [  886.399989] =============================================
      [  886.400871] [ INFO: possible recursive locking detected ]
      [  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
      [  886.402384] ---------------------------------------------
      [  886.403182] fio/8277 is trying to acquire lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] but task is already holding lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] other info that might help us debug this:
      [  886.403568]  Possible unsafe locking scenario:
      [  886.403568]
      [  886.403568]        CPU0
      [  886.403568]        ----
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]
      [  886.403568]  *** DEADLOCK ***
      [  886.403568]
      [  886.403568]  May be due to missing lock nesting notation
      [  886.403568]
      [  886.403568] 3 locks held by fio/8277:
      [  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
      [  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] stack backtrace:
      [  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
      [  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
      [  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
      [  886.403568] Call Trace:
      [  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
      [  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
      [  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
      [  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
      [  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
      [  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
      [  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
      [  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
      [  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
      [  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
      [  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
      [  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
      [  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
      [  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
      [  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
      [ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
      [ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
      [ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
      [ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
      [ 1081.880782] Call Trace:
      [ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
      [ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.985293] INFO: lockdep is turned off.
      [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
      [ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
      [ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
      [ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
      [ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
      [ 1081.996485] Call Trace:
      [ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
      [ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
      [ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
      [ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
      [ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
      [ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
      [ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
      [ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
      [ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This happens because we can recursively acquire the semaphore
      fs_info->delayed_iput_sem when attempting to allocate space to satisfy
      a file write request as shown in the first trace above - when committing
      a transaction we acquire (down_read) the semaphore before running the
      delayed iputs, and when running a delayed iput() we can end up calling
      an inode's eviction handler, which in turn commits another transaction
      and attempts to acquire (down_read) again the semaphore to run more
      delayed iput operations.
      This results in a deadlock because if a task acquires multiple times a
      semaphore it should invoke down_read_nested() with a different lockdep
      class for each level of recursion.
      
      Fix this by simplifying the implementation and use a mutex instead that
      is acquired by the cleaner kthread before it runs the delayed iputs
      instead of always acquiring a semaphore before delayed references are
      run from anywhere.
      
      Fixes: d7c15171 (btrfs: Fix NO_SPACE bug caused by delayed-iput)
      Cc: stable@vger.kernel.org   # 4.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c2d6cb16
  13. 16 1月, 2016 2 次提交
  14. 07 1月, 2016 3 次提交
    • D
      btrfs: cleanup, use enum values for btrfs_path reada · e4058b54
      David Sterba 提交于
      Replace the integers by enums for better readability. The value 2 does
      not have any meaning since a7175319
      "Btrfs: do less aggressive btree readahead" (2009-01-22).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4058b54
    • B
      Btrfs: use linux/sizes.h to represent constants · ee22184b
      Byongho Lee 提交于
      We use many constants to represent size and offset value.  And to make
      code readable we use '256 * 1024 * 1024' instead of '268435456' to
      represent '256MB'.  However we can make far more readable with 'SZ_256MB'
      which is defined in the 'linux/sizes.h'.
      
      So this patch replaces 'xxx * 1024 * 1024' kind of expression with
      single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
      not a power of 2. And I haven't touched to '4096' & '8192' because it's
      more intuitive than 'SZ_4KB' & 'SZ_8KB'.
      Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee22184b
    • D
      btrfs: better packing of btrfs_delayed_extent_op · 35b3ad50
      David Sterba 提交于
      btrfs_delayed_extent_op can be packed in a better way, it's 40 bytes now
      and has 8 unused bytes. Reducing the level type to u8 makes it possible
      to squeeze it to the padding byte after key. The bitfields were switched
      to bool as there's space to store the full byte without increasing the
      whole structure, besides that the generated assembly is smaller.
      
      struct btrfs_delayed_extent_op {
      	struct btrfs_disk_key      key;                  /*     0    17 */
      	u8                         level;                /*    17     1 */
      	bool                       update_key;           /*    18     1 */
      	bool                       update_flags;         /*    19     1 */
      	bool                       is_data;              /*    20     1 */
      
      	/* XXX 3 bytes hole, try to pack */
      
      	u64                        flags_to_set;         /*    24     8 */
      
      	/* size: 32, cachelines: 1, members: 6 */
      	/* sum members: 29, holes: 1, sum holes: 3 */
      	/* last cacheline: 32 bytes */
      };
      
      The final size is 32 bytes which gives +26 object per slab page.
      
         text	   data	    bss	    dec	    hex	filename
       938811	  43670	  23144	1005625	  f5839	fs/btrfs/btrfs.ko.before
       938747	  43670	  23144	1005561	  f57f9	fs/btrfs/btrfs.ko.after
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35b3ad50
  15. 31 12月, 2015 1 次提交
    • F
      Btrfs: fix race between free space endio workers and space cache writeout · 2bc0bb5f
      Filipe Manana 提交于
      While running a stress test I ran into the following trace/transaction
      abort:
      
      [471626.672243] ------------[ cut here ]------------
      [471626.673322] WARNING: CPU: 9 PID: 19107 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
      [471626.675492] BTRFS: Transaction aborted (error -2)
      [471626.676748] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc i2c_piix
      [471626.688802] CPU: 14 PID: 19107 Comm: fsstress Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [471626.690148] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
      [471626.691901]  0000000000000000 ffff880016037cf0 ffffffff812566f4 ffff880016037d38
      [471626.695009]  ffff880016037d28 ffffffff8104d0a6 ffffffffa040c84e 00000000fffffffe
      [471626.697490]  ffff88011fe855f8 ffff88000c484cb0 ffff88000d195000 ffff880016037d90
      [471626.699201] Call Trace:
      [471626.699804]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
      [471626.701049]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
      [471626.702542]  [<ffffffffa040c84e>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
      [471626.704326]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
      [471626.705636]  [<ffffffffa0403717>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
      [471626.707048]  [<ffffffffa040c84e>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
      [471626.708616]  [<ffffffffa048a50a>] commit_cowonly_roots+0x1d7/0x25a [btrfs]
      [471626.709950]  [<ffffffffa041e34a>] btrfs_commit_transaction+0x4c4/0x991 [btrfs]
      [471626.711286]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
      [471626.712611]  [<ffffffffa03f6df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
      [471626.715610]  [<ffffffff811962a2>] ? SyS_tee+0x226/0x226
      [471626.716718]  [<ffffffff811962c2>] sync_fs_one_sb+0x20/0x22
      [471626.717672]  [<ffffffff8116fc01>] iterate_supers+0x75/0xc2
      [471626.718800]  [<ffffffff8119669a>] sys_sync+0x52/0x80
      [471626.719990]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [471626.721835] ---[ end trace baf57f43d76693f4 ]---
      [471626.722954] BTRFS: error (device sdc) in btrfs_write_dirty_block_groups:3740: errno=-2 No such entry
      
      This is a very rare situation and it happened due to a race between a free
      space endio worker and writing the space caches for dirty block groups at
      a transaction's commit critical section. The steps leading to this are:
      
      1) A task calls btrfs_commit_transaction() and starts the writeout of the
         space caches for all currently dirty block groups (i.e. it calls
         btrfs_start_dirty_block_groups());
      
      2) The previous step starts writeback for space caches;
      
      3) When the writeback finishes it queues jobs for free space endio work
         queue (fs_info->endio_freespace_worker) that execute
         btrfs_finish_ordered_io();
      
      4) The task committing the transaction sets the transaction's state
         to TRANS_STATE_COMMIT_DOING and shortly after calls
         btrfs_write_dirty_block_groups();
      
      5) A free space endio job joins the transaction, through
         btrfs_join_transaction_nolock(), and updates a free space inode item
         in the root tree through btrfs_update_inode_fallback();
      
      6) Updating the free space inode item resulted in COWing one or more
         nodes/leaves of the root tree, and that resulted in creating a new
         metadata block group, which gets added to the transaction's list
         of dirty block groups (this is a very rare case);
      
      7) The free space endio job has not released yet its transaction handle
         at this point, so the new metadata block group was not yet fully
         created (didn't go through btrfs_create_pending_block_groups() yet);
      
      8) The transaction commit task sees the new metadata block group in
         the transaction's list of dirty block groups and processes it.
         When it attempts to update the block group's block group item in
         the extent tree, through write_one_cache_group(), it isn't able
         to find it and aborts the transaction with error -ENOENT - this
         is because the free space endio job hasn't yet released its
         transaction handle (which calls btrfs_create_pending_block_groups())
         and therefore the block group item was not yet added to the extent
         tree.
      
      Fix this waiting for free space endio jobs if we fail to find a block
      group item in the extent tree and then retry once updating the block
      group item.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      2bc0bb5f
  16. 30 12月, 2015 2 次提交
  17. 22 12月, 2015 1 次提交
    • F
      Btrfs: fix unprotected list operations at btrfs_write_dirty_block_groups · e44081ef
      Filipe Manana 提交于
      We call btrfs_write_dirty_block_groups() in the critical section of a
      transaction's commit, when no other tasks can join the transaction and
      add more block groups to the transaction's list of dirty block groups,
      so we not taking the dirty block groups spinlock when checking for the
      list's emptyness, grabbing its first element or deleting elements from
      it.
      
      However there's a special and rare case where we can have a concurrent
      task adding elements to this list. We trigger writeback for space
      caches before at btrfs_start_dirty_block_groups() and in past iterations
      of the loop at btrfs_write_dirty_block_groups(), this means that when
      the writeback finishes (which happens asynchronously) it creates a
      task for the endio free space work queue that executes
      btrfs_finish_ordered_io() - this function is able to join the transaction,
      through btrfs_join_transaction_nolock(), and update the free space cache's
      inode item in the root tree, which can result in COWing nodes of this tree
      and therefore allocation of a new block group can happen, which gets added
      to the transaction's list of dirty block groups while the transaction
      commit task is operating on it concurrently.
      
      So fix this by taking the dirty block groups spinlock before doing
      operations on the dirty block groups list at
      btrfs_write_dirty_block_groups().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      e44081ef
  18. 18 12月, 2015 3 次提交
    • O
      Btrfs: wire up the free space tree to the extent tree · 1e144fb8
      Omar Sandoval 提交于
      The free space tree is updated in tandem with the extent tree. There are
      only a handful of places where we need to hook in:
      
      1. Block group creation
      2. Block group deletion
      3. Delayed refs (extent creation and deletion)
      4. Block group caching
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1e144fb8
    • O
      Btrfs: implement the free space B-tree · a5ed9182
      Omar Sandoval 提交于
      The free space cache has turned out to be a scalability bottleneck on
      large, busy filesystems. When the cache for a lot of block groups needs
      to be written out, we can get extremely long commit times; if this
      happens in the critical section, things are especially bad because we
      block new transactions from happening.
      
      The main problem with the free space cache is that it has to be written
      out in its entirety and is managed in an ad hoc fashion. Using a B-tree
      to store free space fixes this: updates can be done as needed and we get
      all of the benefits of using a B-tree: checksumming, RAID handling,
      well-understood behavior.
      
      With the free space tree, we get commit times that are about the same as
      the no cache case with load times slower than the free space cache case
      but still much faster than the no cache case. Free space is represented
      with extents until it becomes more space-efficient to use bitmaps,
      giving us similar space overhead to the free space cache.
      
      The operations on the free space tree are: adding and removing free
      space, handling the creation and deletion of block groups, and loading
      the free space for a block group. We can also create the free space tree
      by walking the extent tree and clear the free space tree.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a5ed9182
    • O
      Btrfs: refactor caching_thread() · 73fa48b6
      Omar Sandoval 提交于
      We're also going to load the free space tree from caching_thread(), so
      we should refactor some of the common code.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      73fa48b6
  19. 10 12月, 2015 1 次提交
    • F
      Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list · 348a0013
      Filipe Manana 提交于
      As of my previous change titled "Btrfs: fix scrub preventing unused block
      groups from being deleted", the following warning at
      extent-tree.c:btrfs_delete_unused_bgs() can be hit when we mount the a
      filesysten with "-o discard":
      
       10263  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
       10264  {
       (...)
       10405                  if (trimming) {
       10406                          WARN_ON(!list_empty(&block_group->bg_list));
       10407                          spin_lock(&trans->transaction->deleted_bgs_lock);
       10408                          list_move(&block_group->bg_list,
       10409                                    &trans->transaction->deleted_bgs);
       10410                          spin_unlock(&trans->transaction->deleted_bgs_lock);
       10411                          btrfs_get_block_group(block_group);
       10412                  }
       (...)
      
      This happens because scrub can now add back the block group to the list of
      unused block groups (fs_info->unused_bgs). This is dangerous because we
      are moving the block group from the unused block groups list to the list
      of deleted block groups without holding the lock that protects the source
      list (fs_info->unused_bgs_lock).
      
      The following diagram illustrates how this happens:
      
                  CPU 1                                     CPU 2
      
       cleaner_kthread()
         btrfs_delete_unused_bgs()
      
           sees bg X in list
            fs_info->unused_bgs
      
           deletes bg X from list
            fs_info->unused_bgs
      
                                                  scrub_enumerate_chunks()
      
                                                    searches device tree using
                                                    its commit root
      
                                                    finds device extent for
                                                    block group X
      
                                                    gets block group X from the tree
                                                    fs_info->block_group_cache_tree
                                                    (via btrfs_lookup_block_group())
      
                                                    sets bg X to RO (again)
      
                                                    scrub_chunk(bg X)
      
                                                    sets bg X back to RW mode
      
                                                    adds bg X to the list
                                                    fs_info->unused_bgs again,
                                                    since it's still unused and
                                                    currently not in that list
      
           sets bg X to RO mode
      
           btrfs_remove_chunk(bg X)
      
           --> discard is enabled and bg X
               is in the fs_info->unused_bgs
               list again so the warning is
               triggered
           --> we move it from that list into
               the transaction's delete_bgs
               list, but we can have another
               task currently manipulating
               the first list (fs_info->unused_bgs)
      
      Fix this by using the same lock (fs_info->unused_bgs_lock) to protect both
      the list of unused block groups and the list of deleted block groups. This
      makes it safe and there's not much worry for more lock contention, as this
      lock is seldom used and only the cleaner kthread adds elements to the list
      of deleted block groups. The warning goes away too, as this was previously
      an impossible case (and would have been better a BUG_ON/ASSERT) but it's
      not impossible anymore.
      Reproduced with fstest btrfs/073 (using MOUNT_OPTIONS="-o discard").
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      348a0013
  20. 07 12月, 2015 1 次提交
  21. 25 11月, 2015 4 次提交
    • M
      btrfs: qgroup: account shared subtree during snapshot delete · 82bd101b
      Mark Fasheh 提交于
      Commit 0ed4792a ('btrfs: qgroup: Switch to new extent-oriented qgroup
      mechanism.') removed our qgroup accounting during
      btrfs_drop_snapshot(). Predictably, this results in qgroup numbers
      going bad shortly after a snapshot is removed.
      
      Fix this by adding a dirty extent record when we encounter extents during
      our shared subtree walk. This effectively restores the functionality we had
      with the original shared subtree walking code in 1152651a (btrfs: qgroup:
      account shared subtrees during snapshot delete).
      
      The idea with the original patch (and this one) is that shared subtrees can
      get skipped during drop_snapshot. The shared subtree walk then allows us a
      chance to visit those extents and add them to the qgroup work for later
      processing. This ultimately makes the accounting for drop snapshot work.
      
      The new qgroup code nicely handles all the other extents during the tree
      walk via the ref dec/inc functions so we don't have to add actions beyond
      what we had originally.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      82bd101b
    • F
      Btrfs: fix race between cleaner kthread and space cache writeout · 036a9348
      Filipe Manana 提交于
      When a block group becomes unused and the cleaner kthread is currently
      running, we can end up getting the current transaction aborted with error
      -ENOENT when we try to commit the transaction, leading to the following
      trace:
      
        [59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]()
        [59779.272594] BTRFS: Transaction aborted (error -2)
        (...)
        [59779.291137] Call Trace:
        [59779.291621]  [<ffffffff812566f4>] dump_stack+0x4e/0x79
        [59779.292543]  [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8
        [59779.293435]  [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
        [59779.295000]  [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50
        [59779.296138]  [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs]
        [59779.297663]  [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]
        [59779.299141]  [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs]
        [59779.300359]  [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs]
        [59779.301805]  [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs]
        [59779.302893]  [<ffffffff81196634>] sync_filesystem+0x7f/0x93
        (...)
        [59779.318186] ---[ end trace 577e2daff90da33a ]---
      
      The following diagram illustrates a sequence of steps leading to this
      problem:
      
             CPU 1                                             CPU 2
      
                                 <at transaction N>
      
                                                              adds bg A to list
                                                              fs_info->unused_bgs
      
                                                              adds bg B to list
                                                              fs_info->unused_bgs
      
                                 <transaction kthread
                                  commits transaction N
                                  and wakes up the
                                  cleaner kthread>
      
        cleaner kthread
          delete_unused_bgs()
      
            sees bg A in list
            fs_info->unused_bgs
      
            btrfs_start_transaction()
      
                                 <transaction N + 1 starts>
      
            deletes bg A
      
                                                              update_block_group(bg C)
      
                                                                --> adds bg C to list
                                                                    fs_info->unused_bgs
      
            deletes bg B
      
            sees bg C in the list
            fs_info->unused_bgs
      
            btrfs_remove_chunk(bg C)
              btrfs_remove_block_group(bg C)
      
                --> checks if the block group
                    is in a dirty list, and
                    because it isn't now, it
                    does nothing
      
                --> the block group item
                    is deleted from the
                    extent tree
      
                                                                --> adds bg C to list
                                                                    transaction->dirty_bgs
      
                                                               some task calls
                                                               btrfs_commit_transaction(t N + 1)
                                                                 commit_cowonly_roots()
                                                                   btrfs_write_dirty_block_groups()
                                                                     --> sees bg C in cur_trans->dirty_bgs
                                                                     --> calls write_one_cache_group()
                                                                         which returns -ENOENT because
                                                                         it did not find the block group
                                                                         item in the extent tree
                                                                     --> transaction aborte with -ENOENT
                                                                         because write_one_cache_group()
                                                                         returned that error
      
      So fix this by adding a block group to the list of dirty block groups
      before adding it to the list of unused block groups.
      
      This happened on a stress test using fsstress plus concurrent calls to
      fallocate 20G and truncate (releasing part of the space allocated with
      fallocate).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      036a9348
    • F
      Btrfs: fix scrub preventing unused block groups from being deleted · 758f2dfc
      Filipe Manana 提交于
      Currently scrub can race with the cleaner kthread when the later attempts
      to delete an unused block group, and the result is preventing the cleaner
      kthread from ever deleting later the block group - unless the block group
      becomes used and unused again. The following diagram illustrates that
      race:
      
                    CPU 1                                 CPU 2
      
       cleaner kthread
         btrfs_delete_unused_bgs()
      
           gets block group X from
           fs_info->unused_bgs and
           removes it from that list
      
                                                   scrub_enumerate_chunks()
      
                                                     searches device tree using
                                                     its commit root
      
                                                     finds device extent for
                                                     block group X
      
                                                     gets block group X from the tree
                                                     fs_info->block_group_cache_tree
                                                     (via btrfs_lookup_block_group())
      
                                                     sets bg X to RO
      
           sees the block group is
           already RO and therefore
           doesn't delete it nor adds
           it back to unused list
      
      So fix this by making scrub add the block group again to the list of
      unused block groups if the block group is still unused when it finished
      scrubbing it and it hasn't been removed already.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758f2dfc
    • F
      Btrfs: fix the number of transaction units needed to remove a block group · 7fd01182
      Filipe Manana 提交于
      We were using only 1 transaction unit when attempting to delete an unused
      block group but in reality we need 3 + N units, where N corresponds to the
      number of stripes. We were accounting only for the addition of the orphan
      item (for the block group's free space cache inode) but we were not
      accounting that we need to delete one block group item from the extent
      tree, one free space item from the tree of tree roots and N device extent
      items from the device tree.
      
      While one unit is not enough, it worked most of the time because for each
      single unit we are too pessimistic and assume an entire tree path, with
      the highest possible heigth (8), needs to be COWed with eventual node
      splits at every possible level in the tree, so there was usually enough
      reserved space for removing all the items and adding the orphan item.
      
      However after adding the orphan item, writepages() can by called by the VM
      subsystem against the btree inode when we are under memory pressure, which
      causes writeback to start for the nodes we COWed before, this forces the
      operation to remove the free space item to COW again some (or all of) the
      same nodes (in the tree of tree roots). Even without writepages() being
      called, we could fail with ENOSPC because these items are located in
      multiple trees and one of them might have a higher heigth and require
      node/leaf splits at many levels, exhausting all the reserved space before
      removing all the items and adding the orphan.
      
      In the kernel 4.0 release, commit 3d84be79 ("Btrfs: fix BUG_ON in
      btrfs_orphan_add() when delete unused block group"), we attempted to fix
      a BUG_ON due to ENOSPC when trying to add the orphan item by making the
      cleaner kthread reserve one transaction unit before attempting to remove
      the block group, but this was not enough. We had a couple user reports
      still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on
      a 4.2-rc6 kernel for example:
      
          http://www.spinics.net/lists/linux-btrfs/msg46070.html
      
      So fix this by reserving all the necessary units of metadata.
      Reported-by: NStefan Priebe <s.priebe@profihost.ag>
      Fixes: 3d84be79 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7fd01182