1. 13 5月, 2016 7 次提交
    • F
      Btrfs: pin logs earlier when doing a rename exchange operation · 376e5a57
      Filipe Manana 提交于
      The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
      which had a race fixed by my previous patch titled "Btrfs: pin log earlier
      when renaming", and so it suffers from the same problem.
      
      We pin the logs of the affected roots after we insert the new inode
      references, leaving a time window where concurrent tasks logging the
      inodes can end up logging both the new and old references, resulting
      in log trees that when replayed can turn the metadata into inconsistent
      states. This behaviour was added to btrfs_rename() in 2009 without any
      explanation about why not pinning the logs earlier, just leaving a
      comment about the posibility for the race. As of today it's perfectly
      safe and sane to pin the logs before we start doing any of the steps
      involved in the rename operation.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      376e5a57
    • F
      Btrfs: unpin logs if rename exchange operation fails · 86e8aa0e
      Filipe Manana 提交于
      If rename exchange operations fail at some point after we pinned any of
      the logs, we end up aborting the current transaction but never unpin the
      logs, which leaves concurrent tasks that are trying to sync the logs (as
      part of an fsync request from user space) blocked forever and preventing
      the filesystem from being unmountable.
      
      Fix this by safely unpinning the log.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      86e8aa0e
    • F
      Btrfs: fix inode leak on failure to setup whiteout inode in rename · c9901618
      Filipe Manana 提交于
      If we failed to fully setup the whiteout inode during a rename operation
      with the whiteout flag, we ended up leaking the inode, not decrementing
      its link count nor removing all its items from the fs/subvol tree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      c9901618
    • D
      btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT · cdd1fedf
      Dan Fuhry 提交于
      Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
      behavior in the renameat2() syscall. This behavior is primarily used by
      overlayfs. This patch adds support for these flags to btrfs, enabling it to
      be used as a fully functional upper layer for overlayfs.
      
      RENAME_EXCHANGE support was written by Davide Italiano originally
      submitted on 2 April 2015.
      Signed-off-by: NDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: NDan Fuhry <dfuhry@datto.com>
      [ remove unlikely ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      cdd1fedf
    • F
      Btrfs: pin log earlier when renaming · c4aba954
      Filipe Manana 提交于
      We were pinning the log right after the first step in the rename operation
      (inserting inode ref for the new name in the destination directory)
      instead of doing it before. This behaviour was introduced in 2009 for some
      reason that was not mentioned neither on the changelog nor any comment,
      with the drawback of a small time window where concurrent log writers can
      end up logging the new inode reference for the inode we are renaming while
      the rename operation is in progress (so that we can end up with a log
      containing both the new and old references). As of today there's no reason
      to not pin the log before that first step anymore, so just fix this.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      c4aba954
    • F
      Btrfs: unpin log if rename operation fails · 3dc9e8f7
      Filipe Manana 提交于
      If rename operations fail at some point after we pinned the log, we end
      up aborting the current transaction but never unpin the log, which leaves
      concurrent tasks that are trying to sync the log (as part of an fsync
      request from user space) blocked forever and preventing the filesystem
      from being unmountable.
      
      Fix this by safely unpinning the log.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      3dc9e8f7
    • F
      Btrfs: don't do unnecessary delalloc flushes when relocating · 9cfa3e34
      Filipe Manana 提交于
      Before we start the actual relocation process of a block group, we do
      calls to flush delalloc of all inodes and then wait for ordered extents
      to complete. However we do these flush calls just to make sure we don't
      race with concurrent tasks that have actually already started to run
      delalloc and have allocated an extent from the block group we want to
      relocate, right before we set it to readonly mode, but have not yet
      created the respective ordered extents. The flush calls make us wait
      for such concurrent tasks because they end up calling
      filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
      __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
      btrfs_run_delalloc_work()) which ends up serializing us with those tasks
      due to attempts to lock the same pages (and the delalloc flush procedure
      calls the allocator and creates the ordered extents before unlocking the
      pages).
      
      These flushing calls not only make us waste time (cpu, IO) but also reduce
      the chances of writing larger extents (applications might be writing to
      contiguous ranges and we flush before they finish dirtying the whole
      ranges).
      
      So make sure we don't flush delalloc and just wait for concurrent tasks
      that have already started flushing delalloc and have allocated an extent
      from the block group we are about to relocate.
      
      This change also ends up fixing a race with direct IO writes that makes
      relocation not wait for direct IO ordered extents. This race is
      illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                 starts direct IO write,
                                                 target inode currently has no
                                                 ordered extents ongoing nor
                                                 dirty pages (delalloc regions),
                                                 therefore the root for our inode
                                                 is not in the list
                                                 fs_info->ordered_roots
      
                                                 btrfs_direct_IO()
                                                   __blockdev_direct_IO()
                                                     btrfs_get_blocks_direct()
                                                       btrfs_lock_extent_direct()
                                                         locks range in the io tree
                                                       btrfs_new_extent_direct()
                                                         btrfs_reserve_extent()
                                                           --> extent allocated
                                                               from bg X
      
         btrfs_inc_block_group_ro(bg X)
      
         btrfs_start_delalloc_roots()
           __start_delalloc_inodes()
             --> does nothing, no dealloc ranges
                 in the inode's io tree so the
                 inode's root is not in the list
                 fs_info->delalloc_roots
      
         btrfs_wait_ordered_roots()
           --> does not find the inode's root in the
               list fs_info->ordered_roots
      
           --> ends up not waiting for the direct IO
               write started by the task at CPU 2
      
         relocate_block_group(rc->stage ==
           MOVE_DATA_EXTENTS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
      
           iterates the extent tree, using its
           commit root and moves extents into new
           locations
      
                                                         btrfs_add_ordered_extent_dio()
                                                           --> now a ordered extent is
                                                               created and added to the
                                                               list root->ordered_extents
                                                               and the root added to the
                                                               list fs_info->ordered_roots
                                                           --> this is too late and the
                                                               task at CPU 1 already
                                                               started the relocation
      
           btrfs_commit_transaction()
      
                                                         btrfs_finish_ordered_io()
                                                           btrfs_alloc_reserved_file_extent()
                                                             --> adds delayed data reference
                                                                 for the extent allocated
                                                                 from bg X
      
         relocate_block_group(rc->stage ==
           UPDATE_DATA_PTRS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
               --> delayed refs are run, so an extent
                   item for the allocated extent from
                   bg X is added to extent tree
               --> commit roots are switched, so the
                   next scan in the extent tree will
                   see the extent item
      
           sees the extent in the extent tree
      
      When this happens the relocation produces the following warning when it
      finishes:
      
      [ 7260.832836] ------------[ cut here ]------------
      [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
      [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
      [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
      [ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
      [ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
      [ 7260.852998] Call Trace:
      [ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
      [ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
      [ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
      [ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
      [ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
      [ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
      [ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
      [ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
      [ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
      [ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
      [ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
      [ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
      [ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
      [ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
      [ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
      [ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
      
      This is because at the end of the first stage, in relocate_block_group(),
      we commit the current transaction, which makes delayed refs run, the
      commit roots are switched and so the second stage will find the extent
      item that the ordered extent added to the delayed refs. But this extent
      was not moved (ordered extent completed after first stage finished), so
      at the end of the relocation our block group item still has a positive
      used bytes counter, triggering a warning at the end of
      btrfs_relocate_block_group(). Later on when trying to read the extent
      contents from disk we hit a BUG_ON() due to the inability to map a block
      with a logical address that belongs to the block group we relocated and
      is no longer valid, resulting in the following trace:
      
      [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
      [ 7344.887518] ------------[ cut here ]------------
      [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
      [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
      [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
      [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
      [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
      [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
      [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
      [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
      [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
      [ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
      [ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
      [ 7344.888431] Stack:
      [ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
      [ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
      [ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
      [ 7344.888431] Call Trace:
      [ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
      [ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
      [ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
      [ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
      [ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
      [ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
      [ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
      [ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
      [ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
      [ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
      [ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
      [ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
      [ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
      [ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
      [ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
      [ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431]  RSP <ffff8802046878f0>
      [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      9cfa3e34
  2. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  3. 02 3月, 2016 1 次提交
    • F
      Btrfs: fix deadlock between direct IO reads and buffered writes · ade77029
      Filipe Manana 提交于
      While running a test with a mix of buffered IO and direct IO against
      the same files I hit a deadlock reported by the following trace:
      
      [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
      [11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
      [11642.149771]     0 15282      2 0x00000000
      [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
      [11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
      [11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
      [11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
      [11642.161403] Call Trace:
      [11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
      [11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
      [11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
      [11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
      [11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
      [11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
      [11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
      [11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
      [11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
      [11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
      [11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
      [11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.226295] 2 locks held by kworker/u32:3/15282:
      [11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
      [11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
      [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
      [11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
      [11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
      [11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
      [11642.243715] Call Trace:
      [11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
      [11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
      [11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
      [11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
      [11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
      [11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
      [11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
      [11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
      [11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
      [11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
      [11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.282604] 3 locks held by kworker/u32:8/15289:
      [11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
      [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
      [11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
      [11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
      [11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
      [11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
      [11642.298433] Call Trace:
      [11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
      [11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
      [11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
      [11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
      [11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
      [11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
      [11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
      [11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
      [11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
      [11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.310403] 3 locks held by fdm-stress/26848:
      [11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
      [11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
      [11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
      [11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
      [11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
      [11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
      [11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
      [11642.325096] Call Trace:
      [11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
      [11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
      [11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
      [11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
      [11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
      [11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
      [11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
      [11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
      [11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
      [11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
      [11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
      [11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
      [11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
      [11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
      [11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
      [11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
      [11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
      [11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
      [11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
      [11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
      [11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.350222] 4 locks held by fdm-stress/26849:
      [11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
      [11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
      [11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
      [11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
      [11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
      [11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
      [11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
      [11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
      [11642.373430] Call Trace:
      [11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
      [11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
      [11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
      [11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
      [11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
      [11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
      [11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
      [11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
      [11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
      [11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
      [11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
      [11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
      [11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
      [11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
      [11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
      [11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
      [11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
      [11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
      [11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
      [11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.416572] 1 lock held by fdm-stress/26850:
      [11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
      [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
      [11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
      [11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
      [11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
      [11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
      [11642.426591] Call Trace:
      [11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
      [11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
      [11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
      [11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
      [11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
      [11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
      [11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
      [11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
      [11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
      [11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.441533] 3 locks held by fdm-stress/26851:
      [11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
      [11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
      [11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      
      This happened because under specific timings the path for direct IO reads
      can deadlock with concurrent buffered writes. The diagram below shows how
      this happens for an example file that has the following layout:
      
           [  extent A  ]  [  extent B  ]  [ ....
           0K              4K              8K
      
           CPU 1                                               CPU 2                             CPU 3
      
      DIO read against range
       [0K, 8K[ starts
      
      btrfs_direct_IO()
        --> calls btrfs_get_blocks_direct()
            which finds the extent map for the
            extent A and leaves the range
            [0K, 4K[ locked in the inode's
            io tree
      
                                                         buffered write against
                                                         range [4K, 8K[ starts
      
                                                         __btrfs_buffered_write()
                                                           --> dirties page at 4K
      
                                                                                           a user space
                                                                                           task calls sync
                                                                                           for e.g or
                                                                                           writepages() is
                                                                                           invoked by mm
      
                                                                                           writepages()
                                                                                             run_delalloc_range()
                                                                                               cow_file_range()
                                                                                                 --> ordered extent X
                                                                                                     for the buffered
                                                                                                     write is created
                                                                                                     and
                                                                                                     writeback starts
      
        --> calls btrfs_get_blocks_direct()
            again, without submitting first
            a bio for reading extent A, and
            finds the extent map for extent B
      
        --> calls lock_extent_direct()
      
            --> locks range [4K, 8K[
            --> finds ordered extent X
                covering range [4K, 8K[
            --> unlocks range [4K, 8K[
      
                                                        buffered write against
                                                        range [0K, 8K[ starts
      
                                                        __btrfs_buffered_write()
                                                          prepare_pages()
                                                            --> locks pages with
                                                                offsets 0 and 4K
                                                          lock_and_cleanup_extent_if_need()
                                                            --> blocks attempting to
                                                                lock range [0K, 8K[ in
                                                                the inode's io tree,
                                                                because the range [0, 4K[
                                                                is already locked by the
                                                                direct IO task at CPU 1
      
            --> calls
                btrfs_start_ordered_extent(oe X)
      
                btrfs_start_ordered_extent(oe X)
      
                  --> At this point writeback for ordered
                      extent X has not finished yet
      
                  filemap_fdatawrite_range()
                    btrfs_writepages()
                      extent_writepages()
                        extent_write_cache_pages()
                          --> finds page with offset 0
                              with the writeback tag
                              (and not dirty)
                          --> tries to lock it
                               --> deadlock, task at CPU 2
                                   has the page locked and
                                   is blocked on the io range
                                   [0, 4K[ that was locked
                                   earlier by this task
      
      So fix this by falling back to a buffered read in the direct IO read path
      when an ordered extent for a buffered write is found.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ade77029
  4. 18 2月, 2016 3 次提交
  5. 16 2月, 2016 1 次提交
    • F
      Btrfs: fix direct IO requests not reporting IO error to user space · 1636d1d7
      Filipe Manana 提交于
      If a bio for a direct IO request fails, we were not setting the error in
      the parent bio (the main DIO bio), making us not return the error to
      user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
      return the number of bytes issued for IO and not the error a bio created
      and submitted by btrfs_submit_direct() got from the block layer.
      This essentially happens because when we call:
      
         dio_end_io(dio_bio, bio->bi_error);
      
      It does not set dio_bio->bi_error to the value of the second argument.
      So just add this missing assignment in endio callbacks, just as we do in
      the error path at btrfs_submit_direct() when we fail to clone the dio bio
      or allocate its private object. This follows the convention of what is
      done with other similar APIs such as bio_endio() where the caller is
      responsible for setting the bi_error field in the bio it passes as an
      argument to bio_endio().
      
      This was detected by the new generic test cases in xfstests: 271, 272,
      276 and 278. Which essentially setup a dm error target, then load the
      error table, do a direct IO write and unload the error table. They
      expect the write to fail with -EIO, which was not getting reported
      when testing against btrfs.
      
      Cc: stable@vger.kernel.org  # 4.3+
      Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      1636d1d7
  6. 11 2月, 2016 2 次提交
    • D
      btrfs: properly set the termination value of ctx->pos in readdir · bc4ef759
      David Sterba 提交于
      The value of ctx->pos in the last readdir call is supposed to be set to
      INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
      larger value, then it's LLONG_MAX.
      
      There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
      overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
      64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
      the increment.
      
      We can get to that situation like that:
      
      * emit all regular readdir entries
      * still in the same call to readdir, bump the last pos to INT_MAX
      * next call to readdir will not emit any entries, but will reach the
        bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
      
      Normally this is not a problem, but if we call readdir again, we'll find
      'pos' set to LLONG_MAX and the unconditional increment will overflow.
      
      The report from Victor at
      (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
      print shows that pattern:
      
       Overflow: e
       Overflow: 7fffffff
       Overflow: 7fffffffffffffff
       PAX: size overflow detected in function btrfs_real_readdir
         fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
         context: dir_context;
       CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
       Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
        ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
        ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
        ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
       Call Trace:
        [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
        [<ffffffff811cb706>] report_size_overflow+0x36/0x40
        [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
        [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
        [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
        [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
       Overflow: 1a
        [<ffffffff811db070>] ? iterate_dir+0x150/0x150
        [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
      
      The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
      are not yet synced and are processed from the delayed list. Then the code
      could go to the bump section again even though it might not emit any new
      dir entries from the delayed list.
      
      The fix avoids entering the "bump" section again once we've finished
      emitting the entries, both for synced and delayed entries.
      
      References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bc4ef759
    • D
      btrfs: readdir: use GFP_KERNEL · 49e350a4
      David Sterba 提交于
      Readdir is initiated from userspace and is not on the critical
      writeback path, we don't need to use GFP_NOFS for allocations.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49e350a4
  7. 02 2月, 2016 8 次提交
  8. 26 1月, 2016 1 次提交
    • F
      Btrfs: fix race between fsync and lockless direct IO writes · de0ee0ed
      Filipe Manana 提交于
      An fsync, using the fast path, can race with a concurrent lockless direct
      IO write and end up logging a file extent item that points to an extent
      that wasn't written to yet. This is because the fast fsync path collects
      ordered extents into a local list and then collects all the new extent
      maps to log file extent items based on them, while the direct IO write
      path creates the new extent map before it creates the corresponding
      ordered extent (and submitting the respective bio(s)).
      
      So fix this by making the direct IO write path create ordered extents
      before the extent maps and make the fast fsync path collect any new
      ordered extents after it collects the extent maps.
      Note that making the fsync handler call inode_dio_wait() (after acquiring
      the inode's i_mutex) would not work and lead to a deadlock when doing
      AIO, as through AIO we end up in a path where the fsync handler is called
      (through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
      before the inode's dio counter is decremented (inode_dio_wait() waits
      for this counter to have a value of zero).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de0ee0ed
  9. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  10. 20 1月, 2016 2 次提交
    • Z
      btrfs: merge functions for wait snapshot creation · 0bc19f90
      Zhao Lei 提交于
      wait_for_snapshot_creation() is in same group with oher two:
       btrfs_start_write_no_snapshoting()
       btrfs_end_write_no_snapshoting()
      
      Rename wait_for_snapshot_creation() and move it into same place
      with other two.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0bc19f90
    • F
      Btrfs: fix deadlock running delayed iputs at transaction commit time · c2d6cb16
      Filipe Manana 提交于
      While running a stress test I ran into a deadlock when running the delayed
      iputs at transaction time, which produced the following report and trace:
      
      [  886.399989] =============================================
      [  886.400871] [ INFO: possible recursive locking detected ]
      [  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
      [  886.402384] ---------------------------------------------
      [  886.403182] fio/8277 is trying to acquire lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] but task is already holding lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] other info that might help us debug this:
      [  886.403568]  Possible unsafe locking scenario:
      [  886.403568]
      [  886.403568]        CPU0
      [  886.403568]        ----
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]
      [  886.403568]  *** DEADLOCK ***
      [  886.403568]
      [  886.403568]  May be due to missing lock nesting notation
      [  886.403568]
      [  886.403568] 3 locks held by fio/8277:
      [  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
      [  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] stack backtrace:
      [  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
      [  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
      [  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
      [  886.403568] Call Trace:
      [  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
      [  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
      [  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
      [  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
      [  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
      [  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
      [  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
      [  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
      [  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
      [  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
      [  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
      [  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
      [  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
      [  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
      [  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
      [ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
      [ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
      [ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
      [ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
      [ 1081.880782] Call Trace:
      [ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
      [ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.985293] INFO: lockdep is turned off.
      [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
      [ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
      [ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
      [ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
      [ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
      [ 1081.996485] Call Trace:
      [ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
      [ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
      [ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
      [ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
      [ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
      [ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
      [ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
      [ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
      [ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This happens because we can recursively acquire the semaphore
      fs_info->delayed_iput_sem when attempting to allocate space to satisfy
      a file write request as shown in the first trace above - when committing
      a transaction we acquire (down_read) the semaphore before running the
      delayed iputs, and when running a delayed iput() we can end up calling
      an inode's eviction handler, which in turn commits another transaction
      and attempts to acquire (down_read) again the semaphore to run more
      delayed iput operations.
      This results in a deadlock because if a task acquires multiple times a
      semaphore it should invoke down_read_nested() with a different lockdep
      class for each level of recursion.
      
      Fix this by simplifying the implementation and use a mutex instead that
      is acquired by the cleaner kthread before it runs the delayed iputs
      instead of always acquiring a semaphore before delayed references are
      run from anywhere.
      
      Fixes: d7c15171 (btrfs: Fix NO_SPACE bug caused by delayed-iput)
      Cc: stable@vger.kernel.org   # 4.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c2d6cb16
  11. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  12. 07 1月, 2016 10 次提交
  13. 01 1月, 2016 2 次提交
    • F
      Btrfs: fix number of transaction units required to create symlink · 9269d12b
      Filipe Manana 提交于
      We weren't accounting for the insertion of an inline extent item for the
      symlink inode nor that we need to update the parent inode item (through
      the call to btrfs_add_nondir()). So fix this by including two more
      transaction units.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      9269d12b
    • F
      Btrfs: don't leave dangling dentry if symlink creation failed · d50866d0
      Filipe Manana 提交于
      When we are creating a symlink we might fail with an error after we
      created its inode and added the corresponding directory indexes to its
      parent inode. In this case we end up never removing the directory indexes
      because the inode eviction handler, called for our symlink inode on the
      final iput(), only removes items associated with the symlink inode and
      not with the parent inode.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdi
        $ mount /dev/sdi /mnt
        $ touch /mnt/foo
        $ ln -s /mnt/foo /mnt/bar
        ln: failed to create symbolic link ‘bar’: Cannot allocate memory
        $ umount /mnt
        $ btrfsck /dev/sdi
        Checking filesystem on /dev/sdi
        UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
      	unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
        found 131073 bytes used err is 1
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 124305
        file data blocks allocated: 262144
         referenced 262144
        btrfs-progs v4.2.3
      
      So fix this by adding the directory index entries as the very last
      step of symlink creation.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      d50866d0