1. 13 5月, 2016 10 次提交
    • F
      Btrfs: pin logs earlier when doing a rename exchange operation · 376e5a57
      Filipe Manana 提交于
      The btrfs_rename_exchange() started as a copy-paste from btrfs_rename(),
      which had a race fixed by my previous patch titled "Btrfs: pin log earlier
      when renaming", and so it suffers from the same problem.
      
      We pin the logs of the affected roots after we insert the new inode
      references, leaving a time window where concurrent tasks logging the
      inodes can end up logging both the new and old references, resulting
      in log trees that when replayed can turn the metadata into inconsistent
      states. This behaviour was added to btrfs_rename() in 2009 without any
      explanation about why not pinning the logs earlier, just leaving a
      comment about the posibility for the race. As of today it's perfectly
      safe and sane to pin the logs before we start doing any of the steps
      involved in the rename operation.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      376e5a57
    • F
      Btrfs: unpin logs if rename exchange operation fails · 86e8aa0e
      Filipe Manana 提交于
      If rename exchange operations fail at some point after we pinned any of
      the logs, we end up aborting the current transaction but never unpin the
      logs, which leaves concurrent tasks that are trying to sync the logs (as
      part of an fsync request from user space) blocked forever and preventing
      the filesystem from being unmountable.
      
      Fix this by safely unpinning the log.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      86e8aa0e
    • F
      Btrfs: fix inode leak on failure to setup whiteout inode in rename · c9901618
      Filipe Manana 提交于
      If we failed to fully setup the whiteout inode during a rename operation
      with the whiteout flag, we ended up leaking the inode, not decrementing
      its link count nor removing all its items from the fs/subvol tree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      c9901618
    • D
      btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT · cdd1fedf
      Dan Fuhry 提交于
      Two new flags, RENAME_EXCHANGE and RENAME_WHITEOUT, provide for new
      behavior in the renameat2() syscall. This behavior is primarily used by
      overlayfs. This patch adds support for these flags to btrfs, enabling it to
      be used as a fully functional upper layer for overlayfs.
      
      RENAME_EXCHANGE support was written by Davide Italiano originally
      submitted on 2 April 2015.
      Signed-off-by: NDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: NDan Fuhry <dfuhry@datto.com>
      [ remove unlikely ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      cdd1fedf
    • F
      Btrfs: pin log earlier when renaming · c4aba954
      Filipe Manana 提交于
      We were pinning the log right after the first step in the rename operation
      (inserting inode ref for the new name in the destination directory)
      instead of doing it before. This behaviour was introduced in 2009 for some
      reason that was not mentioned neither on the changelog nor any comment,
      with the drawback of a small time window where concurrent log writers can
      end up logging the new inode reference for the inode we are renaming while
      the rename operation is in progress (so that we can end up with a log
      containing both the new and old references). As of today there's no reason
      to not pin the log before that first step anymore, so just fix this.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      c4aba954
    • F
      Btrfs: unpin log if rename operation fails · 3dc9e8f7
      Filipe Manana 提交于
      If rename operations fail at some point after we pinned the log, we end
      up aborting the current transaction but never unpin the log, which leaves
      concurrent tasks that are trying to sync the log (as part of an fsync
      request from user space) blocked forever and preventing the filesystem
      from being unmountable.
      
      Fix this by safely unpinning the log.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      3dc9e8f7
    • F
      Btrfs: don't do unnecessary delalloc flushes when relocating · 9cfa3e34
      Filipe Manana 提交于
      Before we start the actual relocation process of a block group, we do
      calls to flush delalloc of all inodes and then wait for ordered extents
      to complete. However we do these flush calls just to make sure we don't
      race with concurrent tasks that have actually already started to run
      delalloc and have allocated an extent from the block group we want to
      relocate, right before we set it to readonly mode, but have not yet
      created the respective ordered extents. The flush calls make us wait
      for such concurrent tasks because they end up calling
      filemap_fdatawrite_range() (through btrfs_start_delalloc_roots() ->
      __start_delalloc_inodes() -> btrfs_alloc_delalloc_work() ->
      btrfs_run_delalloc_work()) which ends up serializing us with those tasks
      due to attempts to lock the same pages (and the delalloc flush procedure
      calls the allocator and creates the ordered extents before unlocking the
      pages).
      
      These flushing calls not only make us waste time (cpu, IO) but also reduce
      the chances of writing larger extents (applications might be writing to
      contiguous ranges and we flush before they finish dirtying the whole
      ranges).
      
      So make sure we don't flush delalloc and just wait for concurrent tasks
      that have already started flushing delalloc and have allocated an extent
      from the block group we are about to relocate.
      
      This change also ends up fixing a race with direct IO writes that makes
      relocation not wait for direct IO ordered extents. This race is
      illustrated by the following diagram:
      
              CPU 1                                       CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                 starts direct IO write,
                                                 target inode currently has no
                                                 ordered extents ongoing nor
                                                 dirty pages (delalloc regions),
                                                 therefore the root for our inode
                                                 is not in the list
                                                 fs_info->ordered_roots
      
                                                 btrfs_direct_IO()
                                                   __blockdev_direct_IO()
                                                     btrfs_get_blocks_direct()
                                                       btrfs_lock_extent_direct()
                                                         locks range in the io tree
                                                       btrfs_new_extent_direct()
                                                         btrfs_reserve_extent()
                                                           --> extent allocated
                                                               from bg X
      
         btrfs_inc_block_group_ro(bg X)
      
         btrfs_start_delalloc_roots()
           __start_delalloc_inodes()
             --> does nothing, no dealloc ranges
                 in the inode's io tree so the
                 inode's root is not in the list
                 fs_info->delalloc_roots
      
         btrfs_wait_ordered_roots()
           --> does not find the inode's root in the
               list fs_info->ordered_roots
      
           --> ends up not waiting for the direct IO
               write started by the task at CPU 2
      
         relocate_block_group(rc->stage ==
           MOVE_DATA_EXTENTS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
      
           iterates the extent tree, using its
           commit root and moves extents into new
           locations
      
                                                         btrfs_add_ordered_extent_dio()
                                                           --> now a ordered extent is
                                                               created and added to the
                                                               list root->ordered_extents
                                                               and the root added to the
                                                               list fs_info->ordered_roots
                                                           --> this is too late and the
                                                               task at CPU 1 already
                                                               started the relocation
      
           btrfs_commit_transaction()
      
                                                         btrfs_finish_ordered_io()
                                                           btrfs_alloc_reserved_file_extent()
                                                             --> adds delayed data reference
                                                                 for the extent allocated
                                                                 from bg X
      
         relocate_block_group(rc->stage ==
           UPDATE_DATA_PTRS)
      
           prepare_to_relocate()
             btrfs_commit_transaction()
               --> delayed refs are run, so an extent
                   item for the allocated extent from
                   bg X is added to extent tree
               --> commit roots are switched, so the
                   next scan in the extent tree will
                   see the extent item
      
           sees the extent in the extent tree
      
      When this happens the relocation produces the following warning when it
      finishes:
      
      [ 7260.832836] ------------[ cut here ]------------
      [ 7260.834653] WARNING: CPU: 5 PID: 6765 at fs/btrfs/relocation.c:4318 btrfs_relocate_block_group+0x245/0x2a1 [btrfs]()
      [ 7260.838268] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7260.850935] CPU: 5 PID: 6765 Comm: btrfs Not tainted 4.5.0-rc6-btrfs-next-28+ #1
      [ 7260.852998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7260.852998]  0000000000000000 ffff88020bf57bc0 ffffffff812648b3 0000000000000000
      [ 7260.852998]  0000000000000009 ffff88020bf57bf8 ffffffff81051608 ffffffffa03c1b2d
      [ 7260.852998]  ffff8800b2bbb800 0000000000000000 ffff8800b17bcc58 ffff8800399dd000
      [ 7260.852998] Call Trace:
      [ 7260.852998]  [<ffffffff812648b3>] dump_stack+0x67/0x90
      [ 7260.852998]  [<ffffffff81051608>] warn_slowpath_common+0x99/0xb2
      [ 7260.852998]  [<ffffffffa03c1b2d>] ? btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffff810516d4>] warn_slowpath_null+0x1a/0x1c
      [ 7260.852998]  [<ffffffffa03c1b2d>] btrfs_relocate_block_group+0x245/0x2a1 [btrfs]
      [ 7260.852998]  [<ffffffffa039d9de>] btrfs_relocate_chunk.isra.29+0x66/0xdb [btrfs]
      [ 7260.852998]  [<ffffffffa039f314>] btrfs_balance+0xde1/0xe4e [btrfs]
      [ 7260.852998]  [<ffffffff8127d671>] ? debug_smp_processor_id+0x17/0x19
      [ 7260.852998]  [<ffffffffa03a9583>] btrfs_ioctl_balance+0x255/0x2d3 [btrfs]
      [ 7260.852998]  [<ffffffffa03ac96a>] btrfs_ioctl+0x11e0/0x1dff [btrfs]
      [ 7260.852998]  [<ffffffff811451df>] ? handle_mm_fault+0x443/0xd63
      [ 7260.852998]  [<ffffffff81491817>] ? _raw_spin_unlock+0x31/0x44
      [ 7260.852998]  [<ffffffff8108b36a>] ? arch_local_irq_save+0x9/0xc
      [ 7260.852998]  [<ffffffff811876ab>] vfs_ioctl+0x18/0x34
      [ 7260.852998]  [<ffffffff81187cb2>] do_vfs_ioctl+0x550/0x5be
      [ 7260.852998]  [<ffffffff81190c30>] ? __fget_light+0x4d/0x71
      [ 7260.852998]  [<ffffffff81187d77>] SyS_ioctl+0x57/0x79
      [ 7260.852998]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7260.893268] ---[ end trace eb7803b24ebab8ad ]---
      
      This is because at the end of the first stage, in relocate_block_group(),
      we commit the current transaction, which makes delayed refs run, the
      commit roots are switched and so the second stage will find the extent
      item that the ordered extent added to the delayed refs. But this extent
      was not moved (ordered extent completed after first stage finished), so
      at the end of the relocation our block group item still has a positive
      used bytes counter, triggering a warning at the end of
      btrfs_relocate_block_group(). Later on when trying to read the extent
      contents from disk we hit a BUG_ON() due to the inability to map a block
      with a logical address that belongs to the block group we relocated and
      is no longer valid, resulting in the following trace:
      
      [ 7344.885290] BTRFS critical (device sdi): unable to find logical 12845056 len 4096
      [ 7344.887518] ------------[ cut here ]------------
      [ 7344.888431] kernel BUG at fs/btrfs/inode.c:1833!
      [ 7344.888431] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [ 7344.888431] Modules linked in: btrfs crc32c_generic xor ppdev raid6_pq psmouse sg acpi_cpufreq evdev i2c_piix4 tpm_tis serio_raw tpm i2c_core pcspkr parport_pc
      [ 7344.888431] CPU: 0 PID: 6831 Comm: od Tainted: G        W       4.5.0-rc6-btrfs-next-28+ #1
      [ 7344.888431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7344.888431] task: ffff880215818600 ti: ffff880204684000 task.ti: ffff880204684000
      [ 7344.888431] RIP: 0010:[<ffffffffa037c88c>]  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431] RSP: 0018:ffff8802046878f0  EFLAGS: 00010282
      [ 7344.888431] RAX: 00000000ffffffea RBX: 0000000000001000 RCX: 0000000000000001
      [ 7344.888431] RDX: ffff88023ec0f950 RSI: ffffffff8183b638 RDI: 00000000ffffffff
      [ 7344.888431] RBP: ffff880204687908 R08: 0000000000000001 R09: 0000000000000000
      [ 7344.888431] R10: ffff880204687770 R11: ffffffff82f2d52d R12: 0000000000001000
      [ 7344.888431] R13: ffff88021afbfee8 R14: 0000000000006208 R15: ffff88006cd199b0
      [ 7344.888431] FS:  00007f1f9e1d6700(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
      [ 7344.888431] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7344.888431] CR2: 00007f1f9dc8cb60 CR3: 000000023e3b6000 CR4: 00000000000006f0
      [ 7344.888431] Stack:
      [ 7344.888431]  0000000000001000 0000000000001000 ffff880204687b98 ffff880204687950
      [ 7344.888431]  ffffffffa0395c8f ffffea0004d64d48 0000000000000000 0000000000001000
      [ 7344.888431]  ffffea0004d64d48 0000000000001000 0000000000000000 0000000000000000
      [ 7344.888431] Call Trace:
      [ 7344.888431]  [<ffffffffa0395c8f>] submit_extent_page+0xf5/0x16f [btrfs]
      [ 7344.888431]  [<ffffffffa03970ac>] __do_readpage+0x4a0/0x4f1 [btrfs]
      [ 7344.888431]  [<ffffffffa039680d>] ? btrfs_create_repair_bio+0xcb/0xcb [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8108df55>] ? trace_hardirqs_on+0xd/0xf
      [ 7344.888431]  [<ffffffffa039728c>] __do_contiguous_readpages.constprop.26+0xc2/0xe4 [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffffa039739b>] __extent_readpages.constprop.25+0xed/0x100 [btrfs]
      [ 7344.888431]  [<ffffffff81129d24>] ? lru_cache_add+0xe/0x10
      [ 7344.888431]  [<ffffffffa0397ea8>] extent_readpages+0x160/0x1aa [btrfs]
      [ 7344.888431]  [<ffffffffa037eeb4>] ? btrfs_writepage_start_hook+0xbc/0xbc [btrfs]
      [ 7344.888431]  [<ffffffff8115daad>] ? alloc_pages_current+0xa9/0xcd
      [ 7344.888431]  [<ffffffffa037cdc9>] btrfs_readpages+0x1f/0x21 [btrfs]
      [ 7344.888431]  [<ffffffff81128316>] __do_page_cache_readahead+0x168/0x1fc
      [ 7344.888431]  [<ffffffff811285a0>] ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff811285a0>] ? ondemand_readahead+0x1f6/0x207
      [ 7344.888431]  [<ffffffff8111cf34>] ? pagecache_get_page+0x2b/0x154
      [ 7344.888431]  [<ffffffff8112870e>] page_cache_sync_readahead+0x3d/0x3f
      [ 7344.888431]  [<ffffffff8111dbf7>] generic_file_read_iter+0x197/0x4e1
      [ 7344.888431]  [<ffffffff8117773a>] __vfs_read+0x79/0x9d
      [ 7344.888431]  [<ffffffff81178050>] vfs_read+0x8f/0xd2
      [ 7344.888431]  [<ffffffff81178a38>] SyS_read+0x50/0x7e
      [ 7344.888431]  [<ffffffff81492017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7344.888431] Code: 8d 4d e8 45 31 c9 45 31 c0 48 8b 00 48 c1 e2 09 48 8b 80 80 fc ff ff 4c 89 65 e8 48 8b b8 f0 01 00 00 e8 1d 42 02 00 85 c0 79 02 <0f> 0b 4c 0
      [ 7344.888431] RIP  [<ffffffffa037c88c>] btrfs_merge_bio_hook+0x54/0x6b [btrfs]
      [ 7344.888431]  RSP <ffff8802046878f0>
      [ 7344.970544] ---[ end trace eb7803b24ebab8ae ]---
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      9cfa3e34
    • F
      Btrfs: don't wait for unrelated IO to finish before relocation · 578def7c
      Filipe Manana 提交于
      Before the relocation process of a block group starts, it sets the block
      group to readonly mode, then flushes all delalloc writes and then finally
      it waits for all ordered extents to complete. This last step includes
      waiting for ordered extents destinated at extents allocated in other block
      groups, making us waste unecessary time.
      
      So improve this by waiting only for ordered extents that fall into the
      block group's range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      578def7c
    • F
      Btrfs: fix empty symlink after creating symlink and fsync parent dir · 3f9749f6
      Filipe Manana 提交于
      If we create a symlink, fsync its parent directory, crash/power fail and
      mount the filesystem, we end up with an empty symlink, which not only is
      useless it's also not allowed in linux (the man page symlink(2) is well
      explicit about that).  So we just need to make sure to fully log an inode
      if it's a symlink, to ensure its inline extent gets logged, ensuring the
      same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc.
      
      Example reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/testdir
        $ sync
        $ ln -s /mnt/foo /mnt/testdir/bar
        $ xfs_io -c fsync /mnt/testdir
        <power fail>
        $ mount /dev/sdb /mnt
        $ readlink /mnt/testdir/bar
        <empty string>
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      3f9749f6
    • F
      Btrfs: fix for incorrect directory entries after fsync log replay · 657ed1aa
      Filipe Manana 提交于
      If we move a directory to a new parent and later log that parent and don't
      explicitly log the old parent, when we replay the log we can end up with
      entries for the moved directory in both the old and new parent directories.
      Besides being ilegal to have directories with multiple hard links in linux,
      it also resulted in the leaving the inode item with a link count of 1.
      A similar issue also happens if we move a regular file - after the log tree
      is replayed the file has a link in both the old and new parent directories,
      when it should be only at the new directory.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ mkdir /mnt/x
        $ mkdir /mnt/y
        $ touch /mnt/x/foo
        $ mkdir /mnt/y/z
        $ sync
        $ ln /mnt/x/foo /mnt/x/bar
        $ mv /mnt/y/z /mnt/x/z
        < power fail >
        $ mount /dev/sdc /mnt
        $ ls -1Ri /mnt
        /mnt:
        257 x
        258 y
      
        /mnt/x:
        259 bar
        259 foo
        260 z
      
        /mnt/x/z:
      
        /mnt/y:
        260 z
      
        /mnt/y/z:
      
        $ umount /dev/sdc
        $ btrfs check /dev/sdc
        Checking filesystem on /dev/sdc
        UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 260 errors 2000, link count wrong
              unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0
              unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0
        (...)
      
      Attempting to remove the directory becomes impossible:
      
        $ mount /dev/sdc /mnt
        $ rmdir /mnt/y/z
        $ ls -lh /mnt/y
        ls: cannot access /mnt/y/z: No such file or directory
        total 0
        d????????? ? ? ? ?            ? z
        $ rmdir /mnt/x/z
        rmdir: failed to remove ‘/mnt/x/z’: Stale file handle
        $ ls -lh /mnt/x
        ls: cannot access /mnt/x/z: Stale file handle
        total 0
        -rw-r--r-- 2 root root 0 Apr  6 18:06 bar
        -rw-r--r-- 2 root root 0 Apr  6 18:06 foo
        d????????? ? ?    ?    ?            ? z
      
      So make sure that on rename we set the last_unlink_trans value for our
      inode, even if it's a directory, to the value of the current transaction's
      ID and that if the new parent directory is logged that we fallback to a
      transaction commit.
      
      A test case for fstests is being submitted as well.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      657ed1aa
  2. 07 4月, 2016 1 次提交
    • F
      Btrfs: fix file/data loss caused by fsync after rename and new inode · 56f23fdb
      Filipe Manana 提交于
      If we rename an inode A (be it a file or a directory), create a new
      inode B with the old name of inode A and under the same parent directory,
      fsync inode B and then power fail, at log tree replay time we end up
      removing inode A completely. If inode A is a directory then all its files
      are gone too.
      
      Example scenarios where this happens:
      This is reproducible with the following steps, taken from a couple of
      test cases written for fstests which are going to be submitted upstream
      soon:
      
         # Scenario 1
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir -p /mnt/a/x
         echo "hello" > /mnt/a/x/foo
         echo "world" > /mnt/a/x/bar
         sync
         mv /mnt/a/x /mnt/a/y
         mkdir /mnt/a/x
         xfs_io -c fsync /mnt/a/x
         <power failure happens>
      
         The next time the fs is mounted, log tree replay happens and
         the directory "y" does not exist nor do the files "foo" and
         "bar" exist anywhere (neither in "y" nor in "x", nor the root
         nor anywhere).
      
         # Scenario 2
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir /mnt/a
         echo "hello" > /mnt/a/foo
         sync
         mv /mnt/a/foo /mnt/a/bar
         echo "world" > /mnt/a/foo
         xfs_io -c fsync /mnt/a/foo
         <power failure happens>
      
         The next time the fs is mounted, log tree replay happens and the
         file "bar" does not exists anymore. A file with the name "foo"
         exists and it matches the second file we created.
      
      Another related problem that does not involve file/data loss is when a
      new inode is created with the name of a deleted snapshot and we fsync it:
      
         mkfs.btrfs -f /dev/sdc
         mount /dev/sdc /mnt
         mkdir /mnt/testdir
         btrfs subvolume snapshot /mnt /mnt/testdir/snap
         btrfs subvolume delete /mnt/testdir/snap
         rmdir /mnt/testdir
         mkdir /mnt/testdir
         xfs_io -c fsync /mnt/testdir # or fsync some file inside /mnt/testdir
         <power failure>
      
         The next time the fs is mounted the log replay procedure fails because
         it attempts to delete the snapshot entry (which has dir item key type
         of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
         resulting in the following error that causes mount to fail:
      
         [52174.510532] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
         [52174.512570] ------------[ cut here ]------------
         [52174.513278] WARNING: CPU: 12 PID: 28024 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x178/0x351 [btrfs]()
         [52174.514681] BTRFS: Transaction aborted (error -2)
         [52174.515630] Modules linked in: btrfs dm_flakey dm_mod overlay crc32c_generic ppdev xor raid6_pq acpi_cpufreq parport_pc tpm_tis sg parport tpm evdev i2c_piix4 proc
         [52174.521568] CPU: 12 PID: 28024 Comm: mount Tainted: G        W       4.5.0-rc6-btrfs-next-27+ #1
         [52174.522805] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
         [52174.524053]  0000000000000000 ffff8801df2a7710 ffffffff81264e93 ffff8801df2a7758
         [52174.524053]  0000000000000009 ffff8801df2a7748 ffffffff81051618 ffffffffa03591cd
         [52174.524053]  00000000fffffffe ffff88015e6e5000 ffff88016dbc3c88 ffff88016dbc3c88
         [52174.524053] Call Trace:
         [52174.524053]  [<ffffffff81264e93>] dump_stack+0x67/0x90
         [52174.524053]  [<ffffffff81051618>] warn_slowpath_common+0x99/0xb2
         [52174.524053]  [<ffffffffa03591cd>] ? __btrfs_unlink_inode+0x178/0x351 [btrfs]
         [52174.524053]  [<ffffffff81051679>] warn_slowpath_fmt+0x48/0x50
         [52174.524053]  [<ffffffffa03591cd>] __btrfs_unlink_inode+0x178/0x351 [btrfs]
         [52174.524053]  [<ffffffff8118f5e9>] ? iput+0xb0/0x284
         [52174.524053]  [<ffffffffa0359fe8>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
         [52174.524053]  [<ffffffffa038631e>] check_item_in_log+0x1fe/0x29b [btrfs]
         [52174.524053]  [<ffffffffa0386522>] replay_dir_deletes+0x167/0x1cf [btrfs]
         [52174.524053]  [<ffffffffa038739e>] fixup_inode_link_count+0x289/0x2aa [btrfs]
         [52174.524053]  [<ffffffffa038748a>] fixup_inode_link_counts+0xcb/0x105 [btrfs]
         [52174.524053]  [<ffffffffa038a5ec>] btrfs_recover_log_trees+0x258/0x32c [btrfs]
         [52174.524053]  [<ffffffffa03885b2>] ? replay_one_extent+0x511/0x511 [btrfs]
         [52174.524053]  [<ffffffffa034f288>] open_ctree+0x1dd4/0x21b9 [btrfs]
         [52174.524053]  [<ffffffffa032b753>] btrfs_mount+0x97e/0xaed [btrfs]
         [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
         [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
         [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
         [52174.524053]  [<ffffffffa032af81>] btrfs_mount+0x1ac/0xaed [btrfs]
         [52174.524053]  [<ffffffff8108e1b7>] ? trace_hardirqs_on+0xd/0xf
         [52174.524053]  [<ffffffff8108c262>] ? lockdep_init_map+0xb9/0x1b3
         [52174.524053]  [<ffffffff8117bafa>] mount_fs+0x67/0x131
         [52174.524053]  [<ffffffff81193003>] vfs_kern_mount+0x6c/0xde
         [52174.524053]  [<ffffffff8119590f>] do_mount+0x8a6/0x9e8
         [52174.524053]  [<ffffffff811358dd>] ? strndup_user+0x3f/0x59
         [52174.524053]  [<ffffffff81195c65>] SyS_mount+0x77/0x9f
         [52174.524053]  [<ffffffff814935d7>] entry_SYSCALL_64_fastpath+0x12/0x6b
         [52174.561288] ---[ end trace 6b53049efb1a3ea6 ]---
      
      Fix this by forcing a transaction commit when such cases happen.
      This means we check in the commit root of the subvolume tree if there
      was any other inode with the same reference when the inode we are
      fsync'ing is a new inode (created in the current transaction).
      
      Test cases for fstests, covering all the scenarios given above, were
      submitted upstream for fstests:
      
        * fstests: generic test for fsync after renaming directory
          https://patchwork.kernel.org/patch/8694281/
      
        * fstests: generic test for fsync after renaming file
          https://patchwork.kernel.org/patch/8694301/
      
        * fstests: add btrfs test for fsync after snapshot deletion
          https://patchwork.kernel.org/patch/8670671/
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      56f23fdb
  3. 05 4月, 2016 2 次提交
    • K
      mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage · ea1754a0
      Kirill A. Shutemov 提交于
      Mostly direct substitution with occasional adjustment or removing
      outdated comments.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1754a0
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  4. 04 4月, 2016 8 次提交
  5. 31 3月, 2016 1 次提交
    • F
      btrfs: fix crash/invalid memory access on fsync when using overlayfs · de17e793
      Filipe Manana 提交于
      If the lower or upper directory of an overlayfs mount belong to a btrfs
      file system and we fsync the file through the overlayfs' merged directory
      we ended up accessing an inode that didn't belong to btrfs as if it were
      a btrfs inode at btrfs_sync_file() resulting in a crash like the following:
      
      [ 7782.588845] BUG: unable to handle kernel NULL pointer dereference at 0000000000000544
      [ 7782.590624] IP: [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs]
      [ 7782.591931] PGD 4d954067 PUD 1e878067 PMD 0
      [ 7782.592016] Oops: 0002 [#6] PREEMPT SMP DEBUG_PAGEALLOC
      [ 7782.592016] Modules linked in: btrfs overlay ppdev crc32c_generic evdev xor raid6_pq psmouse pcspkr sg serio_raw acpi_cpufreq parport_pc parport tpm_tis i2c_piix4 tpm i2c_core processor button loop autofs4 ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
      [ 7782.592016] CPU: 10 PID: 16437 Comm: xfs_io Tainted: G      D         4.5.0-rc6-btrfs-next-26+ #1
      [ 7782.592016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 7782.592016] task: ffff88001b8d40c0 ti: ffff880137488000 task.ti: ffff880137488000
      [ 7782.592016] RIP: 0010:[<ffffffffa030b7ab>]  [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs]
      [ 7782.592016] RSP: 0018:ffff88013748be40  EFLAGS: 00010286
      [ 7782.592016] RAX: 0000000080000000 RBX: ffff880133b30c88 RCX: 0000000000000001
      [ 7782.592016] RDX: 0000000000000001 RSI: ffffffff8148fec0 RDI: 00000000ffffffff
      [ 7782.592016] RBP: ffff88013748bec0 R08: 0000000000000001 R09: 0000000000000000
      [ 7782.624248] R10: ffff88013748be40 R11: 0000000000000246 R12: 0000000000000000
      [ 7782.624248] R13: 0000000000000000 R14: 00000000009305a0 R15: ffff880015e3be40
      [ 7782.624248] FS:  00007fa83b9cb700(0000) GS:ffff88023ed40000(0000) knlGS:0000000000000000
      [ 7782.624248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 7782.624248] CR2: 0000000000000544 CR3: 00000001fa652000 CR4: 00000000000006e0
      [ 7782.624248] Stack:
      [ 7782.624248]  ffffffff8108b5cc ffff88013748bec0 0000000000000246 ffff8800b005ded0
      [ 7782.624248]  ffff880133b30d60 8000000000000000 7fffffffffffffff 0000000000000246
      [ 7782.624248]  0000000000000246 ffffffff81074f9b ffffffff8104357c ffff880015e3be40
      [ 7782.624248] Call Trace:
      [ 7782.624248]  [<ffffffff8108b5cc>] ? arch_local_irq_save+0x9/0xc
      [ 7782.624248]  [<ffffffff81074f9b>] ? ___might_sleep+0xce/0x217
      [ 7782.624248]  [<ffffffff8104357c>] ? __do_page_fault+0x3c0/0x43a
      [ 7782.624248]  [<ffffffff811a2351>] vfs_fsync_range+0x8c/0x9e
      [ 7782.624248]  [<ffffffff811a237f>] vfs_fsync+0x1c/0x1e
      [ 7782.624248]  [<ffffffff811a24d6>] do_fsync+0x31/0x4a
      [ 7782.624248]  [<ffffffff811a2700>] SyS_fsync+0x10/0x14
      [ 7782.624248]  [<ffffffff81493617>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [ 7782.624248] Code: 85 c0 0f 85 e2 02 00 00 48 8b 45 b0 31 f6 4c 29 e8 48 ff c0 48 89 45 a8 48 8d 83 d8 00 00 00 48 89 c7 48 89 45 a0 e8 fc 43 18 e1 <f0> 41 ff 84 24 44 05 00 00 48 8b 83 58 ff ff ff 48 c1 e8 07 83
      [ 7782.624248] RIP  [<ffffffffa030b7ab>] btrfs_sync_file+0x11b/0x3e9 [btrfs]
      [ 7782.624248]  RSP <ffff88013748be40>
      [ 7782.624248] CR2: 0000000000000544
      [ 7782.661994] ---[ end trace 721e14960eb939bc ]---
      
      This started happening since commit 4bacc9c9 (overlayfs: Make f_path
      always point to the overlay and f_inode to the underlay) and even though
      after this change we could still access the btrfs inode through
      struct file->f_mapping->host or struct file->f_inode, we would end up
      resulting in more similar issues later on at check_parent_dirs_for_sync()
      because the dentry we got (from struct file->f_path.dentry) was from
      overlayfs and not from btrfs, that is, we had no way of getting the dentry
      that belonged to btrfs (we always got the dentry that belonged to
      overlayfs).
      
      The new patch from Miklos Szeredi, titled "vfs: add file_dentry()" and
      recently submitted to linux-fsdevel, adds a file_dentry() API that allows
      us to get the btrfs dentry from the input file and therefore being able
      to fsync when the upper and lower directories belong to btrfs filesystems.
      
      This issue has been reported several times by users in the mailing list
      and bugzilla. A test case for xfstests is being submitted as well.
      
      Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=101951
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109791Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org
      de17e793
  6. 22 3月, 2016 4 次提交
  7. 21 3月, 2016 1 次提交
    • C
      btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums · 389f239c
      Chris Mason 提交于
      Commit c40a3d38 (Btrfs: Compute and look up csums based on
      sectorsized blocks) changes around how we walk the bios while looking up
      crcs.  There's an inner loop that is jumping to the next bvec based on
      sectors and before it derefs the next bvec, it needs to make sure we're
      still in the bio.
      
      In this case, the outer loop would have decided to stop moving forward
      too, and the bvec deref is never actually used for anything.  But
      CONFIG_DEBUG_PAGEALLOC catches it because we're outside our bio.
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      389f239c
  8. 18 3月, 2016 1 次提交
  9. 14 3月, 2016 2 次提交
  10. 12 3月, 2016 4 次提交
  11. 11 3月, 2016 1 次提交
  12. 04 3月, 2016 1 次提交
    • F
      Btrfs: fix loading of orphan roots leading to BUG_ON · 909c3a22
      Filipe Manana 提交于
      When looking for orphan roots during mount we can end up hitting a
      BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
      replayed and qgroups are enabled. This is because after a log tree is
      replayed, a transaction commit is made, which triggers qgroup extent
      accounting which in turn does backref walking which ends up reading and
      inserting all roots in the radix tree fs_info->fs_root_radix, including
      orphan roots (deleted snapshots). So after the log tree is replayed, when
      finding orphan roots we hit the BUG_ON with the following trace:
      
      [118209.182438] ------------[ cut here ]------------
      [118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
      [118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse
      processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata
      virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
      [118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G        W       4.5.0-rc5-btrfs-next-24+ #1
      [118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000
      [118209.186318] RIP: 0010:[<ffffffffa04237d7>]  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
      [118209.186318] RSP: 0018:ffff8800af34faa8  EFLAGS: 00010246
      [118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001
      [118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff
      [118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000
      [118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000
      [118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000
      [118209.186318] FS:  00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [118209.186318] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0
      [118209.186318] Stack:
      [118209.186318]  fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000
      [118209.186318]  fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000
      [118209.186318]  0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000
      [118209.186318] Call Trace:
      [118209.186318]  [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs]
      [118209.186318]  [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs]
      [118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
      [118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
      [118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
      [118209.186318]  [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs]
      [118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
      [118209.186318]  [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3
      [118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
      [118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
      [118209.186318]  [<ffffffff81195637>] do_mount+0x8a6/0x9e8
      [118209.186318]  [<ffffffff8119598d>] SyS_mount+0x77/0x9f
      [118209.186318]  [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b
      4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
      [118209.186318] RIP  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
      [118209.186318]  RSP <ffff8800af34faa8>
      [118209.230735] ---[ end trace 83938f987d85d477 ]---
      
      So fix this by not treating the error -EEXIST, returned when attempting
      to insert a root already inserted by the backref walking code, as an error.
      
      The following test case for xfstests reproduces the bug:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        _run_btrfs_util_prog quota enable $SCRATCH_MNT
      
        # Create 2 directories with one file in one of them.
        # We use these just to trigger a transaction commit later, moving the file from
        # directory a to directory b and doing an fsync against directory a.
        mkdir $SCRATCH_MNT/a
        mkdir $SCRATCH_MNT/b
        touch $SCRATCH_MNT/a/f
        sync
      
        # Create our test file with 2 4K extents.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
      
        # Create a snapshot and delete it. This doesn't really delete the snapshot
        # immediately, just makes it inaccessible and invisible to user space, the
        # snapshot is deleted later by a dedicated kernel thread (cleaner kthread)
        # which is woke up at the next transaction commit.
        # A root orphan item is inserted into the tree of tree roots, so that if a
        # power failure happens before the dedicated kernel thread does the snapshot
        # deletion, the next time the filesystem is mounted it resumes the snapshot
        # deletion.
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
        _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap
      
        # Now overwrite half of the extents we wrote before. Because we made a snapshpot
        # before, which isn't really deleted yet (since no transaction commit happened
        # after we did the snapshot delete request), the non overwritten extents get
        # referenced twice, once by the default subvolume and once by the snapshot.
        $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
      
        # Now move file f from directory a to directory b and fsync directory a.
        # The fsync on the directory a triggers a transaction commit (because a file
        # was moved from it to another directory) and the file fsync leaves a log tree
        # with file extent items to replay.
        mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        # Now simulate a power failure and mount the filesystem to replay the log tree.
        # After the log tree was replayed, we used to hit a BUG_ON() when processing
        # the root orphan item for the deleted snapshot. This is because when processing
        # an orphan root the code expected to be the first code inserting the root into
        # the fs_info->fs_root_radix radix tree, while in reallity it was the second
        # caller attempting to do it - the first caller was the transaction commit that
        # took place after replaying the log tree, when updating the qgroup counters.
        _flakey_drop_and_remount
      
        echo "File digest before after failure:"
        # Must match what he got before the power failure.
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        _unmount_flakey
        status=0
        exit
      
      Fixes: 2d9e9776 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref")
      Cc: stable@vger.kernel.org  # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      909c3a22
  13. 02 3月, 2016 4 次提交
    • F
      Btrfs: do not collect ordered extents when logging that inode exists · 5e33a2bd
      Filipe Manana 提交于
      When logging that an inode exists, for example as part of a directory
      fsync operation, we were collecting any ordered extents for the inode but
      we ended up doing nothing with them except tagging them as processed, by
      setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
      subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
      collecting and processing them. This created a time window where a second
      fsync against the inode, using the fast path, ended up not logging the
      checksums for the new extents but it logged the extents since they were
      part of the list of modified extents. This happened because the ordered
      extents were not collected and checksums were not yet added to the csum
      tree - the ordered extents have not gone through btrfs_finish_ordered_io()
      yet (which is where we add them to the csum tree by calling
      inode.c:add_pending_csums()).
      
      So fix this by not collecting an inode's ordered extents if we are logging
      it with the LOG_INODE_EXISTS mode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5e33a2bd
    • F
      Btrfs: fix race when checking if we can skip fsync'ing an inode · affc0ff9
      Filipe Manana 提交于
      If we're about to do a fast fsync for an inode and btrfs_inode_in_log()
      returns false, it's possible that we had an ordered extent in progress
      (btrfs_finish_ordered_io() not run yet) when we noticed that the inode's
      last_trans field was not greater than the id of the last committed
      transaction, but shortly after, before we checked if there were any
      ongoing ordered extents, the ordered extent had just completed and
      removed itself from the inode's ordered tree, in which case we end up not
      logging the inode, losing some data if a power failure or crash happens
      after the fsync handler returns and before the transaction is committed.
      
      Fix this by checking first if there are any ongoing ordered extents
      before comparing the inode's last_trans with the id of the last committed
      transaction - when it completes, an ordered extent always updates the
      inode's last_trans before it removes itself from the inode's ordered
      tree (at btrfs_finish_ordered_io()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      affc0ff9
    • F
      Btrfs: fix listxattrs not listing all xattrs packed in the same item · daac7ba6
      Filipe Manana 提交于
      In the listxattrs handler, we were not listing all the xattrs that are
      packed in the same btree item, which happens when multiple xattrs have
      a name that when crc32c hashed produce the same checksum value.
      
      Fix this by processing them all.
      
      The following test case for xfstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/attr
      
        # real QA test starts here
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_attrs
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount
      
        # Create our test file with a few xattrs. The first 3 xattrs have a name
        # that when given as input to a crc32c function result in the same checksum.
        # This made btrfs list only one of the xattrs through listxattrs system call
        # (because it packs xattrs with the same name checksum into the same btree
        # item).
        touch $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile
      
        # Now call getfattr with --dump, which calls the listxattrs system call.
        # It should list all the xattrs we have set before.
        $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch
      
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      daac7ba6
    • F
      Btrfs: fix deadlock between direct IO reads and buffered writes · ade77029
      Filipe Manana 提交于
      While running a test with a mix of buffered IO and direct IO against
      the same files I hit a deadlock reported by the following trace:
      
      [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
      [11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
      [11642.149771]     0 15282      2 0x00000000
      [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
      [11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
      [11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
      [11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
      [11642.161403] Call Trace:
      [11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
      [11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
      [11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
      [11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
      [11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
      [11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
      [11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
      [11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
      [11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
      [11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
      [11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
      [11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.226295] 2 locks held by kworker/u32:3/15282:
      [11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
      [11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
      [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
      [11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
      [11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
      [11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
      [11642.243715] Call Trace:
      [11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
      [11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
      [11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
      [11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
      [11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
      [11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
      [11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
      [11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
      [11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
      [11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
      [11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.282604] 3 locks held by kworker/u32:8/15289:
      [11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
      [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
      [11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
      [11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
      [11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
      [11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
      [11642.298433] Call Trace:
      [11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
      [11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
      [11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
      [11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
      [11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
      [11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
      [11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
      [11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
      [11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
      [11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.310403] 3 locks held by fdm-stress/26848:
      [11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
      [11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
      [11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
      [11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
      [11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
      [11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
      [11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
      [11642.325096] Call Trace:
      [11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
      [11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
      [11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
      [11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
      [11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
      [11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
      [11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
      [11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
      [11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
      [11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
      [11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
      [11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
      [11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
      [11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
      [11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
      [11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
      [11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
      [11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
      [11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
      [11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
      [11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.350222] 4 locks held by fdm-stress/26849:
      [11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
      [11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
      [11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
      [11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
      [11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
      [11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
      [11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
      [11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
      [11642.373430] Call Trace:
      [11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
      [11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
      [11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
      [11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
      [11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
      [11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
      [11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
      [11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
      [11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
      [11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
      [11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
      [11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
      [11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
      [11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
      [11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
      [11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
      [11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
      [11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
      [11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
      [11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.416572] 1 lock held by fdm-stress/26850:
      [11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
      [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
      [11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
      [11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
      [11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
      [11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
      [11642.426591] Call Trace:
      [11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
      [11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
      [11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
      [11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
      [11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
      [11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
      [11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
      [11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
      [11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
      [11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.441533] 3 locks held by fdm-stress/26851:
      [11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
      [11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
      [11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      
      This happened because under specific timings the path for direct IO reads
      can deadlock with concurrent buffered writes. The diagram below shows how
      this happens for an example file that has the following layout:
      
           [  extent A  ]  [  extent B  ]  [ ....
           0K              4K              8K
      
           CPU 1                                               CPU 2                             CPU 3
      
      DIO read against range
       [0K, 8K[ starts
      
      btrfs_direct_IO()
        --> calls btrfs_get_blocks_direct()
            which finds the extent map for the
            extent A and leaves the range
            [0K, 4K[ locked in the inode's
            io tree
      
                                                         buffered write against
                                                         range [4K, 8K[ starts
      
                                                         __btrfs_buffered_write()
                                                           --> dirties page at 4K
      
                                                                                           a user space
                                                                                           task calls sync
                                                                                           for e.g or
                                                                                           writepages() is
                                                                                           invoked by mm
      
                                                                                           writepages()
                                                                                             run_delalloc_range()
                                                                                               cow_file_range()
                                                                                                 --> ordered extent X
                                                                                                     for the buffered
                                                                                                     write is created
                                                                                                     and
                                                                                                     writeback starts
      
        --> calls btrfs_get_blocks_direct()
            again, without submitting first
            a bio for reading extent A, and
            finds the extent map for extent B
      
        --> calls lock_extent_direct()
      
            --> locks range [4K, 8K[
            --> finds ordered extent X
                covering range [4K, 8K[
            --> unlocks range [4K, 8K[
      
                                                        buffered write against
                                                        range [0K, 8K[ starts
      
                                                        __btrfs_buffered_write()
                                                          prepare_pages()
                                                            --> locks pages with
                                                                offsets 0 and 4K
                                                          lock_and_cleanup_extent_if_need()
                                                            --> blocks attempting to
                                                                lock range [0K, 8K[ in
                                                                the inode's io tree,
                                                                because the range [0, 4K[
                                                                is already locked by the
                                                                direct IO task at CPU 1
      
            --> calls
                btrfs_start_ordered_extent(oe X)
      
                btrfs_start_ordered_extent(oe X)
      
                  --> At this point writeback for ordered
                      extent X has not finished yet
      
                  filemap_fdatawrite_range()
                    btrfs_writepages()
                      extent_writepages()
                        extent_write_cache_pages()
                          --> finds page with offset 0
                              with the writeback tag
                              (and not dirty)
                          --> tries to lock it
                               --> deadlock, task at CPU 2
                                   has the page locked and
                                   is blocked on the io range
                                   [0, 4K[ that was locked
                                   earlier by this task
      
      So fix this by falling back to a buffered read in the direct IO read path
      when an ordered extent for a buffered write is found.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ade77029