1. 23 2月, 2016 1 次提交
    • L
      Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9
      Liu Bo 提交于
      Xfstests btrfs/011 complains about a deadlock warning,
      
      [ 1226.649039] =========================================================
      [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
      [ 1226.649039] 4.1.0+ #270 Not tainted
      [ 1226.649039] ---------------------------------------------------------
      [ 1226.652955] kswapd0/46 just changed the state of lock:
      [ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1226.652955]
      other info that might help us debug this:
      [ 1226.652955] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1226.652955]  Possible interrupt unsafe locking scenario:
      
      [ 1226.652955]        CPU0                    CPU1
      [ 1226.652955]        ----                    ----
      [ 1226.652955]   lock(&fs_info->dev_replace.lock);
      [ 1226.652955]                                local_irq_disable();
      [ 1226.652955]                                lock(&delayed_node->mutex);
      [ 1226.652955]                                lock(&found->groups_sem);
      [ 1226.652955]   <Interrupt>
      [ 1226.652955]     lock(&delayed_node->mutex);
      [ 1226.652955]
       *** DEADLOCK ***
      
      Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
      to fix a similar one that has the exactly same warning, but with that, we still
      run to this.
      
      The above lock chain comes from
      btrfs_commit_transaction
        ->btrfs_run_delayed_items
          ...
          ->__btrfs_update_delayed_inode
            ...
            ->__btrfs_cow_block
               ...
               ->find_free_extent
                  ->cache_block_group
                    ->load_free_space_cache
                      ->btrfs_readpages
                        ->submit_one_bio
                          ...
                          ->__btrfs_map_block
                            ->btrfs_dev_replace_lock
      
      However, with high memory pressure, tasks which hold dev_replace.lock can
      be interrupted by kswapd and then kswapd is intended to release memory occupied
      by superblock, inodes and dentries, where we may call evict_inode, and it comes
      to
      
      [ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
      [ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
      
      delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
      to a ABBA deadlock.
      
      To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
      things are simpler here since we only needs read's spinlock to blocking lock.
      
      With this, btrfs/011 no more produces warnings in dmesg.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73beece9
  2. 11 2月, 2016 1 次提交
    • D
      btrfs: properly set the termination value of ctx->pos in readdir · bc4ef759
      David Sterba 提交于
      The value of ctx->pos in the last readdir call is supposed to be set to
      INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
      larger value, then it's LLONG_MAX.
      
      There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
      overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
      64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
      the increment.
      
      We can get to that situation like that:
      
      * emit all regular readdir entries
      * still in the same call to readdir, bump the last pos to INT_MAX
      * next call to readdir will not emit any entries, but will reach the
        bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
      
      Normally this is not a problem, but if we call readdir again, we'll find
      'pos' set to LLONG_MAX and the unconditional increment will overflow.
      
      The report from Victor at
      (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
      print shows that pattern:
      
       Overflow: e
       Overflow: 7fffffff
       Overflow: 7fffffffffffffff
       PAX: size overflow detected in function btrfs_real_readdir
         fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
         context: dir_context;
       CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
       Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
        ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
        ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
        ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
       Call Trace:
        [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
        [<ffffffff811cb706>] report_size_overflow+0x36/0x40
        [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
        [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
        [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
        [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
       Overflow: 1a
        [<ffffffff811db070>] ? iterate_dir+0x150/0x150
        [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
      
      The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
      are not yet synced and are processed from the delayed list. Then the code
      could go to the bump section again even though it might not emit any new
      dir entries from the delayed list.
      
      The fix avoids entering the "bump" section again once we've finished
      emitting the entries, both for synced and delayed entries.
      
      References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bc4ef759
  3. 08 2月, 2016 1 次提交
  4. 07 2月, 2016 1 次提交
    • H
      pty: make sure super_block is still valid in final /dev/tty close · 1f55c718
      Herton R. Krzesinski 提交于
      Considering current pty code and multiple devpts instances, it's possible
      to umount a devpts file system while a program still has /dev/tty opened
      pointing to a previosuly closed pty pair in that instance. In the case all
      ptmx and pts/N files are closed, umount can be done. If the program closes
      /dev/tty after umount is done, devpts_kill_index will use now an invalid
      super_block, which was already destroyed in the umount operation after
      running ->kill_sb. This is another "use after free" type of issue, but now
      related to the allocated super_block instance.
      
      To avoid the problem (warning at ida_remove and potential crashes) for
      this specific case, I added two functions in devpts which grabs additional
      references to the super_block, which pty code now uses so it makes sure
      the super block structure is still valid until pty shutdown is done.
      I also moved the additional inode references to the same functions, which
      also covered similar case with inode being freed before /dev/tty final
      close/shutdown.
      Signed-off-by: NHerton R. Krzesinski <herton@redhat.com>
      Cc: stable@vger.kernel.org # 2.6.29+
      Reviewed-by: NPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f55c718
  5. 06 2月, 2016 4 次提交
    • J
      epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT · b6a515c8
      Jason Baron 提交于
      In the current implementation of the EPOLLEXCLUSIVE flag (added for
      4.5-rc1), if epoll waiters create different POLL* sets and register them
      as exclusive against the same target fd, the current implementation will
      stop waking any further waiters once it finds the first idle waiter.
      This means that waiters could miss wakeups in certain cases.
      
      For example, when we wake up a pipe for reading we do:
      wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
      one epoll set or epfd is added to pipe p with POLLIN and a second set
      epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
      wakeup since the current implementation will stop after it finds any
      intersection of events with a waiter that is blocked in epoll_wait().
      
      We could potentially address this by requiring all epoll waiters that
      are added to p be required to pass the same set of POLL* events.  IE the
      first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
      flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
      However, I think it might be somewhat confusing interface as we would
      have to reference count the number of users for that set, and so
      userspace would have to keep track of that count, or we would need a
      more involved interface.  It also adds some shared state that we'd have
      store somewhere.  I don't think anybody will want to bloat
      __wait_queue_head for this.
      
      I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
      such that it can only be specified with EPOLLIN and/or EPOLLOUT.  So
      that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
      once we hit the first idle waiter that specifies the EPOLLIN bit, since
      any remaining waiters that only have 'POLLOUT' set wouldn't need to be
      woken.  Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
      bit set and not 'POLLIN'.  If both 'POLLOUT' and 'POLLIN' are set in the
      wake bit set (there is at least one example of this I saw in fs/pipe.c),
      then we just wake the entire exclusive list.  Having both 'POLLOUT' and
      'POLLIN' both set should not be on any performance critical path, so I
      think that's ok (in fs/pipe.c its in pipe_release()).  We also continue
      to include EPOLLERR and EPOLLHUP by default in any exclusive set.  Thus,
      the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
      so.
      
      Since epoll waiters may be interested in other events as well besides
      EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
      doing a 'dup' call on the target fd and adding that as one normally
      would with EPOLL_CTL_ADD.  Since I think that the POLLIN and POLLOUT
      events are what we are interest in balancing, I think that the 'dup'
      thing could perhaps be added to only one of the waiter threads.
      However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
      sufficient for the majority of use-cases.
      
      Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
      among multiple epfds, where between 1 and n of the epfds may receive an
      event, it does not satisfy the semantics of EPOLLONESHOT where only 1
      epfd would get an event.  Thus, it is not allowed to be specified in
      conjunction with EPOLLEXCLUSIVE.
      
      EPOLL_CTL_MOD is also not allowed if the fd was previously added as
      EPOLLEXCLUSIVE.  It seems with the limited number of flags to not be as
      interesting, but this could be relaxed at some further point.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NMadars Vitolins <m@silodev.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a515c8
    • D
      dax: dirty inode only if required · d2b2a28e
      Dmitry Monakhov 提交于
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2b2a28e
    • X
      ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup · c95a5180
      xuejiufei 提交于
      When recovery master down, dlm_do_local_recovery_cleanup() only remove
      the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
      Which will make umount thread falling in dead loop migrating $RECOVERY
      to the dead node.
      Signed-off-by: Nxuejiufei <xuejiufei@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c95a5180
    • R
      block: fix pfn_mkwrite() DAX fault handler · 9c5a05bc
      Ross Zwisler 提交于
      Previously the pfn_mkwrite() fault handler for raw block devices called
      bldev_dax_fault() -> __dax_fault() to do a full DAX page fault.
      
      Really what the pfn_mkwrite() fault handler needs to do is call
      dax_pfn_mkwrite() to make sure that the radix tree entry for the given
      PTE is marked as dirty so that a follow-up fsync or msync call will
      flush it durably to media.
      
      Fixes: 5a023cdb ("block: enable dax for raw block devices")
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c5a05bc
  6. 05 2月, 2016 3 次提交
    • F
      Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl · 0c0fe3b0
      Filipe Manana 提交于
      While doing some tests I ran into an hang on an extent buffer's rwlock
      that produced the following trace:
      
      [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166]
      [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165]
      [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [39389.800016] irq event stamp: 0
      [39389.800016] hardirqs last  enabled at (0): [<          (null)>]           (null)
      [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800016] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800016] softirqs last disabled at (0): [<          (null)>]           (null)
      [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000
      [39389.800016] RIP: 0010:[<ffffffff810902af>]  [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158
      [39389.800016] RSP: 0018:ffff8800a185fb80  EFLAGS: 00000202
      [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101
      [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001
      [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000
      [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98
      [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40
      [39389.800016] FS:  00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000
      [39389.800016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0
      [39389.800016] Stack:
      [39389.800016]  ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0
      [39389.800016]  ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895
      [39389.800016]  ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c
      [39389.800016] Call Trace:
      [39389.800016]  [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60
      [39389.800016]  [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41
      [39389.800016]  [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44
      [39389.800016]  [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs]
      [39389.800016]  [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs]
      [39389.800016]  [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs]
      [39389.800016]  [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs]
      [39389.800016]  [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs]
      [39389.800016]  [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs]
      [39389.800016]  [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs]
      [39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800016]  [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15
      [39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800016]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [39389.800016]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [39389.800016]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [39389.800016]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [39389.800016]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8
      [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [39389.800012] irq event stamp: 0
      [39389.800012] hardirqs last  enabled at (0): [<          (null)>]           (null)
      [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800012] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800012] softirqs last disabled at (0): [<          (null)>]           (null)
      [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G             L  4.4.0-rc6-btrfs-next-18+ #1
      [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000
      [39389.800012] RIP: 0010:[<ffffffff81091e8d>]  [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72
      [39389.800012] RSP: 0018:ffff880034a639f0  EFLAGS: 00000206
      [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000
      [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c
      [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000
      [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98
      [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00
      [39389.800012] FS:  00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000
      [39389.800012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0
      [39389.800012] Stack:
      [39389.800012]  ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98
      [39389.800012]  ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00
      [39389.800012]  ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58
      [39389.800012] Call Trace:
      [39389.800012]  [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c
      [39389.800012]  [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41
      [39389.800012]  [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs]
      [39389.800012]  [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs]
      [39389.800012]  [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs]
      [39389.800012]  [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs]
      [39389.800012]  [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs]
      [39389.800012]  [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs]
      [39389.800012]  [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28
      [39389.800012]  [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs]
      [39389.800012]  [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf
      [39389.800012]  [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc
      [39389.800012]  [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs]
      [39389.800012]  [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs]
      [39389.800012]  [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44
      [39389.800012]  [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs]
      [39389.800012]  [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs]
      [39389.800012]  [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs]
      [39389.800012]  [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs]
      [39389.800012]  [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs]
      [39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
      [39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
      [39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800012]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [39389.800012]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [39389.800012]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [39389.800012]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [39389.800012]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00
      
      This happens because in the code path executed by the inode_paths ioctl we
      end up nesting two calls to read lock a leaf's rwlock when after the first
      call to read_lock() and before the second call to read_lock(), another
      task (running the delayed items as part of a transaction commit) has
      already called write_lock() against the leaf's rwlock. This situation is
      illustrated by the following diagram:
      
               Task A                       Task B
      
        btrfs_ref_to_path()               btrfs_commit_transaction()
          read_lock(&eb->lock);
      
                                            btrfs_run_delayed_items()
                                              __btrfs_commit_inode_delayed_items()
                                                __btrfs_update_delayed_inode()
                                                  btrfs_lookup_inode()
      
                                                    write_lock(&eb->lock);
                                                      --> task waits for lock
      
          read_lock(&eb->lock);
          --> makes this task hang
              forever (and task B too
      	of course)
      
      So fix this by avoiding doing the nested read lock, which is easily
      avoidable. This issue does not happen if task B calls write_lock() after
      task A does the second call to read_lock(), however there does not seem
      to exist anything in the documentation that mentions what is the expected
      behaviour for recursive locking of rwlocks (leaving the idea that doing
      so is not a good usage of rwlocks).
      
      Also, as a side effect necessary for this fix, make sure we do not
      needlessly read lock extent buffers when the input path has skip_locking
      set (used when called from send).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      0c0fe3b0
    • Y
      ceph: fix snap context leak in error path · db6aed70
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      db6aed70
    • D
      ceph: checking for IS_ERR instead of NULL · 1418bf07
      Dan Carpenter 提交于
      ceph_osdc_alloc_request() returns NULL on error, it never returns error
      pointers.
      
      Fixes: 5be0389d ('ceph: re-send AIO write request when getting -EOLDSNAP error')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      1418bf07
  7. 04 2月, 2016 6 次提交
    • F
      Btrfs: remove no longer used function extent_read_full_page_nolock() · 7f042a83
      Filipe Manana 提交于
      Not needed after the previous patch named
      "Btrfs: fix page reading in extent_same ioctl leading to csum errors".
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      7f042a83
    • F
      Btrfs: fix page reading in extent_same ioctl leading to csum errors · 31314002
      Filipe Manana 提交于
      In the extent_same ioctl, we were grabbing the pages (locked) and
      attempting to read them without bothering about any concurrent IO
      against them. That is, we were not checking for any ongoing ordered
      extents nor waiting for them to complete, which leads to a race where
      the extent_same() code gets a checksum verification error when it
      reads the pages, producing a message like the following in dmesg
      and making the operation fail to user space with -ENOMEM:
      
      [18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868
      
      Fix this by using btrfs_readpage() for reading the pages instead of
      extent_read_full_page_nolock(), which waits for any concurrent ordered
      extents to complete and locks the io range. Also do better error handling
      and don't treat all failures as -ENOMEM, as that's clearly misleasing,
      becoming identical to the checks and operation of prepare_uptodate_page().
      
      The use of extent_read_full_page_nolock() was required before
      commit f4414602 ("btrfs: fix deadlock with extent-same and readpage"),
      as we had the range locked in an inode's io tree before attempting to
      read the pages.
      
      Fixes: f4414602 ("btrfs: fix deadlock with extent-same and readpage")
      Cc: stable@vger.kernel.org   # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      31314002
    • F
      Btrfs: fix invalid page accesses in extent_same (dedup) ioctl · e0bd70c6
      Filipe Manana 提交于
      In the extent_same ioctl we are getting the pages for the source and
      target ranges and unlocking them immediately after, which is incorrect
      because later we attempt to map them (with kmap_atomic) and access their
      contents at btrfs_cmp_data(). When we do such access the pages might have
      been relocated or removed from memory, which leads to an invalid memory
      access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
      which produces a trace like the following:
      
      186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
      parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
      crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
      unloaded: btrfs]
      [186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G        W       4.4.0-rc6-btrfs-next-18+ #1
      [186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000
      [186736.681319] RIP: 0010:[<ffffffff81264d00>]  [<ffffffff81264d00>] memcmp+0xb/0x22
      [186736.681319] RSP: 0018:ffff880362287d70  EFLAGS: 00010287
      [186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000
      [186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000
      [186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000
      [186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000
      [186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40
      [186736.681319] FS:  00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000
      [186736.681319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0
      [186736.681319] Stack:
      [186736.681319]  ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
      [186736.681319]  0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
      [186736.681319]  00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
      [186736.681319] Call Trace:
      [186736.681319]  [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
      [186736.681319]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [186736.681319]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [186736.681319]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [186736.681319]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [186736.681319]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [186736.681319]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
      04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0
      
      (gdb) list *(btrfs_ioctl+0x24cb)
      0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
      2967                    dst_addr = kmap_atomic(dst_page);
      2968
      2969                    flush_dcache_page(src_page);
      2970                    flush_dcache_page(dst_page);
      2971
      2972                    if (memcmp(addr, dst_addr, cmp_len))
      2973                            ret = BTRFS_SAME_DATA_DIFFERS;
      2974
      2975                    kunmap_atomic(addr);
      2976                    kunmap_atomic(dst_addr);
      
      So fix this by making sure we keep the pages locked and respect the same
      locking order as everywhere else: get and lock the pages first and then
      lock the range in the inode's io tree (like for example at
      __btrfs_buffered_write() and extent_readpages()). If an ordered extent
      is found after locking the range in the io tree, unlock the range,
      unlock the pages, wait for the ordered extent to complete and repeat the
      entire locking process until no overlapping ordered extents are found.
      
      Cc: stable@vger.kernel.org   # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      e0bd70c6
    • J
      proc: revert /proc/<pid>/maps [stack:TID] annotation · 65376df5
      Johannes Weiner 提交于
      Commit b7643757 ("procfs: mark thread stack correctly in
      proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps.
      
      Finding the task of a stack VMA requires walking the entire thread list,
      turning this into quadratic behavior: a thousand threads means a
      thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a
      million combinations.
      
      The cost is not in proportion to the usefulness as described in the
      patch.
      
      Drop the [stack:TID] annotation to make /proc/<pid>/maps (and
      /proc/<pid>/numa_maps) usable again for higher thread counts.
      
      The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as
      identifying the stack VMA there is an O(1) operation.
      
      Siddesh said:
       "The end users needed a way to identify thread stacks programmatically and
        there wasn't a way to do that.  I'm afraid I no longer remember (or have
        access to the resources that would aid my memory since I changed
        employers) the details of their requirement.  However, I did do this on my
        own time because I thought it was an interesting project for me and nobody
        really gave any feedback then as to its utility, so as far as I am
        concerned you could roll back the main thread maps information since the
        information is available in the thread-specific files"
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65376df5
    • M
      numa: fix /proc/<pid>/numa_maps for hugetlbfs on s390 · 5c2ff95e
      Michael Holzheu 提交于
      When working with hugetlbfs ptes (which are actually pmds) is not valid to
      directly use pte functions like pte_present() because the hardware bit
      layout of pmds and ptes can be different.  This is the case on s390.
      Therefore we have to convert the hugetlbfs ptes first into a valid pte
      encoding with huge_ptep_get().
      
      Currently the /proc/<pid>/numa_maps code uses hugetlbfs ptes without
      huge_ptep_get().  On s390 this leads to the following two problems:
      
      1) The pte_present() function returns false (instead of true) for
         PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
         completely in the "numa_maps" output.
      
      2) The pte_dirty() function always returns false for all hugetlb ptes.
         Therefore these pages are reported as "mapped=xxx" instead of
         "dirty=xxx".
      
      Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c2ff95e
    • J
      ocfs2/cluster: fix memory leak in o2hb_region_release · a4a1dfa4
      Joseph Qi 提交于
      o2hb_region_release currently doesn't free o2hb_debug_buf
      hr_db_elapsed_time and hr_db_pinned malloced in o2hb_debug_create.  Also
      we should call debugfs_remove before freeing its data, to prevent the risk
      accessing debugfs rightly after its data has been freed.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NJiufei Xue <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4a1dfa4
  8. 31 1月, 2016 2 次提交
  9. 30 1月, 2016 1 次提交
  10. 28 1月, 2016 1 次提交
  11. 27 1月, 2016 4 次提交
  12. 26 1月, 2016 3 次提交
    • D
      Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()" · 80ad623e
      David Sterba 提交于
      This reverts commit 69624913. The
      cleaner thread can block freezing when there's a snapshot cleaning in
      progress and the other threads get suspended first. From the logs
      provided by Martin we're waiting for reading extent pages:
      
      kernel: PM: Syncing filesystems ... done.
      kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
      kernel: Freezing remaining freezable tasks ...
      kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
      kernel: btrfs-cleaner   D ffff88033dd13bc0     0   152      2 0x00000000
      kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
      kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
      kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
      kernel: Call Trace:
      kernel: [<ffffffff814a58df>] ? bit_wait+0x2c/0x2c
      kernel: [<ffffffff814a5451>] ? schedule+0x6f/0x7c
      kernel: [<ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8
      kernel: [<ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e
      kernel: [<ffffffff81077603>] ? ktime_get+0x36/0x44
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a590b>] ? bit_wait_io+0x2c/0x30
      kernel: [<ffffffff814a5694>] ? __wait_on_bit+0x41/0x73
      kernel: [<ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72
      kernel: [<ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a
      kernel: [<ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203
      kernel: [<ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c
      kernel: [<ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
      kernel: [<ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45
      kernel: [<ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b
      kernel: [<ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a
      kernel: [<ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736
      kernel: [<ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7
      kernel: [<ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae
      kernel: [<ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4
      kernel: [<ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664
      kernel: [<ffffffff8104ff21>] ? finish_task_switch+0x126/0x167
      kernel: [<ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
      kernel: [<ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b
      kernel: [<ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33
      kernel: [<ffffffff8104d256>] ? kthread+0x95/0x9d
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      kernel: [<ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      
      As this affects a released kernel (4.4) we need a minimal fix for
      stable kernels.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361Reported-by: NMartin Ziegler <ziegler@uni-freiburg.de>
      CC: stable@vger.kernel.org # 4.4
      CC: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      80ad623e
    • Q
      btrfs: async-thread: Fix a use-after-free error for trace · 0a95b851
      Qu Wenruo 提交于
      Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
      So no one use use that pointer after queue_work().
      
      Fix the user-after-free bug by move the trace line before queue_work().
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0a95b851
    • F
      Btrfs: fix race between fsync and lockless direct IO writes · de0ee0ed
      Filipe Manana 提交于
      An fsync, using the fast path, can race with a concurrent lockless direct
      IO write and end up logging a file extent item that points to an extent
      that wasn't written to yet. This is because the fast fsync path collects
      ordered extents into a local list and then collects all the new extent
      maps to log file extent items based on them, while the direct IO write
      path creates the new extent map before it creates the corresponding
      ordered extent (and submitting the respective bio(s)).
      
      So fix this by making the direct IO write path create ordered extents
      before the extent maps and make the fast fsync path collect any new
      ordered extents after it collects the extent maps.
      Note that making the fsync handler call inode_dio_wait() (after acquiring
      the inode's i_mutex) would not work and lead to a deadlock when doing
      AIO, as through AIO we end up in a path where the fsync handler is called
      (through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
      before the inode's dio counter is decremented (inode_dio_wait() waits
      for this counter to have a value of zero).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de0ee0ed
  13. 25 1月, 2016 2 次提交
  14. 23 1月, 2016 10 次提交
    • D
      vfs: abort dedupe loop if fatal signals are pending · e62e560f
      Darrick J. Wong 提交于
      If the program running dedupe receives a fatal signal during the
      dedupe loop, we should bail out to avoid tying up the system.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e62e560f
    • T
      tree wide: use kvfree() than conditional kfree()/vfree() · 1d5cfdb0
      Tetsuo Handa 提交于
      There are many locations that do
      
        if (memory_was_allocated_by_vmalloc)
          vfree(ptr);
        else
          kfree(ptr);
      
      but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
      using is_vmalloc_addr().  Unless callers have special reasons, we can
      replace this branch with kvfree().  Please check and reply if you found
      problems.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJan Kara <jack@suse.com>
      Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Reviewed-by: NAndreas Dilger <andreas.dilger@intel.com>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Boris Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d5cfdb0
    • R
      dax: never rely on bh.b_dev being set by get_block() · eab95db6
      Ross Zwisler 提交于
      Previously in DAX we assumed that calls to get_block() would set
      bh.b_bdev, and we would then use that value even in error cases for
      debugging.  This caused a NULL pointer dereference in __dax_dbg() which
      was fixed by a previous commit, but that commit only changed the one
      place where we were hitting an error.
      
      Instead, update dax.c so that we always initialize bh.b_bdev as best we
      can based on the information that DAX has.  get_block() may or may not
      update to a new value, but this at least lets us get something helpful
      from bh.b_bdev for error messages and not have to worry about whether it
      was set by get_block() or not.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eab95db6
    • R
      xfs: call dax_pfn_mkwrite() for DAX fsync/msync · 5eb88dca
      Ross Zwisler 提交于
      To properly support the new DAX fsync/msync infrastructure filesystems
      need to call dax_pfn_mkwrite() so that DAX can track when user pages are
      dirtied.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5eb88dca
    • R
      ext4: call dax_pfn_mkwrite() for DAX fsync/msync · d5be7a03
      Ross Zwisler 提交于
      To properly support the new DAX fsync/msync infrastructure filesystems
      need to call dax_pfn_mkwrite() so that DAX can track when user pages are
      dirtied.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5be7a03
    • R
      ext2: call dax_pfn_mkwrite() for DAX fsync/msync · 80b4adca
      Ross Zwisler 提交于
      To properly support the new DAX fsync/msync infrastructure filesystems
      need to call dax_pfn_mkwrite() so that DAX can track when user pages are
      dirtied.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80b4adca
    • R
      dax: add support for fsync/sync · 9973c98e
      Ross Zwisler 提交于
      To properly handle fsync/msync in an efficient way DAX needs to track
      dirty pages so it is able to flush them durably to media on demand.
      
      The tracking of dirty pages is done via the radix tree in struct
      address_space.  This radix tree is already used by the page writeback
      infrastructure for tracking dirty pages associated with an open file,
      and it already has support for exceptional (non struct page*) entries.
      We build upon these features to add exceptional entries to the radix
      tree for DAX dirty PMD or PTE pages at fault time.
      
      [dan.j.williams@intel.com: fix dax_pmd_dbg build warning]
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9973c98e
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
    • R
      dax: fix conversion of holes to PMDs · de14b9cb
      Ross Zwisler 提交于
      When we get a DAX PMD fault for a write it is possible that there could
      be some number of 4k zero pages already present for the same range that
      were inserted to service reads from a hole.  These 4k zero pages need to
      be unmapped from the VMAs and removed from the struct address_space
      radix tree before the real DAX PMD entry can be inserted.
      
      For PTE faults this same use case also exists and is handled by a
      combination of unmap_mapping_range() to unmap the VMAs and
      delete_from_page_cache() to remove the page from the address_space radix
      tree.
      
      For PMD faults we do have a call to unmap_mapping_range() (protected by
      a buffer_new() check), but nothing clears out the radix tree entry.  The
      buffer_new() check is also incorrect as the current ext4 and XFS
      filesystem code will never return a buffer_head with BH_New set, even
      when allocating new blocks over a hole.  Instead the filesystem will
      zero the blocks manually and return a buffer_head with only BH_Mapped
      set.
      
      Fix this situation by removing the buffer_new() check and adding a call
      to truncate_inode_pages_range() to clear out the radix tree entries
      before we insert the DAX PMD.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de14b9cb
    • R
      dax: fix NULL pointer dereference in __dax_dbg() · d4bbe706
      Ross Zwisler 提交于
      In __dax_pmd_fault() we currently assume that get_block() will always
      set bh.b_bdev and we unconditionally dereference it in __dax_dbg().
      
      This assumption isn't always true - when called for reads of holes
      ext4_dax_mmap_get_block() returns a buffer head where bh->b_bdev is
      never set.  I hit this BUG while testing the DAX PMD fault path.
      
      Instead, initialize bh.b_bdev before passing bh into get_block().  It is
      possible that the filesystem's get_block() will update bh.b_bdev, and
      this is fine - we just want to initialize bh.b_bdev to something
      reasonable so that the calls to __dax_dbg() work and print something
      useful.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4bbe706