1. 14 3月, 2016 10 次提交
  2. 06 3月, 2016 2 次提交
  3. 01 3月, 2016 1 次提交
  4. 28 2月, 2016 6 次提交
  5. 23 2月, 2016 2 次提交
  6. 20 2月, 2016 4 次提交
    • M
      fs/pnode.c: treat zero mnt_group_id-s as unequal · 7ae8fd03
      Maxim Patlasov 提交于
      propagate_one(m) calculates "type" argument for copy_tree() like this:
      
      >    if (m->mnt_group_id == last_dest->mnt_group_id) {
      >        type = CL_MAKE_SHARED;
      >    } else {
      >        type = CL_SLAVE;
      >        if (IS_MNT_SHARED(m))
      >           type |= CL_MAKE_SHARED;
      >   }
      
      The "type" argument then governs clone_mnt() behavior with respect to flags
      and mnt_master of new mount. When we iterate through a slave group, it is
      possible that both current "m" and "last_dest" are not shared (although,
      both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
      above erroneously makes new mount shared and sets its mnt_master to
      last_source->mnt_master. The patch fixes the problem by handling zero
      mnt_group_id-s as though they are unequal.
      
      The similar problem exists in the implementation of "else" clause above
      when we have to ascend upward in the master/slave tree by calling:
      
      >    last_source = last_source->mnt_master;
      >    last_dest = last_source->mnt_parent;
      
      proper number of times. The last step is governed by
      "n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
      both are zero. The patch fixes this case in the same way as the former one.
      
      [AV: don't open-code an obvious helper...]
      Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ae8fd03
    • A
      affs_do_readpage_ofs(): just use kmap_atomic() around memcpy() · 0bacbe52
      Al Viro 提交于
      It forgets kunmap() on a failure exit, but there's really no point keeping
      the page kmapped at all - after all, what we are doing is a bunch of memcpy()
      into the parts of page, so kmap_atomic()/kunmap_atomic() just around those
      memcpy() is enough.
      Spotted-by: NInsu Yun <wuninsu@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0bacbe52
    • M
      xattr handlers: plug a lock leak in simple_xattr_list · 0e9a7da5
      Mateusz Guzik 提交于
      The code could leak xattrs->lock on error.
      
      Problem introduced with 786534b9 "tmpfs: listxattr should
      include POSIX ACL xattrs".
      Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0e9a7da5
    • W
      fs: allow no_seek_end_llseek to actually seek · 2feb55f8
      Wouter van Kesteren 提交于
      The user-visible impact of the issue is for example that without this
      patch sensors-detect breaks when trying to seek in /dev/cpu/0/cpuid.
      
      '~0ULL' is a 'unsigned long long' that when converted to a loff_t,
      which is signed, gets turned into -1. later in vfs_setpos we have
      'if (offset > maxsize)', which makes it always return EINVAL.
      
      Fixes: b25472f9 ("new helpers: no_seek_end_llseek{,_size}()")
      Signed-off-by: NWouter van Kesteren <woutershep@gmail.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2feb55f8
  7. 11 2月, 2016 1 次提交
    • D
      btrfs: properly set the termination value of ctx->pos in readdir · bc4ef759
      David Sterba 提交于
      The value of ctx->pos in the last readdir call is supposed to be set to
      INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
      larger value, then it's LLONG_MAX.
      
      There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
      overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
      64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
      the increment.
      
      We can get to that situation like that:
      
      * emit all regular readdir entries
      * still in the same call to readdir, bump the last pos to INT_MAX
      * next call to readdir will not emit any entries, but will reach the
        bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
      
      Normally this is not a problem, but if we call readdir again, we'll find
      'pos' set to LLONG_MAX and the unconditional increment will overflow.
      
      The report from Victor at
      (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
      print shows that pattern:
      
       Overflow: e
       Overflow: 7fffffff
       Overflow: 7fffffffffffffff
       PAX: size overflow detected in function btrfs_real_readdir
         fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
         context: dir_context;
       CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
       Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
        ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
        ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
        ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
       Call Trace:
        [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
        [<ffffffff811cb706>] report_size_overflow+0x36/0x40
        [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
        [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
        [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
        [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
       Overflow: 1a
        [<ffffffff811db070>] ? iterate_dir+0x150/0x150
        [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
      
      The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
      are not yet synced and are processed from the delayed list. Then the code
      could go to the bump section again even though it might not emit any new
      dir entries from the delayed list.
      
      The fix avoids entering the "bump" section again once we've finished
      emitting the entries, both for synced and delayed entries.
      
      References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bc4ef759
  8. 08 2月, 2016 1 次提交
  9. 07 2月, 2016 1 次提交
    • H
      pty: make sure super_block is still valid in final /dev/tty close · 1f55c718
      Herton R. Krzesinski 提交于
      Considering current pty code and multiple devpts instances, it's possible
      to umount a devpts file system while a program still has /dev/tty opened
      pointing to a previosuly closed pty pair in that instance. In the case all
      ptmx and pts/N files are closed, umount can be done. If the program closes
      /dev/tty after umount is done, devpts_kill_index will use now an invalid
      super_block, which was already destroyed in the umount operation after
      running ->kill_sb. This is another "use after free" type of issue, but now
      related to the allocated super_block instance.
      
      To avoid the problem (warning at ida_remove and potential crashes) for
      this specific case, I added two functions in devpts which grabs additional
      references to the super_block, which pty code now uses so it makes sure
      the super block structure is still valid until pty shutdown is done.
      I also moved the additional inode references to the same functions, which
      also covered similar case with inode being freed before /dev/tty final
      close/shutdown.
      Signed-off-by: NHerton R. Krzesinski <herton@redhat.com>
      Cc: stable@vger.kernel.org # 2.6.29+
      Reviewed-by: NPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1f55c718
  10. 06 2月, 2016 4 次提交
    • J
      epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT · b6a515c8
      Jason Baron 提交于
      In the current implementation of the EPOLLEXCLUSIVE flag (added for
      4.5-rc1), if epoll waiters create different POLL* sets and register them
      as exclusive against the same target fd, the current implementation will
      stop waking any further waiters once it finds the first idle waiter.
      This means that waiters could miss wakeups in certain cases.
      
      For example, when we wake up a pipe for reading we do:
      wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
      one epoll set or epfd is added to pipe p with POLLIN and a second set
      epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
      wakeup since the current implementation will stop after it finds any
      intersection of events with a waiter that is blocked in epoll_wait().
      
      We could potentially address this by requiring all epoll waiters that
      are added to p be required to pass the same set of POLL* events.  IE the
      first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
      flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
      However, I think it might be somewhat confusing interface as we would
      have to reference count the number of users for that set, and so
      userspace would have to keep track of that count, or we would need a
      more involved interface.  It also adds some shared state that we'd have
      store somewhere.  I don't think anybody will want to bloat
      __wait_queue_head for this.
      
      I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
      such that it can only be specified with EPOLLIN and/or EPOLLOUT.  So
      that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
      once we hit the first idle waiter that specifies the EPOLLIN bit, since
      any remaining waiters that only have 'POLLOUT' set wouldn't need to be
      woken.  Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
      bit set and not 'POLLIN'.  If both 'POLLOUT' and 'POLLIN' are set in the
      wake bit set (there is at least one example of this I saw in fs/pipe.c),
      then we just wake the entire exclusive list.  Having both 'POLLOUT' and
      'POLLIN' both set should not be on any performance critical path, so I
      think that's ok (in fs/pipe.c its in pipe_release()).  We also continue
      to include EPOLLERR and EPOLLHUP by default in any exclusive set.  Thus,
      the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
      so.
      
      Since epoll waiters may be interested in other events as well besides
      EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
      doing a 'dup' call on the target fd and adding that as one normally
      would with EPOLL_CTL_ADD.  Since I think that the POLLIN and POLLOUT
      events are what we are interest in balancing, I think that the 'dup'
      thing could perhaps be added to only one of the waiter threads.
      However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
      sufficient for the majority of use-cases.
      
      Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
      among multiple epfds, where between 1 and n of the epfds may receive an
      event, it does not satisfy the semantics of EPOLLONESHOT where only 1
      epfd would get an event.  Thus, it is not allowed to be specified in
      conjunction with EPOLLEXCLUSIVE.
      
      EPOLL_CTL_MOD is also not allowed if the fd was previously added as
      EPOLLEXCLUSIVE.  It seems with the limited number of flags to not be as
      interesting, but this could be relaxed at some further point.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Tested-by: NMadars Vitolins <m@silodev.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Eric Wong <normalperson@yhbt.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a515c8
    • D
      dax: dirty inode only if required · d2b2a28e
      Dmitry Monakhov 提交于
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2b2a28e
    • X
      ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup · c95a5180
      xuejiufei 提交于
      When recovery master down, dlm_do_local_recovery_cleanup() only remove
      the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
      Which will make umount thread falling in dead loop migrating $RECOVERY
      to the dead node.
      Signed-off-by: Nxuejiufei <xuejiufei@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c95a5180
    • R
      block: fix pfn_mkwrite() DAX fault handler · 9c5a05bc
      Ross Zwisler 提交于
      Previously the pfn_mkwrite() fault handler for raw block devices called
      bldev_dax_fault() -> __dax_fault() to do a full DAX page fault.
      
      Really what the pfn_mkwrite() fault handler needs to do is call
      dax_pfn_mkwrite() to make sure that the radix tree entry for the given
      PTE is marked as dirty so that a follow-up fsync or msync call will
      flush it durably to media.
      
      Fixes: 5a023cdb ("block: enable dax for raw block devices")
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c5a05bc
  11. 05 2月, 2016 3 次提交
    • F
      Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl · 0c0fe3b0
      Filipe Manana 提交于
      While doing some tests I ran into an hang on an extent buffer's rwlock
      that produced the following trace:
      
      [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166]
      [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165]
      [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [39389.800016] irq event stamp: 0
      [39389.800016] hardirqs last  enabled at (0): [<          (null)>]           (null)
      [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800016] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800016] softirqs last disabled at (0): [<          (null)>]           (null)
      [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000
      [39389.800016] RIP: 0010:[<ffffffff810902af>]  [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158
      [39389.800016] RSP: 0018:ffff8800a185fb80  EFLAGS: 00000202
      [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101
      [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001
      [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000
      [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98
      [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40
      [39389.800016] FS:  00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000
      [39389.800016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0
      [39389.800016] Stack:
      [39389.800016]  ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0
      [39389.800016]  ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895
      [39389.800016]  ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c
      [39389.800016] Call Trace:
      [39389.800016]  [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60
      [39389.800016]  [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41
      [39389.800016]  [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44
      [39389.800016]  [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs]
      [39389.800016]  [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs]
      [39389.800016]  [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs]
      [39389.800016]  [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs]
      [39389.800016]  [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs]
      [39389.800016]  [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs]
      [39389.800016]  [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs]
      [39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800016]  [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15
      [39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800016]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [39389.800016]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [39389.800016]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [39389.800016]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [39389.800016]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8
      [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [39389.800012] irq event stamp: 0
      [39389.800012] hardirqs last  enabled at (0): [<          (null)>]           (null)
      [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800012] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
      [39389.800012] softirqs last disabled at (0): [<          (null)>]           (null)
      [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G             L  4.4.0-rc6-btrfs-next-18+ #1
      [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000
      [39389.800012] RIP: 0010:[<ffffffff81091e8d>]  [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72
      [39389.800012] RSP: 0018:ffff880034a639f0  EFLAGS: 00000206
      [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000
      [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c
      [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000
      [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98
      [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00
      [39389.800012] FS:  00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000
      [39389.800012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0
      [39389.800012] Stack:
      [39389.800012]  ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98
      [39389.800012]  ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00
      [39389.800012]  ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58
      [39389.800012] Call Trace:
      [39389.800012]  [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c
      [39389.800012]  [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41
      [39389.800012]  [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs]
      [39389.800012]  [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs]
      [39389.800012]  [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs]
      [39389.800012]  [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs]
      [39389.800012]  [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs]
      [39389.800012]  [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs]
      [39389.800012]  [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28
      [39389.800012]  [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs]
      [39389.800012]  [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf
      [39389.800012]  [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc
      [39389.800012]  [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs]
      [39389.800012]  [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs]
      [39389.800012]  [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44
      [39389.800012]  [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs]
      [39389.800012]  [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs]
      [39389.800012]  [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs]
      [39389.800012]  [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs]
      [39389.800012]  [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs]
      [39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
      [39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
      [39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [39389.800012]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [39389.800012]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [39389.800012]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [39389.800012]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [39389.800012]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00
      
      This happens because in the code path executed by the inode_paths ioctl we
      end up nesting two calls to read lock a leaf's rwlock when after the first
      call to read_lock() and before the second call to read_lock(), another
      task (running the delayed items as part of a transaction commit) has
      already called write_lock() against the leaf's rwlock. This situation is
      illustrated by the following diagram:
      
               Task A                       Task B
      
        btrfs_ref_to_path()               btrfs_commit_transaction()
          read_lock(&eb->lock);
      
                                            btrfs_run_delayed_items()
                                              __btrfs_commit_inode_delayed_items()
                                                __btrfs_update_delayed_inode()
                                                  btrfs_lookup_inode()
      
                                                    write_lock(&eb->lock);
                                                      --> task waits for lock
      
          read_lock(&eb->lock);
          --> makes this task hang
              forever (and task B too
      	of course)
      
      So fix this by avoiding doing the nested read lock, which is easily
      avoidable. This issue does not happen if task B calls write_lock() after
      task A does the second call to read_lock(), however there does not seem
      to exist anything in the documentation that mentions what is the expected
      behaviour for recursive locking of rwlocks (leaving the idea that doing
      so is not a good usage of rwlocks).
      
      Also, as a side effect necessary for this fix, make sure we do not
      needlessly read lock extent buffers when the input path has skip_locking
      set (used when called from send).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      0c0fe3b0
    • Y
      ceph: fix snap context leak in error path · db6aed70
      Yan, Zheng 提交于
      Signed-off-by: NYan, Zheng <zyan@redhat.com>
      db6aed70
    • D
      ceph: checking for IS_ERR instead of NULL · 1418bf07
      Dan Carpenter 提交于
      ceph_osdc_alloc_request() returns NULL on error, it never returns error
      pointers.
      
      Fixes: 5be0389d ('ceph: re-send AIO write request when getting -EOLDSNAP error')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      1418bf07
  12. 04 2月, 2016 5 次提交
    • F
      Btrfs: remove no longer used function extent_read_full_page_nolock() · 7f042a83
      Filipe Manana 提交于
      Not needed after the previous patch named
      "Btrfs: fix page reading in extent_same ioctl leading to csum errors".
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      7f042a83
    • F
      Btrfs: fix page reading in extent_same ioctl leading to csum errors · 31314002
      Filipe Manana 提交于
      In the extent_same ioctl, we were grabbing the pages (locked) and
      attempting to read them without bothering about any concurrent IO
      against them. That is, we were not checking for any ongoing ordered
      extents nor waiting for them to complete, which leads to a race where
      the extent_same() code gets a checksum verification error when it
      reads the pages, producing a message like the following in dmesg
      and making the operation fail to user space with -ENOMEM:
      
      [18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868
      
      Fix this by using btrfs_readpage() for reading the pages instead of
      extent_read_full_page_nolock(), which waits for any concurrent ordered
      extents to complete and locks the io range. Also do better error handling
      and don't treat all failures as -ENOMEM, as that's clearly misleasing,
      becoming identical to the checks and operation of prepare_uptodate_page().
      
      The use of extent_read_full_page_nolock() was required before
      commit f4414602 ("btrfs: fix deadlock with extent-same and readpage"),
      as we had the range locked in an inode's io tree before attempting to
      read the pages.
      
      Fixes: f4414602 ("btrfs: fix deadlock with extent-same and readpage")
      Cc: stable@vger.kernel.org   # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      31314002
    • F
      Btrfs: fix invalid page accesses in extent_same (dedup) ioctl · e0bd70c6
      Filipe Manana 提交于
      In the extent_same ioctl we are getting the pages for the source and
      target ranges and unlocking them immediately after, which is incorrect
      because later we attempt to map them (with kmap_atomic) and access their
      contents at btrfs_cmp_data(). When we do such access the pages might have
      been relocated or removed from memory, which leads to an invalid memory
      access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
      which produces a trace like the following:
      
      186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
      parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
      crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
      unloaded: btrfs]
      [186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G        W       4.4.0-rc6-btrfs-next-18+ #1
      [186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000
      [186736.681319] RIP: 0010:[<ffffffff81264d00>]  [<ffffffff81264d00>] memcmp+0xb/0x22
      [186736.681319] RSP: 0018:ffff880362287d70  EFLAGS: 00010287
      [186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000
      [186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000
      [186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000
      [186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000
      [186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40
      [186736.681319] FS:  00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000
      [186736.681319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0
      [186736.681319] Stack:
      [186736.681319]  ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
      [186736.681319]  0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
      [186736.681319]  00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
      [186736.681319] Call Trace:
      [186736.681319]  [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
      [186736.681319]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [186736.681319]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [186736.681319]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [186736.681319]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [186736.681319]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [186736.681319]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
      04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0
      
      (gdb) list *(btrfs_ioctl+0x24cb)
      0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
      2967                    dst_addr = kmap_atomic(dst_page);
      2968
      2969                    flush_dcache_page(src_page);
      2970                    flush_dcache_page(dst_page);
      2971
      2972                    if (memcmp(addr, dst_addr, cmp_len))
      2973                            ret = BTRFS_SAME_DATA_DIFFERS;
      2974
      2975                    kunmap_atomic(addr);
      2976                    kunmap_atomic(dst_addr);
      
      So fix this by making sure we keep the pages locked and respect the same
      locking order as everywhere else: get and lock the pages first and then
      lock the range in the inode's io tree (like for example at
      __btrfs_buffered_write() and extent_readpages()). If an ordered extent
      is found after locking the range in the io tree, unlock the range,
      unlock the pages, wait for the ordered extent to complete and repeat the
      entire locking process until no overlapping ordered extents are found.
      
      Cc: stable@vger.kernel.org   # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      e0bd70c6
    • J
      proc: revert /proc/<pid>/maps [stack:TID] annotation · 65376df5
      Johannes Weiner 提交于
      Commit b7643757 ("procfs: mark thread stack correctly in
      proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps.
      
      Finding the task of a stack VMA requires walking the entire thread list,
      turning this into quadratic behavior: a thousand threads means a
      thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a
      million combinations.
      
      The cost is not in proportion to the usefulness as described in the
      patch.
      
      Drop the [stack:TID] annotation to make /proc/<pid>/maps (and
      /proc/<pid>/numa_maps) usable again for higher thread counts.
      
      The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as
      identifying the stack VMA there is an O(1) operation.
      
      Siddesh said:
       "The end users needed a way to identify thread stacks programmatically and
        there wasn't a way to do that.  I'm afraid I no longer remember (or have
        access to the resources that would aid my memory since I changed
        employers) the details of their requirement.  However, I did do this on my
        own time because I thought it was an interesting project for me and nobody
        really gave any feedback then as to its utility, so as far as I am
        concerned you could roll back the main thread maps information since the
        information is available in the thread-specific files"
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: Shaohua Li <shli@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65376df5
    • M
      numa: fix /proc/<pid>/numa_maps for hugetlbfs on s390 · 5c2ff95e
      Michael Holzheu 提交于
      When working with hugetlbfs ptes (which are actually pmds) is not valid to
      directly use pte functions like pte_present() because the hardware bit
      layout of pmds and ptes can be different.  This is the case on s390.
      Therefore we have to convert the hugetlbfs ptes first into a valid pte
      encoding with huge_ptep_get().
      
      Currently the /proc/<pid>/numa_maps code uses hugetlbfs ptes without
      huge_ptep_get().  On s390 this leads to the following two problems:
      
      1) The pte_present() function returns false (instead of true) for
         PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
         completely in the "numa_maps" output.
      
      2) The pte_dirty() function always returns false for all hugetlb ptes.
         Therefore these pages are reported as "mapped=xxx" instead of
         "dirty=xxx".
      
      Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.
      Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c2ff95e