1. 14 10月, 2020 2 次提交
  2. 12 10月, 2020 1 次提交
    • M
      mm: mmap: Fix general protection fault in unlink_file_vma() · bc4fe4cd
      Miaohe Lin 提交于
      The syzbot reported the below general protection fault:
      
        general protection fault, probably for non-canonical address
        0xe00eeaee0000003b: 0000 [#1] PREEMPT SMP KASAN
        KASAN: maybe wild-memory-access in range [0x00777770000001d8-0x00777770000001df]
        CPU: 1 PID: 10488 Comm: syz-executor721 Not tainted 5.9.0-rc3-syzkaller #0
        RIP: 0010:unlink_file_vma+0x57/0xb0 mm/mmap.c:164
        Call Trace:
           free_pgtables+0x1b3/0x2f0 mm/memory.c:415
           exit_mmap+0x2c0/0x530 mm/mmap.c:3184
           __mmput+0x122/0x470 kernel/fork.c:1076
           mmput+0x53/0x60 kernel/fork.c:1097
           exit_mm kernel/exit.c:483 [inline]
           do_exit+0xa8b/0x29f0 kernel/exit.c:793
           do_group_exit+0x125/0x310 kernel/exit.c:903
           get_signal+0x428/0x1f00 kernel/signal.c:2757
           arch_do_signal+0x82/0x2520 arch/x86/kernel/signal.c:811
           exit_to_user_mode_loop kernel/entry/common.c:136 [inline]
           exit_to_user_mode_prepare+0x1ae/0x200 kernel/entry/common.c:167
           syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:242
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      It's because the ->mmap() callback can change vma->vm_file and fput the
      original file.  But the commit d70cec89 ("mm: mmap: merge vma after
      call_mmap() if possible") failed to catch this case and always fput()
      the original file, hence add an extra fput().
      
      [ Thanks Hillf for pointing this extra fput() out. ]
      
      Fixes: d70cec89 ("mm: mmap: merge vma after call_mmap() if possible")
      Reported-by: syzbot+c5d5a51dcbb558ca0cb5@syzkaller.appspotmail.com
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Christian König <ckoenig.leichtzumerken@gmail.com>
      Cc: Hongxiang Lou <louhongxiang@huawei.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Link: https://lkml.kernel.org/r/20200916090733.31427-1-linmiaohe@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc4fe4cd
  3. 25 9月, 2020 1 次提交
  4. 04 9月, 2020 1 次提交
  5. 08 8月, 2020 3 次提交
  6. 25 7月, 2020 1 次提交
  7. 30 6月, 2020 1 次提交
    • P
      mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls · 0a3b3c25
      Paul E. McKenney 提交于
      A large process running on a heavily loaded system can encounter the
      following RCU CPU stall warning:
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: 	3-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190
        	(t=21013 jiffies g=1005461 q=132576)
        NMI backtrace for cpu 3
        CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
        Hardware name: Wiwynn   HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
        Call Trace:
         <IRQ>
         dump_stack+0x46/0x60
         nmi_cpu_backtrace.cold.3+0x13/0x50
         ? lapic_can_unplug_cpu.cold.27+0x34/0x34
         nmi_trigger_cpumask_backtrace+0xba/0xca
         rcu_dump_cpu_stacks+0x99/0xc7
         rcu_sched_clock_irq.cold.87+0x1aa/0x397
         ? tick_sched_do_timer+0x60/0x60
         update_process_times+0x28/0x60
         tick_sched_timer+0x37/0x70
         __hrtimer_run_queues+0xfe/0x270
         hrtimer_interrupt+0xf4/0x210
         smp_apic_timer_interrupt+0x5e/0x120
         apic_timer_interrupt+0xf/0x20
         </IRQ>
        RIP: 0010:kmem_cache_free+0x223/0x300
        Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
        RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
        RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030
        RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18
        RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff
        R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f
        R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00
         ? remove_vma+0x4f/0x60
         remove_vma+0x4f/0x60
         exit_mmap+0xd6/0x160
         mmput+0x4a/0x110
         do_exit+0x278/0xae0
         ? syscall_trace_enter+0x1d3/0x2b0
         ? handle_mm_fault+0xaa/0x1c0
         do_group_exit+0x3a/0xa0
         __x64_sys_exit_group+0x14/0x20
         do_syscall_64+0x42/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
      for a very long time given a large process.  This commit therefore adds
      a cond_resched() to this loop, providing RCU any needed quiescent states.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      0a3b3c25
  8. 10 6月, 2020 4 次提交
  9. 05 6月, 2020 1 次提交
  10. 11 4月, 2020 2 次提交
  11. 08 4月, 2020 2 次提交
  12. 03 4月, 2020 2 次提交
  13. 20 2月, 2020 1 次提交
  14. 01 2月, 2020 1 次提交
  15. 14 1月, 2020 1 次提交
  16. 07 1月, 2020 1 次提交
    • C
      arm64: Revert support for execute-only user mappings · 24cecc37
      Catalin Marinas 提交于
      The ARMv8 64-bit architecture supports execute-only user permissions by
      clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
      privileged mapping but from which user running at EL0 can still execute.
      
      The downside, however, is that the kernel at EL1 inadvertently reading
      such mapping would not trip over the PAN (privileged access never)
      protection.
      
      Revert the relevant bits from commit cab15ce6 ("arm64: Introduce
      execute-only page access permissions") so that PROT_EXEC implies
      PROT_READ (and therefore PTE_USER) until the architecture gains proper
      support for execute-only user mappings.
      
      Fixes: cab15ce6 ("arm64: Introduce execute-only page access permissions")
      Cc: <stable@vger.kernel.org> # 4.9.x-
      Acked-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24cecc37
  17. 02 12月, 2019 1 次提交
  18. 01 12月, 2019 6 次提交
  19. 26 9月, 2019 2 次提交
  20. 25 9月, 2019 2 次提交
  21. 20 8月, 2019 1 次提交
  22. 21 5月, 2019 1 次提交
  23. 09 5月, 2019 1 次提交
    • D
      x86/mpx, mm/core: Fix recursive munmap() corruption · 5a28fc94
      Dave Hansen 提交于
      This is a bit of a mess, to put it mildly.  But, it's a bug
      that only seems to have showed up in 4.20 but wasn't noticed
      until now, because nobody uses MPX.
      
      MPX has the arch_unmap() hook inside of munmap() because MPX
      uses bounds tables that protect other areas of memory.  When
      memory is unmapped, there is also a need to unmap the MPX
      bounds tables.  Barring this, unused bounds tables can eat 80%
      of the address space.
      
      But, the recursive do_munmap() that gets called vi arch_unmap()
      wreaks havoc with __do_munmap()'s state.  It can result in
      freeing populated page tables, accessing bogus VMA state,
      double-freed VMAs and more.
      
      See the "long story" further below for the gory details.
      
      To fix this, call arch_unmap() before __do_unmap() has a chance
      to do anything meaningful.  Also, remove the 'vma' argument
      and force the MPX code to do its own, independent VMA lookup.
      
      == UML / unicore32 impact ==
      
      Remove unused 'vma' argument to arch_unmap().  No functional
      change.
      
      I compile tested this on UML but not unicore32.
      
      == powerpc impact ==
      
      powerpc uses arch_unmap() well to watch for munmap() on the
      VDSO and zeroes out 'current->mm->context.vdso_base'.  Moving
      arch_unmap() makes this happen earlier in __do_munmap().  But,
      'vdso_base' seems to only be used in perf and in the signal
      delivery that happens near the return to userspace.  I can not
      find any likely impact to powerpc, other than the zeroing
      happening a little earlier.
      
      powerpc does not use the 'vma' argument and is unaffected by
      its removal.
      
      I compile-tested a 64-bit powerpc defconfig.
      
      == x86 impact ==
      
      For the common success case this is functionally identical to
      what was there before.  For the munmap() failure case, it's
      possible that some MPX tables will be zapped for memory that
      continues to be in use.  But, this is an extraordinarily
      unlikely scenario and the harm would be that MPX provides no
      protection since the bounds table got reset (zeroed).
      
      I can't imagine anyone doing this:
      
      	ptr = mmap();
      	// use ptr
      	ret = munmap(ptr);
      	if (ret)
      		// oh, there was an error, I'll
      		// keep using ptr.
      
      Because if you're doing munmap(), you are *done* with the
      memory.  There's probably no good data in there _anyway_.
      
      This passes the original reproducer from Richard Biener as
      well as the existing mpx selftests/.
      
      The long story:
      
      munmap() has a couple of pieces:
      
       1. Find the affected VMA(s)
       2. Split the start/end one(s) if neceesary
       3. Pull the VMAs out of the rbtree
       4. Actually zap the memory via unmap_region(), including
          freeing page tables (or queueing them to be freed).
       5. Fix up some of the accounting (like fput()) and actually
          free the VMA itself.
      
      This specific ordering was actually introduced by:
      
        dd2283f2 ("mm: mmap: zap pages with read mmap_sem in munmap")
      
      during the 4.20 merge window.  The previous __do_munmap() code
      was actually safe because the only thing after arch_unmap() was
      remove_vma_list().  arch_unmap() could not see 'vma' in the
      rbtree because it was detached, so it is not even capable of
      doing operations unsafe for remove_vma_list()'s use of 'vma'.
      
      Richard Biener reported a test that shows this in dmesg:
      
        [1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
        [1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576
      
      What triggered this was the recursive do_munmap() called via
      arch_unmap().  It was freeing page tables that has not been
      properly zapped.
      
      But, the problem was bigger than this.  For one, arch_unmap()
      can free VMAs.  But, the calling __do_munmap() has variables
      that *point* to VMAs and obviously can't handle them just
      getting freed while the pointer is still in use.
      
      I tried a couple of things here.  First, I tried to fix the page
      table freeing problem in isolation, but I then found the VMA
      issue.  I also tried having the MPX code return a flag if it
      modified the rbtree which would force __do_munmap() to re-walk
      to restart.  That spiralled out of control in complexity pretty
      fast.
      
      Just moving arch_unmap() and accepting that the bonkers failure
      case might eat some bounds tables seems like the simplest viable
      fix.
      
      This was also reported in the following kernel bugzilla entry:
      
        https://bugzilla.kernel.org/show_bug.cgi?id=203123
      
      There are some reports that this commit triggered this bug:
      
        dd2283f2 ("mm: mmap: zap pages with read mmap_sem in munmap")
      
      While that commit certainly made the issues easier to hit, I believe
      the fundamental issue has been with us as long as MPX itself, thus
      the Fixes: tag below is for one of the original MPX commits.
      
      [ mingo: Minor edits to the changelog and the patch. ]
      Reported-by: NRichard Biener <rguenther@suse.de>
      Reported-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by Thomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-um@lists.infradead.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: stable@vger.kernel.org
      Fixes: dd2283f2 ("mm: mmap: zap pages with read mmap_sem in munmap")
      Link: http://lkml.kernel.org/r/20190419194747.5E1AD6DC@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a28fc94
  24. 20 4月, 2019 1 次提交
    • A
      coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · 04f5866e
      Andrea Arcangeli 提交于
      The core dumping code has always run without holding the mmap_sem for
      writing, despite that is the only way to ensure that the entire vma
      layout will not change from under it.  Only using some signal
      serialization on the processes belonging to the mm is not nearly enough.
      This was pointed out earlier.  For example in Hugh's post from Jul 2017:
      
        https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
      
        "Not strictly relevant here, but a related note: I was very surprised
         to discover, only quite recently, how handle_mm_fault() may be called
         without down_read(mmap_sem) - when core dumping. That seems a
         misguided optimization to me, which would also be nice to correct"
      
      In particular because the growsdown and growsup can move the
      vm_start/vm_end the various loops the core dump does around the vma will
      not be consistent if page faults can happen concurrently.
      
      Pretty much all users calling mmget_not_zero()/get_task_mm() and then
      taking the mmap_sem had the potential to introduce unexpected side
      effects in the core dumping code.
      
      Adding mmap_sem for writing around the ->core_dump invocation is a
      viable long term fix, but it requires removing all copy user and page
      faults and to replace them with get_dump_page() for all binary formats
      which is not suitable as a short term fix.
      
      For the time being this solution manually covers the places that can
      confuse the core dump either by altering the vma layout or the vma flags
      while it runs.  Once ->core_dump runs under mmap_sem for writing the
      function mmget_still_valid() can be dropped.
      
      Allowing mmap_sem protected sections to run in parallel with the
      coredump provides some minor parallelism advantage to the swapoff code
      (which seems to be safe enough by never mangling any vma field and can
      keep doing swapins in parallel to the core dumping) and to some other
      corner case.
      
      In order to facilitate the backporting I added "Fixes: 86039bd3"
      however the side effect of this same race condition in /proc/pid/mem
      should be reproducible since before 2.6.12-rc2 so I couldn't add any
      other "Fixes:" because there's no hash beyond the git genesis commit.
      
      Because find_extend_vma() is the only location outside of the process
      context that could modify the "mm" structures under mmap_sem for
      reading, by adding the mmget_still_valid() check to it, all other cases
      that take the mmap_sem for reading don't need the new check after
      mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
      context also doesn't need the new check, because all tasks under core
      dumping are frozen.
      
      Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
      Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Acked-by: NJason Gunthorpe <jgg@mellanox.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f5866e