1. 01 11月, 2017 1 次提交
    • V
      x86/mm: fix use-after-free of vma during userfaultfd fault · cb0631fd
      Vlastimil Babka 提交于
      Syzkaller with KASAN has reported a use-after-free of vma->vm_flags in
      __do_page_fault() with the following reproducer:
      
        mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
        mmap(&(0x7f0000011000/0x3000)=nil, 0x3000, 0x1, 0x32, 0xffffffffffffffff, 0x0)
        r0 = userfaultfd(0x0)
        ioctl$UFFDIO_API(r0, 0xc018aa3f, &(0x7f0000002000-0x18)={0xaa, 0x0, 0x0})
        ioctl$UFFDIO_REGISTER(r0, 0xc020aa00, &(0x7f0000019000)={{&(0x7f0000012000/0x2000)=nil, 0x2000}, 0x1, 0x0})
        r1 = gettid()
        syz_open_dev$evdev(&(0x7f0000013000-0x12)="2f6465762f696e7075742f6576656e742300", 0x0, 0x0)
        tkill(r1, 0x7)
      
      The vma should be pinned by mmap_sem, but handle_userfault() might (in a
      return to userspace scenario) release it and then acquire again, so when
      we return to __do_page_fault() (with other result than VM_FAULT_RETRY),
      the vma might be gone.
      
      Specifically, per Andrea the scenario is
       "A return to userland to repeat the page fault later with a
        VM_FAULT_NOPAGE retval (potentially after handling any pending signal
        during the return to userland). The return to userland is identified
        whenever FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in
        vmf->flags"
      
      However, since commit a3c4fb7c ("x86/mm: Fix fault error path using
      unsafe vma pointer") there is a vma_pkey() read of vma->vm_flags after
      that point, which can thus become use-after-free.  Fix this by moving
      the read before calling handle_mm_fault().
      Reported-by: Nsyzbot <bot+6a5269ce759a7bb12754ed9622076dc93f65a1f6@syzkaller.appspotmail.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Fixes: 3c4fb7c9c2e ("x86/mm: Fix fault error path using unsafe vma pointer")
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb0631fd
  2. 25 9月, 2017 1 次提交
    • L
      x86/mm: Fix fault error path using unsafe vma pointer · a3c4fb7c
      Laurent Dufour 提交于
      commit 7b2d0dba ("x86/mm/pkeys: Pass VMA down in to fault signal
      generation code") passes down a vma pointer to the error path, but that is
      done once the mmap_sem is released when calling mm_fault_error() from
      __do_page_fault().
      
      This is dangerous as the vma structure is no more safe to be used once the
      mmap_sem has been released. As only the protection key value is required in
      the error processing, we could just pass down this value.
      
      Fix it by passing a pointer to a protection key value down to the fault
      signal generation code. The use of a pointer allows to keep the check
      generating a warning message in fill_sig_info_pkey() when the vma was not
      known. If the pointer is valid, the protection value can be accessed by
      deferencing the pointer.
      
      [ tglx: Made *pkey u32 as that's the type which is passed in siginfo ]
      
      Fixes: 7b2d0dba ("x86/mm/pkeys: Pass VMA down in to fault signal generation code")
      Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1504513935-12742-1-git-send-email-ldufour@linux.vnet.ibm.com
      a3c4fb7c
  3. 23 9月, 2017 1 次提交
    • J
      x86/asm: Fix inline asm call constraints for Clang · f5caf621
      Josh Poimboeuf 提交于
      For inline asm statements which have a CALL instruction, we list the
      stack pointer as a constraint to convince GCC to ensure the frame
      pointer is set up first:
      
        static inline void foo()
        {
      	register void *__sp asm(_ASM_SP);
      	asm("call bar" : "+r" (__sp))
        }
      
      Unfortunately, that pattern causes Clang to corrupt the stack pointer.
      
      The fix is easy: convert the stack pointer register variable to a global
      variable.
      
      It should be noted that the end result is different based on the GCC
      version.  With GCC 6.4, this patch has exactly the same result as
      before:
      
      	defconfig	defconfig-nofp	distro		distro-nofp
       before	9820389		9491555		8816046		8516940
       after	9820389		9491555		8816046		8516940
      
      With GCC 7.2, however, GCC's behavior has changed.  It now changes its
      behavior based on the conversion of the register variable to a global.
      That somehow convinces it to *always* set up the frame pointer before
      inserting *any* inline asm.  (Therefore, listing the variable as an
      output constraint is a no-op and is no longer necessary.)  It's a bit
      overkill, but the performance impact should be negligible.  And in fact,
      there's a nice improvement with frame pointers disabled:
      
      	defconfig	defconfig-nofp	distro		distro-nofp
       before	9796316		9468236		9076191		8790305
       after	9796957		9464267		9076381		8785949
      
      So in summary, while listing the stack pointer as an output constraint
      is no longer necessary for newer versions of GCC, it's still needed for
      older versions.
      Suggested-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/3db862e970c432ae823cf515c52b54fec8270e0e.1505942196.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f5caf621
  4. 31 8月, 2017 1 次提交
  5. 29 8月, 2017 2 次提交
  6. 13 6月, 2017 1 次提交
  7. 14 3月, 2017 2 次提交
  8. 02 3月, 2017 1 次提交
  9. 15 12月, 2016 1 次提交
  10. 26 10月, 2016 1 次提交
    • J
      x86/dumpstack: Remove kernel text addresses from stack dump · bb5e5ce5
      Josh Poimboeuf 提交于
      Printing kernel text addresses in stack dumps is of questionable value,
      especially now that address randomization is becoming common.
      
      It can be a security issue because it leaks kernel addresses.  It also
      affects the usefulness of the stack dump.  Linus says:
      
        "I actually spend time cleaning up commit messages in logs, because
        useless data that isn't actually information (random hex numbers) is
        actively detrimental.
      
        It makes commit logs less legible.
      
        It also makes it harder to parse dumps.
      
        It's not useful. That makes it actively bad.
      
        I probably look at more oops reports than most people. I have not
        found the hex numbers useful for the last five years, because they are
        just randomized crap.
      
        The stack content thing just makes code scroll off the screen etc, for
        example."
      
      The only real downside to removing these addresses is that they can be
      used to disambiguate duplicate symbol names.  However such cases are
      rare, and the context of the stack dump should be enough to be able to
      figure it out.
      
      There's now a 'faddr2line' script which can be used to convert a
      function address to a file name and line:
      
        $ ./scripts/faddr2line ~/k/vmlinux write_sysrq_trigger+0x51/0x60
        write_sysrq_trigger+0x51/0x60:
        write_sysrq_trigger at drivers/tty/sysrq.c:1098
      
      Or gdb can be used:
      
        $ echo "list *write_sysrq_trigger+0x51" |gdb ~/k/vmlinux |grep "is in"
        (gdb) 0xffffffff815b5d83 is in driver_probe_device (/home/jpoimboe/git/linux/drivers/base/dd.c:378).
      
      (But note that when there are duplicate symbol names, gdb will only show
      the first symbol it finds.  faddr2line is recommended over gdb because
      it handles duplicates and it also does function size checking.)
      
      Here's an example of what a stack dump looks like after this change:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: sysrq_handle_crash+0x45/0x80
        PGD 36bfa067 [   29.650644] PUD 7aca3067
        Oops: 0002 [#1] PREEMPT SMP
        Modules linked in: ...
        CPU: 1 PID: 786 Comm: bash Tainted: G            E   4.9.0-rc1+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
        task: ffff880078582a40 task.stack: ffffc90000ba8000
        RIP: 0010:sysrq_handle_crash+0x45/0x80
        RSP: 0018:ffffc90000babdc8 EFLAGS: 00010296
        RAX: ffff880078582a40 RBX: 0000000000000063 RCX: 0000000000000001
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000292
        RBP: ffffc90000babdc8 R08: 0000000b31866061 R09: 0000000000000000
        R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
        R13: 0000000000000007 R14: ffffffff81ee8680 R15: 0000000000000000
        FS:  00007ffb43869700(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000007a3e9000 CR4: 00000000001406e0
        Stack:
         ffffc90000babe00 ffffffff81572d08 ffffffff81572bd5 0000000000000002
         0000000000000000 ffff880079606600 00007ffb4386e000 ffffc90000babe20
         ffffffff81573201 ffff880036a3fd00 fffffffffffffffb ffffc90000babe40
        Call Trace:
         __handle_sysrq+0x138/0x220
         ? __handle_sysrq+0x5/0x220
         write_sysrq_trigger+0x51/0x60
         proc_reg_write+0x42/0x70
         __vfs_write+0x37/0x140
         ? preempt_count_sub+0xa1/0x100
         ? __sb_start_write+0xf5/0x210
         ? vfs_write+0x183/0x1a0
         vfs_write+0xb8/0x1a0
         SyS_write+0x58/0xc0
         entry_SYSCALL_64_fastpath+0x1f/0xc2
        RIP: 0033:0x7ffb42f55940
        RSP: 002b:00007ffd33bb6b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007ffb42f55940
        RDX: 0000000000000002 RSI: 00007ffb4386e000 RDI: 0000000000000001
        RBP: 0000000000000011 R08: 00007ffb4321ea40 R09: 00007ffb43869700
        R10: 00007ffb43869700 R11: 0000000000000246 R12: 0000000000778a10
        R13: 00007ffd33bb5c00 R14: 0000000000000007 R15: 0000000000000010
        Code: 34 e8 d0 34 bc ff 48 c7 c2 3b 2b 57 81 be 01 00 00 00 48 c7 c7 e0 dd e5 81 e8 a8 55 ba ff c7 05 0e 3f de 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 e8 4c 49 bc ff 84 c0 75 c3 48 c7
        RIP: sysrq_handle_crash+0x45/0x80 RSP: ffffc90000babdc8
        CR2: 0000000000000000
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/69329cb29b8f324bb5fcea14d61d224807fb6488.1477405374.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bb5e5ce5
  11. 28 9月, 2016 1 次提交
  12. 20 9月, 2016 1 次提交
  13. 09 9月, 2016 1 次提交
  14. 08 9月, 2016 1 次提交
  15. 27 7月, 2016 1 次提交
  16. 15 7月, 2016 3 次提交
  17. 20 5月, 2016 1 次提交
  18. 03 3月, 2016 1 次提交
    • D
      x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA · e2155543
      Dave Hansen 提交于
      Andrey Wagin reported that a simple test case was broken by:
      
      	2b5f7d013fc ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
      
      This test case creates an unreadable VMA and my patch assumed
      that all writes must be to readable VMAs.
      
      The simplest fix for this is to remove the pkey-related bits
      in access_error().  For execute-only support, I believe the
      existing version is sufficient because the permissions we
      are trying to enforce are entirely expressed in vma->vm_flags.
      We just depend on pkeys to get *an* exception, it does not
      matter that PF_PK was set, or even what state PKRU is in.
      
      I will re-add the necessary bits with the full pkeys
      implementation that includes the new syscalls.
      
      The three cases that matter are:
      
      1. If a write to an execute-only VMA occurs, we will see PF_WRITE
         set, but !VM_WRITE on the VMA, and return 1.  All execute-only
         VMAs have VM_WRITE clear by definition.
      2. If a read occurs on a present PTE, we will fall in to the "read,
         present" case and return 1.
      3. If a read occurs to a non-present PTE, we will miss the "read,
         not present" case, because the execute-only VMA will have
         VM_EXEC set, and we will properly return 0 allowing the PTE to
         be populated.
      
      Test program:
      
       int main()
       {
      	int *p;
      	p = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      	p[0] = 1;
      
      	return 0;
       }
      
      Reported-by: Andrey Wagin <avagin@gmail.com>,
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: linux-next@vger.kernel.org
      Fixes: 62b5f7d0 ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
      Link: http://lkml.kernel.org/r/20160301194133.65D0110C@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e2155543
  19. 19 2月, 2016 3 次提交
    • D
      mm/core, x86/mm/pkeys: Add execute-only protection keys support · 62b5f7d0
      Dave Hansen 提交于
      Protection keys provide new page-based protection in hardware.
      But, they have an interesting attribute: they only affect data
      accesses and never affect instruction fetches.  That means that
      if we set up some memory which is set as "access-disabled" via
      protection keys, we can still execute from it.
      
      This patch uses protection keys to set up mappings to do just that.
      If a user calls:
      
      	mmap(..., PROT_EXEC);
      or
      	mprotect(ptr, sz, PROT_EXEC);
      
      (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
      notice this, and set a special protection key on the memory.  It
      also sets the appropriate bits in the Protection Keys User Rights
      (PKRU) register so that the memory becomes unreadable and
      unwritable.
      
      I haven't found any userspace that does this today.  With this
      facility in place, we expect userspace to move to use it
      eventually.  Userspace _could_ start doing this today.  Any
      PROT_EXEC calls get converted to PROT_READ inside the kernel, and
      would transparently be upgraded to "true" PROT_EXEC with this
      code.  IOW, userspace never has to do any PROT_EXEC runtime
      detection.
      
      This feature provides enhanced protection against leaking
      executable memory contents.  This helps thwart attacks which are
      attempting to find ROP gadgets on the fly.
      
      But, the security provided by this approach is not comprehensive.
      The PKRU register which controls access permissions is a normal
      user register writable from unprivileged userspace.  An attacker
      who can execute the 'wrpkru' instruction can easily disable the
      protection provided by this feature.
      
      The protection key that is used for execute-only support is
      permanently dedicated at compile time.  This is fine for now
      because there is currently no API to set a protection key other
      than this one.
      
      Despite there being a constant PKRU value across the entire
      system, we do not set it unless this feature is in use in a
      process.  That is to preserve the PKRU XSAVE 'init state',
      which can lead to faster context switches.
      
      PKRU *is* a user register and the kernel is modifying it.  That
      means that code doing:
      
      	pkru = rdpkru()
      	pkru |= 0x100;
      	mmap(..., PROT_EXEC);
      	wrpkru(pkru);
      
      could lose the bits in PKRU that enforce execute-only
      permissions.  To avoid this, we suggest avoiding ever calling
      mmap() or mprotect() when the PKRU value is expected to be
      unstable.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: keescook@google.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      62b5f7d0
    • D
      mm/core, x86/mm/pkeys: Differentiate instruction fetches · d61172b4
      Dave Hansen 提交于
      As discussed earlier, we attempt to enforce protection keys in
      software.
      
      However, the code checks all faults to ensure that they are not
      violating protection key permissions.  It was assumed that all
      faults are either write faults where we check PKRU[key].WD (write
      disable) or read faults where we check the AD (access disable)
      bit.
      
      But, there is a third category of faults for protection keys:
      instruction faults.  Instruction faults never run afoul of
      protection keys because they do not affect instruction fetches.
      
      So, plumb the PF_INSTR bit down in to the
      arch_vma_access_permitted() function where we do the protection
      key checks.
      
      We also add a new FAULT_FLAG_INSTRUCTION.  This is because
      handle_mm_fault() is not passed the architecture-specific
      error_code where we keep PF_INSTR, so we need to encode the
      instruction fetch information in to the arch-generic fault
      flags.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d61172b4
    • D
      x86/mm/pkeys: Optimize fault handling in access_error() · 07f146f5
      Dave Hansen 提交于
      We might not strictly have to make modifictions to
      access_error() to check the VMA here.
      
      If we do not, we will do this:
      
       1. app sets VMA pkey to K
       2. app touches a !present page
       3. do_page_fault(), allocates and maps page, sets pte.pkey=K
       4. return to userspace
       5. touch instruction reexecutes, but triggers PF_PK
       6. do PKEY signal
      
      What happens with this patch applied:
      
       1. app sets VMA pkey to K
       2. app touches a !present page
       3. do_page_fault() notices that K is inaccessible
       4. do PKEY signal
      
      We basically skip the fault that does an allocation.
      
      So what this lets us do is protect areas from even being
      *populated* unless it is accessible according to protection
      keys.  That seems handy to me and makes protection keys work
      more like an mprotect()'d mapping.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210222.EBB63D8C@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      07f146f5
  20. 18 2月, 2016 6 次提交
    • D
      mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys · 33a709b2
      Dave Hansen 提交于
      Today, for normal faults and page table walks, we check the VMA
      and/or PTE to ensure that it is compatible with the action.  For
      instance, if we get a write fault on a non-writeable VMA, we
      SIGSEGV.
      
      We try to do the same thing for protection keys.  Basically, we
      try to make sure that if a user does this:
      
      	mprotect(ptr, size, PROT_NONE);
      	*ptr = foo;
      
      they see the same effects with protection keys when they do this:
      
      	mprotect(ptr, size, PROT_READ|PROT_WRITE);
      	set_pkey(ptr, size, 4);
      	wrpkru(0xffffff3f); // access disable pkey 4
      	*ptr = foo;
      
      The state to do that checking is in the VMA, but we also
      sometimes have to do it on the page tables only, like when doing
      a get_user_pages_fast() where we have no VMA.
      
      We add two functions and expose them to generic code:
      
      	arch_pte_access_permitted(pte_flags, write)
      	arch_vma_access_permitted(vma, write)
      
      These are, of course, backed up in x86 arch code with checks
      against the PTE or VMA's protection key.
      
      But, there are also cases where we do not want to respect
      protection keys.  When we ptrace(), for instance, we do not want
      to apply the tracer's PKRU permissions to the PTEs from the
      process being traced.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linux-s390@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33a709b2
    • D
      x86/mm/pkeys: Fill in pkey field in siginfo · 019132ff
      Dave Hansen 提交于
      This fills in the new siginfo field: si_pkey to indicate to
      userspace which protection key was set on the PTE that we faulted
      on.
      
      Note though that *ALL* protection key faults have to be generated
      by a valid, present PTE at some point.  But this code does no PTE
      lookups which seeds odd.  The reason is that we take advantage of
      the way we generate PTEs from VMAs.  All PTEs under a VMA share
      some attributes.  For instance, they are _all_ either PROT_READ
      *OR* PROT_NONE.  They also always share a protection key, so we
      never have to walk the page tables; we just use the VMA.
      
      Note that _pkey is a 64-bit value.  The current hardware only
      supports 4-bit protection keys.  We do this because there is
      _plenty_ of space in _sigfault and it is possible that future
      processors would support more than 4 bits of protection keys.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210213.ABC488FA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      019132ff
    • D
      x86/mm/pkeys: Pass VMA down in to fault signal generation code · 7b2d0dba
      Dave Hansen 提交于
      During a page fault, we look up the VMA to ensure that the fault
      is in a region with a valid mapping.  But, in the top-level page
      fault code we don't need the VMA for much else.  Once we have
      decided that an access is bad, we are going to send a signal no
      matter what and do not need the VMA any more.  So we do not pass
      it down in to the signal generation code.
      
      But, for protection keys, we need the VMA.  It tells us *which*
      protection key we violated if we get a PF_PK.  So, we need to
      pass the VMA down and fill in siginfo->si_pkey.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210211.AD3B36A3@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7b2d0dba
    • D
      x86/mm/pkeys: Add new 'PF_PK' page fault error code bit · b3ecd515
      Dave Hansen 提交于
      Note: "PK" is how the Intel SDM refers to this bit, so we also
      use that nomenclature.
      
      This only defines the bit, it does not plumb it anywhere to be
      handled.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210207.DA7B43E6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b3ecd515
    • T
      x86/mm: Expand the exception table logic to allow new handling options · 548acf19
      Tony Luck 提交于
      Huge amounts of help from  Andy Lutomirski and Borislav Petkov to
      produce this. Andy provided the inspiration to add classes to the
      exception table with a clever bit-squeezing trick, Boris pointed
      out how much cleaner it would all be if we just had a new field.
      
      Linus Torvalds blessed the expansion with:
      
        ' I'd rather not be clever in order to save just a tiny amount of space
          in the exception table, which isn't really criticial for anybody. '
      
      The third field is another relative function pointer, this one to a
      handler that executes the actions.
      
      We start out with three handlers:
      
       1: Legacy - just jumps the to fixup IP
       2: Fault - provide the trap number in %ax to the fixup code
       3: Cleaned up legacy for the uaccess error hack
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/f6af78fcbd348cf4939875cfda9c19689b5e50b8.1455732970.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      548acf19
    • T
      x86/mm: Fix vmalloc_fault() to handle large pages properly · f4eafd8b
      Toshi Kani 提交于
      A kernel page fault oops with the callstack below was observed
      when a read syscall was made to a pmem device after a huge amount
      (>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64
      system:
      
           BUG: unable to handle kernel paging request at ffff880840000ff8
           IP: vmalloc_fault+0x1be/0x300
           PGD c7f03a067 PUD 0
           Oops: 0000 [#1] SM
           Call Trace:
              __do_page_fault+0x285/0x3e0
              do_page_fault+0x2f/0x80
              ? put_prev_entity+0x35/0x7a0
              page_fault+0x28/0x30
              ? memcpy_erms+0x6/0x10
              ? schedule+0x35/0x80
              ? pmem_rw_bytes+0x6a/0x190 [nd_pmem]
              ? schedule_timeout+0x183/0x240
              btt_log_read+0x63/0x140 [nd_btt]
               :
              ? __symbol_put+0x60/0x60
              ? kernel_read+0x50/0x80
              SyS_finit_module+0xb9/0xf0
              entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      Since v4.1, ioremap() supports large page (pud/pmd) mappings in
      x86_64 and PAE.  vmalloc_fault() however assumes that the vmalloc
      range is limited to pte mappings.
      
      vmalloc faults do not normally happen in ioremap'd ranges since
      ioremap() sets up the kernel page tables, which are shared by
      user processes.  pgd_ctor() sets the kernel's PGD entries to
      user's during fork().  When allocation of the vmalloc ranges
      crosses a 512GB boundary, ioremap() allocates a new pud table
      and updates the kernel PGD entry to point it.  If user process's
      PGD entry does not have this update yet, a read/write syscall
      to the range will cause a vmalloc fault, which hits the Oops
      above as it does not handle a large page properly.
      
      Following changes are made to vmalloc_fault().
      
      64-bit:
      
       - No change for the PGD sync operation as it handles large
         pages already.
       - Add pud_huge() and pmd_huge() to the validation code to
         handle large pages.
       - Change pud_page_vaddr() to pud_pfn() since an ioremap range
         is not directly mapped (while the if-statement still works
         with a bogus addr).
       - Change pmd_page() to pmd_pfn() since an ioremap range is not
         backed by struct page (while the if-statement still works
         with a bogus addr).
      
      32-bit:
       - No change for the sync operation since the index3 PGD entry
         covers the entire vmalloc range, which is always valid.
         (A separate change to sync PGD entry is necessary if this
          memory layout is changed regardless of the page size.)
       - Add pmd_huge() to the validation code to handle large pages.
         This is for completeness since vmalloc_fault() won't happen
         in ioremap'd ranges as its PGD entry is always valid.
      Reported-by: NHenning Schild <henning.schild@siemens.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Cc: <stable@vger.kernel.org> # 4.1+
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-mm@kvack.org
      Cc: linux-nvdimm@lists.01.org
      Link: http://lkml.kernel.org/r/1455758214-24623-1-git-send-email-toshi.kani@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f4eafd8b
  21. 31 7月, 2015 2 次提交
  22. 19 5月, 2015 1 次提交
    • D
      mm/fault, arch: Use pagefault_disable() to check for disabled pagefaults in the handler · 70ffdb93
      David Hildenbrand 提交于
      Introduce faulthandler_disabled() and use it to check for irq context and
      disabled pagefaults (via pagefault_disable()) in the pagefault handlers.
      
      Please note that we keep the in_atomic() checks in place - to detect
      whether in irq context (in which case preemption is always properly
      disabled).
      
      In contrast, preempt_disable() should never be used to disable pagefaults.
      With !CONFIG_PREEMPT_COUNT, preempt_disable() doesn't modify the preempt
      counter, and therefore the result of in_atomic() differs.
      We validate that condition by using might_fault() checks when calling
      might_sleep().
      
      Therefore, add a comment to faulthandler_disabled(), describing why this
      is needed.
      
      faulthandler_disabled() and pagefault_disable() are defined in
      linux/uaccess.h, so let's properly add that include to all relevant files.
      
      This patch is based on a patch from Thomas Gleixner.
      Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-7-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70ffdb93
  23. 23 3月, 2015 2 次提交
  24. 04 2月, 2015 1 次提交
  25. 30 1月, 2015 1 次提交
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  26. 18 12月, 2014 1 次提交
  27. 16 12月, 2014 1 次提交
    • L
      x86: mm: consolidate VM_FAULT_RETRY handling · 26178ec1
      Linus Torvalds 提交于
      The VM_FAULT_RETRY handling was confusing and incorrect for the case of
      returning to kernel mode.  We need to handle the exception table fixup
      if we return to kernel mode due to a fatal signal - it will basically
      look to the kernel user mode access like the access failed due to the VM
      going away from udner it.  Which is correct - the process is dying - and
      avoids the whole "repeat endless kernel page faults" case.
      
      Handling the VM_FAULT_RETRY early and in just one place also simplifies
      the mmap_sem handling, since once we've taken care of VM_FAULT_RETRY we
      know that we can just drop the lock.  The remaining accounting and
      possible error handling is thread-local and does not need the mmap_sem.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26178ec1