1. 19 5月, 2018 3 次提交
  2. 14 5月, 2018 4 次提交
    • D
      x86/pkeys: Do not special case protection key 0 · 2fa9d1cf
      Dave Hansen 提交于
      mm_pkey_is_allocated() treats pkey 0 as unallocated.  That is
      inconsistent with the manpages, and also inconsistent with
      mm->context.pkey_allocation_map.  Stop special casing it and only
      disallow values that are actually bad (< 0).
      
      The end-user visible effect of this is that you can now use
      mprotect_pkey() to set pkey=0.
      
      This is a bit nicer than what Ram proposed[1] because it is simpler
      and removes special-casing for pkey 0.  On the other hand, it does
      allow applications to pkey_free() pkey-0, but that's just a silly
      thing to do, so we are not going to protect against it.
      
      The scenario that could happen is similar to what happens if you free
      any other pkey that is in use: it might get reallocated later and used
      to protect some other data.  The most likely scenario is that pkey-0
      comes back from pkey_alloc(), an access-disable or write-disable bit
      is set in PKRU for it, and the next stack access will SIGSEGV.  It's
      not horribly different from if you mprotect()'d your stack or heap to
      be unreadable or unwritable, which is generally very foolish, but also
      not explicitly prevented by the kernel.
      
      1. http://lkml.kernel.org/r/1522112702-27853-1-git-send-email-linuxram@us.ibm.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>p
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellermen <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: stable@vger.kernel.org
      Fixes: 58ab9a08 ("x86/pkeys: Check against max pkey to avoid overflows")
      Link: http://lkml.kernel.org/r/20180509171358.47FD785E@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2fa9d1cf
    • D
      x86/pkeys: Override pkey when moving away from PROT_EXEC · 0a0b1520
      Dave Hansen 提交于
      I got a bug report that the following code (roughly) was
      causing a SIGSEGV:
      
      	mprotect(ptr, size, PROT_EXEC);
      	mprotect(ptr, size, PROT_NONE);
      	mprotect(ptr, size, PROT_READ);
      	*ptr = 100;
      
      The problem is hit when the mprotect(PROT_EXEC)
      is implicitly assigned a protection key to the VMA, and made
      that key ACCESS_DENY|WRITE_DENY.  The PROT_NONE mprotect()
      failed to remove the protection key, and the PROT_NONE->
      PROT_READ left the PTE usable, but the pkey still in place
      and left the memory inaccessible.
      
      To fix this, we ensure that we always "override" the pkee
      at mprotect() if the VMA does not have execute-only
      permissions, but the VMA has the execute-only pkey.
      
      We had a check for PROT_READ/WRITE, but it did not work
      for PROT_NONE.  This entirely removes the PROT_* checks,
      which ensures that PROT_NONE now works.
      Reported-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellermen <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: stable@vger.kernel.org
      Fixes: 62b5f7d0 ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
      Link: http://lkml.kernel.org/r/20180509171351.084C5A71@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0a0b1520
    • A
      x86/cpufeature: Guard asm_volatile_goto usage for BPF compilation · b1ae32db
      Alexei Starovoitov 提交于
      Workaround for the sake of BPF compilation which utilizes kernel
      headers, but clang does not support ASM GOTO and fails the build.
      
      Fixes: d0266046 ("x86: Remove FAST_FEATURE_TESTS")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: daniel@iogearbox.net
      Cc: peterz@infradead.org
      Cc: netdev@vger.kernel.org
      Cc: bp@alien8.de
      Cc: yhs@fb.com
      Cc: kernel-team@fb.com
      Cc: torvalds@linux-foundation.org
      Cc: davem@davemloft.net
      Link: https://lkml.kernel.org/r/20180513193222.1997938-1-ast@kernel.org
      b1ae32db
    • M
      kprobes/x86: Prohibit probing on exception masking instructions · ee6a7354
      Masami Hiramatsu 提交于
      Since MOV SS and POP SS instructions will delay the exceptions until the
      next instruction is executed, single-stepping on it by kprobes must be
      prohibited.
      
      However, kprobes usually executes those instructions directly on trampoline
      buffer (a.k.a. kprobe-booster), except for the kprobes which has
      post_handler. Thus if kprobe user probes MOV SS with post_handler, it will
      do single-stepping on the MOV SS.
      
      This means it is safe that if it is used via ftrace or perf/bpf since those
      don't use the post_handler.
      
      Anyway, since the stack switching is a rare case, it is safer just
      rejecting kprobes on such instructions.
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
      Cc: Francis Deslauriers <francis.deslauriers@efficios.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: "David S . Miller" <davem@davemloft.net>
      Link: https://lkml.kernel.org/r/152587069574.17316.3311695234863248641.stgit@devbox
      ee6a7354
  3. 13 5月, 2018 1 次提交
  4. 06 5月, 2018 3 次提交
  5. 26 4月, 2018 4 次提交
  6. 25 4月, 2018 2 次提交
    • S
      tracing/x86: Update syscall trace events to handle new prefixed syscall func names · 1c758a22
      Steven Rostedt (VMware) 提交于
      Arnaldo noticed that the latest kernel is missing the syscall event system
      directory in x86. I bisected it down to d5a00528 ("syscalls/core,
      syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*()").
      
      The system call trace events are special, as there is only one trace event
      for all system calls (the raw_syscalls). But a macro that wraps the system
      calls creates meta data for them that copies the name to find the system
      call that maps to the system call table (the number). At boot up, it does a
      kallsyms lookup of the system call table to find the function that maps to
      the meta data of the system call. If it does not find a function, then that
      system call is ignored.
      
      Because the x86 system calls had "__x64_", or "__ia32_" prefixed to the
      "sys" for the names, they do not match the default compare algorithm. As
      this was a problem for power pc, the algorithm can be overwritten by the
      architecture. The solution is to have x86 have its own algorithm to do the
      compare and this brings back the system call trace events.
      
      Link: http://lkml.kernel.org/r/20180417174128.0f3457f0@gandalf.local.homeReported-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Fixes: d5a00528 ("syscalls/core, syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*()")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1c758a22
    • D
      x86/pti: Filter at vma->vm_page_prot population · 316d097c
      Dave Hansen 提交于
      commit ce9962bf7e22bb3891655c349faff618922d4a73
      
      0day reported warnings at boot on 32-bit systems without NX support:
      
      attempted to set unsupported pgprot: 8000000000000025 bits: 8000000000000000 supported: 7fffffffffffffff
      WARNING: CPU: 0 PID: 1 at
      arch/x86/include/asm/pgtable.h:540 handle_mm_fault+0xfc1/0xfe0:
       check_pgprot at arch/x86/include/asm/pgtable.h:535
       (inlined by) pfn_pte at arch/x86/include/asm/pgtable.h:549
       (inlined by) do_anonymous_page at mm/memory.c:3169
       (inlined by) handle_pte_fault at mm/memory.c:3961
       (inlined by) __handle_mm_fault at mm/memory.c:4087
       (inlined by) handle_mm_fault at mm/memory.c:4124
      
      The problem is that due to the recent commit which removed auto-massaging
      of page protections, filtering page permissions at PTE creation time is not
      longer done, so vma->vm_page_prot is passed unfiltered to PTE creation.
      
      Filter the page protections before they are installed in vma->vm_page_prot.
      
      Fixes: fb43d6cb ("x86/mm: Do not auto-massage page protections")
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180420222028.99D72858@viggo.jf.intel.com
      
      316d097c
  7. 23 4月, 2018 1 次提交
  8. 17 4月, 2018 1 次提交
  9. 16 4月, 2018 1 次提交
  10. 14 4月, 2018 1 次提交
  11. 12 4月, 2018 4 次提交
    • A
      Revert "x86/asm: Allow again using asm.h when building for the 'bpf' clang target" · fd97d39b
      Arnaldo Carvalho de Melo 提交于
      This reverts commit ca26cffa.
      
      Newer clang versions accept that asm(_ASM_SP) construct, and now that
      the bpf-script-test-kbuild.c script, used in one of the 'perf test LLVM'
      subtests doesn't include ptrace.h, which ended up including
      arch/x86/include/asm/asm.h, we can revert this patch.
      Suggested-by: NYonghong Song <yhs@fb.com>
      Link: https://lkml.kernel.org/r/613f0a0d-c433-8f4d-dcc1-c9889deae39e@fb.comAcked-by: NYonghong Song <yhs@fb.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-nqozcv8loq40tkqpfw997993@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fd97d39b
    • D
      x86/pti: Leave kernel text global for !PCID · 8c06c774
      Dave Hansen 提交于
      Global pages are bad for hardening because they potentially let an
      exploit read the kernel image via a Meltdown-style attack which
      makes it easier to find gadgets.
      
      But, global pages are good for performance because they reduce TLB
      misses when making user/kernel transitions, especially when PCIDs
      are not available, such as on older hardware, or where a hypervisor
      has disabled them for some reason.
      
      This patch implements a basic, sane policy: If you have PCIDs, you
      only map a minimal amount of kernel text global.  If you do not have
      PCIDs, you map all kernel text global.
      
      This policy effectively makes PCIDs something that not only adds
      performance but a little bit of hardening as well.
      
      I ran a simple "lseek" microbenchmark[1] to test the benefit on
      a modern Atom microserver.  Most of the benefit comes from applying
      the series before this patch ("entry only"), but there is still a
      signifiant benefit from this patch.
      
        No Global Lines (baseline  ): 6077741 lseeks/sec
        88 Global Lines (entry only): 7528609 lseeks/sec (+23.9%)
        94 Global Lines (this patch): 8433111 lseeks/sec (+38.8%)
      
      [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.cSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205518.E3D989EB@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8c06c774
    • D
      x86/mm: Do not auto-massage page protections · fb43d6cb
      Dave Hansen 提交于
      A PTE is constructed from a physical address and a pgprotval_t.
      __PAGE_KERNEL, for instance, is a pgprot_t and must be converted
      into a pgprotval_t before it can be used to create a PTE.  This is
      done implicitly within functions like pfn_pte() by massage_pgprot().
      
      However, this makes it very challenging to set bits (and keep them
      set) if your bit is being filtered out by massage_pgprot().
      
      This moves the bit filtering out of pfn_pte() and friends.  For
      users of PAGE_KERNEL*, filtering will be done automatically inside
      those macros but for users of __PAGE_KERNEL*, they need to do their
      own filtering now.
      
      Note that we also just move pfn_pte/pmd/pud() over to check_pgprot()
      instead of massage_pgprot().  This way, we still *look* for
      unsupported bits and properly warn about them if we find them.  This
      might happen if an unfiltered __PAGE_KERNEL* value was passed in,
      for instance.
      
      - printk format warning fix from: Arnd Bergmann <arnd@arndb.de>
      - boot crash fix from:            Tom Lendacky <thomas.lendacky@amd.com>
      - crash bisected by:              Mike Galbraith <efault@gmx.de>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reported-and-fixed-by: NArnd Bergmann <arnd@arndb.de>
      Fixed-by: NTom Lendacky <thomas.lendacky@amd.com>
      Bisected-by: NMike Galbraith <efault@gmx.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205509.77E1D7F6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb43d6cb
    • P
      xen, mm: allow deferred page initialization for xen pv domains · 6f84f8d1
      Pavel Tatashin 提交于
      Juergen Gross noticed that commit f7f99100 ("mm: stop zeroing memory
      during allocation in vmemmap") broke XEN PV domains when deferred struct
      page initialization is enabled.
      
      This is because the xen's PagePinned() flag is getting erased from
      struct pages when they are initialized later in boot.
      
      Juergen fixed this problem by disabling deferred pages on xen pv
      domains.  It is desirable, however, to have this feature available as it
      reduces boot time.  This fix re-enables the feature for pv-dmains, and
      fixes the problem the following way:
      
      The fix is to delay setting PagePinned flag until struct pages for all
      allocated memory are initialized, i.e.  until after free_all_bootmem().
      
      A new x86_init.hyper op init_after_bootmem() is called to let xen know
      that boot allocator is done, and hence struct pages for all the
      allocated memory are now initialized.  If deferred page initialization
      is enabled, the rest of struct pages are going to be initialized later
      in boot once page_alloc_init_late() is called.
      
      xen_after_bootmem() walks page table's pages and marks them pinned.
      
      Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Tested-by: NJuergen Gross <jgross@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f84f8d1
  12. 10 4月, 2018 2 次提交
    • L
      x86/apic: Fix signedness bug in APIC ID validity checks · a774635d
      Li RongQing 提交于
      The APIC ID as parsed from ACPI MADT is validity checked with the
      apic->apic_id_valid() callback, which depends on the selected APIC type.
      
      For non X2APIC types APIC IDs >= 0xFF are invalid, but values > 0x7FFFFFFF
      are detected as valid. This happens because the 'apicid' argument of the
      apic_id_valid() callback is type 'int'. So the resulting comparison
      
         apicid < 0xFF
      
      evaluates to true for all unsigned int values > 0x7FFFFFFF which are handed
      to default_apic_id_valid(). As a consequence, invalid APIC IDs in !X2APIC
      mode are considered valid and accounted as possible CPUs.
      
      Change the apicid argument type of the apic_id_valid() callback to u32 so
      the evaluation is unsigned and returns the correct result.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: jgross@suse.com
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/1523322966-10296-1-git-send-email-lirongqing@baidu.com
      a774635d
    • D
      x86/mm: Introduce "default" kernel PTE mask · 8a57f484
      Dave Hansen 提交于
      The __PAGE_KERNEL_* page permissions are "raw".  They contain bits
      that may or may not be supported on the current processor.  They need
      to be filtered by a mask (currently __supported_pte_mask) to turn them
      into a value that we can actually set in a PTE.
      
      These __PAGE_KERNEL_* values all contain _PAGE_GLOBAL.  But, with PTI,
      we want to be able to support _PAGE_GLOBAL (have the bit set in
      __supported_pte_mask) but not have it appear in any of these masks by
      default.
      
      This patch creates a new mask, __default_kernel_pte_mask, and applies
      it when creating all of the PAGE_KERNEL_* masks.  This makes
      PAGE_KERNEL_* safe to use anywhere (they only contain supported bits).
      It also ensures that PAGE_KERNEL_* contains _PAGE_GLOBAL on PTI=n
      kernels but clears _PAGE_GLOBAL when PTI=y.
      
      We also make __default_kernel_pte_mask a non-GPL exported symbol
      because there are plenty of driver-available interfaces that take
      PAGE_KERNEL_* permissions.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205506.030DB6B6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a57f484
  13. 09 4月, 2018 4 次提交
    • D
      syscalls/x86: Adapt syscall_wrapper.h to the new syscall stub naming convention · c76fc982
      Dominik Brodowski 提交于
      Make the code in syscall_wrapper.h more readable by naming the stub macros
      similar to the stub they provide. While at it, fix a stray newline at the
      end of the __IA32_COMPAT_SYS_STUBx macro.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180409105145.5364-5-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c76fc982
    • D
      syscalls/core, syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*() · d5a00528
      Dominik Brodowski 提交于
      This rename allows us to have a coherent syscall stub naming convention on
      64-bit x86 (0xffffffff prefix removed):
      
       810f0af0 t            kernel_waitid	# common (32/64) kernel helper
      
       <inline>            __do_sys_waitid	# inlined helper doing actual work
       810f0be0 t          __se_sys_waitid	# C func calling inlined helper
      
       <inline>     __do_compat_sys_waitid	# inlined helper doing actual work
       810f0d80 t   __se_compat_sys_waitid	# compat C func calling inlined helper
      
       810f2080 T         __x64_sys_waitid	# x64 64-bit-ptregs -> C stub
       810f20b0 T        __ia32_sys_waitid	# ia32 32-bit-ptregs -> C stub[*]
       810f2470 T __ia32_compat_sys_waitid	# ia32 32-bit-ptregs -> compat C stub
       810f2490 T  __x32_compat_sys_waitid	# x32 64-bit-ptregs -> compat C stub
      
          [*] This stub is unused, as the syscall table links
      	__ia32_compat_sys_waitid instead of __ia32_sys_waitid as we need
      	a compat variant here.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180409105145.5364-4-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d5a00528
    • D
      syscalls/core, syscalls/x86: Clean up compat syscall stub naming convention · 5ac9efa3
      Dominik Brodowski 提交于
      Tidy the naming convention for compat syscall subs. Hints which describe
      the purpose of the stub go in front and receive a double underscore to
      denote that they are generated on-the-fly by the COMPAT_SYSCALL_DEFINEx()
      macro.
      
      For the generic case, this means:
      
      t            kernel_waitid	# common C function (see kernel/exit.c)
      
          __do_compat_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
      T   __se_compat_sys_waitid	# sign-extending C function calling inlined
      				# helper (takes parameters of type long,
      				# casts them to unsigned long and then to
      				# the declared type)
      
      T        compat_sys_waitid      # alias to __se_compat_sys_waitid()
      				# (taking parameters as declared), to
      				# be included in syscall table
      
      For x86, the naming is as follows:
      
      t            kernel_waitid	# common C function (see kernel/exit.c)
      
          __do_compat_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
      t   __se_compat_sys_waitid      # sign-extending C function calling inlined
      				# helper (takes parameters of type long,
      				# casts them to unsigned long and then to
      				# the declared type)
      
      T __ia32_compat_sys_waitid	# IA32_EMULATION 32-bit-ptregs -> C stub,
      				# calls __se_compat_sys_waitid(); to be
      				# included in syscall table
      
      T  __x32_compat_sys_waitid	# x32 64-bit-ptregs -> C stub, calls
      				# __se_compat_sys_waitid(); to be included
      				# in syscall table
      
      If only one of IA32_EMULATION and x32 is enabled, __se_compat_sys_waitid()
      may be inlined into the stub __{ia32,x32}_compat_sys_waitid().
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180409105145.5364-3-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ac9efa3
    • D
      syscalls/core, syscalls/x86: Clean up syscall stub naming convention · e145242e
      Dominik Brodowski 提交于
      Tidy the naming convention for compat syscall subs. Hints which describe
      the purpose of the stub go in front and receive a double underscore to
      denote that they are generated on-the-fly by the SYSCALL_DEFINEx() macro.
      
      For the generic case, this means (0xffffffff prefix removed):
      
       810f08d0 t     kernel_waitid	# common C function (see kernel/exit.c)
      
       <inline>     __do_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
       810f1aa0 T   __se_sys_waitid	# sign-extending C function calling inlined
      				# helper (takes parameters of type long;
      				# casts them to the declared type)
      
       810f1aa0 T        sys_waitid	# alias to __se_sys_waitid() (taking
      				# parameters as declared), to be included
      				# in syscall table
      
      For x86, the naming is as follows:
      
       810efc70 t     kernel_waitid	# common C function (see kernel/exit.c)
      
       <inline>     __do_sys_waitid	# inlined helper doing the actual work
      				# (takes original parameters as declared)
      
       810efd60 t   __se_sys_waitid	# sign-extending C function calling inlined
      				# helper (takes parameters of type long;
      				# casts them to the declared type)
      
       810f1140 T __ia32_sys_waitid	# IA32_EMULATION 32-bit-ptregs -> C stub,
      				# calls __se_sys_waitid(); to be included
      				# in syscall table
      
       810f1110 T        sys_waitid	# x86 64-bit-ptregs -> C stub, calls
      				# __se_sys_waitid(); to be included in
      				# syscall table
      
      For x86, sys_waitid() will be re-named to __x64_sys_waitid in a follow-up
      patch.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180409105145.5364-2-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e145242e
  14. 05 4月, 2018 4 次提交
    • D
      syscalls/x86: Unconditionally enable 'struct pt_regs' based syscalls on x86_64 · f8781c4a
      Dominik Brodowski 提交于
      Removing CONFIG_SYSCALL_PTREGS from arch/x86/Kconfig and simply selecting
      ARCH_HAS_SYSCALL_WRAPPER unconditionally on x86-64 allows us to simplify
      several codepaths.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-7-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f8781c4a
    • D
      syscalls/x86: Use 'struct pt_regs' based syscall calling for IA32_EMULATION and x32 · ebeb8c82
      Dominik Brodowski 提交于
      Extend ARCH_HAS_SYSCALL_WRAPPER for i386 emulation and for x32 on 64-bit
      x86.
      
      For x32, all we need to do is to create an additional stub for each
      compat syscall which decodes the parameters in x86-64 ordering, e.g.:
      
      	asmlinkage long __compat_sys_x32_xyzzy(struct pt_regs *regs)
      	{
      		return c_SyS_xyzzy(regs->di, regs->si, regs->dx);
      	}
      
      For i386 emulation, we need to teach compat_sys_*() to take struct
      pt_regs as its only argument, e.g.:
      
      	asmlinkage long __compat_sys_ia32_xyzzy(struct pt_regs *regs)
      	{
      		return c_SyS_xyzzy(regs->bx, regs->cx, regs->dx);
      	}
      
      In addition, we need to create additional stubs for common syscalls
      (that is, for syscalls which have the same parameters on 32-bit and
      64-bit), e.g.:
      
      	asmlinkage long __sys_ia32_xyzzy(struct pt_regs *regs)
      	{
      		return c_sys_xyzzy(regs->bx, regs->cx, regs->dx);
      	}
      
      This approach avoids leaking random user-provided register content down
      the call chain.
      
      This patch is based on an original proof-of-concept
      
       | From: Linus Torvalds <torvalds@linux-foundation.org>
       | Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
      
      and was split up and heavily modified by me, in particular to base it on
      ARCH_HAS_SYSCALL_WRAPPER.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-6-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ebeb8c82
    • D
      syscalls/x86: Use 'struct pt_regs' based syscall calling convention for 64-bit syscalls · fa697140
      Dominik Brodowski 提交于
      Let's make use of ARCH_HAS_SYSCALL_WRAPPER=y on pure 64-bit x86-64 systems:
      
      Each syscall defines a stub which takes struct pt_regs as its only
      argument. It decodes just those parameters it needs, e.g:
      
      	asmlinkage long sys_xyzzy(const struct pt_regs *regs)
      	{
      		return SyS_xyzzy(regs->di, regs->si, regs->dx);
      	}
      
      This approach avoids leaking random user-provided register content down
      the call chain.
      
      For example, for sys_recv() which is a 4-parameter syscall, the assembly
      now is (in slightly reordered fashion):
      
      	<sys_recv>:
      		callq	<__fentry__>
      
      		/* decode regs->di, ->si, ->dx and ->r10 */
      		mov	0x70(%rdi),%rdi
      		mov	0x68(%rdi),%rsi
      		mov	0x60(%rdi),%rdx
      		mov	0x38(%rdi),%rcx
      
      		[ SyS_recv() is automatically inlined by the compiler,
      		  as it is not [yet] used anywhere else ]
      		/* clear %r9 and %r8, the 5th and 6th args */
      		xor	%r9d,%r9d
      		xor	%r8d,%r8d
      
      		/* do the actual work */
      		callq	__sys_recvfrom
      
      		/* cleanup and return */
      		cltq
      		retq
      
      The only valid place in an x86-64 kernel which rightfully calls
      a syscall function on its own -- vsyscall -- needs to be modified
      to pass struct pt_regs onwards as well.
      
      To keep the syscall table generation working independent of
      SYSCALL_PTREGS being enabled, the stubs are named the same as the
      "original" syscall stubs, i.e. sys_*().
      
      This patch is based on an original proof-of-concept
      
       | From: Linus Torvalds <torvalds@linux-foundation.org>
       | Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
      
      and was split up and heavily modified by me, in particular to base it on
      ARCH_HAS_SYSCALL_WRAPPER, to limit it to 64-bit-only for the time being,
      and to update the vsyscall to the new calling convention.
      Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180405095307.3730-4-linux@dominikbrodowski.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fa697140
    • S
      x86/mm: Fix bogus warning during EFI bootup, use boot_cpu_has() instead of... · 162ee5a8
      Sai Praneeth 提交于
      x86/mm: Fix bogus warning during EFI bootup, use boot_cpu_has() instead of this_cpu_has() in build_cr3_noflush()
      
      Linus reported the following boot warning:
      
        WARNING: CPU: 0 PID: 0 at arch/x86/include/asm/tlbflush.h:134 load_new_mm_cr3+0x114/0x170
        [...]
        Call Trace:
        switch_mm_irqs_off+0x267/0x590
        switch_mm+0xe/0x20
        efi_switch_mm+0x3e/0x50
        efi_enter_virtual_mode+0x43f/0x4da
        start_kernel+0x3bf/0x458
        secondary_startup_64+0xa5/0xb0
      
      ... after merging:
      
        03781e40: x86/efi: Use efi_switch_mm() rather than manually twiddling with %cr3
      
      When the platform supports PCID and if CONFIG_DEBUG_VM=y is enabled,
      build_cr3_noflush() (called via switch_mm()) does a sanity check to see
      if X86_FEATURE_PCID is set.
      
      Presently, build_cr3_noflush() uses "this_cpu_has(X86_FEATURE_PCID)" to
      perform the check but this_cpu_has() works only after SMP is initialized
      (i.e. per cpu cpu_info's should be populated) and this happens to be very
      late in the boot process (during rest_init()).
      
      As efi_runtime_services() are called during (early) kernel boot time
      and run time, modify build_cr3_noflush() to use boot_cpu_has() all the
      time. As suggested by Dave Hansen, this should be OK because all CPU's have
      same capabilities on x86.
      
      With this change the warning is fixed.
      
      ( Dave also suggested that we put a warning in this_cpu_has() if it's used
        early in the boot process. This is still work in progress as it affects
        MCE. )
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Lee Chun-Yi <jlee@suse.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Ricardo Neri <ricardo.neri@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/1522870459-7432-1-git-send-email-sai.praneeth.prakhya@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      162ee5a8
  15. 03 4月, 2018 4 次提交
  16. 29 3月, 2018 1 次提交
    • L
      KVM: x86: Rename interrupt.pending to interrupt.injected · 04140b41
      Liran Alon 提交于
      For exceptions & NMIs events, KVM code use the following
      coding convention:
      *) "pending" represents an event that should be injected to guest at
      some point but it's side-effects have not yet occurred.
      *) "injected" represents an event that it's side-effects have already
      occurred.
      
      However, interrupts don't conform to this coding convention.
      All current code flows mark interrupt.pending when it's side-effects
      have already taken place (For example, bit moved from LAPIC IRR to
      ISR). Therefore, it makes sense to just rename
      interrupt.pending to interrupt.injected.
      
      This change follows logic of previous commit 664f8e26 ("KVM: X86:
      Fix loss of exception which has not yet been injected") which changed
      exception to follow this coding convention as well.
      
      It is important to note that in case !lapic_in_kernel(vcpu),
      interrupt.pending usage was and still incorrect.
      In this case, interrrupt.pending can only be set using one of the
      following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
      KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
      QEMU uses them either to re-set an "interrupt.pending" state it has
      received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
      via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
      from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
      before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
      in this case is also suppose to represent "interrupt.injected".
      However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
      is misusing (now named) interrupt.injected in order to return if
      there is a pending interrupt.
      This leads to nVMX/nSVM not be able to distinguish if it should exit
      from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
      re-inject an injected interrupt.
      Therefore, add a FIXME at these functions for handling this issue.
      
      This patch introduce no semantics change.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      04140b41