1. 15 5月, 2018 1 次提交
  2. 13 5月, 2018 1 次提交
  3. 08 5月, 2018 1 次提交
    • V
      x86/xen: Reset VCPU0 info pointer after shared_info remap · d1ecfa9d
      van der Linden, Frank 提交于
      This patch fixes crashes during boot for HVM guests on older (pre HVM
      vector callback) Xen versions. Without this, current kernels will always
      fail to boot on those Xen versions.
      
      Sample stack trace:
      
         BUG: unable to handle kernel paging request at ffffffffff200000
         IP: __xen_evtchn_do_upcall+0x1e/0x80
         PGD 1e0e067 P4D 1e0e067 PUD 1e10067 PMD 235c067 PTE 0
          Oops: 0002 [#1] SMP PTI
         Modules linked in:
         CPU: 0 PID: 512 Comm: kworker/u2:0 Not tainted 4.14.33-52.13.amzn1.x86_64 #1
         Hardware name: Xen HVM domU, BIOS 3.4.3.amazon 11/11/2016
         task: ffff88002531d700 task.stack: ffffc90000480000
         RIP: 0010:__xen_evtchn_do_upcall+0x1e/0x80
         RSP: 0000:ffff880025403ef0 EFLAGS: 00010046
         RAX: ffffffff813cc760 RBX: ffffffffff200000 RCX: ffffc90000483ef0
         RDX: ffff880020540a00 RSI: ffff880023c78000 RDI: 000000000000001c
         RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
         R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
         R13: ffff880025403f5c R14: 0000000000000000 R15: 0000000000000000
         FS:  0000000000000000(0000) GS:ffff880025400000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: ffffffffff200000 CR3: 0000000001e0a000 CR4: 00000000000006f0
          Call Trace:
         <IRQ>
         do_hvm_evtchn_intr+0xa/0x10
         __handle_irq_event_percpu+0x43/0x1a0
         handle_irq_event_percpu+0x20/0x50
         handle_irq_event+0x39/0x60
         handle_fasteoi_irq+0x80/0x140
         handle_irq+0xaf/0x120
         do_IRQ+0x41/0xd0
         common_interrupt+0x7d/0x7d
         </IRQ>
      
      During boot, the HYPERVISOR_shared_info page gets remapped to make it work
      with KASLR. This means that any pointer derived from it needs to be
      adjusted.
      
      The only value that this applies to is the vcpu_info pointer for VCPU 0.
      For PV and HVM with the callback vector feature, this gets done via the
      smp_ops prepare_boot_cpu callback. Older Xen versions do not support the
      HVM callback vector, so there is no Xen-specific smp_ops set up in that
      scenario. So, the vcpu_info pointer for VCPU 0 never gets set to the proper
      value, and the first reference of it will be bad. Fix this by resetting it
      immediately after the remap.
      Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
      Reviewed-by: NEduardo Valentin <eduval@amazon.com>
      Reviewed-by: NAlakesh Haloi <alakeshh@amazon.com>
      Reviewed-by: NVallish Vaidyeshwara <vallish@amazon.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: xen-devel@lists.xenproject.org
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      d1ecfa9d
  4. 06 5月, 2018 1 次提交
    • A
      KVM: x86: remove APIC Timer periodic/oneshot spikes · ecf08dad
      Anthoine Bourgeois 提交于
      Since the commit "8003c9ae: add APIC Timer periodic/oneshot mode VMX
      preemption timer support", a Windows 10 guest has some erratic timer
      spikes.
      
      Here the results on a 150000 times 1ms timer without any load:
      	  Before 8003c9ae | After 8003c9ae
      Max           1834us          |  86000us
      Mean          1100us          |   1021us
      Deviation       59us          |    149us
      Here the results on a 150000 times 1ms timer with a cpu-z stress test:
      	  Before 8003c9ae | After 8003c9ae
      Max          32000us          | 140000us
      Mean          1006us          |   1997us
      Deviation      140us          |  11095us
      
      The root cause of the problem is starting hrtimer with an expiry time
      already in the past can take more than 20 milliseconds to trigger the
      timer function.  It can be solved by forward such past timers
      immediately, rather than submitting them to hrtimer_start().
      In case the timer is periodic, update the target expiration and call
      hrtimer_start with it.
      
      v2: Check if the tsc deadline is already expired. Thank you Mika.
      v3: Execute the past timers immediately rather than submitting them to
      hrtimer_start().
      v4: Rearm the periodic timer with advance_periodic_target_expiration() a
      simpler version of set_target_expiration(). Thank you Paolo.
      
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAnthoine Bourgeois <anthoine.bourgeois@blade-group.com>
      8003c9ae ("KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support")
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ecf08dad
  5. 03 5月, 2018 2 次提交
    • D
      bpf, x64: fix memleak when not converging on calls · 39f56ca9
      Daniel Borkmann 提交于
      The JIT logic in jit_subprogs() is as follows: for all subprogs we
      allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
      and pass it to bpf_int_jit_compile(). If a failure occurred during
      JIT and prog->jited is not set, then we bail out from attempting to
      JIT the whole program, and punt to the interpreter instead. In case
      JITing went successful, we fixup BPF call offsets and do another
      pass to bpf_int_jit_compile() (extra_pass is true at that point) to
      complete JITing calls. Given that requires to pass JIT context around
      addrs and jit_data from x86 JIT are freed in the extra_pass in
      bpf_int_jit_compile() when calls are involved (if not, they can
      be freed immediately). However, if in the original pass, the JIT
      image didn't converge then we leak addrs and jit_data since image
      itself is NULL, the prog->is_func is set and extra_pass is false
      in that case, meaning both will become unreachable and are never
      cleaned up, therefore we need to free as well on !image. Only x64
      JIT is affected.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      39f56ca9
    • D
      bpf, x64: fix memleak when not converging after image · 3aab8884
      Daniel Borkmann 提交于
      While reviewing x64 JIT code, I noticed that we leak the prior allocated
      JIT image in the case where proglen != oldproglen during the JIT passes.
      Prior to the commit e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT
      compiler") we would just break out of the loop, and using the image as the
      JITed prog since it could only shrink in size anyway. After e0ee9c12,
      we would bail out to out_addrs label where we free addrs and jit_data but
      not the image coming from bpf_jit_binary_alloc().
      
      Fixes: e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3aab8884
  6. 02 5月, 2018 3 次提交
  7. 28 4月, 2018 1 次提交
  8. 27 4月, 2018 5 次提交
    • J
      kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in use · a468f2db
      Junaid Shahid 提交于
      Currently, KVM flushes the TLB after a change to the APIC access page
      address or the APIC mode when EPT mode is enabled. However, even in
      shadow paging mode, a TLB flush is needed if VPIDs are being used, as
      specified in the Intel SDM Section 29.4.5.
      
      So replace vmx_flush_tlb_ept_only() with vmx_flush_tlb(), which will
      flush if either EPT or VPIDs are in use.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a468f2db
    • A
      x86/entry/64/compat: Preserve r8-r11 in int $0x80 · 8bb2610b
      Andy Lutomirski 提交于
      32-bit user code that uses int $80 doesn't care about r8-r11.  There is,
      however, some 64-bit user code that intentionally uses int $0x80 to invoke
      32-bit system calls.  From what I've seen, basically all such code assumes
      that r8-r15 are all preserved, but the kernel clobbers r8-r11.  Since I
      doubt that there's any code that depends on int $0x80 zeroing r8-r11,
      change the kernel to preserve them.
      
      I suspect that very little user code is broken by the old clobber, since
      r8-r11 are only rarely allocated by gcc, and they're clobbered by function
      calls, so they only way we'd see a problem is if the same function that
      invokes int $0x80 also spills something important to one of these
      registers.
      
      The current behavior seems to date back to the historical commit
      "[PATCH] x86-64 merge for 2.6.4".  Before that, all regs were
      preserved.  I can't find any explanation of why this change was made.
      
      Update the test_syscall_vdso_32 testcase as well to verify the new
      behavior, and it strengthens the test to make sure that the kernel doesn't
      accidentally permute r8..r15.
      Suggested-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Link: https://lkml.kernel.org/r/d4c4d9985fbe64f8c9e19291886453914b48caee.1523975710.git.luto@kernel.org
      8bb2610b
    • A
      x86/ipc: Fix x32 version of shmid64_ds and msqid64_ds · 1a512c08
      Arnd Bergmann 提交于
      A bugfix broke the x32 shmid64_ds and msqid64_ds data structure layout
      (as seen from user space)  a few years ago: Originally, __BITS_PER_LONG
      was defined as 64 on x32, so we did not have padding after the 64-bit
      __kernel_time_t fields, After __BITS_PER_LONG got changed to 32,
      applications would observe extra padding.
      
      In other parts of the uapi headers we seem to have a mix of those
      expecting either 32 or 64 on x32 applications, so we can't easily revert
      the path that broke these two structures.
      
      Instead, this patch decouples x32 from the other architectures and moves
      it back into arch specific headers, partially reverting the even older
      commit 73a2d096 ("x86: remove all now-duplicate header files").
      
      It's not clear whether this ever made any difference, since at least
      glibc carries its own (correct) copy of both of these header files,
      so possibly no application has ever observed the definitions here.
      
      Based on a suggestion from H.J. Lu, I tried out the tool from
      https://github.com/hjl-tools/linux-header to find other such
      bugs, which pointed out the same bug in statfs(), which also has
      a separate (correct) copy in glibc.
      
      Fixes: f4b4aae1 ("x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "H . J . Lu" <hjl.tools@gmail.com>
      Cc: Jeffrey Walton <noloader@gmail.com>
      Cc: stable@vger.kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180424212013.3967461-1-arnd@arndb.de
      1a512c08
    • P
      x86/setup: Do not reserve a crash kernel region if booted on Xen PV · 3db3eb28
      Petr Tesarik 提交于
      Xen PV domains cannot shut down and start a crash kernel. Instead,
      the crashing kernel makes a SCHEDOP_shutdown hypercall with the
      reason code SHUTDOWN_crash, cf. xen_crash_shutdown() machine op in
      arch/x86/xen/enlighten_pv.c.
      
      A crash kernel reservation is merely a waste of RAM in this case. It
      may also confuse users of kexec_load(2) and/or kexec_file_load(2).
      When flags include KEXEC_ON_CRASH or KEXEC_FILE_ON_CRASH,
      respectively, these syscalls return success, which is technically
      correct, but the crash kexec image will never be actually used.
      Signed-off-by: NPetr Tesarik <ptesarik@suse.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: xen-devel@lists.xenproject.org
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jean Delvare <jdelvare@suse.de>
      Link: https://lkml.kernel.org/r/20180425120835.23cef60c@ezekiel.suse.cz
      3db3eb28
    • J
      x86/cpu/intel: Add missing TLB cpuid values · b837913f
      jacek.tomaka@poczta.fm 提交于
      Make kernel print the correct number of TLB entries on Intel Xeon Phi 7210
      (and others)
      
      Before:
      [ 0.320005] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
      After:
      [ 0.320005] Last level dTLB entries: 4KB 256, 2MB 128, 4MB 128, 1GB 16
      
      The entries do exist in the official Intel SMD but the type column there is
      incorrect (states "Cache" where it should read "TLB"), but the entries for
      the values 0x6B, 0x6C and 0x6D are correctly described as 'Data TLB'.
      Signed-off-by: NJacek Tomaka <jacek.tomaka@poczta.fm>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20180423161425.24366-1-jacekt@dugeo.com
      b837913f
  9. 26 4月, 2018 6 次提交
  10. 25 4月, 2018 7 次提交
    • G
      bpf, x64: fix JIT emission for dead code · 1612a981
      Gianluca Borello 提交于
      Commit 2a5418a1 ("bpf: improve dead code sanitizing") replaced dead
      code with a series of ja-1 instructions, for safety. That made JIT
      compilation much more complex for some BPF programs. One instance of such
      programs is, for example:
      
      bool flag = false
      ...
      /* A bunch of other code */
      ...
      if (flag)
              do_something()
      
      In some cases llvm is not able to remove at compile time the code for
      do_something(), so the generated BPF program ends up with a large amount
      of dead instructions. In one specific real life example, there are two
      series of ~500 and ~1000 dead instructions in the program. When the
      verifier replaces them with a series of ja-1 instructions, it causes an
      interesting behavior at JIT time.
      
      During the first pass, since all the instructions are estimated at 64
      bytes, the ja-1 instructions end up being translated as 5 bytes JMP
      instructions (0xE9), since the jump offsets become increasingly large (>
      127) as each instruction gets discovered to be 5 bytes instead of the
      estimated 64.
      
      Starting from the second pass, the first N instructions of the ja-1
      sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
      become <= 127 this time. In particular, N is defined as roughly 127 / (5
      - 2) ~= 42. So, each further pass will make the subsequent N JMP
      instructions shrink from 5 to 2 bytes, making the image shrink every time.
      This means that in order to have the entire program converge, there need
      to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
      for translating the dead code. If we add this number to the passes needed
      to translate the other non dead code, it brings such program to 40+
      passes, and JIT doesn't complete. Ultimately the userspace loader fails
      because such BPF program was supposed to be part of a prog array owner
      being JITed.
      
      While it is certainly possible to try to refactor such programs to help
      the compiler remove dead code, the behavior is not really intuitive and it
      puts further burden on the BPF developer who is not expecting such
      behavior. To make things worse, such programs are working just fine in all
      the kernel releases prior to the ja-1 fix.
      
      A possible approach to mitigate this behavior consists into noticing that
      for ja-1 instructions we don't really need to rely on the estimated size
      of the previous and current instructions, we know that a -1 BPF jump
      offset can be safely translated into a 0xEB instruction with a jump offset
      of -2.
      
      Such fix brings the BPF program in the previous example to complete again
      in ~9 passes.
      
      Fixes: 2a5418a1 ("bpf: improve dead code sanitizing")
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1612a981
    • S
      tracing/x86: Update syscall trace events to handle new prefixed syscall func names · 1c758a22
      Steven Rostedt (VMware) 提交于
      Arnaldo noticed that the latest kernel is missing the syscall event system
      directory in x86. I bisected it down to d5a00528 ("syscalls/core,
      syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*()").
      
      The system call trace events are special, as there is only one trace event
      for all system calls (the raw_syscalls). But a macro that wraps the system
      calls creates meta data for them that copies the name to find the system
      call that maps to the system call table (the number). At boot up, it does a
      kallsyms lookup of the system call table to find the function that maps to
      the meta data of the system call. If it does not find a function, then that
      system call is ignored.
      
      Because the x86 system calls had "__x64_", or "__ia32_" prefixed to the
      "sys" for the names, they do not match the default compare algorithm. As
      this was a problem for power pc, the algorithm can be overwritten by the
      architecture. The solution is to have x86 have its own algorithm to do the
      compare and this brings back the system call trace events.
      
      Link: http://lkml.kernel.org/r/20180417174128.0f3457f0@gandalf.local.homeReported-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Fixes: d5a00528 ("syscalls/core, syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*()")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1c758a22
    • D
      x86/pti: Filter at vma->vm_page_prot population · 316d097c
      Dave Hansen 提交于
      commit ce9962bf7e22bb3891655c349faff618922d4a73
      
      0day reported warnings at boot on 32-bit systems without NX support:
      
      attempted to set unsupported pgprot: 8000000000000025 bits: 8000000000000000 supported: 7fffffffffffffff
      WARNING: CPU: 0 PID: 1 at
      arch/x86/include/asm/pgtable.h:540 handle_mm_fault+0xfc1/0xfe0:
       check_pgprot at arch/x86/include/asm/pgtable.h:535
       (inlined by) pfn_pte at arch/x86/include/asm/pgtable.h:549
       (inlined by) do_anonymous_page at mm/memory.c:3169
       (inlined by) handle_pte_fault at mm/memory.c:3961
       (inlined by) __handle_mm_fault at mm/memory.c:4087
       (inlined by) handle_mm_fault at mm/memory.c:4124
      
      The problem is that due to the recent commit which removed auto-massaging
      of page protections, filtering page permissions at PTE creation time is not
      longer done, so vma->vm_page_prot is passed unfiltered to PTE creation.
      
      Filter the page protections before they are installed in vma->vm_page_prot.
      
      Fixes: fb43d6cb ("x86/mm: Do not auto-massage page protections")
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180420222028.99D72858@viggo.jf.intel.com
      
      316d097c
    • D
      x86/pti: Disallow global kernel text with RANDSTRUCT · b7c21bc5
      Dave Hansen 提交于
      commit 26d35ca6c3776784f8156e1d6f80cc60d9a2a915
      
      RANDSTRUCT derives its hardening benefits from the attacker's lack of
      knowledge about the layout of kernel data structures.  Keep the kernel
      image non-global in cases where RANDSTRUCT is in use to help keep the
      layout a secret.
      
      Fixes: 8c06c774 (x86/pti: Leave kernel text global for !PCID)
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: https://lkml.kernel.org/r/20180420222026.D0B4AAC9@viggo.jf.intel.com
      
      b7c21bc5
    • D
      x86/pti: Reduce amount of kernel text allowed to be Global · a44ca8f5
      Dave Hansen 提交于
      commit abb67605203687c8b7943d760638d0301787f8d9
      
      Kees reported to me that I made too much of the kernel image global.
      It was far more than just text:
      
      	I think this is too much set global: _end is after data,
      	bss, and brk, and all kinds of other stuff that could
      	hold secrets. I think this should match what
      	mark_rodata_ro() is doing.
      
      This does exactly that.  We use __end_rodata_hpage_align as our
      marker both because it is huge-page-aligned and it does not contain
      any sections we expect to hold secrets.
      
      Kees's logic was that r/o data is in the kernel image anyway and,
      in the case of traditional distributions, can be freely downloaded
      from the web, so there's no reason to hide it.
      
      Fixes: 8c06c774 (x86/pti: Leave kernel text global for !PCID)
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180420222023.1C8B2B20@viggo.jf.intel.com
      
      a44ca8f5
    • D
      x86/pti: Fix boot warning from Global-bit setting · 58e65b51
      Dave Hansen 提交于
      commit 231df823c4f04176f607afc4576c989895cff40e
      
      The pageattr.c code attempts to process "faults" when it goes looking
      for PTEs to change and finds non-present entries.  It allows these
      faults in the linear map which is "expected to have holes", but
      WARN()s about them elsewhere, like when called on the kernel image.
      
      However, change_page_attr_clear() is now called on the kernel image in the
      process of trying to clear the Global bit.
      
      This trips the warning in __cpa_process_fault() if a non-present PTE is
      encountered in the kernel image.  The "holes" in the kernel image result
      from free_init_pages()'s use of set_memory_np().  These holes are totally
      fine, and result from normal operation, just as they would be in the kernel
      linear map.
      
      Just silence the warning when holes in the kernel image are encountered.
      
      Fixes: 39114b7a (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image)
      Reported-by: NMariusz Ceier <mceier@gmail.com>
      Reported-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180420222021.1C7D2B3F@viggo.jf.intel.com
      
      58e65b51
    • D
      x86/pti: Fix boot problems from Global-bit setting · d2479a30
      Dave Hansen 提交于
      commit 16dce603adc9de4237b7bf2ff5c5290f34373e7b
      
      Part of the global bit _setting_ patches also includes clearing the
      Global bit when it should not be enabled.  That is done with
      set_memory_nonglobal(), which uses change_page_attr_clear() in
      pageattr.c under the covers.
      
      The TLB flushing code inside pageattr.c has has checks like
      BUG_ON(irqs_disabled()), looking for interrupt disabling that might
      cause deadlocks.  But, these also trip in early boot on certain
      preempt configurations.  Just copy the existing BUG_ON() sequence from
      cpa_flush_range() to the other two sites and check for early boot.
      
      Fixes: 39114b7a (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image)
      Reported-by: NMariusz Ceier <mceier@gmail.com>
      Reported-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180420222019.20C4A410@viggo.jf.intel.com
      
      d2479a30
  11. 24 4月, 2018 2 次提交
  12. 23 4月, 2018 1 次提交
  13. 21 4月, 2018 1 次提交
  14. 20 4月, 2018 4 次提交
  15. 17 4月, 2018 4 次提交
    • J
      x86/mm: Prevent kernel Oops in PTDUMP code with HIGHPTE=y · d6ef1f19
      Joerg Roedel 提交于
      The walk_pte_level() function just uses __va to get the virtual address of
      the PTE page, but that breaks when the PTE page is not in the direct
      mapping with HIGHPTE=y.
      
      The result is an unhandled kernel paging request at some random address
      when accessing the current_kernel or current_user file.
      
      Use the correct API to access PTE pages.
      
      Fixes: fe770bf0 ('x86: clean up the page table dumper and add 32-bit support')
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: jgross@suse.com
      Cc: JBeulich@suse.com
      Cc: hpa@zytor.com
      Cc: aryabinin@virtuozzo.com
      Cc: kirill.shutemov@linux.intel.com
      Link: https://lkml.kernel.org/r/1523971636-4137-1-git-send-email-joro@8bytes.org
      d6ef1f19
    • A
      x86,sched: Allow topologies where NUMA nodes share an LLC · 1340ccfa
      Alison Schofield 提交于
      Intel's Skylake Server CPUs have a different LLC topology than previous
      generations. When in Sub-NUMA-Clustering (SNC) mode, the package is divided
      into two "slices", each containing half the cores, half the LLC, and one
      memory controller and each slice is enumerated to Linux as a NUMA
      node. This is similar to how the cores and LLC were arranged for the
      Cluster-On-Die (CoD) feature.
      
      CoD allowed the same cache line to be present in each half of the LLC.
      But, with SNC, each line is only ever present in *one* slice. This means
      that the portion of the LLC *available* to a CPU depends on the data being
      accessed:
      
          Remote socket: entire package LLC is shared
          Local socket->local slice: data goes into local slice LLC
          Local socket->remote slice: data goes into remote-slice LLC. Slightly
                          		higher latency than local slice LLC.
      
      The biggest implication from this is that a process accessing all
      NUMA-local memory only sees half the LLC capacity.
      
      The CPU describes its cache hierarchy with the CPUID instruction. One of
      the CPUID leaves enumerates the "logical processors sharing this
      cache". This information is used for scheduling decisions so that tasks
      move more freely between CPUs sharing the cache.
      
      But, the CPUID for the SNC configuration discussed above enumerates the LLC
      as being shared by the entire package. This is not 100% precise because the
      entire cache is not usable by all accesses. But, it *is* the way the
      hardware enumerates itself, and this is not likely to change.
      
      The userspace visible impact of all the above is that the sysfs info
      reports the entire LLC as being available to the entire package. As noted
      above, this is not true for local socket accesses. This patch does not
      correct the sysfs info. It is the same, pre and post patch.
      
      The current code emits the following warning:
      
       sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
      
      The warning is coming from the topology_sane() check in smpboot.c because
      the topology is not matching the expectations of the model for obvious
      reasons.
      
      To fix this, add a vendor and model specific check to never call
      topology_sane() for these systems. Also, just like "Cluster-on-Die" disable
      the "coregroup" sched_domain_topology_level and use NUMA information from
      the SRAT alone.
      
      This is OK at least on the hardware we are immediately concerned about
      because the LLC sharing happens at both the slice and at the package level,
      which are also NUMA boundaries.
      Signed-off-by: NAlison Schofield <alison.schofield@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: brice.goglin@gmail.com
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180407002130.GA18984@alison-desk.jf.intel.com
      1340ccfa
    • D
      x86/processor: Remove two unused function declarations · 451cf3ca
      Dou Liyang 提交于
      early_trap_init() and cpu_set_gdt() have been removed, so remove the stale
      declarations as well.
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@chromium.org
      Cc: luto@kernel.org
      Cc: hpa@zytor.com
      Cc: bp@suse.de
      Cc: kirill.shutemov@linux.intel.com
      Link: https://lkml.kernel.org/r/20180404064527.10562-1-douly.fnst@cn.fujitsu.com
      451cf3ca
    • D
      x86/acpi: Prevent X2APIC id 0xffffffff from being accounted · 10daf10a
      Dou Liyang 提交于
      RongQing reported that there are some X2APIC id 0xffffffff in his machine's
      ACPI MADT table, which makes the number of possible CPU inaccurate.
      
      The reason is that the ACPI X2APIC parser has no sanity check for APIC ID
      0xffffffff, which is an invalid id in all APIC types. See "Intel® 64
      Architecture x2APIC Specification", Chapter 2.4.1.
      
      Add a sanity check to acpi_parse_x2apic() which ignores the invalid id.
      Reported-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: len.brown@intel.com
      Cc: rjw@rjwysocki.net
      Cc: hpa@zytor.com
      Link: https://lkml.kernel.org/r/20180412014052.25186-1-douly.fnst@cn.fujitsu.com
      10daf10a