1. 30 10月, 2019 15 次提交
  2. 29 10月, 2019 3 次提交
    • S
      x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu · 4dedaa73
      Sean Christopherson 提交于
      commit 7a22e03b0c02988e91003c505b34d752a51de344 upstream.
      
      Check that the per-cpu cluster mask pointer has been set prior to
      clearing a dying cpu's bit.  The per-cpu pointer is not set until the
      target cpu reaches smp_callin() during CPUHP_BRINGUP_CPU, whereas the
      teardown function, x2apic_dead_cpu(), is associated with the earlier
      CPUHP_X2APIC_PREPARE.  If an error occurs before the cpu is awakened,
      e.g. if do_boot_cpu() itself fails, x2apic_dead_cpu() will dereference
      the NULL pointer and cause a panic.
      
        smpboot: do_boot_cpu failed(-22) to wakeup CPU#1
        BUG: kernel NULL pointer dereference, address: 0000000000000008
        RIP: 0010:x2apic_dead_cpu+0x1a/0x30
        Call Trace:
         cpuhp_invoke_callback+0x9a/0x580
         _cpu_up+0x10d/0x140
         do_cpu_up+0x69/0xb0
         smp_init+0x63/0xa9
         kernel_init_freeable+0xd7/0x229
         ? rest_init+0xa0/0xa0
         kernel_init+0xa/0x100
         ret_from_fork+0x35/0x40
      
      Fixes: 023a6117 ("x86/apic/x2apic: Simplify cluster management")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191001205019.5789-1-sean.j.christopherson@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4dedaa73
    • S
      x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area · 17099172
      Steve Wahl 提交于
      commit 2aa85f246c181b1fa89f27e8e20c5636426be624 upstream.
      
      Our hardware (UV aka Superdome Flex) has address ranges marked
      reserved by the BIOS. Access to these ranges is caught as an error,
      causing the BIOS to halt the system.
      
      Initial page tables mapped a large range of physical addresses that
      were not checked against the list of BIOS reserved addresses, and
      sometimes included reserved addresses in part of the mapped range.
      Including the reserved range in the map allowed processor speculative
      accesses to the reserved range, triggering a BIOS halt.
      
      Used early in booting, the page table level2_kernel_pgt addresses 1
      GiB divided into 2 MiB pages, and it was set up to linearly map a full
       1 GiB of physical addresses that included the physical address range
      of the kernel image, as chosen by KASLR.  But this also included a
      large range of unused addresses on either side of the kernel image.
      And unlike the kernel image's physical address range, this extra
      mapped space was not checked against the BIOS tables of usable RAM
      addresses.  So there were times when the addresses chosen by KASLR
      would result in processor accessible mappings of BIOS reserved
      physical addresses.
      
      The kernel code did not directly access any of this extra mapped
      space, but having it mapped allowed the processor to issue speculative
      accesses into reserved memory, causing system halts.
      
      This was encountered somewhat rarely on a normal system boot, and much
      more often when starting the crash kernel if "crashkernel=512M,high"
      was specified on the command line (this heavily restricts the physical
      address of the crash kernel, in our case usually within 1 GiB of
      reserved space).
      
      The solution is to invalidate the pages of this table outside the kernel
      image's space before the page table is activated. It fixes this problem
      on our hardware.
      
       [ bp: Touchups. ]
      Signed-off-by: NSteve Wahl <steve.wahl@hpe.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: dimitri.sivanich@hpe.com
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jordan Borgner <mail@jordan-borgner.de>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: mike.travis@hpe.com
      Cc: russ.anderson@hpe.com
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
      Link: https://lkml.kernel.org/r/9c011ee51b081534a7a15065b1681d200298b530.1569358539.git.steve.wahl@hpe.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      17099172
    • R
      xen/efi: Set nonblocking callbacks · 90a886b6
      Ross Lagerwall 提交于
      [ Upstream commit df359f0d09dc029829b66322707a2f558cb720f7 ]
      
      Other parts of the kernel expect these nonblocking EFI callbacks to
      exist and crash when running under Xen. Since the implementations of
      xen_efi_set_variable() and xen_efi_query_variable_info() do not take any
      locks, use them for the nonblocking callbacks too.
      Signed-off-by: NRoss Lagerwall <ross.lagerwall@citrix.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      90a886b6
  3. 18 10月, 2019 1 次提交
  4. 12 10月, 2019 4 次提交
    • S
      KVM: nVMX: Fix consistency check on injected exception error code · 63bb8b76
      Sean Christopherson 提交于
      [ Upstream commit 567926cca99ba1750be8aae9c4178796bf9bb90b ]
      
      Current versions of Intel's SDM incorrectly state that "bits 31:15 of
      the VM-Entry exception error-code field" must be zero.  In reality, bits
      31:16 must be zero, i.e. error codes are 16-bit values.
      
      The bogus error code check manifests as an unexpected VM-Entry failure
      due to an invalid code field (error number 7) in L1, e.g. when injecting
      a #GP with error_code=0x9f00.
      
      Nadav previously reported the bug[*], both to KVM and Intel, and fixed
      the associated kvm-unit-test.
      
      [*] https://patchwork.kernel.org/patch/11124749/Reported-by: NNadav Amit <namit@vmware.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      63bb8b76
    • A
      x86/purgatory: Disable the stackleak GCC plugin for the purgatory · 9dabade5
      Arvind Sankar 提交于
      [ Upstream commit ca14c996afe7228ff9b480cf225211cc17212688 ]
      
      Since commit:
      
        b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS")
      
      kexec breaks if GCC_PLUGIN_STACKLEAK=y is enabled, as the purgatory
      contains undefined references to stackleak_track_stack.
      
      Attempting to load a kexec kernel results in this failure:
      
        kexec: Undefined symbol: stackleak_track_stack
        kexec-bzImage64: Loading purgatory failed
      
      Fix this by disabling the stackleak plugin for the purgatory.
      Signed-off-by: NArvind Sankar <nivedita@alum.mit.edu>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS")
      Link: https://lkml.kernel.org/r/20190923171753.GA2252517@rani.riverdale.lanSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9dabade5
    • J
      KVM: nVMX: handle page fault in vmread fix · eff3a54a
      Jack Wang 提交于
      During backport f7eea636c3d5 ("KVM: nVMX: handle page fault in vmread"),
      there was a mistake the exception reference should be passed to function
      kvm_write_guest_virt_system, instead of NULL, other wise, we will get
      NULL pointer deref, eg
      
      kvm-unit-test triggered a NULL pointer deref below:
      [  948.518437] kvm [24114]: vcpu0, guest rIP: 0x407ef9 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x3, nop
      [  949.106464] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [  949.106707] PGD 0 P4D 0
      [  949.106872] Oops: 0002 [#1] SMP
      [  949.107038] CPU: 2 PID: 24126 Comm: qemu-2.7 Not tainted 4.19.77-pserver #4.19.77-1+feature+daily+update+20191005.1625+a4168bb~deb9
      [  949.107283] Hardware name: Dell Inc. Precision Tower 3620/09WH54, BIOS 2.7.3 01/31/2018
      [  949.107549] RIP: 0010:kvm_write_guest_virt_system+0x12/0x40 [kvm]
      [  949.107719] Code: c0 5d 41 5c 41 5d 41 5e 83 f8 03 41 0f 94 c0 41 c1 e0 02 e9 b0 ed ff ff 0f 1f 44 00 00 48 89 f0 c6 87 59 56 00 00 01 48 89 d6 <49> c7 00 00 00 00 00 89 ca 49 c7 40 08 00 00 00 00 49 c7 40 10 00
      [  949.108044] RSP: 0018:ffffb31b0a953cb0 EFLAGS: 00010202
      [  949.108216] RAX: 000000000046b4d8 RBX: ffff9e9f415b0000 RCX: 0000000000000008
      [  949.108389] RDX: ffffb31b0a953cc0 RSI: ffffb31b0a953cc0 RDI: ffff9e9f415b0000
      [  949.108562] RBP: 00000000d2e14928 R08: 0000000000000000 R09: 0000000000000000
      [  949.108733] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffffffffc8
      [  949.108907] R13: 0000000000000002 R14: ffff9e9f4f26f2e8 R15: 0000000000000000
      [  949.109079] FS:  00007eff8694c700(0000) GS:ffff9e9f51a80000(0000) knlGS:0000000031415928
      [  949.109318] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  949.109495] CR2: 0000000000000000 CR3: 00000003be53b002 CR4: 00000000003626e0
      [  949.109671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  949.109845] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  949.110017] Call Trace:
      [  949.110186]  handle_vmread+0x22b/0x2f0 [kvm_intel]
      [  949.110356]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
      [  949.110549]  kvm_arch_vcpu_ioctl_run+0xa98/0x1b30 [kvm]
      [  949.110725]  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
      [  949.110901]  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
      [  949.111072]  do_vfs_ioctl+0xa2/0x620
      Signed-off-by: NJack Wang <jinpu.wang@cloud.ionos.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      eff3a54a
    • W
      KVM: X86: Fix userspace set invalid CR4 · 21874027
      Wanpeng Li 提交于
      commit 3ca94192278ca8de169d78c085396c424be123b3 upstream.
      
      Reported by syzkaller:
      
      	WARNING: CPU: 0 PID: 6544 at /home/kernel/data/kvm/arch/x86/kvm//vmx/vmx.c:4689 handle_desc+0x37/0x40 [kvm_intel]
      	CPU: 0 PID: 6544 Comm: a.out Tainted: G           OE     5.3.0-rc4+ #4
      	RIP: 0010:handle_desc+0x37/0x40 [kvm_intel]
      	Call Trace:
      	 vmx_handle_exit+0xbe/0x6b0 [kvm_intel]
      	 vcpu_enter_guest+0x4dc/0x18d0 [kvm]
      	 kvm_arch_vcpu_ioctl_run+0x407/0x660 [kvm]
      	 kvm_vcpu_ioctl+0x3ad/0x690 [kvm]
      	 do_vfs_ioctl+0xa2/0x690
      	 ksys_ioctl+0x6d/0x80
      	 __x64_sys_ioctl+0x1a/0x20
      	 do_syscall_64+0x74/0x720
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When CR4.UMIP is set, guest should have UMIP cpuid flag. Current
      kvm set_sregs function doesn't have such check when userspace inputs
      sregs values. SECONDARY_EXEC_DESC is enabled on writes to CR4.UMIP
      in vmx_set_cr4 though guest doesn't have UMIP cpuid flag. The testcast
      triggers handle_desc warning when executing ltr instruction since
      guest architectural CR4 doesn't set UMIP. This patch fixes it by
      adding valid CR4 and CPUID combination checking in __set_sregs.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=138efb99600000
      
      Reported-by: syzbot+0f1819555fbdce992df9@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      21874027
  5. 05 10月, 2019 10 次提交
    • S
      KVM: x86: Manually calculate reserved bits when loading PDPTRS · 496cf984
      Sean Christopherson 提交于
      commit 16cfacc8085782dab8e365979356ce1ca87fd6cc upstream.
      
      Manually generate the PDPTR reserved bit mask when explicitly loading
      PDPTRs.  The reserved bits that are being tracked by the MMU reflect the
      current paging mode, which is unlikely to be PAE paging in the vast
      majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation,
      __set_sregs(), etc...  This can cause KVM to incorrectly signal a bad
      PDPTR, or more likely, miss a reserved bit check and subsequently fail
      a VM-Enter due to a bad VMCS.GUEST_PDPTR.
      
      Add a one off helper to generate the reserved bits instead of sharing
      code across the MMU's calculations and the PDPTR emulation.  The PDPTR
      reserved bits are basically set in stone, and pushing a helper into
      the MMU's calculation adds unnecessary complexity without improving
      readability.
      
      Oppurtunistically fix/update the comment for load_pdptrs().
      
      Note, the buggy commit also introduced a deliberate functional change,
      "Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was
      effectively (and correctly) reverted by commit cd9ae5fe ("KVM: x86:
      Fix page-tables reserved bits").  A bit of SDM archaeology shows that
      the SDM from late 2008 had a bug (likely a copy+paste error) where it
      listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved
      for 2mb entries.  I.e. the SDM contradicted itself, and bits 6:5 are and
      always have been reserved.
      
      Fixes: 20c466b5 ("KVM: Use rsvd_bits_mask in load_pdptrs()")
      Cc: stable@vger.kernel.org
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Reported-by: NDoug Reiland <doug.reiland@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      496cf984
    • J
      KVM: x86: set ctxt->have_exception in x86_decode_insn() · 933e3e2b
      Jan Dakinevich 提交于
      commit c8848cee74ff05638e913582a476bde879c968ad upstream.
      
      x86_emulate_instruction() takes into account ctxt->have_exception flag
      during instruction decoding, but in practice this flag is never set in
      x86_decode_insn().
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Cc: stable@vger.kernel.org
      Cc: Denis Lunev <den@virtuozzo.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Signed-off-by: NJan Dakinevich <jan.dakinevich@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      933e3e2b
    • J
      KVM: x86: always stop emulation on page fault · 9723e445
      Jan Dakinevich 提交于
      commit 8530a79c5a9f4e29e6ffb35ec1a79d81f4968ec8 upstream.
      
      inject_emulated_exception() returns true if and only if nested page
      fault happens. However, page fault can come from guest page tables
      walk, either nested or not nested. In both cases we should stop an
      attempt to read under RIP and give guest to step over its own page
      fault handler.
      
      This is also visible when an emulated instruction causes a #GP fault
      and the VMware backdoor is enabled.  To handle the VMware backdoor,
      KVM intercepts #GP faults; with only the next patch applied,
      x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL
      instead of EMULATE_DONE.   EMULATE_FAIL causes handle_exception_nmi()
      (or gp_interception() for SVM) to re-inject the original #GP because it
      thinks emulation failed due to a non-VMware opcode.  This patch prevents
      the issue as x86_emulate_instruction() will return EMULATE_DONE after
      injecting the #GP.
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Cc: stable@vger.kernel.org
      Cc: Denis Lunev <den@virtuozzo.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Signed-off-by: NJan Dakinevich <jan.dakinevich@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9723e445
    • G
      x86/cpu: Add Tiger Lake to Intel family · e836cd29
      Gayatri Kammela 提交于
      [ Upstream commit 6e1c32c5dbb4b90eea8f964c2869d0bde050dbe0 ]
      
      Add the model numbers/CPUIDs of Tiger Lake mobile and desktop to the
      Intel family.
      Suggested-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NGayatri Kammela <gayatri.kammela@intel.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rahul Tanwar <rahul.tanwar@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190905193020.14707-2-tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e836cd29
    • S
      x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() · 7bbb7a9d
      Song Liu 提交于
      [ Upstream commit 825d0b73cd7526b0bb186798583fae810091cbac ]
      
      pti_clone_pmds() assumes that the supplied address is either:
      
       - properly PUD/PMD aligned
      or
       - the address is actually mapped which means that independently
         of the mapping level (PUD/PMD/PTE) the next higher mapping
         exists.
      
      If that's not the case the unaligned address can be incremented by PUD or
      PMD size incorrectly. All callers supply mapped and/or aligned addresses,
      but for the sake of robustness it's better to handle that case properly and
      to emit a warning.
      
      [ tglx: Rewrote changelog and added WARN_ON_ONCE() ]
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282352470.1938@nanos.tec.linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      7bbb7a9d
    • T
      x86/mm/pti: Do not invoke PTI functions when PTI is disabled · 4b7d9c2a
      Thomas Gleixner 提交于
      [ Upstream commit 990784b57731192b7d90c8d4049e6318d81e887d ]
      
      When PTI is disabled at boot time either because the CPU is not affected or
      PTI has been disabled on the command line, the boot code still calls into
      pti_finalize() which then unconditionally invokes:
      
           pti_clone_entry_text()
           pti_clone_kernel_text()
      
      pti_clone_kernel_text() was called unconditionally before the 32bit support
      was added and 32bit added the call to pti_clone_entry_text().
      
      The call has no side effects as cloning the page tables into the available
      second one, which was allocated for PTI does not create damage. But it does
      not make sense either and in case that this functionality would be extended
      later this might actually lead to hard to diagnose issues.
      
      Neither function should be called when PTI is runtime disabled. Make the
      invocation conditional.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20190828143124.063353972@linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      4b7d9c2a
    • N
      x86/apic/vector: Warn when vector space exhaustion breaks affinity · b6194965
      Neil Horman 提交于
      [ Upstream commit 743dac494d61d991967ebcfab92e4f80dc7583b3 ]
      
      On x86, CPUs are limited in the number of interrupts they can have affined
      to them as they only support 256 interrupt vectors per CPU. 32 vectors are
      reserved for the CPU and the kernel reserves another 22 for internal
      purposes. That leaves 202 vectors for assignement to devices.
      
      When an interrupt is set up or the affinity is changed by the kernel or the
      administrator, the vector assignment code attempts to honor the requested
      affinity mask. If the vector space on the CPUs in that affinity mask is
      exhausted the code falls back to a wider set of CPUs and assigns a vector
      on a CPU outside of the requested affinity mask silently.
      
      While the effective affinity is reflected in the corresponding
      /proc/irq/$N/effective_affinity* files the silent breakage of the requested
      affinity can lead to unexpected behaviour for administrators.
      
      Add a pr_warn() when this happens so that adminstrators get at least
      informed about it in the syslog.
      
      [ tglx: Massaged changelog and made the pr_warn() more informative ]
      
      Reported-by: djuran@redhat.com
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: djuran@redhat.com
      Link: https://lkml.kernel.org/r/20190822143421.9535-1-nhorman@tuxdriver.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      b6194965
    • T
      x86/apic: Soft disable APIC before initializing it · b40c15c2
      Thomas Gleixner 提交于
      [ Upstream commit 2640da4cccf5cc613bf26f0998b9e340f4b5f69c ]
      
      If the APIC was already enabled on entry of setup_local_APIC() then
      disabling it soft via the SPIV register makes a lot of sense.
      
      That masks all LVT entries and brings it into a well defined state.
      
      Otherwise previously enabled LVTs which are not touched in the setup
      function stay unmasked and might surprise the just booting kernel.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20190722105219.068290579@linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      b40c15c2
    • G
      x86/reboot: Always use NMI fallback when shutdown via reboot vector IPI fails · ce7fdd5c
      Grzegorz Halat 提交于
      [ Upstream commit 747d5a1bf293dcb33af755a6d285d41b8c1ea010 ]
      
      A reboot request sends an IPI via the reboot vector and waits for all other
      CPUs to stop. If one or more CPUs are in critical regions with interrupts
      disabled then the IPI is not handled on those CPUs and the shutdown hangs
      if native_stop_other_cpus() is called with the wait argument set.
      
      Such a situation can happen when one CPU was stopped within a lock held
      section and another CPU is trying to acquire that lock with interrupts
      disabled. There are other scenarios which can cause such a lockup as well.
      
      In theory the shutdown should be attempted by an NMI IPI after the timeout
      period elapsed. Though the wait loop after sending the reboot vector IPI
      prevents this. It checks the wait request argument and the timeout. If wait
      is set, which is true for sys_reboot() then it won't fall through to the
      NMI shutdown method after the timeout period has finished.
      
      This was an oversight when the NMI shutdown mechanism was added to handle
      the 'reboot IPI is not working' situation. The mechanism was added to deal
      with stuck panic shutdowns, which do not have the wait request set, so the
      'wait request' case was probably not considered.
      
      Remove the wait check from the post reboot vector IPI wait loop and enforce
      that the wait loop in the NMI fallback path is invoked even if NMI IPIs are
      disabled or the registration of the NMI handler fails. That second wait
      loop will then hang if not all CPUs shutdown and the wait argument is set.
      
      [ tglx: Avoid the hard to parse line break in the NMI fallback path,
        	add comments and massage the changelog ]
      
      Fixes: 7d007d21 ("x86/reboot: Use NMI to assist in shutting down if IRQ fails")
      Signed-off-by: NGrzegorz Halat <ghalat@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Don Zickus <dzickus@redhat.com>
      Link: https://lkml.kernel.org/r/20190628122813.15500-1-ghalat@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      ce7fdd5c
    • T
      x86/apic: Make apic_pending_intr_clear() more robust · d29c7b8b
      Thomas Gleixner 提交于
      [ Upstream commit cc8bf191378c1da8ad2b99cf470ee70193ace84e ]
      
      In course of developing shorthand based IPI support issues with the
      function which tries to clear eventually pending ISR bits in the local APIC
      were observed.
      
        1) O-day testing triggered the WARN_ON() in apic_pending_intr_clear().
      
           This warning is emitted when the function fails to clear pending ISR
           bits or observes pending IRR bits which are not delivered to the CPU
           after the stale ISR bit(s) are ACK'ed.
      
           Unfortunately the function only emits a WARN_ON() and fails to dump
           the IRR/ISR content. That's useless for debugging.
      
           Feng added spot on debug printk's which revealed that the stale IRR
           bit belonged to the APIC timer interrupt vector, but adding ad hoc
           debug code does not help with sporadic failures in the field.
      
           Rework the loop so the full IRR/ISR contents are saved and on failure
           dumped.
      
        2) The loop termination logic is interesting at best.
      
           If the machine has no TSC or cpu_khz is not known yet it tries 1
           million times to ack stale IRR/ISR bits. What?
      
           With TSC it uses the TSC to calculate the loop termination. It takes a
           timestamp at entry and terminates the loop when:
      
           	  (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
      
           That's roughly one second.
      
           Both methods are problematic. The APIC has 256 vectors, which means
           that in theory max. 256 IRR/ISR bits can be set. In practice this is
           impossible and the chance that more than a few bits are set is close
           to zero.
      
           With the pure loop based approach the 1 million retries are complete
           overkill.
      
           With TSC this can terminate too early in a guest which is running on a
           heavily loaded host even with only a couple of IRR/ISR bits set. The
           reason is that after acknowledging the highest priority ISR bit,
           pending IRRs must get serviced first before the next round of
           acknowledge can take place as the APIC (real and virtualized) does not
           honour EOI without a preceeding interrupt on the CPU. And every APIC
           read/write takes a VMEXIT if the APIC is virtualized. While trying to
           reproduce the issue 0-day reported it was observed that the guest was
           scheduled out long enough under heavy load that it terminated after 8
           iterations.
      
           Make the loop terminate after 512 iterations. That's plenty enough
           in any case and does not take endless time to complete.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20190722105219.158847694@linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      d29c7b8b
  6. 21 9月, 2019 5 次提交
    • T
      x86/hyper-v: Fix overflow bug in fill_gva_list() · d73515a1
      Tianyu Lan 提交于
      [ Upstream commit 4030b4c585c41eeefec7bd20ce3d0e100a0f2e4d ]
      
      When the 'start' parameter is >=  0xFF000000 on 32-bit
      systems, or >= 0xFFFFFFFF'FF000000 on 64-bit systems,
      fill_gva_list() gets into an infinite loop.
      
      With such inputs, 'cur' overflows after adding HV_TLB_FLUSH_UNIT
      and always compares as less than end.  Memory is filled with
      guest virtual addresses until the system crashes.
      
      Fix this by never incrementing 'cur' to be larger than 'end'.
      Reported-by: NJong Hyun Park <park.jonghyun@yonsei.ac.kr>
      Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
      Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 2ffd9e33 ("x86/hyper-v: Use hypercall for remote TLB flush")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d73515a1
    • P
      x86/uaccess: Don't leak the AC flags into __get_user() argument evaluation · 37135777
      Peter Zijlstra 提交于
      [ Upstream commit 9b8bd476e78e89c9ea26c3b435ad0201c3d7dbf5 ]
      
      Identical to __put_user(); the __get_user() argument evalution will too
      leak UBSAN crud into the __uaccess_begin() / __uaccess_end() region.
      While uncommon this was observed to happen for:
      
        drivers/xen/gntdev.c: if (__get_user(old_status, batch->status[i]))
      
      where UBSAN added array bound checking.
      
      This complements commit:
      
        6ae865615fc4 ("x86/uaccess: Dont leak the AC flag into __put_user() argument evaluation")
      
      Tested-by Sedat Dilek <sedat.dilek@gmail.com>
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: broonie@kernel.org
      Cc: sfr@canb.auug.org.au
      Cc: akpm@linux-foundation.org
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: mhocko@suse.cz
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Link: https://lkml.kernel.org/r/20190829082445.GM2369@hirez.programming.kicks-ass.netSigned-off-by: NSasha Levin <sashal@kernel.org>
      37135777
    • K
      perf/x86/amd/ibs: Fix sample bias for dispatched micro-ops · 7ec11cad
      Kim Phillips 提交于
      [ Upstream commit 0f4cd769c410e2285a4e9873a684d90423f03090 ]
      
      When counting dispatched micro-ops with cnt_ctl=1, in order to prevent
      sample bias, IBS hardware preloads the least significant 7 bits of
      current count (IbsOpCurCnt) with random values, such that, after the
      interrupt is handled and counting resumes, the next sample taken
      will be slightly perturbed.
      
      The current count bitfield is in the IBS execution control h/w register,
      alongside the maximum count field.
      
      Currently, the IBS driver writes that register with the maximum count,
      leaving zeroes to fill the current count field, thereby overwriting
      the random bits the hardware preloaded for itself.
      
      Fix the driver to actually retain and carry those random bits from the
      read of the IBS control register, through to its write, instead of
      overwriting the lower current count bits with zeroes.
      
      Tested with:
      
      perf record -c 100001 -e ibs_op/cnt_ctl=1/pp -a -C 0 taskset -c 0 <workload>
      
      'perf annotate' output before:
      
       15.70  65:   addsd     %xmm0,%xmm1
       17.30        add       $0x1,%rax
       15.88        cmp       %rdx,%rax
                    je        82
       17.32  72:   test      $0x1,%al
                    jne       7c
        7.52        movapd    %xmm1,%xmm0
        5.90        jmp       65
        8.23  7c:   sqrtsd    %xmm1,%xmm0
       12.15        jmp       65
      
      'perf annotate' output after:
      
       16.63  65:   addsd     %xmm0,%xmm1
       16.82        add       $0x1,%rax
       16.81        cmp       %rdx,%rax
                    je        82
       16.69  72:   test      $0x1,%al
                    jne       7c
        8.30        movapd    %xmm1,%xmm0
        8.13        jmp       65
        8.24  7c:   sqrtsd    %xmm1,%xmm0
        8.39        jmp       65
      
      Tested on Family 15h and 17h machines.
      
      Machines prior to family 10h Rev. C don't have the RDWROPCNT capability,
      and have the IbsOpCurCnt bitfield reserved, so this patch shouldn't
      affect their operation.
      
      It is unknown why commit db98c5fa ("perf/x86: Implement 64-bit
      counter support for IBS") ignored the lower 4 bits of the IbsOpCurCnt
      field; the number of preloaded random bits has always been 7, AFAICT.
      Signed-off-by: NKim Phillips <kim.phillips@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Arnaldo Carvalho de Melo" <acme@kernel.org>
      Cc: <x86@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: "Namhyung Kim" <namhyung@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20190826195730.30614-1-kim.phillips@amd.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      7ec11cad
    • J
      perf/x86/intel: Restrict period on Nehalem · 560857de
      Josh Hunt 提交于
      [ Upstream commit 44d3bbb6f5e501b873218142fe08cdf62a4ac1f3 ]
      
      We see our Nehalem machines reporting 'perfevents: irq loop stuck!' in
      some cases when using perf:
      
      perfevents: irq loop stuck!
      WARNING: CPU: 0 PID: 3485 at arch/x86/events/intel/core.c:2282 intel_pmu_handle_irq+0x37b/0x530
      ...
      RIP: 0010:intel_pmu_handle_irq+0x37b/0x530
      ...
      Call Trace:
      <NMI>
      ? perf_event_nmi_handler+0x2e/0x50
      ? intel_pmu_save_and_restart+0x50/0x50
      perf_event_nmi_handler+0x2e/0x50
      nmi_handle+0x6e/0x120
      default_do_nmi+0x3e/0x100
      do_nmi+0x102/0x160
      end_repeat_nmi+0x16/0x50
      ...
      ? native_write_msr+0x6/0x20
      ? native_write_msr+0x6/0x20
      </NMI>
      intel_pmu_enable_event+0x1ce/0x1f0
      x86_pmu_start+0x78/0xa0
      x86_pmu_enable+0x252/0x310
      __perf_event_task_sched_in+0x181/0x190
      ? __switch_to_asm+0x41/0x70
      ? __switch_to_asm+0x35/0x70
      ? __switch_to_asm+0x41/0x70
      ? __switch_to_asm+0x35/0x70
      finish_task_switch+0x158/0x260
      __schedule+0x2f6/0x840
      ? hrtimer_start_range_ns+0x153/0x210
      schedule+0x32/0x80
      schedule_hrtimeout_range_clock+0x8a/0x100
      ? hrtimer_init+0x120/0x120
      ep_poll+0x2f7/0x3a0
      ? wake_up_q+0x60/0x60
      do_epoll_wait+0xa9/0xc0
      __x64_sys_epoll_wait+0x1a/0x20
      do_syscall_64+0x4e/0x110
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7fdeb1e96c03
      ...
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: acme@kernel.org
      Cc: Josh Hunt <johunt@akamai.com>
      Cc: bpuranda@akamai.com
      Cc: mingo@redhat.com
      Cc: jolsa@redhat.com
      Cc: tglx@linutronix.de
      Cc: namhyung@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Link: https://lkml.kernel.org/r/1566256411-18820-1-git-send-email-johunt@akamai.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      560857de
    • T
      x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machines · e997c073
      Thomas Gleixner 提交于
      [ Upstream commit 3e5bedc2c258341702ddffbd7688c5e6eb01eafa ]
      
      Rahul Tanwar reported the following bug on DT systems:
      
      > 'ioapic_dynirq_base' contains the virtual IRQ base number. Presently, it is
      > updated to the end of hardware IRQ numbers but this is done only when IOAPIC
      > configuration type is IOAPIC_DOMAIN_LEGACY or IOAPIC_DOMAIN_STRICT. There is
      > a third type IOAPIC_DOMAIN_DYNAMIC which applies when IOAPIC configuration
      > comes from devicetree.
      >
      > See dtb_add_ioapic() in arch/x86/kernel/devicetree.c
      >
      > In case of IOAPIC_DOMAIN_DYNAMIC (DT/OF based system), 'ioapic_dynirq_base'
      > remains to zero initialized value. This means that for OF based systems,
      > virtual IRQ base will get set to zero.
      
      Such systems will very likely not even boot.
      
      For DT enabled machines ioapic_dynirq_base is irrelevant and not
      updated, so simply map the IRQ base 1:1 instead.
      Reported-by: NRahul Tanwar <rahul.tanwar@linux.intel.com>
      Tested-by: NRahul Tanwar <rahul.tanwar@linux.intel.com>
      Tested-by: NAndy Shevchenko <andriy.shevchenko@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: alan@linux.intel.com
      Cc: bp@alien8.de
      Cc: cheol.yong.kim@intel.com
      Cc: qi-ming.wu@intel.com
      Cc: rahul.tanwar@intel.com
      Cc: rppt@linux.ibm.com
      Cc: tony.luck@intel.com
      Link: http://lkml.kernel.org/r/20190821081330.1187-1-rahul.tanwar@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e997c073
  7. 19 9月, 2019 2 次提交
    • L
      x86/build: Add -Wnoaddress-of-packed-member to REALMODE_CFLAGS, to silence GCC9 build warning · 9d587fe2
      Linus Torvalds 提交于
      commit 42e0e95474fc6076b5cd68cab8fa0340a1797a72 upstream.
      
      One of the very few warnings I have in the current build comes from
      arch/x86/boot/edd.c, where I get the following with a gcc9 build:
      
         arch/x86/boot/edd.c: In function ‘query_edd’:
         arch/x86/boot/edd.c:148:11: warning: taking address of packed member of ‘struct boot_params’ may result in an unaligned pointer value [-Waddress-of-packed-member]
           148 |  mbrptr = boot_params.edd_mbr_sig_buffer;
               |           ^~~~~~~~~~~
      
      This warning triggers because we throw away all the CFLAGS and then make
      a new set for REALMODE_CFLAGS, so the -Wno-address-of-packed-member we
      added in the following commit is not present:
      
        6f303d60534c ("gcc-9: silence 'address-of-packed-member' warning")
      
      The simplest solution for now is to adjust the warning for this version
      of CFLAGS as well, but it would definitely make sense to examine whether
      REALMODE_CFLAGS could be derived from CFLAGS, so that it picks up changes
      in the compiler flags environment automatically.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d587fe2
    • S
      x86/purgatory: Change compiler flags from -mcmodel=kernel to -mcmodel=large to... · eb020b77
      Steve Wahl 提交于
      x86/purgatory: Change compiler flags from -mcmodel=kernel to -mcmodel=large to fix kexec relocation errors
      
      commit e16c2983fba0fa6763e43ad10916be35e3d8dc05 upstream.
      
      The last change to this Makefile caused relocation errors when loading
      a kdump kernel.  Restore -mcmodel=large (not -mcmodel=kernel),
      -ffreestanding, and -fno-zero-initialized-bsss, without reverting to
      the former practice of resetting KBUILD_CFLAGS.
      
      Purgatory.ro is a standalone binary that is not linked against the
      rest of the kernel.  Its image is copied into an array that is linked
      to the kernel, and from there kexec relocates it wherever it desires.
      
      With the previous change to compiler flags, the error "kexec: Overflow
      in relocation type 11 value 0x11fffd000" was encountered when trying
      to load the crash kernel.  This is from kexec code trying to relocate
      the purgatory.ro object.
      
      From the error message, relocation type 11 is R_X86_64_32S.  The
      x86_64 ABI says:
      
        "The R_X86_64_32 and R_X86_64_32S relocations truncate the
         computed value to 32-bits.  The linker must verify that the
         generated value for the R_X86_64_32 (R_X86_64_32S) relocation
         zero-extends (sign-extends) to the original 64-bit value."
      
      This type of relocation doesn't work when kexec chooses to place the
      purgatory binary in memory that is not reachable with 32 bit
      addresses.
      
      The compiler flag -mcmodel=kernel allows those type of relocations to
      be emitted, so revert to using -mcmodel=large as was done before.
      
      Also restore the -ffreestanding and -fno-zero-initialized-bss flags
      because they are appropriate for a stand alone piece of object code
      which doesn't explicitly zero the bss, and one other report has said
      undefined symbols are encountered without -ffreestanding.
      
      These identical compiler flag changes need to happen for every object
      that becomes part of the purgatory.ro object, so gather them together
      first into PURGATORY_CFLAGS_REMOVE and PURGATORY_CFLAGS, and then
      apply them to each of the objects that have C source.  Do not apply
      any of these flags to kexec-purgatory.o, which is not part of the
      standalone object but part of the kernel proper.
      Tested-by: NVaibhav Rustagi <vaibhavrustagi@google.com>
      Tested-by: NAndreas Smas <andreas@lonelycoder.com>
      Signed-off-by: NSteve Wahl <steve.wahl@hpe.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: None
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: clang-built-linux@googlegroups.com
      Cc: dimitri.sivanich@hpe.com
      Cc: mike.travis@hpe.com
      Cc: russ.anderson@hpe.com
      Fixes: b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS")
      Link: https://lkml.kernel.org/r/20190905202346.GA26595@swahl-linuxSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andreas Smas <andreas@lonelycoder.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      eb020b77