1. 21 7月, 2019 5 次提交
    • T
      x86/irq: Handle spurious interrupt after shutdown gracefully · 9494cd39
      Thomas Gleixner 提交于
      commit b7107a67f0d125459fe41f86e8079afd1a5e0b15 upstream
      
      Since the rework of the vector management, warnings about spurious
      interrupts have been reported. Robert provided some more information and
      did an initial analysis. The following situation leads to these warnings:
      
         CPU 0                  CPU 1               IO_APIC
      
                                                    interrupt is raised
                                                    sent to CPU1
      			  Unable to handle
      			  immediately
      			  (interrupts off,
      			   deep idle delay)
         mask()
         ...
         free()
           shutdown()
           synchronize_irq()
           clear_vector()
                                do_IRQ()
                                  -> vector is clear
      
      Before the rework the vector entries of legacy interrupts were statically
      assigned and occupied precious vector space while most of them were
      unused. Due to that the above situation was handled silently because the
      vector was handled and the core handler of the assigned interrupt
      descriptor noticed that it is shut down and returned.
      
      While this has been usually observed with legacy interrupts, this situation
      is not limited to them. Any other interrupt source, e.g. MSI, can cause the
      same issue.
      
      After adding proper synchronization for level triggered interrupts, this
      can only happen for edge triggered interrupts where the IO-APIC obviously
      cannot provide information about interrupts in flight.
      
      While the spurious warning is actually harmless in this case it worries
      users and driver developers.
      
      Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead
      of VECTOR_UNUSED when the vector is freed up.
      
      If that above late handling happens the spurious detector will not complain
      and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on
      that line will trigger the spurious warning as before.
      
      Fixes: 464d1230 ("x86/vector: Switch IOAPIC to global reservation mode")
      Reported-by: NRobert Hodaszi <Robert.Hodaszi@digi.com>
      Signed-off-by: Thomas Gleixner <tglx@linutronix.de>-
      Tested-by: NRobert Hodaszi <Robert.Hodaszi@digi.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20190628111440.459647741@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      9494cd39
    • T
      x86/ioapic: Implement irq_get_irqchip_state() callback · 7897f5a4
      Thomas Gleixner 提交于
      commit dfe0cf8b51b07e56ded571e3de0a4a9382517231 upstream
      
      When an interrupt is shut down in free_irq() there might be an inflight
      interrupt pending in the IO-APIC remote IRR which is not yet serviced. That
      means the interrupt has been sent to the target CPUs local APIC, but the
      target CPU is in a state which delays the servicing.
      
      So free_irq() would proceed to free resources and to clear the vector
      because synchronize_hardirq() does not see an interrupt handler in
      progress.
      
      That can trigger a spurious interrupt warning, which is harmless and just
      confuses users, but it also can leave the remote IRR in a stale state
      because once the handler is invoked the interrupt resources might be freed
      already and therefore acknowledgement is not possible anymore.
      
      Implement the irq_get_irqchip_state() callback for the IO-APIC irq chip. The
      callback is invoked from free_irq() via __synchronize_hardirq(). Check the
      remote IRR bit of the interrupt and return 'in flight' if it is set and the
      interrupt is configured in level mode. For edge mode the remote IRR has no
      meaning.
      
      As this is only meaningful for level triggered interrupts this won't cure
      the potential spurious interrupt warning for edge triggered interrupts, but
      the edge trigger case does not result in stale hardware state. This has to
      be addressed at the vector/interrupt entry level seperately.
      
      Fixes: 464d1230 ("x86/vector: Switch IOAPIC to global reservation mode")
      Reported-by: NRobert Hodaszi <Robert.Hodaszi@digi.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20190628111440.370295517@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      7897f5a4
    • K
      x86/boot/64: Add missing fixup_pointer() for next_early_pgt access · 94968c37
      Kirill A. Shutemov 提交于
      [ Upstream commit c1887159eb48ba40e775584cfb2a443962cf1a05 ]
      
      __startup_64() uses fixup_pointer() to access global variables in a
      position-independent fashion. Access to next_early_pgt was wrapped into the
      helper, but one instance in the 5-level paging branch was missed.
      
      GCC generates a R_X86_64_PC32 PC-relative relocation for the access which
      doesn't trigger the issue, but Clang emmits a R_X86_64_32S which leads to
      an invalid memory access and system reboot.
      
      Fixes: 187e91fe ("x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt'")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Alexander Potapenko <glider@google.com>
      Link: https://lkml.kernel.org/r/20190620112422.29264-1-kirill.shutemov@linux.intel.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      94968c37
    • K
      x86/boot/64: Fix crash if kernel image crosses page table boundary · 729d25f4
      Kirill A. Shutemov 提交于
      [ Upstream commit 81c7ed296dcd02bc0b4488246d040e03e633737a ]
      
      A kernel which boots in 5-level paging mode crashes in a small percentage
      of cases if KASLR is enabled.
      
      This issue was tracked down to the case when the kernel image unpacks in a
      way that it crosses an 1G boundary. The crash is caused by an overrun of
      the PMD page table in __startup_64() and corruption of P4D page table
      allocated next to it. This particular issue is not visible with 4-level
      paging as P4D page tables are not used.
      
      But the P4D and the PUD calculation have similar problems.
      
      The PMD index calculation is wrong due to operator precedence, which fails
      to confine the PMDs in the PMD array on wrap around.
      
      The P4D calculation for 5-level paging and the PUD calculation calculate
      the first index correctly, but then blindly increment it which causes the
      same issue when a kernel image is located across a 512G and for 5-level
      paging across a 46T boundary.
      
      This wrap around mishandling was introduced when these parts moved from
      assembly to C.
      
      Restore it to the correct behaviour.
      
      Fixes: c88d7150 ("x86/boot/64: Rewrite startup_64() in C")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20190620112345.28833-1-kirill.shutemov@linux.intel.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      729d25f4
    • C
      x86/apic: Fix integer overflow on 10 bit left shift of cpu_khz · 2a6ee369
      Colin Ian King 提交于
      [ Upstream commit ea136a112d89bade596314a1ae49f748902f4727 ]
      
      The left shift of unsigned int cpu_khz will overflow for large values of
      cpu_khz, so cast it to a long long before shifting it to avoid overvlow.
      For example, this can happen when cpu_khz is 4194305, i.e. ~4.2 GHz.
      
      Addresses-Coverity: ("Unintentional integer overflow")
      Fixes: 8c3ba8d0 ("x86, apic: ack all pending irqs when crashed/on kexec")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: kernel-janitors@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190619181446.13635-1-colin.king@canonical.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      2a6ee369
  2. 14 7月, 2019 3 次提交
  3. 10 7月, 2019 5 次提交
    • W
      KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC · f25c0695
      Wanpeng Li 提交于
      commit bb34e690e9340bc155ebed5a3d75fc63ff69e082 upstream.
      
      Thomas reported that:
      
       | Background:
       |
       |    In preparation of supporting IPI shorthands I changed the CPU offline
       |    code to software disable the local APIC instead of just masking it.
       |    That's done by clearing the APIC_SPIV_APIC_ENABLED bit in the APIC_SPIV
       |    register.
       |
       | Failure:
       |
       |    When the CPU comes back online the startup code triggers occasionally
       |    the warning in apic_pending_intr_clear(). That complains that the IRRs
       |    are not empty.
       |
       |    The offending vector is the local APIC timer vector who's IRR bit is set
       |    and stays set.
       |
       | It took me quite some time to reproduce the issue locally, but now I can
       | see what happens.
       |
       | It requires apicv_enabled=0, i.e. full apic emulation. With apicv_enabled=1
       | (and hardware support) it behaves correctly.
       |
       | Here is the series of events:
       |
       |     Guest CPU
       |
       |     goes down
       |
       |       native_cpu_disable()
       |
       | 			apic_soft_disable();
       |
       |     play_dead()
       |
       |     ....
       |
       |     startup()
       |
       |       if (apic_enabled())
       |         apic_pending_intr_clear()	<- Not taken
       |
       |      enable APIC
       |
       |         apic_pending_intr_clear()	<- Triggers warning because IRR is stale
       |
       | When this happens then the deadline timer or the regular APIC timer -
       | happens with both, has fired shortly before the APIC is disabled, but the
       | interrupt was not serviced because the guest CPU was in an interrupt
       | disabled region at that point.
       |
       | The state of the timer vector ISR/IRR bits:
       |
       |     	     	       	        ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      1
       |
       | Now one would assume that the IRR is cleared after the INIT reset, but this
       | happens only on CPU0.
       |
       | Why?
       |
       | Because our CPU0 hotplug is just for testing to make sure nothing breaks
       | and goes through an NMI wakeup vehicle because INIT would send it through
       | the boots-trap code which is not really working if that CPU was not
       | physically unplugged.
       |
       | Now looking at a real world APIC the situation in that case is:
       |
       |     	     	       	      	ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      0
       |
       | Why?
       |
       | Once the dying CPU reenables interrupts the pending interrupt gets
       | delivered as a spurious interupt and then the state is clear.
       |
       | While that CPU0 hotplug test case is surely an esoteric issue, the APIC
       | emulation is still wrong, Even if the play_dead() code would not enable
       | interrupts then the pending IRR bit would turn into an ISR .. interrupt
       | when the APIC is reenabled on startup.
      
      From SDM 10.4.7.2 Local APIC State After It Has Been Software Disabled
      * Pending interrupts in the IRR and ISR registers are held and require
        masking or handling by the CPU.
      
      In Thomas's testing, hardware cpu will not respect soft disable LAPIC
      when IRR has already been set or APICv posted-interrupt is in flight,
      so we can skip soft disable APIC checking when clearing IRR and set ISR,
      continue to respect soft disable APIC when attempting to set IRR.
      Reported-by: NRong Chen <rong.a.chen@intel.com>
      Reported-by: NFeng Tang <feng.tang@intel.com>
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rong Chen <rong.a.chen@intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f25c0695
    • P
      KVM: x86: degrade WARN to pr_warn_ratelimited · f6472f50
      Paolo Bonzini 提交于
      commit 3f16a5c318392cbb5a0c7a3d19dff8c8ef3c38ee upstream.
      
      This warning can be triggered easily by userspace, so it should certainly not
      cause a panic if panic_on_warn is set.
      
      Reported-by: syzbot+c03f30b4f4c46bdf8575@syzkaller.appspotmail.com
      Suggested-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6472f50
    • K
      x86/boot/compressed/64: Do not corrupt EDX on EFER.LME=1 setting · 6bf96773
      Kirill A. Shutemov 提交于
      [ Upstream commit 45b13b424faafb81c8c44541f093a682fdabdefc ]
      
      RDMSR in the trampoline code overwrites EDX but that register is used
      to indicate whether 5-level paging has to be enabled and if clobbered,
      leads to failure to boot on a 5-level paging machine.
      
      Preserve EDX on the stack while we are dealing with EFER.
      
      Fixes: b677dfae5aa1 ("x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode")
      Reported-by: NKyle D Pelton <kyle.d.pelton@intel.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: dave.hansen@linux.intel.com
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wei Huang <wei@redhat.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190206115253.1907-1-kirill.shutemov@linux.intel.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      6bf96773
    • P
      ftrace/x86: Remove possible deadlock between register_kprobe() and ftrace_run_update_code() · c854d9b6
      Petr Mladek 提交于
      commit d5b844a2cf507fc7642c9ae80a9d585db3065c28 upstream.
      
      The commit 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module text
      permissions race") causes a possible deadlock between register_kprobe()
      and ftrace_run_update_code() when ftrace is using stop_machine().
      
      The existing dependency chain (in reverse order) is:
      
      -> #1 (text_mutex){+.+.}:
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             __mutex_lock+0x88/0x908
             mutex_lock_nested+0x32/0x40
             register_kprobe+0x254/0x658
             init_kprobes+0x11a/0x168
             do_one_initcall+0x70/0x318
             kernel_init_freeable+0x456/0x508
             kernel_init+0x22/0x150
             ret_from_fork+0x30/0x34
             kernel_thread_starter+0x0/0xc
      
      -> #0 (cpu_hotplug_lock.rw_sem){++++}:
             check_prev_add+0x90c/0xde0
             validate_chain.isra.21+0xb32/0xd70
             __lock_acquire+0x4b8/0x928
             lock_acquire+0x102/0x230
             cpus_read_lock+0x62/0xd0
             stop_machine+0x2e/0x60
             arch_ftrace_update_code+0x2e/0x40
             ftrace_run_update_code+0x40/0xa0
             ftrace_startup+0xb2/0x168
             register_ftrace_function+0x64/0x88
             klp_patch_object+0x1a2/0x290
             klp_enable_patch+0x554/0x980
             do_one_initcall+0x70/0x318
             do_init_module+0x6e/0x250
             load_module+0x1782/0x1990
             __s390x_sys_finit_module+0xaa/0xf0
             system_call+0xd8/0x2d0
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(text_mutex);
                                     lock(cpu_hotplug_lock.rw_sem);
                                     lock(text_mutex);
        lock(cpu_hotplug_lock.rw_sem);
      
      It is similar problem that has been solved by the commit 2d1e38f5
      ("kprobes: Cure hotplug lock ordering issues"). Many locks are involved.
      To be on the safe side, text_mutex must become a low level lock taken
      after cpu_hotplug_lock.rw_sem.
      
      This can't be achieved easily with the current ftrace design.
      For example, arm calls set_all_modules_text_rw() already in
      ftrace_arch_code_modify_prepare(), see arch/arm/kernel/ftrace.c.
      This functions is called:
      
        + outside stop_machine() from ftrace_run_update_code()
        + without stop_machine() from ftrace_module_enable()
      
      Fortunately, the problematic fix is needed only on x86_64. It is
      the only architecture that calls set_all_modules_text_rw()
      in ftrace path and supports livepatching at the same time.
      
      Therefore it is enough to move text_mutex handling from the generic
      kernel/trace/ftrace.c into arch/x86/kernel/ftrace.c:
      
         ftrace_arch_code_modify_prepare()
         ftrace_arch_code_modify_post_process()
      
      This patch basically reverts the ftrace part of the problematic
      commit 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module
      text permissions race"). And provides x86_64 specific-fix.
      
      Some refactoring of the ftrace code will be needed when livepatching
      is implemented for arm or nds32. These architectures call
      set_all_modules_text_rw() and use stop_machine() at the same time.
      
      Link: http://lkml.kernel.org/r/20190627081334.12793-1-pmladek@suse.com
      
      Fixes: 9f255b632bf12c4dd7 ("module: Fix livepatch/ftrace module text permissions race")
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: NMiroslav Benes <mbenes@suse.cz>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      [
        As reviewed by Miroslav Benes <mbenes@suse.cz>, removed return value of
        ftrace_run_update_code() as it is a void function.
      ]
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c854d9b6
    • K
      x86/CPU: Add more Icelake model numbers · 5284327f
      Kan Liang 提交于
      [ Upstream commit e35faeb64146f2015f2aec14b358ae508e4066db ]
      
      Add the CPUID model numbers of Icelake (ICL) desktop and server
      processors to the Intel family list.
      
       [ Qiuxu: Sort the macros by model number. ]
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com>
      Cc: rui.zhang@intel.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190603134122.13853-1-kan.liang@linux.intel.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      5284327f
  4. 03 7月, 2019 4 次提交
  5. 25 6月, 2019 1 次提交
  6. 22 6月, 2019 2 次提交
    • F
      x86/CPU/AMD: Don't force the CPB cap when running under a hypervisor · a35e7822
      Frank van der Linden 提交于
      [ Upstream commit 2ac44ab608705948564791ce1d15d43ba81a1e38 ]
      
      For F17h AMD CPUs, the CPB capability ('Core Performance Boost') is forcibly set,
      because some versions of that chip incorrectly report that they do not have it.
      
      However, a hypervisor may filter out the CPB capability, for good
      reasons. For example, KVM currently does not emulate setting the CPB
      bit in MSR_K7_HWCR, and unchecked MSR access errors will be thrown
      when trying to set it as a guest:
      
      	unchecked MSR access error: WRMSR to 0xc0010015 (tried to write 0x0000000001000011) at rIP: 0xffffffff890638f4 (native_write_msr+0x4/0x20)
      
      	Call Trace:
      	boost_set_msr+0x50/0x80 [acpi_cpufreq]
      	cpuhp_invoke_callback+0x86/0x560
      	sort_range+0x20/0x20
      	cpuhp_thread_fun+0xb0/0x110
      	smpboot_thread_fn+0xef/0x160
      	kthread+0x113/0x130
      	kthread_create_worker_on_cpu+0x70/0x70
      	ret_from_fork+0x35/0x40
      
      To avoid this issue, don't forcibly set the CPB capability for a CPU
      when running under a hypervisor.
      Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: jiaxun.yang@flygoat.com
      Fixes: 0237199186e7 ("x86/CPU/AMD: Set the CPB bit unconditionally on F17h")
      Link: http://lkml.kernel.org/r/20190522221745.GA15789@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com
      [ Minor edits to the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a35e7822
    • S
      perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints · 5a9c29cc
      Stephane Eranian 提交于
      [ Upstream commit 23e3983a466cd540ffdd2bbc6e0c51e31934f941 ]
      
      This patch fixes an bug revealed by the following commit:
      
        6b89d4c1ae85 ("perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking")
      
      That patch modified INTEL_FLAGS_EVENT_CONSTRAINT() to only look at the event code
      when matching a constraint. If code+umask were needed, then the
      INTEL_FLAGS_UEVENT_CONSTRAINT() macro was needed instead.
      This broke with some of the constraints for PEBS events.
      
      Several of them, including the one used for cycles:p, cycles:pp, cycles:ppp
      fell in that category and caused the event to be rejected in PEBS mode.
      In other words, on some platforms a cmdline such as:
      
        $ perf top -e cycles:pp
      
      would fail with -EINVAL.
      
      This patch fixes this bug by properly using INTEL_FLAGS_UEVENT_CONSTRAINT()
      when needed in the PEBS constraint tables.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/20190521005246.423-1-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5a9c29cc
  7. 19 6月, 2019 6 次提交
    • P
      x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled · e93ce57f
      Prarit Bhargava 提交于
      commit c7563e62a6d720aa3b068e26ddffab5f0df29263 upstream.
      
      Booting with kernel parameter "rdt=cmt,mbmtotal,memlocal,l3cat,mba" and
      executing "mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl" results in
      a NULL pointer dereference on systems which do not have local MBM support
      enabled..
      
      BUG: kernel NULL pointer dereference, address: 0000000000000020
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 0 PID: 722 Comm: kworker/0:3 Not tainted 5.2.0-0.rc3.git0.1.el7_UNSUPPORTED.x86_64 #2
      Workqueue: events mbm_handle_overflow
      RIP: 0010:mbm_handle_overflow+0x150/0x2b0
      
      Only enter the bandwith update loop if the system has local MBM enabled.
      
      Fixes: de73f38f ("x86/intel_rdt/mba_sc: Feedback loop to dynamically update mem bandwidth")
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190610171544.13474-1-prarit@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e93ce57f
    • B
      x86/mm/KASLR: Compute the size of the vmemmap section properly · 0257fc9a
      Baoquan He 提交于
      commit 00e5a2bbcc31d5fea853f8daeba0f06c1c88c3ff upstream.
      
      The size of the vmemmap section is hardcoded to 1 TB to support the
      maximum amount of system RAM in 4-level paging mode - 64 TB.
      
      However, 1 TB is not enough for vmemmap in 5-level paging mode. Assuming
      the size of struct page is 64 Bytes, to support 4 PB system RAM in 5-level,
      64 TB of vmemmap area is needed:
      
        4 * 1000^5 PB / 4096 bytes page size * 64 bytes per page struct / 1000^4 TB = 62.5 TB.
      
      This hardcoding may cause vmemmap to corrupt the following
      cpu_entry_area section, if KASLR puts vmemmap very close to it and the
      actual vmemmap size is bigger than 1 TB.
      
      So calculate the actual size of the vmemmap region needed and then align
      it up to 1 TB boundary.
      
      In 4-level paging mode it is always 1 TB. In 5-level it's adjusted on
      demand. The current code reserves 0.5 PB for vmemmap on 5-level. With
      this change, the space can be saved and thus used to increase entropy
      for the randomization.
      
       [ bp: Spell out how the 64 TB needed for vmemmap is computed and massage commit
         message. ]
      
      Fixes: eedb92ab ("x86/mm: Make virtual memory layout dynamic for CONFIG_X86_5LEVEL=y")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NKirill A. Shutemov <kirill@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: kirill.shutemov@linux.intel.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190523025744.3756-1-bhe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0257fc9a
    • A
      x86/kasan: Fix boot with 5-level paging and KASAN · 5e3d10d9
      Andrey Ryabinin 提交于
      commit f3176ec9420de0c385023afa3e4970129444ac2f upstream.
      
      Since commit d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on
      5-level paging") kernel doesn't boot with KASAN on 5-level paging machines.
      The bug is actually in early_p4d_offset() and introduced by commit
      12a8cc7f ("x86/kasan: Use the same shadow offset for 4- and 5-level paging")
      
      early_p4d_offset() tries to convert pgd_val(*pgd) value to a physical
      address. This doesn't make sense because pgd_val() already contains the
      physical address.
      
      It did work prior to commit d52888aa2753 because the result of
      "__pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK" was the same as "pgd_val(*pgd)
      & PTE_PFN_MASK". __pa_nodebug() just set some high bits which were masked
      out by applying PTE_PFN_MASK.
      
      After the change of the PAGE_OFFSET offset in commit d52888aa2753
      __pa_nodebug(pgd_val(*pgd)) started to return a value with more high bits
      set and PTE_PFN_MASK wasn't enough to mask out all of them. So it returns a
      wrong not even canonical address and crashes on the attempt to dereference
      it.
      
      Switch back to pgd_val() & PTE_PFN_MASK to cure the issue.
      
      Fixes: 12a8cc7f ("x86/kasan: Use the same shadow offset for 4- and 5-level paging")
      Reported-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: kasan-dev@googlegroups.com
      Cc: stable@vger.kernel.org
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20190614143149.2227-1-aryabinin@virtuozzo.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e3d10d9
    • B
      x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback · ecec31ce
      Borislav Petkov 提交于
      commit 78f4e932f7760d965fb1569025d1576ab77557c5 upstream.
      
      Adric Blake reported the following warning during suspend-resume:
      
        Enabling non-boot CPUs ...
        x86: Booting SMP configuration:
        smpboot: Booting Node 0 Processor 1 APIC 0x2
        unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
         at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
        Call Trace:
         intel_set_tfa
         intel_pmu_cpu_starting
         ? x86_pmu_dead_cpu
         x86_pmu_starting_cpu
         cpuhp_invoke_callback
         ? _raw_spin_lock_irqsave
         notify_cpu_starting
         start_secondary
         secondary_startup_64
        microcode: sig=0x806ea, pf=0x80, revision=0x96
        microcode: updated to revision 0xb4, date = 2019-04-01
        CPU1 is up
      
      The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
      by microcode. The log above shows that the microcode loader callback
      happens after the PMU restoration, leading to the conjecture that
      because the microcode hasn't been updated yet, that MSR is not present
      yet, leading to the #GP.
      
      Add a microcode loader-specific hotplug vector which comes before
      the PERF vectors and thus executes earlier and makes sure the MSR is
      present.
      
      Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
      Reported-by: NAdric Blake <promarbler14@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: x86@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecec31ce
    • P
      KVM: x86/pmu: do not mask the value that is written to fixed PMUs · 9d8f338c
      Paolo Bonzini 提交于
      [ Upstream commit 2924b52117b2812e9633d5ea337333299166d373 ]
      
      According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of
      each MSR may be written with any value, and the high-order 8 bits are
      sign-extended according to the value of bit 31", but the fixed counters
      in real hardware are limited to the width of the fixed counters ("bits
      beyond the width of the fixed-function counter are reserved and must be
      written as zeros").  Fix KVM to do the same.
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9d8f338c
    • P
      KVM: x86/pmu: mask the result of rdpmc according to the width of the counters · 04d2a113
      Paolo Bonzini 提交于
      [ Upstream commit 0e6f467ee28ec97f68c7b74e35ec1601bb1368a7 ]
      
      This patch will simplify the changes in the next, by enforcing the
      masking of the counters to RDPMC and RDMSR.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      04d2a113
  8. 15 6月, 2019 2 次提交
  9. 11 6月, 2019 2 次提交
    • J
      x86/insn-eval: Fix use-after-free access to LDT entry · b598ddc7
      Jann Horn 提交于
      commit de9f869616dd95e95c00bdd6b0fcd3421e8a4323 upstream.
      
      get_desc() computes a pointer into the LDT while holding a lock that
      protects the LDT from being freed, but then drops the lock and returns the
      (now potentially dangling) pointer to its caller.
      
      Fix it by giving the caller a copy of the LDT entry instead.
      
      Fixes: 670f928b ("x86/insn-eval: Add utility function to get segment descriptor")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b598ddc7
    • J
      x86/power: Fix 'nosmt' vs hibernation triple fault during resume · 4d166206
      Jiri Kosina 提交于
      commit ec527c318036a65a083ef68d8ba95789d2212246 upstream.
      
      As explained in
      
      	0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      
      we always, no matter what, have to bring up x86 HT siblings during boot at
      least once in order to avoid first MCE bringing the system to its knees.
      
      That means that whenever 'nosmt' is supplied on the kernel command-line,
      all the HT siblings are as a result sitting in mwait or cpudile after
      going through the online-offline cycle at least once.
      
      This causes a serious issue though when a kernel, which saw 'nosmt' on its
      commandline, is going to perform resume from hibernation: if the resume
      from the hibernated image is successful, cr3 is flipped in order to point
      to the address space of the kernel that is being resumed, which in turn
      means that all the HT siblings are all of a sudden mwaiting on address
      which is no longer valid.
      
      That results in triple fault shortly after cr3 is switched, and machine
      reboots.
      
      Fix this by always waking up all the SMT siblings before initiating the
      'restore from hibernation' process; this guarantees that all the HT
      siblings will be properly carried over to the resumed kernel waiting in
      resume_play_dead(), and acted upon accordingly afterwards, based on the
      target kernel configuration.
      
      Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
      again in case it has SMT disabled; this means it has to online all
      the siblings when resuming (so that they come out of hlt) and offline
      them again to let them reach mwait.
      
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Debugged-by: NThomas Gleixner <tglx@linutronix.de>
      Fixes: 0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d166206
  10. 09 6月, 2019 5 次提交
  11. 04 6月, 2019 1 次提交
  12. 31 5月, 2019 4 次提交
    • Y
      x86/mce: Handle varying MCA bank counts · 75a96196
      Yazen Ghannam 提交于
      [ Upstream commit 006c077041dc73b9490fffc4c6af5befe0687110 ]
      
      Linux reads MCG_CAP[Count] to find the number of MCA banks visible to a
      CPU. Currently, this number is the same for all CPUs and a warning is
      shown if there is a difference. The number of banks is overwritten with
      the MCG_CAP[Count] value of each following CPU that boots.
      
      According to the Intel SDM and AMD APM, the MCG_CAP[Count] value gives
      the number of banks that are available to a "processor implementation".
      The AMD BKDGs/PPRs further clarify that this value is per core. This
      value has historically been the same for every core in the system, but
      that is not an architectural requirement.
      
      Future AMD systems may have different MCG_CAP[Count] values per core,
      so the assumption that all CPUs will have the same MCG_CAP[Count] value
      will no longer be valid.
      
      Also, the first CPU to boot will allocate the struct mce_banks[] array
      using the number of banks based on its MCG_CAP[Count] value. The machine
      check handler and other functions use the global number of banks to
      iterate and index into the mce_banks[] array. So it's possible to use an
      out-of-bounds index on an asymmetric system where a following CPU sees a
      MCG_CAP[Count] value greater than its predecessors.
      
      Thus, allocate the mce_banks[] array to the maximum number of banks.
      This will avoid the potential out-of-bounds index since the value of
      mca_cfg.banks is capped to MAX_NR_BANKS.
      
      Set the value of mca_cfg.banks equal to the max of the previous value
      and the value for the current CPU. This way mca_cfg.banks will always
      represent the max number of banks detected on any CPU in the system.
      
      This will ensure that all CPUs will access all the banks that are
      visible to them. A CPU that can access fewer than the max number of
      banks will find the registers of the extra banks to be read-as-zero.
      
      Furthermore, print the resulting number of MCA banks in use. Do this in
      mcheck_late_init() so that the final value is printed after all CPUs
      have been initialized.
      
      Finally, get bank count from target CPU when doing injection with mce-inject
      module.
      
       [ bp: Remove out-of-bounds example, passify and cleanup commit message. ]
      Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20180727214009.78289-1-Yazen.Ghannam@amd.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      75a96196
    • T
      x86/mce: Fix machine_check_poll() tests for error types · 3d036cba
      Tony Luck 提交于
      [ Upstream commit f19501aa07f18268ab14f458b51c1c6b7f72a134 ]
      
      There has been a lurking "TBD" in the machine check poll routine ever
      since it was first split out from the machine check handler. The
      potential issue is that the poll routine may have just begun a read from
      the STATUS register in a machine check bank when the hardware logs an
      error in that bank and signals a machine check.
      
      That race used to be pretty small back when machine checks were
      broadcast, but the addition of local machine check means that the poll
      code could continue running and clear the error from the bank before the
      local machine check handler on another CPU gets around to reading it.
      
      Fix the code to be sure to only process errors that need to be processed
      in the poll code, leaving other logged errors alone for the machine
      check handler to find and process.
      
       [ bp: Massage a bit and flip the "== 0" check to the usual !(..) test. ]
      
      Fixes: b79109c3 ("x86, mce: separate correct machine check poller and fatal exception handler")
      Fixes: ed7290d0 ("x86, mce: implement new status bits")
      Reported-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
      Link: https://lkml.kernel.org/r/20190312170938.GA23035@agluck-deskSigned-off-by: NSasha Levin <sashal@kernel.org>
      3d036cba
    • P
      x86/uaccess: Fix up the fixup · fc242af8
      Peter Zijlstra 提交于
      [ Upstream commit b69656fa7ea2f75e47d7bd5b9430359fa46488af ]
      
      New tooling got confused about this:
      
        arch/x86/lib/memcpy_64.o: warning: objtool: .fixup+0x7: return with UACCESS enabled
      
      While the code isn't wrong, it is tedious (if at all possible) to
      figure out what function a particular chunk of .fixup belongs to.
      
      This then confuses the objtool uaccess validation. Instead of
      returning directly from the .fixup, jump back into the right function.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fc242af8
    • P
      x86/ia32: Fix ia32_restore_sigcontext() AC leak · 5007453c
      Peter Zijlstra 提交于
      [ Upstream commit 67a0514afdbb8b2fc70b771b8c77661a9cb9d3a9 ]
      
      Objtool spotted that we call native_load_gs_index() with AC set.
      Re-arrange the code to avoid that.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5007453c