1. 24 11月, 2019 4 次提交
  2. 21 11月, 2019 4 次提交
  3. 13 11月, 2019 11 次提交
  4. 29 10月, 2019 2 次提交
    • S
      x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu · 4dedaa73
      Sean Christopherson 提交于
      commit 7a22e03b0c02988e91003c505b34d752a51de344 upstream.
      
      Check that the per-cpu cluster mask pointer has been set prior to
      clearing a dying cpu's bit.  The per-cpu pointer is not set until the
      target cpu reaches smp_callin() during CPUHP_BRINGUP_CPU, whereas the
      teardown function, x2apic_dead_cpu(), is associated with the earlier
      CPUHP_X2APIC_PREPARE.  If an error occurs before the cpu is awakened,
      e.g. if do_boot_cpu() itself fails, x2apic_dead_cpu() will dereference
      the NULL pointer and cause a panic.
      
        smpboot: do_boot_cpu failed(-22) to wakeup CPU#1
        BUG: kernel NULL pointer dereference, address: 0000000000000008
        RIP: 0010:x2apic_dead_cpu+0x1a/0x30
        Call Trace:
         cpuhp_invoke_callback+0x9a/0x580
         _cpu_up+0x10d/0x140
         do_cpu_up+0x69/0xb0
         smp_init+0x63/0xa9
         kernel_init_freeable+0xd7/0x229
         ? rest_init+0xa0/0xa0
         kernel_init+0xa/0x100
         ret_from_fork+0x35/0x40
      
      Fixes: 023a6117 ("x86/apic/x2apic: Simplify cluster management")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20191001205019.5789-1-sean.j.christopherson@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4dedaa73
    • S
      x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area · 17099172
      Steve Wahl 提交于
      commit 2aa85f246c181b1fa89f27e8e20c5636426be624 upstream.
      
      Our hardware (UV aka Superdome Flex) has address ranges marked
      reserved by the BIOS. Access to these ranges is caught as an error,
      causing the BIOS to halt the system.
      
      Initial page tables mapped a large range of physical addresses that
      were not checked against the list of BIOS reserved addresses, and
      sometimes included reserved addresses in part of the mapped range.
      Including the reserved range in the map allowed processor speculative
      accesses to the reserved range, triggering a BIOS halt.
      
      Used early in booting, the page table level2_kernel_pgt addresses 1
      GiB divided into 2 MiB pages, and it was set up to linearly map a full
       1 GiB of physical addresses that included the physical address range
      of the kernel image, as chosen by KASLR.  But this also included a
      large range of unused addresses on either side of the kernel image.
      And unlike the kernel image's physical address range, this extra
      mapped space was not checked against the BIOS tables of usable RAM
      addresses.  So there were times when the addresses chosen by KASLR
      would result in processor accessible mappings of BIOS reserved
      physical addresses.
      
      The kernel code did not directly access any of this extra mapped
      space, but having it mapped allowed the processor to issue speculative
      accesses into reserved memory, causing system halts.
      
      This was encountered somewhat rarely on a normal system boot, and much
      more often when starting the crash kernel if "crashkernel=512M,high"
      was specified on the command line (this heavily restricts the physical
      address of the crash kernel, in our case usually within 1 GiB of
      reserved space).
      
      The solution is to invalidate the pages of this table outside the kernel
      image's space before the page table is activated. It fixes this problem
      on our hardware.
      
       [ bp: Touchups. ]
      Signed-off-by: NSteve Wahl <steve.wahl@hpe.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: dimitri.sivanich@hpe.com
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jordan Borgner <mail@jordan-borgner.de>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: mike.travis@hpe.com
      Cc: russ.anderson@hpe.com
      Cc: stable@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
      Link: https://lkml.kernel.org/r/9c011ee51b081534a7a15065b1681d200298b530.1569358539.git.steve.wahl@hpe.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      17099172
  5. 05 10月, 2019 4 次提交
    • N
      x86/apic/vector: Warn when vector space exhaustion breaks affinity · b6194965
      Neil Horman 提交于
      [ Upstream commit 743dac494d61d991967ebcfab92e4f80dc7583b3 ]
      
      On x86, CPUs are limited in the number of interrupts they can have affined
      to them as they only support 256 interrupt vectors per CPU. 32 vectors are
      reserved for the CPU and the kernel reserves another 22 for internal
      purposes. That leaves 202 vectors for assignement to devices.
      
      When an interrupt is set up or the affinity is changed by the kernel or the
      administrator, the vector assignment code attempts to honor the requested
      affinity mask. If the vector space on the CPUs in that affinity mask is
      exhausted the code falls back to a wider set of CPUs and assigns a vector
      on a CPU outside of the requested affinity mask silently.
      
      While the effective affinity is reflected in the corresponding
      /proc/irq/$N/effective_affinity* files the silent breakage of the requested
      affinity can lead to unexpected behaviour for administrators.
      
      Add a pr_warn() when this happens so that adminstrators get at least
      informed about it in the syslog.
      
      [ tglx: Massaged changelog and made the pr_warn() more informative ]
      
      Reported-by: djuran@redhat.com
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: djuran@redhat.com
      Link: https://lkml.kernel.org/r/20190822143421.9535-1-nhorman@tuxdriver.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      b6194965
    • T
      x86/apic: Soft disable APIC before initializing it · b40c15c2
      Thomas Gleixner 提交于
      [ Upstream commit 2640da4cccf5cc613bf26f0998b9e340f4b5f69c ]
      
      If the APIC was already enabled on entry of setup_local_APIC() then
      disabling it soft via the SPIV register makes a lot of sense.
      
      That masks all LVT entries and brings it into a well defined state.
      
      Otherwise previously enabled LVTs which are not touched in the setup
      function stay unmasked and might surprise the just booting kernel.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20190722105219.068290579@linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      b40c15c2
    • G
      x86/reboot: Always use NMI fallback when shutdown via reboot vector IPI fails · ce7fdd5c
      Grzegorz Halat 提交于
      [ Upstream commit 747d5a1bf293dcb33af755a6d285d41b8c1ea010 ]
      
      A reboot request sends an IPI via the reboot vector and waits for all other
      CPUs to stop. If one or more CPUs are in critical regions with interrupts
      disabled then the IPI is not handled on those CPUs and the shutdown hangs
      if native_stop_other_cpus() is called with the wait argument set.
      
      Such a situation can happen when one CPU was stopped within a lock held
      section and another CPU is trying to acquire that lock with interrupts
      disabled. There are other scenarios which can cause such a lockup as well.
      
      In theory the shutdown should be attempted by an NMI IPI after the timeout
      period elapsed. Though the wait loop after sending the reboot vector IPI
      prevents this. It checks the wait request argument and the timeout. If wait
      is set, which is true for sys_reboot() then it won't fall through to the
      NMI shutdown method after the timeout period has finished.
      
      This was an oversight when the NMI shutdown mechanism was added to handle
      the 'reboot IPI is not working' situation. The mechanism was added to deal
      with stuck panic shutdowns, which do not have the wait request set, so the
      'wait request' case was probably not considered.
      
      Remove the wait check from the post reboot vector IPI wait loop and enforce
      that the wait loop in the NMI fallback path is invoked even if NMI IPIs are
      disabled or the registration of the NMI handler fails. That second wait
      loop will then hang if not all CPUs shutdown and the wait argument is set.
      
      [ tglx: Avoid the hard to parse line break in the NMI fallback path,
        	add comments and massage the changelog ]
      
      Fixes: 7d007d21 ("x86/reboot: Use NMI to assist in shutting down if IRQ fails")
      Signed-off-by: NGrzegorz Halat <ghalat@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Don Zickus <dzickus@redhat.com>
      Link: https://lkml.kernel.org/r/20190628122813.15500-1-ghalat@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      ce7fdd5c
    • T
      x86/apic: Make apic_pending_intr_clear() more robust · d29c7b8b
      Thomas Gleixner 提交于
      [ Upstream commit cc8bf191378c1da8ad2b99cf470ee70193ace84e ]
      
      In course of developing shorthand based IPI support issues with the
      function which tries to clear eventually pending ISR bits in the local APIC
      were observed.
      
        1) O-day testing triggered the WARN_ON() in apic_pending_intr_clear().
      
           This warning is emitted when the function fails to clear pending ISR
           bits or observes pending IRR bits which are not delivered to the CPU
           after the stale ISR bit(s) are ACK'ed.
      
           Unfortunately the function only emits a WARN_ON() and fails to dump
           the IRR/ISR content. That's useless for debugging.
      
           Feng added spot on debug printk's which revealed that the stale IRR
           bit belonged to the APIC timer interrupt vector, but adding ad hoc
           debug code does not help with sporadic failures in the field.
      
           Rework the loop so the full IRR/ISR contents are saved and on failure
           dumped.
      
        2) The loop termination logic is interesting at best.
      
           If the machine has no TSC or cpu_khz is not known yet it tries 1
           million times to ack stale IRR/ISR bits. What?
      
           With TSC it uses the TSC to calculate the loop termination. It takes a
           timestamp at entry and terminates the loop when:
      
           	  (rdtsc() - start_timestamp) >= (cpu_hkz << 10)
      
           That's roughly one second.
      
           Both methods are problematic. The APIC has 256 vectors, which means
           that in theory max. 256 IRR/ISR bits can be set. In practice this is
           impossible and the chance that more than a few bits are set is close
           to zero.
      
           With the pure loop based approach the 1 million retries are complete
           overkill.
      
           With TSC this can terminate too early in a guest which is running on a
           heavily loaded host even with only a couple of IRR/ISR bits set. The
           reason is that after acknowledging the highest priority ISR bit,
           pending IRRs must get serviced first before the next round of
           acknowledge can take place as the APIC (real and virtualized) does not
           honour EOI without a preceeding interrupt on the CPU. And every APIC
           read/write takes a VMEXIT if the APIC is virtualized. While trying to
           reproduce the issue 0-day reported it was observed that the guest was
           scheduled out long enough under heavy load that it terminated after 8
           iterations.
      
           Make the loop terminate after 512 iterations. That's plenty enough
           in any case and does not take endless time to complete.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20190722105219.158847694@linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
      d29c7b8b
  6. 21 9月, 2019 1 次提交
  7. 16 9月, 2019 3 次提交
    • P
      x86/kvmclock: set offset for kvm unstable clock · 1d60902a
      Pavel Tatashin 提交于
      [ Upstream commit b5179ec4187251a751832193693d6e474d3445ac ]
      
      VMs may show incorrect uptime and dmesg printk offsets on hypervisors with
      unstable clock. The problem is produced when VM is rebooted without exiting
      from qemu.
      
      The fix is to calculate clock offset not only for stable clock but for
      unstable clock as well, and use kvm_sched_clock_read() which substracts
      the offset for both clocks.
      
      This is safe, because pvclock_clocksource_read() does the right thing and
      makes sure that clock always goes forward, so once offset is calculated
      with unstable clock, we won't get new reads that are smaller than offset,
      and thus won't get negative results.
      
      Thank you Jon DeVree for helping to reproduce this issue.
      
      Fixes: 857baa87 ("sched/clock: Enable sched clock early")
      Cc: stable@vger.kernel.org
      Reported-by: NDominique Martinet <asmadeus@codewreck.org>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1d60902a
    • Z
      x86, hibernate: Fix nosave_regions setup for hibernation · 4d970758
      Zhimin Gu 提交于
      [ Upstream commit cc55f7537db6af371e9c1c6a71161ee40f918824 ]
      
      On 32bit systems, nosave_regions(non RAM areas) located between
      max_low_pfn and max_pfn are not excluded from hibernation snapshot
      currently, which may result in a machine check exception when
      trying to access these unsafe regions during hibernation:
      
      [  612.800453] Disabling lock debugging due to kernel taint
      [  612.805786] mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 6: fe00000000801136
      [  612.814344] mce: [Hardware Error]: RIP !INEXACT! 60:<00000000d90be566> {swsusp_save+0x436/0x560}
      [  612.823167] mce: [Hardware Error]: TSC 1f5939fe276 ADDR dd000000 MISC 30e0000086
      [  612.830677] mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1529487426 SOCKET 0 APIC 0 microcode 24
      [  612.839581] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
      [  612.846394] mce: [Hardware Error]: Machine check: Processor context corrupt
      [  612.853380] Kernel panic - not syncing: Fatal machine check
      [  612.858978] Kernel Offset: 0x18000000 from 0xc1000000 (relocation range: 0xc0000000-0xf7ffdfff)
      
      This is because on 32bit systems, pages above max_low_pfn are regarded
      as high memeory, and accessing unsafe pages might cause expected MCE.
      On the problematic 32bit system, there are reserved memory above low
      memory, which triggered the MCE:
      
      e820 memory mapping:
      [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
      [    0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000d160cfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d160d000-0x00000000d1613fff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000d1614000-0x00000000d1a44fff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d1a45000-0x00000000d1ecffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000d1ed0000-0x00000000d7eeafff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d7eeb000-0x00000000d7ffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000d8000000-0x00000000d875ffff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d8760000-0x00000000d87fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000d8800000-0x00000000d8fadfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000d8fae000-0x00000000d8ffffff] ACPI data
      [    0.000000] BIOS-e820: [mem 0x00000000d9000000-0x00000000da71bfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000da71c000-0x00000000da7fffff] ACPI NVS
      [    0.000000] BIOS-e820: [mem 0x00000000da800000-0x00000000dbb8bfff] usable
      [    0.000000] BIOS-e820: [mem 0x00000000dbb8c000-0x00000000dbffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000dd000000-0x00000000df1fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec00fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000041edfffff] usable
      
      Fix this problem by changing pfn limit from max_low_pfn to max_pfn.
      This fix does not impact 64bit system because on 64bit max_low_pfn
      is the same as max_pfn.
      Signed-off-by: NZhimin Gu <kookoo.gu@intel.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NChen Yu <yu.c.chen@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: All applicable <stable@vger.kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d970758
    • S
      x86/ftrace: Fix warning and considate ftrace_jmp_replace() and ftrace_call_replace() · 85a24825
      Steven Rostedt (VMware) 提交于
      [ Upstream commit 745cfeaac09ce359130a5451d90cb0bd4094c290 ]
      
      Arnd reported the following compiler warning:
      
      arch/x86/kernel/ftrace.c:669:23: error: 'ftrace_jmp_replace' defined but not used [-Werror=unused-function]
      
      The ftrace_jmp_replace() function now only has a single user and should be
      simply moved by that user. But looking at the code, it shows that
      ftrace_jmp_replace() is similar to ftrace_call_replace() except that instead
      of using the opcode of 0xe8 it uses 0xe9. It makes more sense to consolidate
      that function into one implementation that both ftrace_jmp_replace() and
      ftrace_call_replace() use by passing in the op code separate.
      
      The structure in ftrace_code_union is also modified to replace the "e8"
      field with the more appropriate name "op".
      
      Cc: stable@vger.kernel.org
      Reported-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Link: http://lkml.kernel.org/r/20190304200748.1418790-1-arnd@arndb.de
      Fixes: d2a68c4effd8 ("x86/ftrace: Do not call function graph from dynamic trampolines")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      85a24825
  8. 10 9月, 2019 1 次提交
    • L
      Revert "x86/apic: Include the LDR when clearing out APIC registers" · 991467a4
      Linus Torvalds 提交于
      [ Upstream commit 950b07c14e8c59444e2359f15fd70ed5112e11a0 ]
      
      This reverts commit 558682b5291937a70748d36fd9ba757fb25b99ae.
      
      Chris Wilson reports that it breaks his CPU hotplug test scripts.  In
      particular, it breaks offlining and then re-onlining the boot CPU, which
      we treat specially (and the BIOS does too).
      
      The symptoms are that we can offline the CPU, but it then does not come
      back online again:
      
          smpboot: CPU 0 is now offline
          smpboot: Booting Node 0 Processor 0 APIC 0x0
          smpboot: do_boot_cpu failed(-1) to wakeup CPU#0
      
      Thomas says he knows why it's broken (my personal suspicion: our magic
      handling of the "cpu0_logical_apicid" thing), but for 5.3 the right fix
      is to just revert it, since we've never touched the LDR bits before, and
      it's not worth the risk to do anything else at this stage.
      
      [ Hotpluging of the boot CPU is special anyway, and should be off by
        default. See the "BOOTPARAM_HOTPLUG_CPU0" config option and the
        cpu0_hotplug kernel parameter.
      
        In general you should not do it, and it has various known limitations
        (hibernate and suspend require the boot CPU, for example).
      
        But it should work, even if the boot CPU is special and needs careful
        treatment       - Linus ]
      
      Link: https://lore.kernel.org/lkml/156785100521.13300.14461504732265570003@skylake-alporthouse-com/Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Bandan Das <bsd@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      991467a4
  9. 06 9月, 2019 4 次提交
    • G
      x86/ptrace: fix up botched merge of spectrev1 fix · b307f99d
      Greg Kroah-Hartman 提交于
      I incorrectly merged commit 31a2fbb390fe ("x86/ptrace: Fix possible
      spectre-v1 in ptrace_get_debugreg()") when backporting it, as was
      graciously pointed out at
      https://grsecurity.net/teardown_of_a_failed_linux_lts_spectre_fix.php
      
      Resolve the upstream difference with the stable kernel merge to properly
      protect things.
      Reported-by: NBrad Spengler <spender@grsecurity.net>
      Cc: Dianzhang Chen <dianzhangchen0@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <bp@alien8.de>
      Cc: <hpa@zytor.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b307f99d
    • B
      x86/apic: Include the LDR when clearing out APIC registers · edc454cd
      Bandan Das 提交于
      commit 558682b5291937a70748d36fd9ba757fb25b99ae upstream.
      
      Although APIC initialization will typically clear out the LDR before
      setting it, the APIC cleanup code should reset the LDR.
      
      This was discovered with a 32-bit KVM guest jumping into a kdump
      kernel. The stale bits in the LDR triggered a bug in the KVM APIC
      implementation which caused the destination mapping for VCPUs to be
      corrupted.
      
      Note that this isn't intended to paper over the KVM APIC bug. The kernel
      has to clear the LDR when resetting the APIC registers except when X2APIC
      is enabled.
      
      This lacks a Fixes tag because missing to clear LDR goes way back into pre
      git history.
      
      [ tglx: Made x2apic_enabled a function call as required ]
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190826101513.5080-3-bsd@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edc454cd
    • B
      x86/apic: Do not initialize LDR and DFR for bigsmp · 95983265
      Bandan Das 提交于
      commit bae3a8d3308ee69a7dbdf145911b18dfda8ade0d upstream.
      
      Legacy apic init uses bigsmp for smp systems with 8 and more CPUs. The
      bigsmp APIC implementation uses physical destination mode, but it
      nevertheless initializes LDR and DFR. The LDR even ends up incorrectly with
      multiple bit being set.
      
      This does not cause a functional problem because LDR and DFR are ignored
      when physical destination mode is active, but it triggered a problem on a
      32-bit KVM guest which jumps into a kdump kernel.
      
      The multiple bits set unearthed a bug in the KVM APIC implementation. The
      code which creates the logical destination map for VCPUs ignores the
      disabled state of the APIC and ends up overwriting an existing valid entry
      and as a result, APIC calibration hangs in the guest during kdump
      initialization.
      
      Remove the bogus LDR/DFR initialization.
      
      This is not intended to work around the KVM APIC bug. The LDR/DFR
      ininitalization is wrong on its own.
      
      The issue goes back into the pre git history. The fixes tag is the commit
      in the bitkeeper import which introduced bigsmp support in 2003.
      
        git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      
      Fixes: db7b9e9f26b8 ("[PATCH] Clustered APIC setup for >8 CPU systems")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190826101513.5080-2-bsd@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      95983265
    • S
      uprobes/x86: Fix detection of 32-bit user mode · 941d875c
      Sebastian Mayr 提交于
      commit 9212ec7d8357ea630031e89d0d399c761421c83b upstream.
      
      32-bit processes running on a 64-bit kernel are not always detected
      correctly, causing the process to crash when uretprobes are installed.
      
      The reason for the crash is that in_ia32_syscall() is used to determine the
      process's mode, which only works correctly when called from a syscall.
      
      In the case of uretprobes, however, the function is called from a exception
      and always returns 'false' on a 64-bit kernel. In consequence this leads to
      corruption of the process's return address.
      
      Fix this by using user_64bit_mode() instead of in_ia32_syscall(), which
      is correct in any situation.
      
      [ tglx: Add a comment and the following historical info ]
      
      This should have been detected by the rename which happened in commit
      
        abfb9498 ("x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()")
      
      which states in the changelog:
      
          The is_ia32_task()/is_x32_task() function names are a big misnomer: they
          suggests that the compat-ness of a system call is a task property, which
          is not true, the compatness of a system call purely depends on how it
          was invoked through the system call layer.
          .....
      
      and then it went and blindly renamed every call site.
      
      Sadly enough this was already mentioned here:
      
         8faaed1b ("uprobes/x86: Introduce sizeof_long(), cleanup adjust_ret_addr() and
      arch_uretprobe_hijack_return_addr()")
      
      where the changelog says:
      
          TODO: is_ia32_task() is not what we actually want, TS_COMPAT does
          not necessarily mean 32bit. Fortunately syscall-like insns can't be
          probed so it actually works, but it would be better to rename and
          use is_ia32_frame().
      
      and goes all the way back to:
      
          0326f5a9 ("uprobes/core: Handle breakpoint and singlestep exceptions")
      
      Oh well. 7+ years until someone actually tried a uretprobe on a 32bit
      process on a 64bit kernel....
      
      Fixes: 0326f5a9 ("uprobes/core: Handle breakpoint and singlestep exceptions")
      Signed-off-by: NSebastian Mayr <me@sam.st>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190728152617.7308-1-me@sam.stSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      941d875c
  10. 29 8月, 2019 2 次提交
    • T
      x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h · e063b03b
      Tom Lendacky 提交于
      commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24 upstream.
      
      There have been reports of RDRAND issues after resuming from suspend on
      some AMD family 15h and family 16h systems. This issue stems from a BIOS
      not performing the proper steps during resume to ensure RDRAND continues
      to function properly.
      
      RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
      reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
      support using CPUID, including the kernel, will believe that RDRAND is
      not supported.
      
      Update the CPU initialization to clear the RDRAND CPUID bit for any family
      15h and 16h processor that supports RDRAND. If it is known that the family
      15h or family 16h system does not have an RDRAND resume issue or that the
      system will not be placed in suspend, the "rdrand=force" kernel parameter
      can be used to stop the clearing of the RDRAND CPUID bit.
      
      Additionally, update the suspend and resume path to save and restore the
      MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
      place after resuming from suspend.
      
      Note, that clearing the RDRAND CPUID bit does not prevent a processor
      that normally supports the RDRAND instruction from executing it. So any
      code that determined the support based on family and model won't #UD.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
      Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "x86@kernel.org" <x86@kernel.org>
      Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e063b03b
    • T
      x86/apic: Handle missing global clockevent gracefully · 685e598e
      Thomas Gleixner 提交于
      commit f897e60a12f0b9146357780d317879bce2a877dc upstream.
      
      Some newer machines do not advertise legacy timers. The kernel can handle
      that situation if the TSC and the CPU frequency are enumerated by CPUID or
      MSRs and the CPU supports TSC deadline timer. If the CPU does not support
      TSC deadline timer the local APIC timer frequency has to be known as well.
      
      Some Ryzens machines do not advertize legacy timers, but there is no
      reliable way to determine the bus frequency which feeds the local APIC
      timer when the machine allows overclocking of that frequency.
      
      As there is no legacy timer the local APIC timer calibration crashes due to
      a NULL pointer dereference when accessing the not installed global clock
      event device.
      
      Switch the calibration loop to a non interrupt based one, which polls
      either TSC (if frequency is known) or jiffies. The latter requires a global
      clockevent. As the machines which do not have a global clockevent installed
      have a known TSC frequency this is a non issue. For older machines where
      TSC frequency is not known, there is no known case where the legacy timers
      do not exist as that would have been reported long ago.
      Reported-by: NDaniel Drake <drake@endlessm.com>
      Reported-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908091443030.21433@nanos.tec.linutronix.de
      Link: http://bugzilla.opensuse.org/show_bug.cgi?id=1142926#c12Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      685e598e
  11. 07 8月, 2019 4 次提交
    • T
      x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS · b88241ae
      Thomas Gleixner 提交于
      commit f36cf386e3fec258a341d446915862eded3e13d8 upstream
      
      Intel provided the following information:
      
       On all current Atom processors, instructions that use a segment register
       value (e.g. a load or store) will not speculatively execute before the
       last writer of that segment retires. Thus they will not use a
       speculatively written segment value.
      
      That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS
      entry paths can be excluded from the extra LFENCE if PTI is disabled.
      
      Create a separate bug flag for the through SWAPGS speculation and mark all
      out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs
      are excluded from the whole mitigation mess anyway.
      Reported-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b88241ae
    • J
      x86/speculation: Enable Spectre v1 swapgs mitigations · 23e7a7b3
      Josh Poimboeuf 提交于
      commit a2059825986a1c8143fd6698774fa9d83733bb11 upstream
      
      The previous commit added macro calls in the entry code which mitigate the
      Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
      enabled.  Enable those features where applicable.
      
      The mitigations may be disabled with "nospectre_v1" or "mitigations=off".
      
      There are different features which can affect the risk of attack:
      
      - When FSGSBASE is enabled, unprivileged users are able to place any
        value in GS, using the wrgsbase instruction.  This means they can
        write a GS value which points to any value in kernel space, which can
        be useful with the following gadget in an interrupt/exception/NMI
        handler:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg1
      	// dependent load or store based on the value of %reg
      	// for example: mov %(reg1), %reg2
      
        If an interrupt is coming from user space, and the entry code
        speculatively skips the swapgs (due to user branch mistraining), it
        may speculatively execute the GS-based load and a subsequent dependent
        load or store, exposing the kernel data to an L1 side channel leak.
      
        Note that, on Intel, a similar attack exists in the above gadget when
        coming from kernel space, if the swapgs gets speculatively executed to
        switch back to the user GS.  On AMD, this variant isn't possible
        because swapgs is serializing with respect to future GS-based
        accesses.
      
        NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
      	doesn't exist quite yet.
      
      - When FSGSBASE is disabled, the issue is mitigated somewhat because
        unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
        restricts GS values to user space addresses only.  That means the
        gadget would need an additional step, since the target kernel address
        needs to be read from user space first.  Something like:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg1
      	mov (%reg1), %reg2
      	// dependent load or store based on the value of %reg2
      	// for example: mov %(reg2), %reg3
      
        It's difficult to audit for this gadget in all the handlers, so while
        there are no known instances of it, it's entirely possible that it
        exists somewhere (or could be introduced in the future).  Without
        tooling to analyze all such code paths, consider it vulnerable.
      
        Effects of SMAP on the !FSGSBASE case:
      
        - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
          susceptible to Meltdown), the kernel is prevented from speculatively
          reading user space memory, even L1 cached values.  This effectively
          disables the !FSGSBASE attack vector.
      
        - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
          still prevents the kernel from speculatively reading user space
          memory.  But it does *not* prevent the kernel from reading the
          user value from L1, if it has already been cached.  This is probably
          only a small hurdle for an attacker to overcome.
      
      Thanks to Dave Hansen for contributing the speculative_smap() function.
      
      Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
      is serializing on AMD.
      
      [ tglx: Fixed the USER fence decision and polished the comment as suggested
        	by Dave Hansen ]
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      23e7a7b3
    • F
      x86/cpufeatures: Combine word 11 and 12 into a new scattered features word · b5dd7f61
      Fenghua Yu 提交于
      commit acec0ce081de0c36459eea91647faf99296445a3 upstream
      
      It's a waste for the four X86_FEATURE_CQM_* feature bits to occupy two
      whole feature bits words. To better utilize feature words, re-define
      word 11 to host scattered features and move the four X86_FEATURE_CQM_*
      features into Linux defined word 11. More scattered features can be
      added in word 11 in the future.
      
      Rename leaf 11 in cpuid_leafs to CPUID_LNX_4 to reflect it's a
      Linux-defined leaf.
      
      Rename leaf 12 as CPUID_DUMMY which will be replaced by a meaningful
      name in the next patch when CPUID.7.1:EAX occupies world 12.
      
      Maximum number of RMID and cache occupancy scale are retrieved from
      CPUID.0xf.1 after scattered CQM features are enumerated. Carve out the
      code into a separate function.
      
      KVM doesn't support resctrl now. So it's safe to move the
      X86_FEATURE_CQM_* features to scattered features word 11 for KVM.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Aaron Lewis <aaronlewis@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Babu Moger <babu.moger@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: "Sean J Christopherson" <sean.j.christopherson@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: kvm ML <kvm@vger.kernel.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
      Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: x86 <x86@kernel.org>
      Link: https://lkml.kernel.org/r/1560794416-217638-2-git-send-email-fenghua.yu@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b5dd7f61
    • B
      x86/cpufeatures: Carve out CQM features retrieval · 16ad0b63
      Borislav Petkov 提交于
      commit 45fc56e629caa451467e7664fbd4c797c434a6c4 upstream
      
      ... into a separate function for better readability. Split out from a
      patch from Fenghua Yu <fenghua.yu@intel.com> to keep the mechanical,
      sole code movement separate for easy review.
      
      No functional changes.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: x86@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16ad0b63