1. 15 8月, 2019 6 次提交
    • T
      x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS · f5d27995
      Thomas Gleixner 提交于
      commit f36cf386e3fec258a341d446915862eded3e13d8 upstream.
      
      Intel provided the following information:
      
       On all current Atom processors, instructions that use a segment register
       value (e.g. a load or store) will not speculatively execute before the
       last writer of that segment retires. Thus they will not use a
       speculatively written segment value.
      
      That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS
      entry paths can be excluded from the extra LFENCE if PTI is disabled.
      
      Create a separate bug flag for the through SWAPGS speculation and mark all
      out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs
      are excluded from the whole mitigation mess anyway.
      Reported-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f5d27995
    • J
      x86/entry/64: Use JMP instead of JMPQ · 49e1ce98
      Josh Poimboeuf 提交于
      commit 64dbc122b20f75183d8822618c24f85144a5a94d upstream.
      
      Somehow the swapgs mitigation entry code patch ended up with a JMPQ
      instruction instead of JMP, where only the short jump is needed.  Some
      assembler versions apparently fail to optimize JMPQ into a two-byte JMP
      when possible, instead always using a 7-byte JMP with relocation.  For
      some reason that makes the entry code explode with a #GP during boot.
      
      Change it back to "JMP" as originally intended.
      
      Fixes: 18ec54fdd6d1 ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations")
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      49e1ce98
    • J
      x86/speculation: Enable Spectre v1 swapgs mitigations · 12351a6c
      Josh Poimboeuf 提交于
      commit a2059825986a1c8143fd6698774fa9d83733bb11 upstream.
      
      The previous commit added macro calls in the entry code which mitigate the
      Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
      enabled.  Enable those features where applicable.
      
      The mitigations may be disabled with "nospectre_v1" or "mitigations=off".
      
      There are different features which can affect the risk of attack:
      
      - When FSGSBASE is enabled, unprivileged users are able to place any
        value in GS, using the wrgsbase instruction.  This means they can
        write a GS value which points to any value in kernel space, which can
        be useful with the following gadget in an interrupt/exception/NMI
        handler:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg1
      	// dependent load or store based on the value of %reg
      	// for example: mov %(reg1), %reg2
      
        If an interrupt is coming from user space, and the entry code
        speculatively skips the swapgs (due to user branch mistraining), it
        may speculatively execute the GS-based load and a subsequent dependent
        load or store, exposing the kernel data to an L1 side channel leak.
      
        Note that, on Intel, a similar attack exists in the above gadget when
        coming from kernel space, if the swapgs gets speculatively executed to
        switch back to the user GS.  On AMD, this variant isn't possible
        because swapgs is serializing with respect to future GS-based
        accesses.
      
        NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
      	doesn't exist quite yet.
      
      - When FSGSBASE is disabled, the issue is mitigated somewhat because
        unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
        restricts GS values to user space addresses only.  That means the
        gadget would need an additional step, since the target kernel address
        needs to be read from user space first.  Something like:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg1
      	mov (%reg1), %reg2
      	// dependent load or store based on the value of %reg2
      	// for example: mov %(reg2), %reg3
      
        It's difficult to audit for this gadget in all the handlers, so while
        there are no known instances of it, it's entirely possible that it
        exists somewhere (or could be introduced in the future).  Without
        tooling to analyze all such code paths, consider it vulnerable.
      
        Effects of SMAP on the !FSGSBASE case:
      
        - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
          susceptible to Meltdown), the kernel is prevented from speculatively
          reading user space memory, even L1 cached values.  This effectively
          disables the !FSGSBASE attack vector.
      
        - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
          still prevents the kernel from speculatively reading user space
          memory.  But it does *not* prevent the kernel from reading the
          user value from L1, if it has already been cached.  This is probably
          only a small hurdle for an attacker to overcome.
      
      Thanks to Dave Hansen for contributing the speculative_smap() function.
      
      Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
      is serializing on AMD.
      
      [ tglx: Fixed the USER fence decision and polished the comment as suggested
        	by Dave Hansen ]
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      12351a6c
    • J
      x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations · 06e45d3a
      Josh Poimboeuf 提交于
      commit 18ec54fdd6d18d92025af097cd042a75cf0ea24c upstream.
      
      Spectre v1 isn't only about array bounds checks.  It can affect any
      conditional checks.  The kernel entry code interrupt, exception, and NMI
      handlers all have conditional swapgs checks.  Those may be problematic in
      the context of Spectre v1, as kernel code can speculatively run with a user
      GS.
      
      For example:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg
      	mov (%reg), %reg1
      
      When coming from user space, the CPU can speculatively skip the swapgs, and
      then do a speculative percpu load using the user GS value.  So the user can
      speculatively force a read of any kernel value.  If a gadget exists which
      uses the percpu value as an address in another load/store, then the
      contents of the kernel value may become visible via an L1 side channel
      attack.
      
      A similar attack exists when coming from kernel space.  The CPU can
      speculatively do the swapgs, causing the user GS to get used for the rest
      of the speculative window.
      
      The mitigation is similar to a traditional Spectre v1 mitigation, except:
      
        a) index masking isn't possible; because the index (percpu offset)
           isn't user-controlled; and
      
        b) an lfence is needed in both the "from user" swapgs path and the
           "from kernel" non-swapgs path (because of the two attacks described
           above).
      
      The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a
      CR3 write when PTI is enabled.  Since CR3 writes are serializing, the
      lfences can be skipped in those cases.
      
      On the other hand, the kernel entry swapgs paths don't depend on PTI.
      
      To avoid unnecessary lfences for the user entry case, create two separate
      features for alternative patching:
      
        X86_FEATURE_FENCE_SWAPGS_USER
        X86_FEATURE_FENCE_SWAPGS_KERNEL
      
      Use these features in entry code to patch in lfences where needed.
      
      The features aren't enabled yet, so there's no functional change.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      06e45d3a
    • F
      x86/cpufeatures: Combine word 11 and 12 into a new scattered features word · 74f65670
      Fenghua Yu 提交于
      commit acec0ce081de0c36459eea91647faf99296445a3 upstream.
      
      It's a waste for the four X86_FEATURE_CQM_* feature bits to occupy two
      whole feature bits words. To better utilize feature words, re-define
      word 11 to host scattered features and move the four X86_FEATURE_CQM_*
      features into Linux defined word 11. More scattered features can be
      added in word 11 in the future.
      
      Rename leaf 11 in cpuid_leafs to CPUID_LNX_4 to reflect it's a
      Linux-defined leaf.
      
      Rename leaf 12 as CPUID_DUMMY which will be replaced by a meaningful
      name in the next patch when CPUID.7.1:EAX occupies world 12.
      
      Maximum number of RMID and cache occupancy scale are retrieved from
      CPUID.0xf.1 after scattered CQM features are enumerated. Carve out the
      code into a separate function.
      
      KVM doesn't support resctrl now. So it's safe to move the
      X86_FEATURE_CQM_* features to scattered features word 11 for KVM.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Aaron Lewis <aaronlewis@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Babu Moger <babu.moger@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: "Sean J Christopherson" <sean.j.christopherson@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: kvm ML <kvm@vger.kernel.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Ravi V Shankar <ravi.v.shankar@intel.com>
      Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: x86 <x86@kernel.org>
      Link: https://lkml.kernel.org/r/1560794416-217638-2-git-send-email-fenghua.yu@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      74f65670
    • B
      x86/cpufeatures: Carve out CQM features retrieval · deaf49f3
      Borislav Petkov 提交于
      commit 45fc56e629caa451467e7664fbd4c797c434a6c4 upstream.
      
      ... into a separate function for better readability. Split out from a
      patch from Fenghua Yu <fenghua.yu@intel.com> to keep the mechanical,
      sole code movement separate for easy review.
      
      No functional changes.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: x86@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      deaf49f3
  2. 06 8月, 2019 2 次提交
    • W
      x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() · e28b4155
      Will Deacon 提交于
      commit 6e693b3ffecb0b478c7050b44a4842854154f715 upstream.
      
      Commit 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'")
      makes the access_ok() check part of the user_access_begin() preceding a
      series of 'unsafe' accesses.  This has the desirable effect of ensuring
      that all 'unsafe' accesses have been range-checked, without having to
      pick through all of the callsites to verify whether the appropriate
      checking has been made.
      
      However, the consolidated range check does not inhibit speculation, so
      it is still up to the caller to ensure that they are not susceptible to
      any speculative side-channel attacks for user addresses that ultimately
      fail the access_ok() check.
      
      This is an oversight, so use __uaccess_begin_nospec() to ensure that
      speculation is inhibited until the access_ok() check has passed.
      Reported-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e28b4155
    • L
      make 'user_access_begin()' do 'access_ok()' · aba1b548
      Linus Torvalds 提交于
      commit 594cc251fdd0d231d342d88b2fdff4bc42fb0690 upstream.
      
      Originally, the rule used to be that you'd have to do access_ok()
      separately, and then user_access_begin() before actually doing the
      direct (optimized) user access.
      
      But experience has shown that people then decide not to do access_ok()
      at all, and instead rely on it being implied by other operations or
      similar.  Which makes it very hard to verify that the access has
      actually been range-checked.
      
      If you use the unsafe direct user accesses, hardware features (either
      SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
      Access Never - on ARM) do force you to use user_access_begin().  But
      nothing really forces the range check.
      
      By putting the range check into user_access_begin(), we actually force
      people to do the right thing (tm), and the range check vill be visible
      near the actual accesses.  We have way too long a history of people
      trying to avoid them.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      [ Shile: fix following conflicts by adding a dummy arguments ]
      Conflicts:
      	kernel/compat.c
      	kernel/exit.c
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      aba1b548
  3. 10 7月, 2019 1 次提交
  4. 05 7月, 2019 3 次提交
    • J
      kconfig: Disable x86 clocksource watchdog · bbb76873
      Jiufei Xue 提交于
      Unstable tsc will trigger clocksource watchdog and disable itself, as a
      result other clocksource will be elected as the current clocksource
      which will result in performace issue on our servers.
      
      RHEL7 also disabled this feature for some issues, see changelog:
      [x86] disable clocksource watchdog (Prarit Bhargava) [914709]
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      bbb76873
    • J
      Revert "x86/tsc: Prepare warp test for TSC adjustment" · f5f62304
      Jiufei Xue 提交于
      This reverts commit 76d3b851.
      
      The returned value for check_tsc_warp() is useless now, remove it.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f5f62304
    • J
      Revert "x86/tsc: Try to adjust TSC if sync test fails" · 64a17a82
      Jiufei Xue 提交于
      This reverts commit cc4db268.
      
      When we do hot-add and enable vCPU, the time inside the VM jumps and
      then VM stucks.
      The dmesg shows like this:
      [   48.402948] CPU2 has been hot-added
      [   48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2
      [   48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock
      [   48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp.  Adjust: 139318776350
      [  102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
      [  102.060874] clocksource:                       'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff
      [  102.060874] clocksource:                       'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff
      [  102.060874] tsc: Marking TSC unstable due to clocksource watchdog
      [  102.070188] KVM setup async PF for cpu 2
      [  102.071461] kvm-stealtime: cpu 2, msr 13ba95000
      [  102.074530] Will online and init hotplugged CPU: 2
      
      This is because the TSC for the newly added VCPU is initialized to 0
      while others are ahead. Guest will do the TSC ADJUST compensate and
      cause the time jumps.
      
      Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability
      on guest CPU hotplug") can fix this problem.  However, the host kernel
      version may be older, so do not ajust TSC if sync test fails, just mark
      it unstable.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      64a17a82
  5. 03 7月, 2019 4 次提交
  6. 25 6月, 2019 1 次提交
  7. 22 6月, 2019 2 次提交
    • F
      x86/CPU/AMD: Don't force the CPB cap when running under a hypervisor · a35e7822
      Frank van der Linden 提交于
      [ Upstream commit 2ac44ab608705948564791ce1d15d43ba81a1e38 ]
      
      For F17h AMD CPUs, the CPB capability ('Core Performance Boost') is forcibly set,
      because some versions of that chip incorrectly report that they do not have it.
      
      However, a hypervisor may filter out the CPB capability, for good
      reasons. For example, KVM currently does not emulate setting the CPB
      bit in MSR_K7_HWCR, and unchecked MSR access errors will be thrown
      when trying to set it as a guest:
      
      	unchecked MSR access error: WRMSR to 0xc0010015 (tried to write 0x0000000001000011) at rIP: 0xffffffff890638f4 (native_write_msr+0x4/0x20)
      
      	Call Trace:
      	boost_set_msr+0x50/0x80 [acpi_cpufreq]
      	cpuhp_invoke_callback+0x86/0x560
      	sort_range+0x20/0x20
      	cpuhp_thread_fun+0xb0/0x110
      	smpboot_thread_fn+0xef/0x160
      	kthread+0x113/0x130
      	kthread_create_worker_on_cpu+0x70/0x70
      	ret_from_fork+0x35/0x40
      
      To avoid this issue, don't forcibly set the CPB capability for a CPU
      when running under a hypervisor.
      Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: jiaxun.yang@flygoat.com
      Fixes: 0237199186e7 ("x86/CPU/AMD: Set the CPB bit unconditionally on F17h")
      Link: http://lkml.kernel.org/r/20190522221745.GA15789@dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com
      [ Minor edits to the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a35e7822
    • S
      perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints · 5a9c29cc
      Stephane Eranian 提交于
      [ Upstream commit 23e3983a466cd540ffdd2bbc6e0c51e31934f941 ]
      
      This patch fixes an bug revealed by the following commit:
      
        6b89d4c1ae85 ("perf/x86/intel: Fix INTEL_FLAGS_EVENT_CONSTRAINT* masking")
      
      That patch modified INTEL_FLAGS_EVENT_CONSTRAINT() to only look at the event code
      when matching a constraint. If code+umask were needed, then the
      INTEL_FLAGS_UEVENT_CONSTRAINT() macro was needed instead.
      This broke with some of the constraints for PEBS events.
      
      Several of them, including the one used for cycles:p, cycles:pp, cycles:ppp
      fell in that category and caused the event to be rejected in PEBS mode.
      In other words, on some platforms a cmdline such as:
      
        $ perf top -e cycles:pp
      
      would fail with -EINVAL.
      
      This patch fixes this bug by properly using INTEL_FLAGS_UEVENT_CONSTRAINT()
      when needed in the PEBS constraint tables.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/20190521005246.423-1-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5a9c29cc
  8. 19 6月, 2019 6 次提交
    • P
      x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled · e93ce57f
      Prarit Bhargava 提交于
      commit c7563e62a6d720aa3b068e26ddffab5f0df29263 upstream.
      
      Booting with kernel parameter "rdt=cmt,mbmtotal,memlocal,l3cat,mba" and
      executing "mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl" results in
      a NULL pointer dereference on systems which do not have local MBM support
      enabled..
      
      BUG: kernel NULL pointer dereference, address: 0000000000000020
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 0 PID: 722 Comm: kworker/0:3 Not tainted 5.2.0-0.rc3.git0.1.el7_UNSUPPORTED.x86_64 #2
      Workqueue: events mbm_handle_overflow
      RIP: 0010:mbm_handle_overflow+0x150/0x2b0
      
      Only enter the bandwith update loop if the system has local MBM enabled.
      
      Fixes: de73f38f ("x86/intel_rdt/mba_sc: Feedback loop to dynamically update mem bandwidth")
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190610171544.13474-1-prarit@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e93ce57f
    • B
      x86/mm/KASLR: Compute the size of the vmemmap section properly · 0257fc9a
      Baoquan He 提交于
      commit 00e5a2bbcc31d5fea853f8daeba0f06c1c88c3ff upstream.
      
      The size of the vmemmap section is hardcoded to 1 TB to support the
      maximum amount of system RAM in 4-level paging mode - 64 TB.
      
      However, 1 TB is not enough for vmemmap in 5-level paging mode. Assuming
      the size of struct page is 64 Bytes, to support 4 PB system RAM in 5-level,
      64 TB of vmemmap area is needed:
      
        4 * 1000^5 PB / 4096 bytes page size * 64 bytes per page struct / 1000^4 TB = 62.5 TB.
      
      This hardcoding may cause vmemmap to corrupt the following
      cpu_entry_area section, if KASLR puts vmemmap very close to it and the
      actual vmemmap size is bigger than 1 TB.
      
      So calculate the actual size of the vmemmap region needed and then align
      it up to 1 TB boundary.
      
      In 4-level paging mode it is always 1 TB. In 5-level it's adjusted on
      demand. The current code reserves 0.5 PB for vmemmap on 5-level. With
      this change, the space can be saved and thus used to increase entropy
      for the randomization.
      
       [ bp: Spell out how the 64 TB needed for vmemmap is computed and massage commit
         message. ]
      
      Fixes: eedb92ab ("x86/mm: Make virtual memory layout dynamic for CONFIG_X86_5LEVEL=y")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NKirill A. Shutemov <kirill@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: kirill.shutemov@linux.intel.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable <stable@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190523025744.3756-1-bhe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0257fc9a
    • A
      x86/kasan: Fix boot with 5-level paging and KASAN · 5e3d10d9
      Andrey Ryabinin 提交于
      commit f3176ec9420de0c385023afa3e4970129444ac2f upstream.
      
      Since commit d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on
      5-level paging") kernel doesn't boot with KASAN on 5-level paging machines.
      The bug is actually in early_p4d_offset() and introduced by commit
      12a8cc7f ("x86/kasan: Use the same shadow offset for 4- and 5-level paging")
      
      early_p4d_offset() tries to convert pgd_val(*pgd) value to a physical
      address. This doesn't make sense because pgd_val() already contains the
      physical address.
      
      It did work prior to commit d52888aa2753 because the result of
      "__pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK" was the same as "pgd_val(*pgd)
      & PTE_PFN_MASK". __pa_nodebug() just set some high bits which were masked
      out by applying PTE_PFN_MASK.
      
      After the change of the PAGE_OFFSET offset in commit d52888aa2753
      __pa_nodebug(pgd_val(*pgd)) started to return a value with more high bits
      set and PTE_PFN_MASK wasn't enough to mask out all of them. So it returns a
      wrong not even canonical address and crashes on the attempt to dereference
      it.
      
      Switch back to pgd_val() & PTE_PFN_MASK to cure the issue.
      
      Fixes: 12a8cc7f ("x86/kasan: Use the same shadow offset for 4- and 5-level paging")
      Reported-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: kasan-dev@googlegroups.com
      Cc: stable@vger.kernel.org
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20190614143149.2227-1-aryabinin@virtuozzo.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e3d10d9
    • B
      x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback · ecec31ce
      Borislav Petkov 提交于
      commit 78f4e932f7760d965fb1569025d1576ab77557c5 upstream.
      
      Adric Blake reported the following warning during suspend-resume:
      
        Enabling non-boot CPUs ...
        x86: Booting SMP configuration:
        smpboot: Booting Node 0 Processor 1 APIC 0x2
        unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
         at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
        Call Trace:
         intel_set_tfa
         intel_pmu_cpu_starting
         ? x86_pmu_dead_cpu
         x86_pmu_starting_cpu
         cpuhp_invoke_callback
         ? _raw_spin_lock_irqsave
         notify_cpu_starting
         start_secondary
         secondary_startup_64
        microcode: sig=0x806ea, pf=0x80, revision=0x96
        microcode: updated to revision 0xb4, date = 2019-04-01
        CPU1 is up
      
      The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
      by microcode. The log above shows that the microcode loader callback
      happens after the PMU restoration, leading to the conjecture that
      because the microcode hasn't been updated yet, that MSR is not present
      yet, leading to the #GP.
      
      Add a microcode loader-specific hotplug vector which comes before
      the PERF vectors and thus executes earlier and makes sure the MSR is
      present.
      
      Fixes: 400816f60c54 ("perf/x86/intel: Implement support for TSX Force Abort")
      Reported-by: NAdric Blake <promarbler14@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: x86@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecec31ce
    • P
      KVM: x86/pmu: do not mask the value that is written to fixed PMUs · 9d8f338c
      Paolo Bonzini 提交于
      [ Upstream commit 2924b52117b2812e9633d5ea337333299166d373 ]
      
      According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of
      each MSR may be written with any value, and the high-order 8 bits are
      sign-extended according to the value of bit 31", but the fixed counters
      in real hardware are limited to the width of the fixed counters ("bits
      beyond the width of the fixed-function counter are reserved and must be
      written as zeros").  Fix KVM to do the same.
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9d8f338c
    • P
      KVM: x86/pmu: mask the result of rdpmc according to the width of the counters · 04d2a113
      Paolo Bonzini 提交于
      [ Upstream commit 0e6f467ee28ec97f68c7b74e35ec1601bb1368a7 ]
      
      This patch will simplify the changes in the next, by enforcing the
      masking of the counters to RDPMC and RDMSR.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      04d2a113
  9. 15 6月, 2019 2 次提交
  10. 11 6月, 2019 2 次提交
    • J
      x86/insn-eval: Fix use-after-free access to LDT entry · b598ddc7
      Jann Horn 提交于
      commit de9f869616dd95e95c00bdd6b0fcd3421e8a4323 upstream.
      
      get_desc() computes a pointer into the LDT while holding a lock that
      protects the LDT from being freed, but then drops the lock and returns the
      (now potentially dangling) pointer to its caller.
      
      Fix it by giving the caller a copy of the LDT entry instead.
      
      Fixes: 670f928b ("x86/insn-eval: Add utility function to get segment descriptor")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b598ddc7
    • J
      x86/power: Fix 'nosmt' vs hibernation triple fault during resume · 4d166206
      Jiri Kosina 提交于
      commit ec527c318036a65a083ef68d8ba95789d2212246 upstream.
      
      As explained in
      
      	0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      
      we always, no matter what, have to bring up x86 HT siblings during boot at
      least once in order to avoid first MCE bringing the system to its knees.
      
      That means that whenever 'nosmt' is supplied on the kernel command-line,
      all the HT siblings are as a result sitting in mwait or cpudile after
      going through the online-offline cycle at least once.
      
      This causes a serious issue though when a kernel, which saw 'nosmt' on its
      commandline, is going to perform resume from hibernation: if the resume
      from the hibernated image is successful, cr3 is flipped in order to point
      to the address space of the kernel that is being resumed, which in turn
      means that all the HT siblings are all of a sudden mwaiting on address
      which is no longer valid.
      
      That results in triple fault shortly after cr3 is switched, and machine
      reboots.
      
      Fix this by always waking up all the SMT siblings before initiating the
      'restore from hibernation' process; this guarantees that all the HT
      siblings will be properly carried over to the resumed kernel waiting in
      resume_play_dead(), and acted upon accordingly afterwards, based on the
      target kernel configuration.
      
      Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
      again in case it has SMT disabled; this means it has to online all
      the siblings when resuming (so that they come out of hlt) and offline
      them again to let them reach mwait.
      
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Debugged-by: NThomas Gleixner <tglx@linutronix.de>
      Fixes: 0cc3cd21 ("cpu/hotplug: Boot HT siblings at least once")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d166206
  11. 09 6月, 2019 5 次提交
  12. 04 6月, 2019 1 次提交
  13. 31 5月, 2019 5 次提交
    • Y
      x86/mce: Handle varying MCA bank counts · 75a96196
      Yazen Ghannam 提交于
      [ Upstream commit 006c077041dc73b9490fffc4c6af5befe0687110 ]
      
      Linux reads MCG_CAP[Count] to find the number of MCA banks visible to a
      CPU. Currently, this number is the same for all CPUs and a warning is
      shown if there is a difference. The number of banks is overwritten with
      the MCG_CAP[Count] value of each following CPU that boots.
      
      According to the Intel SDM and AMD APM, the MCG_CAP[Count] value gives
      the number of banks that are available to a "processor implementation".
      The AMD BKDGs/PPRs further clarify that this value is per core. This
      value has historically been the same for every core in the system, but
      that is not an architectural requirement.
      
      Future AMD systems may have different MCG_CAP[Count] values per core,
      so the assumption that all CPUs will have the same MCG_CAP[Count] value
      will no longer be valid.
      
      Also, the first CPU to boot will allocate the struct mce_banks[] array
      using the number of banks based on its MCG_CAP[Count] value. The machine
      check handler and other functions use the global number of banks to
      iterate and index into the mce_banks[] array. So it's possible to use an
      out-of-bounds index on an asymmetric system where a following CPU sees a
      MCG_CAP[Count] value greater than its predecessors.
      
      Thus, allocate the mce_banks[] array to the maximum number of banks.
      This will avoid the potential out-of-bounds index since the value of
      mca_cfg.banks is capped to MAX_NR_BANKS.
      
      Set the value of mca_cfg.banks equal to the max of the previous value
      and the value for the current CPU. This way mca_cfg.banks will always
      represent the max number of banks detected on any CPU in the system.
      
      This will ensure that all CPUs will access all the banks that are
      visible to them. A CPU that can access fewer than the max number of
      banks will find the registers of the extra banks to be read-as-zero.
      
      Furthermore, print the resulting number of MCA banks in use. Do this in
      mcheck_late_init() so that the final value is printed after all CPUs
      have been initialized.
      
      Finally, get bank count from target CPU when doing injection with mce-inject
      module.
      
       [ bp: Remove out-of-bounds example, passify and cleanup commit message. ]
      Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20180727214009.78289-1-Yazen.Ghannam@amd.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      75a96196
    • T
      x86/mce: Fix machine_check_poll() tests for error types · 3d036cba
      Tony Luck 提交于
      [ Upstream commit f19501aa07f18268ab14f458b51c1c6b7f72a134 ]
      
      There has been a lurking "TBD" in the machine check poll routine ever
      since it was first split out from the machine check handler. The
      potential issue is that the poll routine may have just begun a read from
      the STATUS register in a machine check bank when the hardware logs an
      error in that bank and signals a machine check.
      
      That race used to be pretty small back when machine checks were
      broadcast, but the addition of local machine check means that the poll
      code could continue running and clear the error from the bank before the
      local machine check handler on another CPU gets around to reading it.
      
      Fix the code to be sure to only process errors that need to be processed
      in the poll code, leaving other logged errors alone for the machine
      check handler to find and process.
      
       [ bp: Massage a bit and flip the "== 0" check to the usual !(..) test. ]
      
      Fixes: b79109c3 ("x86, mce: separate correct machine check poller and fatal exception handler")
      Fixes: ed7290d0 ("x86, mce: implement new status bits")
      Reported-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
      Link: https://lkml.kernel.org/r/20190312170938.GA23035@agluck-deskSigned-off-by: NSasha Levin <sashal@kernel.org>
      3d036cba
    • P
      x86/uaccess: Fix up the fixup · fc242af8
      Peter Zijlstra 提交于
      [ Upstream commit b69656fa7ea2f75e47d7bd5b9430359fa46488af ]
      
      New tooling got confused about this:
      
        arch/x86/lib/memcpy_64.o: warning: objtool: .fixup+0x7: return with UACCESS enabled
      
      While the code isn't wrong, it is tedious (if at all possible) to
      figure out what function a particular chunk of .fixup belongs to.
      
      This then confuses the objtool uaccess validation. Instead of
      returning directly from the .fixup, jump back into the right function.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fc242af8
    • P
      x86/ia32: Fix ia32_restore_sigcontext() AC leak · 5007453c
      Peter Zijlstra 提交于
      [ Upstream commit 67a0514afdbb8b2fc70b771b8c77661a9cb9d3a9 ]
      
      Objtool spotted that we call native_load_gs_index() with AC set.
      Re-arrange the code to avoid that.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5007453c
    • P
      x86/uaccess, signal: Fix AC=1 bloat · 4614b0bb
      Peter Zijlstra 提交于
      [ Upstream commit 88e4718275c1bddca6f61f300688b4553dc8584b ]
      
      Occasionally GCC is less agressive with inlining and the following is
      observed:
      
        arch/x86/kernel/signal.o: warning: objtool: restore_sigcontext()+0x3cc: call to force_valid_ss.isra.5() with UACCESS enabled
        arch/x86/kernel/signal.o: warning: objtool: do_signal()+0x384: call to frame_uc_flags.isra.0() with UACCESS enabled
      
      Cure this by moving this code out of the AC=1 region, since it really
      isn't needed for the user access.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4614b0bb