1. 23 6月, 2021 7 次提交
  2. 22 6月, 2021 2 次提交
    • T
      x86/fpu: Make init_fpstate correct with optimized XSAVE · f9dfb5e3
      Thomas Gleixner 提交于
      The XSAVE init code initializes all enabled and supported components with
      XRSTOR(S) to init state. Then it XSAVEs the state of the components back
      into init_fpstate which is used in several places to fill in the init state
      of components.
      
      This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
      those use the init optimization and skip writing state of components which
      are in init state. So init_fpstate.xsave still contains all zeroes after
      this operation.
      
      There are two ways to solve that:
      
         1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
            XSAVES is enabled because XSAVES uses compacted format.
      
         2) Save the components which are known to have a non-zero init state by other
            means.
      
      Looking deeper, #2 is the right thing to do because all components the
      kernel supports have all-zeroes init state except the legacy features (FP,
      SSE). Those cannot be hard coded because the states are not identical on all
      CPUs, but they can be saved with FXSAVE which avoids all conditionals.
      
      Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
      a BUILD_BUG_ON() which reminds developers to validate that a newly added
      component has all zeroes init state. As a bonus remove the now unused
      copy_xregs_to_kernel_booting() crutch.
      
      The XSAVE and reshuffle method can still be implemented in the unlikely
      case that components are added which have a non-zero init state and no
      other means to save them. For now, FXSAVE is just simple and good enough.
      
        [ bp: Fix a typo or two in the text. ]
      
      Fixes: 6bad06b7 ("x86, xsave: Use xsaveopt in context-switch path when supported")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
      f9dfb5e3
    • T
      x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate() · 9301982c
      Thomas Gleixner 提交于
      sanitize_restored_user_xstate() preserves the supervisor states only
      when the fx_only argument is zero, which allows unprivileged user space
      to put supervisor states back into init state.
      
      Preserve them unconditionally.
      
       [ bp: Fix a typo or two in the text. ]
      
      Fixes: 5d6b6a6f ("x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstates")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210618143444.438635017@linutronix.de
      9301982c
  3. 19 6月, 2021 4 次提交
    • D
      riscv: dts: fu740: fix cache-controller interrupts · 7ede12b0
      David Abdurachmanov 提交于
      The order of interrupt numbers is incorrect.
      
      The order for FU740 is: DirError, DataError, DataFail, DirFail
      
      From SiFive FU740-C000 Manual:
      19 - L2 Cache DirError
      20 - L2 Cache DirFail
      21 - L2 Cache DataError
      22 - L2 Cache DataFail
      Signed-off-by: NDavid Abdurachmanov <david.abdurachmanov@sifive.com>
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      7ede12b0
    • J
      riscv: Ensure BPF_JIT_REGION_START aligned with PMD size · 3a02764c
      Jisheng Zhang 提交于
      Andreas reported commit fc850476 ("riscv: bpf: Avoid breaking W^X")
      breaks booting with one kind of defconfig, I reproduced a kernel panic
      with the defconfig:
      
      [    0.138553] Unable to handle kernel paging request at virtual address ffffffff81201220
      [    0.139159] Oops [#1]
      [    0.139303] Modules linked in:
      [    0.139601] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc5-default+ #1
      [    0.139934] Hardware name: riscv-virtio,qemu (DT)
      [    0.140193] epc : __memset+0xc4/0xfc
      [    0.140416]  ra : skb_flow_dissector_init+0x1e/0x82
      [    0.140609] epc : ffffffff8029806c ra : ffffffff8033be78 sp : ffffffe001647da0
      [    0.140878]  gp : ffffffff81134b08 tp : ffffffe001654380 t0 : ffffffff81201158
      [    0.141156]  t1 : 0000000000000002 t2 : 0000000000000154 s0 : ffffffe001647dd0
      [    0.141424]  s1 : ffffffff80a43250 a0 : ffffffff81201220 a1 : 0000000000000000
      [    0.141654]  a2 : 000000000000003c a3 : ffffffff81201258 a4 : 0000000000000064
      [    0.141893]  a5 : ffffffff8029806c a6 : 0000000000000040 a7 : ffffffffffffffff
      [    0.142126]  s2 : ffffffff81201220 s3 : 0000000000000009 s4 : ffffffff81135088
      [    0.142353]  s5 : ffffffff81135038 s6 : ffffffff8080ce80 s7 : ffffffff80800438
      [    0.142584]  s8 : ffffffff80bc6578 s9 : 0000000000000008 s10: ffffffff806000ac
      [    0.142810]  s11: 0000000000000000 t3 : fffffffffffffffc t4 : 0000000000000000
      [    0.143042]  t5 : 0000000000000155 t6 : 00000000000003ff
      [    0.143220] status: 0000000000000120 badaddr: ffffffff81201220 cause: 000000000000000f
      [    0.143560] [<ffffffff8029806c>] __memset+0xc4/0xfc
      [    0.143859] [<ffffffff8061e984>] init_default_flow_dissectors+0x22/0x60
      [    0.144092] [<ffffffff800010fc>] do_one_initcall+0x3e/0x168
      [    0.144278] [<ffffffff80600df0>] kernel_init_freeable+0x1c8/0x224
      [    0.144479] [<ffffffff804868a8>] kernel_init+0x12/0x110
      [    0.144658] [<ffffffff800022de>] ret_from_exception+0x0/0xc
      [    0.145124] ---[ end trace f1e9643daa46d591 ]---
      
      After some investigation, I think I found the root cause: commit
      2bfc6cd8 ("move kernel mapping outside of linear mapping") moves
      BPF JIT region after the kernel:
      
      | #define BPF_JIT_REGION_START	PFN_ALIGN((unsigned long)&_end)
      
      The &_end is unlikely aligned with PMD size, so the front bpf jit
      region sits with part of kernel .data section in one PMD size mapping.
      But kernel is mapped in PMD SIZE, when bpf_jit_binary_lock_ro() is
      called to make the first bpf jit prog ROX, we will make part of kernel
      .data section RO too, so when we write to, for example memset the
      .data section, MMU will trigger a store page fault.
      
      To fix the issue, we need to ensure the BPF JIT region is PMD size
      aligned. This patch acchieve this goal by restoring the BPF JIT region
      to original position, I.E the 128MB before kernel .text section. The
      modification to kasan_init.c is inspired by Alexandre.
      
      Fixes: fc850476 ("riscv: bpf: Avoid breaking W^X")
      Reported-by: NAndreas Schwab <schwab@linux-m68k.org>
      Signed-off-by: NJisheng Zhang <jszhang@kernel.org>
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      3a02764c
    • J
      riscv: kasan: Fix MODULES_VADDR evaluation due to local variables' name · 314b7817
      Jisheng Zhang 提交于
      commit 2bfc6cd8 ("riscv: Move kernel mapping outside of linear
      mapping") makes use of MODULES_VADDR to populate kernel, BPF, modules
      mapping. Currently, MODULES_VADDR is defined as below for RV64:
      
      | #define MODULES_VADDR   (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
      
      But kasan_init() has two local variables which are also named as _start,
      _end, so MODULES_VADDR is evaluated with the local variable _end
      rather than the global "_end" as we expected. Fix this issue by
      renaming the two local variables.
      
      Fixes: 2bfc6cd8 ("riscv: Move kernel mapping outside of linear mapping")
      Signed-off-by: NJisheng Zhang <jszhang@kernel.org>
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      314b7817
    • F
      x86/mm: Avoid truncating memblocks for SGX memory · 28e5e44a
      Fan Du 提交于
      tl;dr:
      
      Several SGX users reported seeing the following message on NUMA systems:
      
        sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.
      
      This turned out to be the memblock code mistakenly throwing away SGX
      memory.
      
      === Full Changelog ===
      
      The 'max_pfn' variable represents the highest known RAM address.  It can
      be used, for instance, to quickly determine for which physical addresses
      there is mem_map[] space allocated.  The numa_meminfo code makes an
      effort to throw out ("trim") all memory blocks which are above 'max_pfn'.
      
      SGX memory is not considered RAM (it is marked as "Reserved" in the
      e820) and is not taken into account by max_pfn. Despite this, SGX memory
      areas have NUMA affinity and are enumerated in the ACPI SRAT table. The
      existing SGX code uses the numa_meminfo mechanism to look up the NUMA
      affinity for its memory areas.
      
      In cases where SGX memory was above max_pfn (usually just the one EPC
      section in the last highest NUMA node), the numa_memblock is truncated
      at 'max_pfn', which is below the SGX memory.  When the SGX code tries to
      look up the affinity of this memory, it fails and produces an error message:
      
        sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.
      
      and assigns the memory to NUMA node 0.
      
      Instead of silently truncating the memory block at 'max_pfn' and
      dropping the SGX memory, add the truncated portion to
      'numa_reserved_meminfo'.  This allows the SGX code to later determine
      the NUMA affinity of its 'Reserved' area.
      
      Before, numa_meminfo looked like this (from 'crash'):
      
        blk = { start =          0x0, end = 0x2080000000, nid = 0x0 }
              { start = 0x2080000000, end = 0x4000000000, nid = 0x1 }
      
      numa_reserved_meminfo is empty.
      
      With this, numa_meminfo looks like this:
      
        blk = { start =          0x0, end = 0x2080000000, nid = 0x0 }
              { start = 0x2080000000, end = 0x4000000000, nid = 0x1 }
      
      and numa_reserved_meminfo has an entry for node 1's SGX memory:
      
        blk =  { start = 0x4000000000, end = 0x4080000000, nid = 0x1 }
      
       [ daveh: completely rewrote/reworked changelog ]
      
      Fixes: 5d30f92e ("x86/NUMA: Provide a range-to-target_node lookup facility")
      Reported-by: NReinette Chatre <reinette.chatre@intel.com>
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NJarkko Sakkinen <jarkko@kernel.org>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20210617194657.0A99CB22@viggo.jf.intel.com
      28e5e44a
  4. 18 6月, 2021 2 次提交
    • M
      PCI: Add AMD RS690 quirk to enable 64-bit DMA · cacf994a
      Mikel Rychliski 提交于
      Although the AMD RS690 chipset has 64-bit DMA support, BIOS implementations
      sometimes fail to configure the memory limit registers correctly.
      
      The Acer F690GVM mainboard uses this chipset and a Marvell 88E8056 NIC. The
      sky2 driver programs the NIC to use 64-bit DMA, which will not work:
      
        sky2 0000:02:00.0: error interrupt status=0x8
        sky2 0000:02:00.0 eth0: tx timeout
        sky2 0000:02:00.0 eth0: transmit ring 0 .. 22 report=0 done=0
      
      Other drivers required by this mainboard either don't support 64-bit DMA,
      or have it disabled using driver specific quirks. For example, the ahci
      driver has quirks to enable or disable 64-bit DMA depending on the BIOS
      version (see ahci_sb600_enable_64bit() in ahci.c). This ahci quirk matches
      against the SB600 SATA controller, but the real issue is almost certainly
      with the RS690 PCI host that it was commonly attached to.
      
      To avoid this issue in all drivers with 64-bit DMA support, fix the
      configuration of the PCI host. If the kernel is aware of physical memory
      above 4GB, but the BIOS never configured the PCI host with this
      information, update the registers with our values.
      
      [bhelgaas: drop PCI_DEVICE_ID_ATI_RS690 definition]
      Link: https://lore.kernel.org/r/20210611214823.4898-1-mikel@mikelr.comSigned-off-by: NMikel Rychliski <mikel@mikelr.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      cacf994a
    • A
      powerpc/perf: Fix crash in perf_instruction_pointer() when ppmu is not set · 60b7ed54
      Athira Rajeev 提交于
      On systems without any specific PMU driver support registered, running
      perf record causes Oops.
      
      The relevant portion from call trace:
      
        BUG: Kernel NULL pointer dereference on read at 0x00000040
        Faulting instruction address: 0xc0021f0c
        Oops: Kernel access of bad area, sig: 11 [#1]
        BE PAGE_SIZE=4K PREEMPT CMPCPRO
        SAF3000 DIE NOTIFICATION
        CPU: 0 PID: 442 Comm: null_syscall Not tainted 5.13.0-rc6-s3k-dev-01645-g7649ee3d2957 #5164
        NIP:  c0021f0c LR: c00e8ad8 CTR: c00d8a5c
        NIP perf_instruction_pointer+0x10/0x60
        LR  perf_prepare_sample+0x344/0x674
        Call Trace:
          perf_prepare_sample+0x7c/0x674 (unreliable)
          perf_event_output_forward+0x3c/0x94
          __perf_event_overflow+0x74/0x14c
          perf_swevent_hrtimer+0xf8/0x170
          __hrtimer_run_queues.constprop.0+0x160/0x318
          hrtimer_interrupt+0x148/0x3b0
          timer_interrupt+0xc4/0x22c
          Decrementer_virt+0xb8/0xbc
      
      During perf record session, perf_instruction_pointer() is called to
      capture the sample IP. This function in core-book3s accesses
      ppmu->flags. If a platform specific PMU driver is not registered, ppmu
      is set to NULL and accessing its members results in a crash. Fix this
      crash by checking if ppmu is set.
      
      Fixes: 2ca13a4c ("powerpc/perf: Use regs->nip when SIAR is zero")
      Cc: stable@vger.kernel.org # v5.11+
      Reported-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Tested-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/1623952506-1431-1-git-send-email-atrajeev@linux.vnet.ibm.com
      60b7ed54
  5. 16 6月, 2021 1 次提交
  6. 15 6月, 2021 1 次提交
    • M
      powerpc: Fix initrd corruption with relative jump labels · 478036c4
      Michael Ellerman 提交于
      Commit b0b3b2c7 ("powerpc: Switch to relative jump labels") switched
      us to using relative jump labels. That involves changing the code,
      target and key members in struct jump_entry to be relative to the
      address of the jump_entry, rather than absolute addresses.
      
      We have two static inlines that create a struct jump_entry,
      arch_static_branch() and arch_static_branch_jump(), as well as an asm
      macro ARCH_STATIC_BRANCH, which is used by the pseries-only hypervisor
      tracing code.
      
      Unfortunately we missed updating the key to be a relative reference in
      ARCH_STATIC_BRANCH.
      
      That causes a pseries kernel to have a handful of jump_entry structs
      with bad key values. Instead of being a relative reference they instead
      hold the full address of the key.
      
      However the code doesn't expect that, it still adds the key value to the
      address of the jump_entry (see jump_entry_key()) expecting to get a
      pointer to a key somewhere in kernel data.
      
      The table of jump_entry structs sits in rodata, which comes after the
      kernel text. In a typical build this will be somewhere around 15MB. The
      address of the key will be somewhere in data, typically around 20MB.
      Adding the two values together gets us a pointer somewhere around 45MB.
      
      We then call static_key_set_entries() with that bad pointer and modify
      some members of the struct static_key we think we are pointing at.
      
      A pseries kernel is typically ~30MB in size, so writing to ~45MB won't
      corrupt the kernel itself. However if we're booting with an initrd,
      depending on the size and exact location of the initrd, we can corrupt
      the initrd. Depending on how exactly we corrupt the initrd it can either
      cause the system to not boot, or just corrupt one of the files in the
      initrd.
      
      The fix is simply to make the key value relative to the jump_entry
      struct in the ARCH_STATIC_BRANCH macro.
      
      Fixes: b0b3b2c7 ("powerpc: Switch to relative jump labels")
      Reported-by: NAnastasia Kovaleva <a.kovaleva@yadro.com>
      Reported-by: NRoman Bolshakov <r.bolshakov@yadro.com>
      Reported-by: NGreg Kurz <groug@kaod.org>
      Reported-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NDaniel Axtens <dja@axtens.net>
      Tested-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210614131440.312360-1-mpe@ellerman.id.au
      478036c4
  7. 14 6月, 2021 1 次提交
    • M
      powerpc/signal64: Copy siginfo before changing regs->nip · e41d6c3f
      Michael Ellerman 提交于
      In commit 96d7a4e0 ("powerpc/signal64: Rewrite handle_rt_signal64()
      to minimise uaccess switches") the 64-bit signal code was rearranged to
      use user_write_access_begin/end().
      
      As part of that change the call to copy_siginfo_to_user() was moved
      later in the function, so that it could be done after the
      user_write_access_end().
      
      In particular it was moved after we modify regs->nip to point to the
      signal trampoline. That means if copy_siginfo_to_user() fails we exit
      handle_rt_signal64() with an error but with regs->nip modified, whereas
      previously we would not modify regs->nip until the copy succeeded.
      
      Returning an error from signal delivery but with regs->nip updated
      leaves the process in a sort of half-delivered state. We do immediately
      force a SEGV in signal_setup_done(), called from do_signal(), so the
      process should never run in the half-delivered state.
      
      However that SEGV is not delivered until we've gone around to
      do_notify_resume() again, so it's possible some tracing could observe
      the half-delivered state.
      
      There are other cases where we fail signal delivery with regs partly
      updated, eg. the write to newsp and SA_SIGINFO, but the latter at least
      is very unlikely to fail as it reads back from the frame we just wrote
      to.
      
      Looking at other arches they seem to be more careful about leaving regs
      unchanged until the copy operations have succeeded, and in general that
      seems like good hygenie.
      
      So although the current behaviour is not cleary buggy, it's also not
      clearly correct. So move the call to copy_siginfo_to_user() up prior to
      the modification of regs->nip, which is closer to the old behaviour, and
      easier to reason about.
      
      Fixes: 96d7a4e0 ("powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210608134605.2783677-1-mpe@ellerman.id.au
      e41d6c3f
  8. 13 6月, 2021 2 次提交
    • R
      riscv: sifive: fix Kconfig errata warning · 01f5315d
      Randy Dunlap 提交于
      The SOC_SIFIVE Kconfig entry unconditionally selects ERRATA_SIFIVE.
      However, ERRATA_SIFIVE depends on RISCV_ERRATA_ALTERNATIVE, which is
      not set, so SOC_SIFIVE should either depend on or select
      RISCV_ERRATA_ALTERNATIVE. Use 'select' here to quieten the Kconfig
      warning.
      
      WARNING: unmet direct dependencies detected for ERRATA_SIFIVE
        Depends on [n]: RISCV_ERRATA_ALTERNATIVE [=n]
        Selected by [y]:
        - SOC_SIFIVE [=y]
      
      Fixes: 1a0e5dbd ("riscv: sifive: Add SiFive alternative ports")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: linux-riscv@lists.infradead.org
      Cc: Vincent Chen <vincent.chen@sifive.com>
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      01f5315d
    • K
      riscv32: Use medany C model for modules · 5d2388db
      Khem Raj 提交于
      When CONFIG_CMODEL_MEDLOW is used it ends up generating riscv_hi20_rela
      relocations in modules which are not resolved during runtime and
      following errors would be seen
      
      [    4.802714] virtio_input: target 00000000c1539090 can not be addressed by the 32-bit offset from PC = 39148b7b
      [    4.854800] virtio_input: target 00000000c1539090 can not be addressed by the 32-bit offset from PC = 9774456d
      Signed-off-by: NKhem Raj <raj.khem@gmail.com>
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      5d2388db
  9. 12 6月, 2021 2 次提交
  10. 11 6月, 2021 8 次提交
    • S
      KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU · 654430ef
      Sean Christopherson 提交于
      Calculate and check the full mmu_role when initializing the MMU context
      for the nested MMU, where "full" means the bits and pieces of the role
      that aren't handled by kvm_calc_mmu_role_common().  While the nested MMU
      isn't used for shadow paging, things like the number of levels in the
      guest's page tables are surprisingly important when walking the guest
      page tables.  Failure to reinitialize the nested MMU context if L2's
      paging mode changes can result in unexpected and/or missed page faults,
      and likely other explosions.
      
      E.g. if an L1 vCPU is running both a 32-bit PAE L2 and a 64-bit L2, the
      "common" role calculation will yield the same role for both L2s.  If the
      64-bit L2 is run after the 32-bit PAE L2, L0 will fail to reinitialize
      the nested MMU context, ultimately resulting in a bad walk of L2's page
      tables as the MMU will still have a guest root_level of PT32E_ROOT_LEVEL.
      
        WARNING: CPU: 4 PID: 167334 at arch/x86/kvm/vmx/vmx.c:3075 ept_save_pdptrs+0x15/0xe0 [kvm_intel]
        Modules linked in: kvm_intel]
        CPU: 4 PID: 167334 Comm: CPU 3/KVM Not tainted 5.13.0-rc1-d849817d5673-reqs #185
        Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
        RIP: 0010:ept_save_pdptrs+0x15/0xe0 [kvm_intel]
        Code: <0f> 0b c3 f6 87 d8 02 00f
        RSP: 0018:ffffbba702dbba00 EFLAGS: 00010202
        RAX: 0000000000000011 RBX: 0000000000000002 RCX: ffffffff810a2c08
        RDX: ffff91d7bc30acc0 RSI: 0000000000000011 RDI: ffff91d7bc30a600
        RBP: ffff91d7bc30a600 R08: 0000000000000010 R09: 0000000000000007
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff91d7bc30a600
        R13: ffff91d7bc30acc0 R14: ffff91d67c123460 R15: 0000000115d7e005
        FS:  00007fe8e9ffb700(0000) GS:ffff91d90fb00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000029f15a001 CR4: 00000000001726e0
        Call Trace:
         kvm_pdptr_read+0x3a/0x40 [kvm]
         paging64_walk_addr_generic+0x327/0x6a0 [kvm]
         paging64_gva_to_gpa_nested+0x3f/0xb0 [kvm]
         kvm_fetch_guest_virt+0x4c/0xb0 [kvm]
         __do_insn_fetch_bytes+0x11a/0x1f0 [kvm]
         x86_decode_insn+0x787/0x1490 [kvm]
         x86_decode_emulated_instruction+0x58/0x1e0 [kvm]
         x86_emulate_instruction+0x122/0x4f0 [kvm]
         vmx_handle_exit+0x120/0x660 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0xe25/0x1cb0 [kvm]
         kvm_vcpu_ioctl+0x211/0x5a0 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x40/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: bf627a92 ("x86/kvm/mmu: check if MMU reconfiguration is needed in init_kvm_nested_mmu()")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210610220026.1364486-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      654430ef
    • W
      KVM: X86: Fix x86_emulator slab cache leak · dfdc0a71
      Wanpeng Li 提交于
      Commit c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context)
      tries to allocate per-vCPU emulation context dynamically, however, the
      x86_emulator slab cache is still exiting after the kvm module is unload
      as below after destroying the VM and unloading the kvm module.
      
      grep x86_emulator /proc/slabinfo
      x86_emulator          36     36   2672   12    8 : tunables    0    0    0 : slabdata      3      3      0
      
      This patch fixes this slab cache leak by destroying the x86_emulator slab cache
      when the kvm module is unloaded.
      
      Fixes: c9b8b07c (KVM: x86: Dynamically allocate per-vCPU emulation context)
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1623387573-5969-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dfdc0a71
    • A
      KVM: SVM: Call SEV Guest Decommission if ASID binding fails · 934002cd
      Alper Gun 提交于
      Send SEV_CMD_DECOMMISSION command to PSP firmware if ASID binding
      fails. If a failure happens after  a successful LAUNCH_START command,
      a decommission command should be executed. Otherwise, guest context
      will be unfreed inside the AMD SP. After the firmware will not have
      memory to allocate more SEV guest context, LAUNCH_START command will
      begin to fail with SEV_RET_RESOURCE_LIMIT error.
      
      The existing code calls decommission inside sev_unbind_asid, but it is
      not called if a failure happens before guest activation succeeds. If
      sev_bind_asid fails, decommission is never called. PSP firmware has a
      limit for the number of guests. If sev_asid_binding fails many times,
      PSP firmware will not have resources to create another guest context.
      
      Cc: stable@vger.kernel.org
      Fixes: 59414c98 ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command")
      Reported-by: NPeter Gonda <pgonda@google.com>
      Signed-off-by: NAlper Gun <alpergun@google.com>
      Reviewed-by: NMarc Orr <marcorr@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210610174604.2554090-1-alpergun@google.com>
      934002cd
    • V
      riscv: alternative: fix typo in macro name · 858cf860
      Vitaly Wool 提交于
      alternative-macros.h defines ALT_NEW_CONTENT in its assembly part
      and ALT_NEW_CONSTENT in the C part. Most likely it is the latter
      that is wrong.
      
      Fixes: 6f4eea90
      	(riscv: Introduce alternative mechanism to apply errata solution)
      Signed-off-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      858cf860
    • V
      ARC: fix CONFIG_HARDENED_USERCOPY · 110febc0
      Vineet Gupta 提交于
      Currently enabling this triggers a warning
      
      | usercopy: Kernel memory overwrite attempt detected to kernel text (offset 155633, size 11)!
      | usercopy: BUG: failure at mm/usercopy.c:99/usercopy_abort()!
      |
      |gcc generated __builtin_trap
      |Path: /bin/busybox
      |CPU: 0 PID: 84 Comm: init Not tainted 5.4.22
      |
      |[ECR ]: 0x00090005 => gcc generated __builtin_trap
      |[EFA ]: 0x9024fcaa
      |[BLINK ]: usercopy_abort+0x8a/0x8c
      |[ERET ]: memfd_fcntl+0x0/0x470
      |[STAT32]: 0x80080802 : IE K
      |...
      |...
      |Stack Trace:
      | memfd_fcntl+0x0/0x470
      | usercopy_abort+0x8a/0x8c
      | __check_object_size+0x10e/0x138
      | copy_strings+0x1f4/0x38c
      | __do_execve_file+0x352/0x848
      | EV_Trap+0xcc/0xd0
      
      The issue is triggered by an allocation in "init reclaimed" region.
      ARC _stext emcompasses the init region (for historical reasons we wanted
      the init.text to be under .text as well). This however trips up
      __check_object_size()->check_kernel_text_object() which treats this as
      object bleeding into kernel text.
      
      Fix that by rezoning _stext to start from regular kernel .text and leave
      out .init altogether.
      
      Fixes: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/15Reported-by: NEvgeniy Didin <didin@synopsys.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      110febc0
    • V
      ARCv2: save ABI registers across signal handling · 96f1b001
      Vineet Gupta 提交于
      ARCv2 has some configuration dependent registers (r30, r58, r59) which
      could be targetted by the compiler. To keep the ABI stable, these were
      unconditionally part of the glibc ABI
      (sysdeps/unix/sysv/linux/arc/sys/ucontext.h:mcontext_t) however we
      missed populating them (by saving/restoring them across signal
      handling).
      
      This patch fixes the issue by
       - adding arcv2 ABI regs to kernel struct sigcontext
       - populating them during signal handling
      
      Change to struct sigcontext might seem like a glibc ABI change (although
      it primarily uses ucontext_t:mcontext_t) but the fact is
       - it has only been extended (existing fields are not touched)
       - the old sigcontext was ABI incomplete to begin with anyways
      
      Fixes: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/53
      Cc: <stable@vger.kernel.org>
      Tested-by: Nkernel test robot <lkp@intel.com>
      Reported-by: NVladimir Isaev <isaev@synopsys.com>
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      96f1b001
    • J
      riscv: code patching only works on !XIP_KERNEL · 42e0e0b4
      Jisheng Zhang 提交于
      Some features which need code patching such as KPROBES, DYNAMIC_FTRACE
      KGDB can only work on !XIP_KERNEL. Add dependencies for these features
      that rely on code patching.
      Signed-off-by: NJisheng Zhang <jszhang@kernel.org>
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      42e0e0b4
    • V
      riscv: xip: support runtime trap patching · 5e63215c
      Vitaly Wool 提交于
      RISCV_ERRATA_ALTERNATIVE patches text at runtime which is currently
      not possible when the kernel is executed from the flash in XIP mode.
      Since runtime patching concerns only traps at the moment, let's just
      have all the traps reside in RAM anyway if RISCV_ERRATA_ALTERNATIVE
      is set. Thus, these functions will be patch-able even when the .text
      section is in flash.
      Signed-off-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      5e63215c
  11. 10 6月, 2021 6 次提交
    • S
      KVM: x86: Immediately reset the MMU context when the SMM flag is cleared · 78fcb2c9
      Sean Christopherson 提交于
      Immediately reset the MMU context when the vCPU's SMM flag is cleared so
      that the SMM flag in the MMU role is always synchronized with the vCPU's
      flag.  If RSM fails (which isn't correctly emulated), KVM will bail
      without calling post_leave_smm() and leave the MMU in a bad state.
      
      The bad MMU role can lead to a NULL pointer dereference when grabbing a
      shadow page's rmap for a page fault as the initial lookups for the gfn
      will happen with the vCPU's SMM flag (=0), whereas the rmap lookup will
      use the shadow page's SMM flag, which comes from the MMU (=1).  SMM has
      an entirely different set of memslots, and so the initial lookup can find
      a memslot (SMM=0) and then explode on the rmap memslot lookup (SMM=1).
      
        general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
        KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
        CPU: 1 PID: 8410 Comm: syz-executor382 Not tainted 5.13.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:__gfn_to_rmap arch/x86/kvm/mmu/mmu.c:935 [inline]
        RIP: 0010:gfn_to_rmap+0x2b0/0x4d0 arch/x86/kvm/mmu/mmu.c:947
        Code: <42> 80 3c 20 00 74 08 4c 89 ff e8 f1 79 a9 00 4c 89 fb 4d 8b 37 44
        RSP: 0018:ffffc90000ffef98 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff888015b9f414 RCX: ffff888019669c40
        RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001
        RBP: 0000000000000001 R08: ffffffff811d9cdb R09: ffffed10065a6002
        R10: ffffed10065a6002 R11: 0000000000000000 R12: dffffc0000000000
        R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
        FS:  000000000124b300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000028e31000 CR4: 00000000001526e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         rmap_add arch/x86/kvm/mmu/mmu.c:965 [inline]
         mmu_set_spte+0x862/0xe60 arch/x86/kvm/mmu/mmu.c:2604
         __direct_map arch/x86/kvm/mmu/mmu.c:2862 [inline]
         direct_page_fault+0x1f74/0x2b70 arch/x86/kvm/mmu/mmu.c:3769
         kvm_mmu_do_page_fault arch/x86/kvm/mmu.h:124 [inline]
         kvm_mmu_page_fault+0x199/0x1440 arch/x86/kvm/mmu/mmu.c:5065
         vmx_handle_exit+0x26/0x160 arch/x86/kvm/vmx/vmx.c:6122
         vcpu_enter_guest+0x3bdd/0x9630 arch/x86/kvm/x86.c:9428
         vcpu_run+0x416/0xc20 arch/x86/kvm/x86.c:9494
         kvm_arch_vcpu_ioctl_run+0x4e8/0xa40 arch/x86/kvm/x86.c:9722
         kvm_vcpu_ioctl+0x70f/0xbb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3460
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:1069 [inline]
         __se_sys_ioctl+0xfb/0x170 fs/ioctl.c:1055
         do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x440ce9
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+fb0b6a7e8713aeb0319c@syzkaller.appspotmail.com
      Fixes: 9ec19493 ("KVM: x86: clear SMM flags before loading state while leaving SMM")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      78fcb2c9
    • G
      KVM: x86: Fix fall-through warnings for Clang · 551912d2
      Gustavo A. R. Silva 提交于
      In preparation to enable -Wimplicit-fallthrough for Clang, fix a couple
      of warnings by explicitly adding break statements instead of just letting
      the code fall through to the next case.
      
      Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      Message-Id: <20210528200756.GA39320@embeddedor>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      551912d2
    • C
      KVM: SVM: fix doc warnings · 02ffbe63
      ChenXiaoSong 提交于
      Fix kernel-doc warnings:
      
      arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'activate' not described in 'avic_update_access_page'
      arch/x86/kvm/svm/avic.c:233: warning: Function parameter or member 'kvm' not described in 'avic_update_access_page'
      arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'e' not described in 'get_pi_vcpu_info'
      arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'kvm' not described in 'get_pi_vcpu_info'
      arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'svm' not described in 'get_pi_vcpu_info'
      arch/x86/kvm/svm/avic.c:781: warning: Function parameter or member 'vcpu_info' not described in 'get_pi_vcpu_info'
      arch/x86/kvm/svm/avic.c:1009: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
      Signed-off-by: NChenXiaoSong <chenxiaosong2@huawei.com>
      Message-Id: <20210609122217.2967131-1-chenxiaosong2@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      02ffbe63
    • C
      x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs · a8383dfb
      CodyYao-oc 提交于
      The following commit:
      
         3a4ac121 ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")
      
      Got the old-style NMI watchdog logic wrong and broke it for basically every
      Intel CPU where it was active. Which is only truly old CPUs, so few people noticed.
      
      On CPUs with perf events support we turn off the old-style NMI watchdog, so it
      was pretty pointless to add the logic for X86_VENDOR_ZHAOXIN to begin with ... :-/
      
      Anyway, the fix is to restore the old logic and add a 'break'.
      
      [ mingo: Wrote a new changelog. ]
      
      Fixes: 3a4ac121 ("x86/perf: Add hardware performance events support for Zhaoxin CPU.")
      Signed-off-by: NCodyYao-oc <CodyYao-oc@zhaoxin.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210607025335.9643-1-CodyYao-oc@zhaoxin.com
      a8383dfb
    • T
      x86/fpu: Reset state for all signal restore failures · efa16550
      Thomas Gleixner 提交于
      If access_ok() or fpregs_soft_set() fails in __fpu__restore_sig() then the
      function just returns but does not clear the FPU state as it does for all
      other fatal failures.
      
      Clear the FPU state for these failures as well.
      
      Fixes: 72a671ce ("x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/87mtryyhhz.ffs@nanos.tec.linutronix.de
      efa16550
    • J
      kvm: LAPIC: Restore guard to prevent illegal APIC register access · 218bf772
      Jim Mattson 提交于
      Per the SDM, "any access that touches bytes 4 through 15 of an APIC
      register may cause undefined behavior and must not be executed."
      Worse, such an access in kvm_lapic_reg_read can result in a leak of
      kernel stack contents. Prior to commit 01402cf8 ("kvm: LAPIC:
      write down valid APIC registers"), such an access was explicitly
      disallowed. Restore the guard that was removed in that commit.
      
      Fixes: 01402cf8 ("kvm: LAPIC: write down valid APIC registers")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Message-Id: <20210602205224.3189316-1-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      218bf772
  12. 09 6月, 2021 4 次提交