1. 05 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  2. 22 7月, 2019 2 次提交
    • W
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li 提交于
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
    • P
      Revert "kvm: x86: Use task structs fpu field for user" · ec269475
      Paolo Bonzini 提交于
      This reverts commit 240c35a3
      ("kvm: x86: Use task structs fpu field for user", 2018-11-06).
      The commit is broken and causes QEMU's FPU state to be destroyed
      when KVM_RUN is preempted.
      
      Fixes: 240c35a3 ("kvm: x86: Use task structs fpu field for user")
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec269475
  3. 13 7月, 2019 4 次提交
    • M
      asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel] · 5fba4af4
      Mike Rapoport 提交于
      Most architectures have identical or very similar implementation of
      pte_alloc_one_kernel(), pte_alloc_one(), pte_free_kernel() and
      pte_free().
      
      Add a generic implementation that can be reused across architectures and
      enable its use on x86.
      
      The generic implementation uses
      
      	GFP_KERNEL | __GFP_ZERO
      
      for the kernel page tables and
      
      	GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
      
      for the user page tables.
      
      The "base" functions for PTE allocation, namely __pte_alloc_one_kernel()
      and __pte_alloc_one() are intended for the architectures that require
      additional actions after actual memory allocation or must use non-default
      GFP flags.
      
      x86 is switched to use generic pte_alloc_one_kernel(), pte_free_kernel() and
      pte_free().
      
      x86 still implements pte_alloc_one() to allow run-time control of GFP
      flags required for "userpte" command line option.
      
      Link: http://lkml.kernel.org/r/1557296232-15361-2-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Creasey <sammy@sammy.net>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fba4af4
    • C
      mm: lift the x86_32 PAE version of gup_get_pte to common code · 39656e83
      Christoph Hellwig 提交于
      The split low/high access is the only non-READ_ONCE version of gup_get_pte
      that did show up in the various arch implemenations.  Lift it to common
      code and drop the ifdef based arch override.
      
      Link: http://lkml.kernel.org/r/20190625143715.1689-4-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39656e83
    • C
      mm: simplify gup_fast_permitted · 26f4c328
      Christoph Hellwig 提交于
      Pass in the already calculated end value instead of recomputing it, and
      leave the end > start check in the callers instead of duplicating them in
      the arch code.
      
      Link: http://lkml.kernel.org/r/20190625143715.1689-3-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26f4c328
    • M
      asm-generic, x86: add bitops instrumentation for KASAN · 751ad98d
      Marco Elver 提交于
      This adds a new header to asm-generic to allow optionally instrumenting
      architecture-specific asm implementations of bitops.
      
      This change includes the required change for x86 as reference and
      changes the kernel API doc to point to bitops-instrumented.h instead.
      Rationale: the functions in x86's bitops.h are no longer the kernel API
      functions, but instead the arch_ prefixed functions, which are then
      instrumented via bitops-instrumented.h.
      
      Other architectures can similarly add support for asm implementations of
      bitops.
      
      The documentation text was derived from x86 and existing bitops
      asm-generic versions: 1) references to x86 have been removed; 2) as a
      result, some of the text had to be reworded for clarity and consistency.
      
      Tested using lib/test_kasan with bitops tests (pre-requisite patch).
      Bugzilla ref: https://bugzilla.kernel.org/show_bug.cgi?id=198439
      
      Link: http://lkml.kernel.org/r/20190613125950.197667-4-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      751ad98d
  4. 11 7月, 2019 2 次提交
  5. 10 7月, 2019 1 次提交
    • A
      x86/pgtable/32: Fix LOWMEM_PAGES constant · 26515699
      Arnd Bergmann 提交于
      clang points out that the computation of LOWMEM_PAGES causes a signed
      integer overflow on 32-bit x86:
      
      arch/x86/kernel/head32.c:83:20: error: signed shift result (0x100000000) requires 34 bits to represent, but 'int' only has 32 bits [-Werror,-Wshift-overflow]
                      (PAGE_TABLE_SIZE(LOWMEM_PAGES) << PAGE_SHIFT);
                                       ^~~~~~~~~~~~
      arch/x86/include/asm/pgtable_32.h:109:27: note: expanded from macro 'LOWMEM_PAGES'
       #define LOWMEM_PAGES ((((2<<31) - __PAGE_OFFSET) >> PAGE_SHIFT))
                               ~^ ~~
      arch/x86/include/asm/pgtable_32.h:98:34: note: expanded from macro 'PAGE_TABLE_SIZE'
       #define PAGE_TABLE_SIZE(pages) ((pages) / PTRS_PER_PGD)
      
      Use the _ULL() macro to make it a 64-bit constant.
      
      Fixes: 1e620f9b ("x86/boot/32: Convert the 32-bit pgtable setup code from assembly to C")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190710130522.1802800-1-arnd@arndb.de
      26515699
  6. 09 7月, 2019 2 次提交
  7. 07 7月, 2019 1 次提交
  8. 03 7月, 2019 5 次提交
    • T
      x86/fsgsbase: Revert FSGSBASE support · 049331f2
      Thomas Gleixner 提交于
      The FSGSBASE series turned out to have serious bugs and there is still an
      open issue which is not fully understood yet.
      
      The confidence in those changes has become close to zero especially as the
      test cases which have been shipped with that series were obviously never
      run before sending the final series out to LKML.
      
        ./fsgsbase_64 >/dev/null
        Segmentation fault
      
      As the merge window is close, the only sane decision is to revert FSGSBASE
      support. The revert is necessary as this branch has been merged into
      perf/core already and rebasing all of that a few days before the merge
      window is not the most brilliant idea.
      
      I could definitely slap myself for not noticing the test case fail when
      merging that series, but TBH my expectations weren't that low back
      then. Won't happen again.
      
      Revert the following commits:
      539bca53 ("x86/entry/64: Fix and clean up paranoid_exit")
      2c7b5ac5 ("Documentation/x86/64: Add documentation for GS/FS addressing mode")
      f987c955 ("x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2")
      2032f1f9 ("x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit")
      5bf0cab6 ("x86/entry/64: Document GSBASE handling in the paranoid path")
      708078f6 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
      79e1932f ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
      1d07316b ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
      f60a83df ("x86/process/64: Use FSGSBASE instructions on thread copy and ptrace")
      1ab5f3f7 ("x86/process/64: Use FSBSBASE in switch_to() if available")
      a86b4625 ("x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions")
      8b71340d ("x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions")
      b64ed19b ("x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Chang S. Bae <chang.seok.bae@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      049331f2
    • M
      clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic · dd2cb348
      Michael Kelley 提交于
      Continue consolidating Hyper-V clock and timer code into an ISA
      independent Hyper-V clocksource driver.
      
      Move the existing clocksource code under drivers/hv and arch/x86 to the new
      clocksource driver while separating out the ISA dependencies. Update
      Hyper-V initialization to call initialization and cleanup routines since
      the Hyper-V synthetic clock is not independently enumerated in ACPI.
      
      Update Hyper-V clocksource users in KVM and VDSO to get definitions from
      the new include file.
      
      No behavior is changed and no new functionality is added.
      Suggested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "bp@alien8.de" <bp@alien8.de>
      Cc: "will.deacon@arm.com" <will.deacon@arm.com>
      Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
      Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
      Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
      Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
      Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
      Cc: "olaf@aepfle.de" <olaf@aepfle.de>
      Cc: "apw@canonical.com" <apw@canonical.com>
      Cc: "jasowang@redhat.com" <jasowang@redhat.com>
      Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
      Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: "sashal@kernel.org" <sashal@kernel.org>
      Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
      Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
      Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
      Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
      Cc: "arnd@arndb.de" <arnd@arndb.de>
      Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
      Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
      Cc: "paul.burton@mips.com" <paul.burton@mips.com>
      Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
      Cc: "salyzyn@android.com" <salyzyn@android.com>
      Cc: "pcc@google.com" <pcc@google.com>
      Cc: "shuah@kernel.org" <shuah@kernel.org>
      Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
      Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
      Cc: "huw@codeweavers.com" <huw@codeweavers.com>
      Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
      Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
      Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
      Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
      Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com
      dd2cb348
    • M
      clocksource/drivers: Make Hyper-V clocksource ISA agnostic · fd1fea68
      Michael Kelley 提交于
      Hyper-V clock/timer code and data structures are currently mixed
      in with other code in the ISA independent drivers/hv directory as
      well as the ISA dependent Hyper-V code under arch/x86.
      
      Consolidate this code and data structures into a Hyper-V clocksource driver
      to better follow the Linux model. In doing so, separate out the ISA
      dependent portions so the new clocksource driver works for x86 and for the
      in-process Hyper-V on ARM64 code.
      
      To start, move the existing clockevents code to create the new clocksource
      driver. Update the VMbus driver to call initialization and cleanup routines
      since the Hyper-V synthetic timers are not independently enumerated in
      ACPI.
      
      No behavior is changed and no new functionality is added.
      Suggested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "bp@alien8.de" <bp@alien8.de>
      Cc: "will.deacon@arm.com" <will.deacon@arm.com>
      Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
      Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
      Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
      Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
      Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
      Cc: "olaf@aepfle.de" <olaf@aepfle.de>
      Cc: "apw@canonical.com" <apw@canonical.com>
      Cc: "jasowang@redhat.com" <jasowang@redhat.com>
      Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
      Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: "sashal@kernel.org" <sashal@kernel.org>
      Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
      Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
      Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
      Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
      Cc: "arnd@arndb.de" <arnd@arndb.de>
      Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
      Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
      Cc: "paul.burton@mips.com" <paul.burton@mips.com>
      Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
      Cc: "salyzyn@android.com" <salyzyn@android.com>
      Cc: "pcc@google.com" <pcc@google.com>
      Cc: "shuah@kernel.org" <shuah@kernel.org>
      Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
      Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
      Cc: "huw@codeweavers.com" <huw@codeweavers.com>
      Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
      Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
      Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
      Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
      Link: https://lkml.kernel.org/r/1561955054-1838-2-git-send-email-mikelley@microsoft.com
      fd1fea68
    • T
      x86/irq: Seperate unused system vectors from spurious entry again · f8a8fe61
      Thomas Gleixner 提交于
      Quite some time ago the interrupt entry stubs for unused vectors in the
      system vector range got removed and directly mapped to the spurious
      interrupt vector entry point.
      
      Sounds reasonable, but it's subtly broken. The spurious interrupt vector
      entry point pushes vector number 0xFF on the stack which makes the whole
      logic in __smp_spurious_interrupt() pointless.
      
      As a consequence any spurious interrupt which comes from a vector != 0xFF
      is treated as a real spurious interrupt (vector 0xFF) and not
      acknowledged. That subsequently stalls all interrupt vectors of equal and
      lower priority, which brings the system to a grinding halt.
      
      This can happen because even on 64-bit the system vector space is not
      guaranteed to be fully populated. A full compile time handling of the
      unused vectors is not possible because quite some of them are conditonally
      populated at runtime.
      
      Bring the entry stubs back, which wastes 160 bytes if all stubs are unused,
      but gains the proper handling back. There is no point to selectively spare
      some of the stubs which are known at compile time as the required code in
      the IDT management would be way larger and convoluted.
      
      Do not route the spurious entries through common_interrupt and do_IRQ() as
      the original code did. Route it to smp_spurious_interrupt() which evaluates
      the vector number and acts accordingly now that the real vector numbers are
      handed in.
      
      Fixup the pr_warn so the actual spurious vector (0xff) is clearly
      distiguished from the other vectors and also note for the vectored case
      whether it was pending in the ISR or not.
      
       "Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen."
       "Spurious interrupt vector 0xed on CPU#1. Acked."
       "Spurious interrupt vector 0xee on CPU#1. Not pending!."
      
      Fixes: 2414e021 ("x86: Avoid building unused IRQ entry stubs")
      Reported-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Link: https://lkml.kernel.org/r/20190628111440.550568228@linutronix.de
      f8a8fe61
    • T
      x86/irq: Handle spurious interrupt after shutdown gracefully · b7107a67
      Thomas Gleixner 提交于
      Since the rework of the vector management, warnings about spurious
      interrupts have been reported. Robert provided some more information and
      did an initial analysis. The following situation leads to these warnings:
      
         CPU 0                  CPU 1               IO_APIC
      
                                                    interrupt is raised
                                                    sent to CPU1
      			  Unable to handle
      			  immediately
      			  (interrupts off,
      			   deep idle delay)
         mask()
         ...
         free()
           shutdown()
           synchronize_irq()
           clear_vector()
                                do_IRQ()
                                  -> vector is clear
      
      Before the rework the vector entries of legacy interrupts were statically
      assigned and occupied precious vector space while most of them were
      unused. Due to that the above situation was handled silently because the
      vector was handled and the core handler of the assigned interrupt
      descriptor noticed that it is shut down and returned.
      
      While this has been usually observed with legacy interrupts, this situation
      is not limited to them. Any other interrupt source, e.g. MSI, can cause the
      same issue.
      
      After adding proper synchronization for level triggered interrupts, this
      can only happen for edge triggered interrupts where the IO-APIC obviously
      cannot provide information about interrupts in flight.
      
      While the spurious warning is actually harmless in this case it worries
      users and driver developers.
      
      Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead
      of VECTOR_UNUSED when the vector is freed up.
      
      If that above late handling happens the spurious detector will not complain
      and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on
      that line will trigger the spurious warning as before.
      
      Fixes: 464d1230 ("x86/vector: Switch IOAPIC to global reservation mode")
      Reported-by: NRobert Hodaszi <Robert.Hodaszi@digi.com>
      Signed-off-by: Thomas Gleixner <tglx@linutronix.de>-
      Tested-by: NRobert Hodaszi <Robert.Hodaszi@digi.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20190628111440.459647741@linutronix.de
      b7107a67
  9. 01 7月, 2019 1 次提交
  10. 29 6月, 2019 1 次提交
    • T
      x86/timer: Skip PIT initialization on modern chipsets · c8c40767
      Thomas Gleixner 提交于
      Recent Intel chipsets including Skylake and ApolloLake have a special
      ITSSPRC register which allows the 8254 PIT to be gated.  When gated, the
      8254 registers can still be programmed as normal, but there are no IRQ0
      timer interrupts.
      
      Some products such as the Connex L1430 and exone go Rugged E11 use this
      register to ship with the PIT gated by default. This causes Linux to fail
      to boot:
      
        Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with
        apic=debug and send a report.
      
      The panic happens before the framebuffer is initialized, so to the user, it
      appears as an early boot hang on a black screen.
      
      Affected products typically have a BIOS option that can be used to enable
      the 8254 and make Linux work (Chipset -> South Cluster Configuration ->
      Miscellaneous Configuration -> 8254 Clock Gating), however it would be best
      to make Linux support the no-8254 case.
      
      Modern sytems allow to discover the TSC and local APIC timer frequencies,
      so the calibration against the PIT is not required. These systems have
      always running timers and the local APIC timer works also in deep power
      states.
      
      So the setup of the PIT including the IO-APIC timer interrupt delivery
      checks are a pointless exercise.
      
      Skip the PIT setup and the IO-APIC timer interrupt checks on these systems,
      which avoids the panic caused by non ticking PITs and also speeds up the
      boot process.
      
      Thanks to Daniel for providing the changelog, initial analysis of the
      problem and testing against a variety of machines.
      Reported-by: NDaniel Drake <drake@endlessm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: linux@endlessm.com
      Cc: rafael.j.wysocki@intel.com
      Cc: hdegoede@redhat.com
      Link: https://lkml.kernel.org/r/20190628072307.24678-1-drake@endlessm.com
      c8c40767
  11. 28 6月, 2019 3 次提交
  12. 26 6月, 2019 2 次提交
    • Z
      x86/speculation/mds: Eliminate leaks by trace_hardirqs_on() · ab3765a0
      Zhenzhong Duan 提交于
      Move mds_idle_clear_cpu_buffers() after trace_hardirqs_on() to ensure
      all store buffer entries are flushed.
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: jgross@suse.com
      Cc: ndesaulniers@google.com
      Cc: gregkh@linuxfoundation.org
      Link: https://lkml.kernel.org/r/1561260904-29669-2-git-send-email-zhenzhong.duan@oracle.com
      ab3765a0
    • T
      lib/vdso: Make delta calculation work correctly · 9d90b93b
      Thomas Gleixner 提交于
      The x86 vdso implementation on which the generic vdso library is based on
      has subtle (unfortunately undocumented) twists:
      
       1) The code assumes that the clocksource mask is U64_MAX which means that
          no bits are masked. Which is true for any valid x86 VDSO clocksource.
          Stupidly it still did the mask operation for no reason and at the wrong
          place right after reading the clocksource.
      
       2) It contains a sanity check to catch the case where slightly
          unsynchronized TSC values can be observed which would cause the delta
          calculation to make a huge jump. It therefore checks whether the
          current TSC value is larger than the value on which the current
          conversion is based on. If it's not larger the base value is used to
          prevent time jumps.
      
      #1 Is not only stupid for the X86 case because it does the masking for no
      reason it is also completely wrong for clocksources with a smaller mask
      which can legitimately wrap around during a conversion period. The core
      timekeeping code does it correct by applying the mask after the delta
      calculation:
      
      	(now - base) & mask
      
      #2 is equally broken for clocksources which have smaller masks and can wrap
      around during a conversion period because there the now > base check is
      just wrong and causes stale time stamps and time going backwards issues.
      
      Unbreak it by:
      
        1) Removing the mask operation from the clocksource read which makes the
           fallback detection work for all clocksources
      
        2) Replacing the conditional delta calculation with a overrideable inline
           function.
      
      #2 could reuse clocksource_delta() from the timekeeping code but that
      results in a significant performance hit for the x86 VSDO. The timekeeping
      core code must have the non optimized version as it has to operate
      correctly with clocksources which have smaller masks as well to handle the
      case where TSC is discarded as timekeeper clocksource and replaced by HPET
      or pmtimer. For the VDSO there is no replacement clocksource. If TSC is
      unusable the syscall is enforced which does the right thing.
      
      To accommodate to the needs of various architectures provide an
      override-able inline function which defaults to the regular delta
      calculation with masking:
      
      	(now - base) & mask
      
      Override it for x86 with the non-masking and checking version.
      
      This unbreaks the ARM64 syscall fallback operation, allows to use
      clocksources with arbitrary width and preserves the performance
      optimization for x86.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: linux-arch@vger.kernel.org
      Cc: LAK <linux-arm-kernel@lists.infradead.org>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Cc: catalin.marinas@arm.com
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux@armlinux.org.uk
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: paul.burton@mips.com
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: salyzyn@android.com
      Cc: pcc@google.com
      Cc: shuah@kernel.org
      Cc: 0x7f454c46@gmail.com
      Cc: linux@rasmusvillemoes.dk
      Cc: huw@codeweavers.com
      Cc: sthotton@marvell.com
      Cc: andre.przywara@arm.com
      Cc: Andy Lutomirski <luto@kernel.org>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1906261159230.32342@nanos.tec.linutronix.de
      9d90b93b
  13. 25 6月, 2019 3 次提交
  14. 24 6月, 2019 3 次提交
    • F
      x86/umwait: Initialize umwait control values · bd688c69
      Fenghua Yu 提交于
      umwait or tpause allows the processor to enter a light-weight
      power/performance optimized state (C0.1 state) or an improved
      power/performance optimized state (C0.2 state) for a period specified by
      the instruction or until the system time limit or until a store to the
      monitored address range in umwait.
      
      IA32_UMWAIT_CONTROL MSR register allows the OS to enable/disable C0.2 on
      the processor and to set the maximum time the processor can reside in C0.1
      or C0.2.
      
      By default C0.2 is enabled so the user wait instructions can enter the
      C0.2 state to save more power with slower wakeup time.
      
      Andy Lutomirski proposed to set the maximum umwait time to 100000 cycles by
      default. A quote from Andy:
      
        "What I want to avoid is the case where it works dramatically differently
         on NO_HZ_FULL systems as compared to everything else. Also, UMWAIT may
         behave a bit differently if the max timeout is hit, and I'd like that
         path to get exercised widely by making it happen even on default
         configs."
      
      A sysfs interface to adjust the time and the C0.2 enablement is provided in
      a follow up change.
      
      [ tglx: Renamed MSR_IA32_UMWAIT_CONTROL_MAX_TIME to
        	MSR_IA32_UMWAIT_CONTROL_TIME_MASK because the constant is used as
        	mask throughout the code.
      	Massaged comments and changelog ]
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAshok Raj <ashok.raj@intel.com>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Tony Luck" <tony.luck@intel.com>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-3-git-send-email-fenghua.yu@intel.com
      bd688c69
    • F
      x86/cpufeatures: Enumerate user wait instructions · 6dbbf5ec
      Fenghua Yu 提交于
      umonitor, umwait, and tpause are a set of user wait instructions.
      
      umonitor arms address monitoring hardware using an address. The
      address range is determined by using CPUID.0x5. A store to
      an address within the specified address range triggers the
      monitoring hardware to wake up the processor waiting in umwait.
      
      umwait instructs the processor to enter an implementation-dependent
      optimized state while monitoring a range of addresses. The optimized
      state may be either a light-weight power/performance optimized state
      (C0.1 state) or an improved power/performance optimized state
      (C0.2 state).
      
      tpause instructs the processor to enter an implementation-dependent
      optimized state C0.1 or C0.2 state and wake up when time-stamp counter
      reaches specified timeout.
      
      The three instructions may be executed at any privilege level.
      
      The instructions provide power saving method while waiting in
      user space. Additionally, they can allow a sibling hyperthread to
      make faster progress while this thread is waiting. One example of an
      application usage of umwait is when waiting for input data from another
      application, such as a user level multi-threaded packet processing
      engine.
      
      Availability of the user wait instructions is indicated by the presence
      of the CPUID feature flag WAITPKG CPUID.0x07.0x0:ECX[5].
      
      Detailed information on the instructions and CPUID feature WAITPKG flag
      can be found in the latest Intel Architecture Instruction Set Extensions
      and Future Features Programming Reference and Intel 64 and IA-32
      Architectures Software Developer's Manual.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAshok Raj <ashok.raj@intel.com>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Tony Luck" <tony.luck@intel.com>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-2-git-send-email-fenghua.yu@intel.com
      6dbbf5ec
    • A
      x86/vdso: Give the [ph]vclock_page declarations real types · ecf9db3d
      Andy Lutomirski 提交于
      Clean up the vDSO code a bit by giving pvclock_page and hvclock_page
      their actual types instead of u8[PAGE_SIZE].  This shouldn't
      materially affect the generated code.
      
      Heavily based on a patch from Linus.
      
      [ tglx: Adapted to the unified VDSO code ]
      Co-developed-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/6920c5188f8658001af1fc56fd35b815706d300c.1561241273.git.luto@kernel.org
      ecf9db3d
  15. 23 6月, 2019 2 次提交
    • V
      x86/vdso: Add clock_getres() entry point · f66501dc
      Vincenzo Frascino 提交于
      The generic vDSO library provides an implementation of clock_getres()
      that can be leveraged by each architecture.
      
      Add the clock_getres() VDSO entry point on x86.
      
      [ tglx: Massaged changelog and cleaned up the function signature formatting ]
      Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Huw Davies <huw@codeweavers.com>
      Cc: Shijith Thotton <sthotton@marvell.com>
      Cc: Andre Przywara <andre.przywara@arm.com>
      Link: https://lkml.kernel.org/r/20190621095252.32307-24-vincenzo.frascino@arm.com
      f66501dc
    • V
      x86/vdso: Switch to generic vDSO implementation · 7ac87074
      Vincenzo Frascino 提交于
      The x86 vDSO library requires some adaptations to take advantage of the
      newly introduced generic vDSO library.
      
      Introduce the following changes:
       - Modification of vdso.c to be compliant with the common vdso datapage
       - Use of lib/vdso for gettimeofday
      
      [ tglx: Massaged changelog and cleaned up the function signature formatting ]
      Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mips@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Huw Davies <huw@codeweavers.com>
      Cc: Shijith Thotton <sthotton@marvell.com>
      Cc: Andre Przywara <andre.przywara@arm.com>
      Link: https://lkml.kernel.org/r/20190621095252.32307-23-vincenzo.frascino@arm.com
      7ac87074
  16. 22 6月, 2019 7 次提交