1. 19 12月, 2019 1 次提交
  2. 21 11月, 2019 5 次提交
  3. 14 11月, 2019 1 次提交
    • S
      KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast() · ed69a6cb
      Sean Christopherson 提交于
      Acquire the per-VM slots_lock when zapping all shadow pages as part of
      toggling nx_huge_pages.  The fast zap algorithm relies on exclusivity
      (via slots_lock) to identify obsolete vs. valid shadow pages, because it
      uses a single bit for its generation number. Holding slots_lock also
      obviates the need to acquire a read lock on the VM's srcu.
      
      Failing to take slots_lock when toggling nx_huge_pages allows multiple
      instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
      user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
      (kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
      temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
      to enforce exclusivity).
      
      Concurrent fast zap instances causes obsolete shadow pages to be
      incorrectly identified as valid due to the single bit generation number
      wrapping, which results in stale shadow pages being left in KVM's MMU
      and leads to all sorts of undesirable behavior.
      The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
      toggling nx_huge_pages via its module param.
      
      Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
      using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
      an ulong-sized generation instead of relying on exclusivity for
      correctness, but all callers except the recently added set_nx_huge_pages()
      needed to hold slots_lock anyways.  Therefore, this patch does not have
      to be backported to stable kernels.
      
      Given that toggling nx_huge_pages is by no means a fast path, force it
      to conform to the current approach instead of reintroducing the previous
      generation count.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed69a6cb
  4. 13 11月, 2019 2 次提交
  5. 12 11月, 2019 6 次提交
  6. 07 11月, 2019 2 次提交
  7. 06 11月, 2019 4 次提交
  8. 05 11月, 2019 5 次提交
    • M
      x86/tsc: Respect tsc command line paraemeter for clocksource_tsc_early · 63ec58b4
      Michael Zhivich 提交于
      The introduction of clocksource_tsc_early broke the functionality of
      "tsc=reliable" and "tsc=nowatchdog" command line parameters, since
      clocksource_tsc_early is unconditionally registered with
      CLOCK_SOURCE_MUST_VERIFY and thus put on the watchdog list.
      
      This can cause the TSC to be declared unstable during boot:
      
        clocksource: timekeeping watchdog on CPU0: Marking clocksource
                     'tsc-early' as unstable because the skew is too large:
        clocksource: 'refined-jiffies' wd_now: fffb7018 wd_last: fffb6e9d
                     mask: ffffffff
        clocksource: 'tsc-early' cs_now: 68a6a7070f6a0 cs_last: 68a69ab6f74d6
                     mask: ffffffffffffffff
        tsc: Marking TSC unstable due to clocksource watchdog
      
      The corresponding elapsed times are cs_nsec=1224152026 and wd_nsec=378942392, so
      the watchdog differs from TSC by 0.84 seconds.
      
      This happens when HPET is not available and jiffies are used as the TSC
      watchdog instead and the jiffies update is not happening due to lost timer
      interrupts in periodic mode, which can happen e.g. with expensive debug
      mechanisms enabled or under massive overload conditions in virtualized
      environments.
      
      Before the introduction of the early TSC clocksource the command line
      parameters "tsc=reliable" and "tsc=nowatchdog" could be used to work around
      this issue.
      
      Restore the behaviour by disabling the watchdog if requested on the kernel
      command line.
      
      [ tglx: Clarify changelog ]
      
      Fixes: aa83c457 ("x86/tsc: Introduce early tsc clocksource")
      Signed-off-by: NMichael Zhivich <mzhivich@akamai.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191024175945.14338-1-mzhivich@akamai.com
      63ec58b4
    • T
      x86/dumpstack/64: Don't evaluate exception stacks before setup · e361362b
      Thomas Gleixner 提交于
      Cyrill reported the following crash:
      
        BUG: unable to handle page fault for address: 0000000000001ff0
        #PF: supervisor read access in kernel mode
        RIP: 0010:get_stack_info+0xb3/0x148
      
      It turns out that if the stack tracer is invoked before the exception stack
      mappings are initialized in_exception_stack() can erroneously classify an
      invalid address as an address inside of an exception stack:
      
          begin = this_cpu_read(cea_exception_stacks);  <- 0
          end = begin + sizeof(exception stacks);
      
      i.e. any address between 0 and end will be considered as exception stack
      address and the subsequent code will then try to derefence the resulting
      stack frame at a non mapped address.
      
       end = begin + (unsigned long)ep->size;
           ==> end = 0x2000
      
       regs = (struct pt_regs *)end - 1;
           ==> regs = 0x2000 - sizeof(struct pt_regs *) = 0x1ff0
      
       info->next_sp   = (unsigned long *)regs->sp;
           ==> Crashes due to accessing 0x1ff0
      
      Prevent this by checking the validity of the cea_exception_stack base
      address and bailing out if it is zero.
      
      Fixes: afcd21da ("x86/dumpstack/64: Use cpu_entry_area instead of orig_ist")
      Reported-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1910231950590.1852@nanos.tec.linutronix.de
      e361362b
    • J
      x86/apic/32: Avoid bogus LDR warnings · fe6f85ca
      Jan Beulich 提交于
      The removal of the LDR initialization in the bigsmp_32 APIC code unearthed
      a problem in setup_local_APIC().
      
      The code checks unconditionally for a mismatch of the logical APIC id by
      comparing the early APIC id which was initialized in get_smp_config() with
      the actual LDR value in the APIC.
      
      Due to the removal of the bogus LDR initialization the check now can
      trigger on bigsmp_32 APIC systems emitting a warning for every booting
      CPU. This is of course a false positive because the APIC is not using
      logical destination mode.
      
      Restrict the check and the possibly resulting fixup to systems which are
      actually using the APIC in logical destination mode.
      
      [ tglx: Massaged changelog and added Cc stable ]
      
      Fixes: bae3a8d3 ("x86/apic: Do not initialize LDR and DFR for bigsmp")
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/666d8f91-b5a8-1afd-7add-821e72a35f03@suse.com
      fe6f85ca
    • H
      timekeeping/vsyscall: Update VDSO data unconditionally · 52338415
      Huacai Chen 提交于
      The update of the VDSO data is depending on __arch_use_vsyscall() returning
      True. This is a leftover from the attempt to map the features of various
      architectures 1:1 into generic code.
      
      The usage of __arch_use_vsyscall() in the actual vsyscall implementations
      got dropped and replaced by the requirement for the architecture code to
      return U64_MAX if the global clocksource is not usable in the VDSO.
      
      But the __arch_use_vsyscall() check in the update code stayed which causes
      the VDSO data to be stale or invalid when an architecture actually
      implements that function and returns False when the current clocksource is
      not usable in the VDSO.
      
      As a consequence the VDSO implementations of clock_getres(), time(),
      clock_gettime(CLOCK_.*_COARSE) operate on invalid data and return bogus
      information.
      
      Remove the __arch_use_vsyscall() check from the VDSO update function and
      update the VDSO data unconditionally.
      
      [ tglx: Massaged changelog and removed the now useless implementations in
        	asm-generic/ARM64/MIPS ]
      
      Fixes: 44f57d78 ("timekeeping: Provide a generic update_vsyscall() implementation")
      Signed-off-by: NHuacai Chen <chenhc@lemote.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: linux-mips@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1571887709-11447-1-git-send-email-chenhc@lemote.com
      52338415
    • J
      kvm: x86: mmu: Recovery of shattered NX large pages · 1aa9b957
      Junaid Shahid 提交于
      The page table pages corresponding to broken down large pages are zapped in
      FIFO order, so that the large page can potentially be recovered, if it is
      not longer being used for execution.  This removes the performance penalty
      for walking deeper EPT page tables.
      
      By default, one large page will last about one hour once the guest
      reaches a steady state.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1aa9b957
  9. 04 11月, 2019 5 次提交
  10. 02 11月, 2019 1 次提交
    • E
      powerpc/bpf: Fix tail call implementation · 7de08690
      Eric Dumazet 提交于
      We have seen many crashes on powerpc hosts while loading bpf programs.
      
      The problem here is that bpf_int_jit_compile() does a first pass
      to compute the program length.
      
      Then it allocates memory to store the generated program and
      calls bpf_jit_build_body() a second time (and a third time
      later)
      
      What I have observed is that the second bpf_jit_build_body()
      could end up using few more words than expected.
      
      If bpf_jit_binary_alloc() put the space for the program
      at the end of the allocated page, we then write on
      a non mapped memory.
      
      It appears that bpf_jit_emit_tail_call() calls
      bpf_jit_emit_common_epilogue() while ctx->seen might not
      be stable.
      
      Only after the second pass we can be sure ctx->seen wont be changed.
      
      Trying to avoid a second pass seems quite complex and probably
      not worth it.
      
      Fixes: ce076141 ("powerpc/bpf: Implement support for tail calls")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20191101033444.143741-1-edumazet@google.com
      7de08690
  11. 01 11月, 2019 6 次提交
  12. 31 10月, 2019 2 次提交