1. 05 4月, 2019 2 次提交
    • A
      KVM: PPC: Book3S: Protect memslots while validating user address · 345077c8
      Alexey Kardashevskiy 提交于
      Guest physical to user address translation uses KVM memslots and reading
      these requires holding the kvm->srcu lock. However recently introduced
      kvmppc_tce_validate() broke the rule (see the lockdep warning below).
      
      This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect
      kvmppc_tce_validate() as well.
      
      =============================
      WARNING: suspicious RCU usage
      5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted
      -----------------------------
      include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by qemu-system-ppc/8020:
       #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm]
      
      stack backtrace:
      CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380
      Call Trace:
      [c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable)
      [c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170
      [c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290
      [c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm]
      [c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm]
      [c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv]
      [c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
      [c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
      [c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
      [c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
      [c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80
      [c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70
      
      Fixes: 42de7b9e ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10)
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      345077c8
    • S
      KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit · 7cb9eb10
      Suraj Jitindar Singh 提交于
      There is a hardware bug in some POWER9 processors where a treclaim in
      fake suspend mode can cause an inconsistency in the XER[SO] bit across
      the threads of a core, the workaround being to force the core into SMT4
      when doing the treclaim.
      
      The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
      thread is in fake suspend or real suspend. The important difference here
      being that thread reconfiguration is blocked in real suspend but not
      fake suspend mode.
      
      When we exit a guest which was in fake suspend mode, we force the core
      into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
      However on the new exit path introduced with the function
      kvmhv_run_single_vcpu() we restore the host PSSCR before calling
      kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
      put the thread into real suspend mode when we clear the
      PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
      and the thread which is trying to get the core into SMT4 before it can
      do the treclaim spins forever since it itself is blocking thread
      reconfiguration. The result is that that core is essentially lost.
      
      This results in a trace such as:
      [   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
      [   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
      [   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
      [   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
      [   93.512914] CFAR: c000000000098a5c IRQMASK: 3
      [   93.512915] PACATMSCRATCH: 0000000000000001
      [   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
      [   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
      [   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
      [   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
      [   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
      [   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
      [   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
      [   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
      [   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
      [   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
      [   93.512960] Call Trace:
      [   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
      [   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
      [   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
      [   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
      [   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
      [   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
      [   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
      [   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
      [   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
      [   93.512983] Instruction dump:
      [   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
      [   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214
      
      To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
      kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
      perform the treclaim. Note kvmppc_save_tm_hv() clears the
      PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7cb9eb10
  2. 29 3月, 2019 11 次提交
    • S
      KVM: x86: update %rip after emulating IO · 45def77e
      Sean Christopherson 提交于
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45def77e
    • V
      x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init · 013cc6eb
      Vitaly Kuznetsov 提交于
      When userspace initializes guest vCPUs it may want to zero all supported
      MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
      HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5 ("kvm/x86: Update SynIC
      timers on guest entry only") we began doing stimer_mark_pending()
      unconditionally on every config change.
      
      The issue I'm observing manifests itself as following:
      - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
        pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
      - kvm_hv_has_stimer_pending() starts returning true;
      - kvm_vcpu_has_events() starts returning true;
      - kvm_arch_vcpu_runnable() starts returning true;
      - when kvm_arch_vcpu_ioctl_run() gets into
        (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
        - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
          immediately, avoiding normal wait path;
        - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
          userspace to retry.
      
      So instead of normal wait path we get a busy loop on all secondary vCPUs
      before they get INIT signal. This seems to be undesirable, especially given
      that this happens even when Hyper-V extensions are not used.
      
      Generally, it seems to be pointless to mark an stimer as pending in
      stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
      kvm_hv_process_stimers() will do is clear the corresponding bit. We may
      just not mark disabled timers as pending instead.
      
      Fixes: f3b138c5 ("kvm/x86: Update SynIC timers on guest entry only")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      013cc6eb
    • X
      kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs · 2bdb76c0
      Xiaoyao Li 提交于
      Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if
      host doesn't suppot it. We should move it to array emulated_msrs from
      arry msrs_to_save, to report to userspace that guest support this msr.
      Signed-off-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2bdb76c0
    • S
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 0cf9135b
      Sean Christopherson 提交于
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0cf9135b
    • B
      kvm: mmu: Used range based flushing in slot_handle_level_range · f285c633
      Ben Gardon 提交于
      Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
      in slot_handle_level_range. When range based flushes are not enabled
      kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.
      
      This changes the behavior of many functions that indirectly use
      slot_handle_level_range, iff the range based flushes are enabled. The
      only potential problem I see with this is that kvm->tlbs_dirty will be
      cleared less often, however the only caller of slot_handle_level_range that
      checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
      checks it and does a kvm_flush_remote_tlbs after calling
      kvm_unmap_hva_range anyway.
      
      Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
      	without this patch. The patch introduced no new failures.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f285c633
    • M
      KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported · 3d9683cf
      Masahiro Yamada 提交于
      I do not see any consistency about headers_install of <linux/kvm_para.h>
      and <asm/kvm_para.h>.
      
      According to my analysis of Linux 5.1-rc1, there are 3 groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          alpha, arm, hexagon, mips, powerpc, s390, sparc, x86
      
       [2] <asm/kvm_para.h> is exported, but <linux/kvm_para.h> is not
      
          arc, arm64, c6x, h8300, ia64, m68k, microblaze, nios2, openrisc,
          parisc, sh, unicore32, xtensa
      
       [3] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          csky, nds32, riscv
      
      This does not match to the actual KVM support. At least, [2] is
      half-baked.
      
      Nor do arch maintainers look like they care about this. For example,
      commit 0add5371 ("microblaze: Add missing kvm_para.h to Kbuild")
      exported <asm/kvm_para.h> to user-space in order to fix an in-kernel
      build error.
      
      We have two ways to make this consistent:
      
       [A] export both <linux/kvm_para.h> and <asm/kvm_para.h> for all
           architectures, irrespective of the KVM support
      
       [B] Match the header export of <linux/kvm_para.h> and <asm/kvm_para.h>
           to the KVM support
      
      My first attempt was [A] because the code looks cleaner, but Paolo
      suggested [B].
      
      So, this commit goes with [B].
      
      For most architectures, <asm/kvm_para.h> was moved to the kernel-space.
      I changed include/uapi/linux/Kbuild so that it checks generated
      asm/kvm_para.h as well as check-in ones.
      
      After this commit, there will be two groups:
      
       [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported
      
          arm, arm64, mips, powerpc, s390, x86
      
       [2] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported
      
          alpha, arc, c6x, csky, h8300, hexagon, ia64, m68k, microblaze,
          nds32, nios2, openrisc, parisc, riscv, sh, sparc, unicore32, xtensa
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NCornelia Huck <cohuck@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3d9683cf
    • W
      KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() · 4d66623c
      Wei Yang 提交于
      * nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
        non-zero.
      
      * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
        never return zero.
      
      Based on these two reasons, we can merge the two *if* clause and use the
      return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
      the code and also eliminate the possibility for reader to believe
      nr_mmu_pages would be zero.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d66623c
    • K
      kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields · 711eff3a
      Krish Sadhukhan 提交于
      According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
      following check is performed on vmentry of L2 guests:
      
          On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP
          field and the IA32_SYSENTER_EIP field must each contain a canonical
          address.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      711eff3a
    • S
      KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · 05d5a486
      Singh, Brijesh 提交于
      Errata#1096:
      
      On a nested data page fault when CR.SMAP=1 and the guest data read
      generates a SMAP violation, GuestInstrBytes field of the VMCB on a
      VMEXIT will incorrectly return 0h instead the correct guest
      instruction bytes .
      
      Recommend Workaround:
      
      To determine what instruction the guest was executing the hypervisor
      will have to decode the instruction at the instruction pointer.
      
      The recommended workaround can not be implemented for the SEV
      guest because guest memory is encrypted with the guest specific key,
      and instruction decoder will not be able to decode the instruction
      bytes. If we hit this errata in the SEV guest then log the message
      and request a guest shutdown.
      Reported-by: NVenkatesh Srinivas <venkateshs@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05d5a486
    • S
      KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size' · 47c42e6b
      Sean Christopherson 提交于
      The cr4_pae flag is a bit of a misnomer, its purpose is really to track
      whether the guest PTE that is being shadowed is a 4-byte entry or an
      8-byte entry.  Prior to supporting nested EPT, the size of the gpte was
      reflected purely by CR4.PAE.  KVM fudged things a bit for direct sptes,
      but it was mostly harmless since the size of the gpte never mattered.
      Now that a spte may be tracking an indirect EPT entry, relying on
      CR4.PAE is wrong and ill-named.
      
      For direct shadow pages, force the gpte_size to '1' as they are always
      8-byte entries; EPT entries can only be 8-bytes and KVM always uses
      8-byte entries for NPT and its identity map (when running with EPT but
      not unrestricted guest).
      
      Likewise, nested EPT entries are always 8-bytes.  Nested EPT presents a
      unique scenario as the size of the entries are not dictated by CR4.PAE,
      but neither is the shadow page a direct map.  To handle this scenario,
      set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
      to denote a nested EPT shadow page.  Use the information to avoid
      incorrectly zapping an unsync'd indirect page in __kvm_sync_page().
      
      Providing a consistent and accurate gpte_size fixes a bug reported by
      Vitaly where fast_cr3_switch() always fails when switching from L2 to
      L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
      whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47c42e6b
    • S
      KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT · 552c69b1
      Sean Christopherson 提交于
      Explicitly zero out quadrant and invalid instead of inheriting them from
      the root_mmu.  Functionally, this patch is a nop as we (should) never
      set quadrant for a direct mapped (EPT) root_mmu and nested EPT is only
      allowed if EPT is used for L1, and the root_mmu will never be invalid at
      this point.
      
      Explicitly setting flags sets the stage for repurposing the legacy
      paging bits in role, e.g. nxe, cr0_wp, and sm{a,e}p_andnot_wp, at which
      point 'smm' would be the only flag to be inherited from root_mmu.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      552c69b1
  3. 23 3月, 2019 2 次提交
  4. 22 3月, 2019 1 次提交
    • V
      x86/mm/pti: Make local symbols static · 4fe64a62
      Valdis Kletnieks 提交于
      With 'make C=2 W=1', sparse and gcc both complain:
      
        CHECK   arch/x86/mm/pti.c
      arch/x86/mm/pti.c:84:3: warning: symbol 'pti_mode' was not declared. Should it be static?
      arch/x86/mm/pti.c:605:6: warning: symbol 'pti_set_kernel_image_nonglobal' was not declared. Should it be static?
        CC      arch/x86/mm/pti.o
      arch/x86/mm/pti.c:605:6: warning: no previous prototype for 'pti_set_kernel_image_nonglobal' [-Wmissing-prototypes]
        605 | void pti_set_kernel_image_nonglobal(void)
            |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      pti_set_kernel_image_nonglobal() is only used locally. 'pti_mode' exists in
      drivers/hwtracing/intel_th/pti.c as well, but it's a completely unrelated
      local (static) symbol.
      
      Make both static.
      Signed-off-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/27680.1552376873@turing-police
      
      4fe64a62
  5. 21 3月, 2019 10 次提交
  6. 20 3月, 2019 3 次提交
    • B
      powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations · 8bc08689
      Ben Hutchings 提交于
      MAX_PHYSMEM_BITS only needs to be defined if CONFIG_SPARSEMEM is
      enabled, and that was the case before commit 4ffe713b
      ("powerpc/mm: Increase the max addressable memory to 2PB").
      
      On 32-bit systems, where CONFIG_SPARSEMEM is not enabled, we now
      define it as 46.  That is larger than the real number of physical
      address bits, and breaks calculations in zsmalloc:
      
        mm/zsmalloc.c:130:49: warning: right shift count is negative
          MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
                                                         ^~
        ...
        mm/zsmalloc.c:253:21: error: variably modified 'size_class' at file scope
          struct size_class *size_class[ZS_SIZE_CLASSES];
                             ^~~~~~~~~~
      
      Fixes: 4ffe713b ("powerpc/mm: Increase the max addressable memory to 2PB")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8bc08689
    • M
      KVM: arm/arm64: vgic-its: Take the srcu lock when writing to guest memory · a6ecfb11
      Marc Zyngier 提交于
      When halting a guest, QEMU flushes the virtual ITS caches, which
      amounts to writing to the various tables that the guest has allocated.
      
      When doing this, we fail to take the srcu lock, and the kernel
      shouts loudly if running a lockdep kernel:
      
      [   69.680416] =============================
      [   69.680819] WARNING: suspicious RCU usage
      [   69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted
      [   69.682096] -----------------------------
      [   69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      [   69.683225]
      [   69.683225] other info that might help us debug this:
      [   69.683225]
      [   69.683975]
      [   69.683975] rcu_scheduler_active = 2, debug_locks = 1
      [   69.684598] 6 locks held by qemu-system-aar/4097:
      [   69.685059]  #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
      [   69.686087]  #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
      [   69.686919]  #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.687698]  #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.688475]  #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.689978]  #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.690729]
      [   69.690729] stack backtrace:
      [   69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18
      [   69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
      [   69.692831] Call trace:
      [   69.694072]  lockdep_rcu_suspicious+0xcc/0x110
      [   69.694490]  gfn_to_memslot+0x174/0x190
      [   69.694853]  kvm_write_guest+0x50/0xb0
      [   69.695209]  vgic_its_save_tables_v0+0x248/0x330
      [   69.695639]  vgic_its_set_attr+0x298/0x3a0
      [   69.696024]  kvm_device_ioctl_attr+0x9c/0xd8
      [   69.696424]  kvm_device_ioctl+0x8c/0xf8
      [   69.696788]  do_vfs_ioctl+0xc8/0x960
      [   69.697128]  ksys_ioctl+0x8c/0xa0
      [   69.697445]  __arm64_sys_ioctl+0x28/0x38
      [   69.697817]  el0_svc_common+0xd8/0x138
      [   69.698173]  el0_svc_handler+0x38/0x78
      [   69.698528]  el0_svc+0x8/0xc
      
      The fix is to obviously take the srcu lock, just like we do on the
      read side of things since bf308242. One wonders why this wasn't
      fixed at the same time, but hey...
      
      Fixes: bf308242 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      a6ecfb11
    • M
      KVM: arm64: Reset the PMU in preemptible context · ebff0b0e
      Marc Zyngier 提交于
      We've become very cautious to now always reset the vcpu when nothing
      is loaded on the physical CPU. To do so, we now disable preemption
      and do a kvm_arch_vcpu_put() to make sure we have all the state
      in memory (and that it won't be loaded behind out back).
      
      This now causes issues with resetting the PMU, which calls into perf.
      Perf itself uses mutexes, which clashes with the lack of preemption.
      It is worth realizing that the PMU is fully emulated, and that
      no PMU state is ever loaded on the physical CPU. This means we can
      perfectly reset the PMU outside of the non-preemptible section.
      
      Fixes: e761a927 ("KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded")
      Reported-by: NJulien Grall <julien.grall@arm.com>
      Tested-by: NJulien Grall <julien.grall@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      ebff0b0e
  7. 19 3月, 2019 9 次提交
  8. 18 3月, 2019 2 次提交
    • C
      powerpc/6xx: fix setup and use of SPRN_SPRG_PGDIR for hash32 · 4622a2d4
      Christophe Leroy 提交于
      Not only the 603 but all 6xx need SPRN_SPRG_PGDIR to be initialised at
      startup. This patch move it from __setup_cpu_603() to start_here()
      and __secondary_start(), close to the initialisation of SPRN_THREAD.
      
      Previously, virt addr of PGDIR was retrieved from thread struct.
      Now that it is the phys addr which is stored in SPRN_SPRG_PGDIR,
      hash_page() shall not convert it to phys anymore.
      This patch removes the conversion.
      
      Fixes: 93c4a162 ("powerpc/6xx: Store PGDIR physical address in a SPRG")
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4622a2d4
    • M
      powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038 · b5b4453e
      Michael Ellerman 提交于
      Jakub Drnec reported:
        Setting the realtime clock can sometimes make the monotonic clock go
        back by over a hundred years. Decreasing the realtime clock across
        the y2k38 threshold is one reliable way to reproduce. Allegedly this
        can also happen just by running ntpd, I have not managed to
        reproduce that other than booting with rtc at >2038 and then running
        ntp. When this happens, anything with timers (e.g. openjdk) breaks
        rather badly.
      
      And included a test case (slightly edited for brevity):
        #define _POSIX_C_SOURCE 199309L
        #include <stdio.h>
        #include <time.h>
        #include <stdlib.h>
        #include <unistd.h>
      
        long get_time(void) {
          struct timespec tp;
          clock_gettime(CLOCK_MONOTONIC, &tp);
          return tp.tv_sec + tp.tv_nsec / 1000000000;
        }
      
        int main(void) {
          long last = get_time();
          while(1) {
            long now = get_time();
            if (now < last) {
              printf("clock went backwards by %ld seconds!\n", last - now);
            }
            last = now;
            sleep(1);
          }
          return 0;
        }
      
      Which when run concurrently with:
       # date -s 2040-1-1
       # date -s 2037-1-1
      
      Will detect the clock going backward.
      
      The root cause is that wtom_clock_sec in struct vdso_data is only a
      32-bit signed value, even though we set its value to be equal to
      tk->wall_to_monotonic.tv_sec which is 64-bits.
      
      Because the monotonic clock starts at zero when the system boots the
      wall_to_montonic.tv_sec offset is negative for current and future
      dates. Currently on a freshly booted system the offset will be in the
      vicinity of negative 1.5 billion seconds.
      
      However if the wall clock is set past the Y2038 boundary, the offset
      from wall to monotonic becomes less than negative 2^31, and no longer
      fits in 32-bits. When that value is assigned to wtom_clock_sec it is
      truncated and becomes positive, causing the VDSO assembly code to
      calculate CLOCK_MONOTONIC incorrectly.
      
      That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
      it is not meant to do. Worse, if the time is then set back before the
      Y2038 boundary CLOCK_MONOTONIC will jump backward.
      
      We can fix it simply by storing the full 64-bit offset in the
      vdso_data, and using that in the VDSO assembly code. We also shuffle
      some of the fields in vdso_data to avoid creating a hole.
      
      The original commit that added the CLOCK_MONOTONIC support to the VDSO
      did actually use a 64-bit value for wtom_clock_sec, see commit
      a7f290da ("[PATCH] powerpc: Merge vdso's and add vdso support to
      32 bits kernel") (Nov 2005). However just 3 days later it was
      converted to 32-bits in commit 0c37ec2a ("[PATCH] powerpc: vdso
      fixes (take #2)"), and the bug has existed since then AFAICS.
      
      Fixes: 0c37ec2a ("[PATCH] powerpc: vdso fixes (take #2)")
      Cc: stable@vger.kernel.org # v2.6.15+
      Link: http://lkml.kernel.org/r/HaC.ZfES.62bwlnvAvMP.1STMMj@seznam.czReported-by: NJakub Drnec <jaydee@email.cz>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b5b4453e
新手
引导
客服 返回
顶部