1. 28 9月, 2020 2 次提交
    • A
      KVM: x86: Introduce MSR filtering · 1a155254
      Alexander Graf 提交于
      It's not desireable to have all MSRs always handled by KVM kernel space. Some
      MSRs would be useful to handle in user space to either emulate behavior (like
      uCode updates) or differentiate whether they are valid based on the CPU model.
      
      To allow user space to specify which MSRs it wants to see handled by KVM,
      this patch introduces a new ioctl to push filter rules with bitmaps into
      KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
      With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
      denied MSR events to user space to operate on.
      
      If no filter is populated, MSR handling stays identical to before.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-8-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1a155254
    • A
      KVM: x86: Allow deflecting unknown MSR accesses to user space · 1ae09954
      Alexander Graf 提交于
      MSRs are weird. Some of them are normal control registers, such as EFER.
      Some however are registers that really are model specific, not very
      interesting to virtualization workloads, and not performance critical.
      Others again are really just windows into package configuration.
      
      Out of these MSRs, only the first category is necessary to implement in
      kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
      certain CPU models and MSRs that contain information on the package level
      are much better suited for user space to process. However, over time we have
      accumulated a lot of MSRs that are not the first category, but still handled
      by in-kernel KVM code.
      
      This patch adds a generic interface to handle WRMSR and RDMSR from user
      space. With this, any future MSR that is part of the latter categories can
      be handled in user space.
      
      Furthermore, it allows us to replace the existing "ignore_msrs" logic with
      something that applies per-VM rather than on the full system. That way you
      can run productive VMs in parallel to experimental ones where you don't care
      about proper MSR handling.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      
      Message-Id: <20200925143422.21718-3-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ae09954
  2. 12 9月, 2020 1 次提交
    • H
      KVM: MIPS: Change the definition of kvm type · 15e9e35c
      Huacai Chen 提交于
      MIPS defines two kvm types:
      
       #define KVM_VM_MIPS_TE          0
       #define KVM_VM_MIPS_VZ          1
      
      In Documentation/virt/kvm/api.rst it is said that "You probably want to
      use 0 as machine type", which implies that type 0 be the "automatic" or
      "default" type. And, in user-space libvirt use the null-machine (with
      type 0) to detect the kvm capability, which returns "KVM not supported"
      on a VZ platform.
      
      I try to fix it in QEMU but it is ugly:
      https://lists.nongnu.org/archive/html/qemu-devel/2020-08/msg05629.html
      
      And Thomas Huth suggests me to change the definition of kvm type:
      https://lists.nongnu.org/archive/html/qemu-devel/2020-09/msg03281.html
      
      So I define like this:
      
       #define KVM_VM_MIPS_AUTO        0
       #define KVM_VM_MIPS_VZ          1
       #define KVM_VM_MIPS_TE          2
      
      Since VZ and TE cannot co-exists, using type 0 on a TE platform will
      still return success (so old user-space tools have no problems on new
      kernels); the advantage is that using type 0 on a VZ platform will not
      return failure. So, the only problem is "new user-space tools use type
      2 on old kernels", but if we treat this as a kernel bug, we can backport
      this patch to old stable kernels.
      Signed-off-by: NHuacai Chen <chenhc@lemote.com>
      Message-Id: <1599734031-28746-1-git-send-email-chenhc@lemote.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15e9e35c
  3. 21 8月, 2020 1 次提交
  4. 11 7月, 2020 1 次提交
    • M
      KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support · 3edd6839
      Mohammed Gamal 提交于
      This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which
      allows userspace to query if the underlying architecture would
      support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
      (e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X)
      
      The complications in this patch are due to unexpected (but documented)
      behaviour we see with NPF vmexit handling in AMD processor.  If
      SVM is modified to add guest physical address checks in the NPF
      and guest #PF paths, we see the followning error multiple times in
      the 'access' test in kvm-unit-tests:
      
                  test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001
                  Dump mapping: address: 0x123400000000
                  ------L4: 24c3027
                  ------L3: 24c4027
                  ------L2: 24c5021
                  ------L1: 1002000021
      
      This is because the PTE's accessed bit is set by the CPU hardware before
      the NPF vmexit. This is handled completely by hardware and cannot be fixed
      in software.
      
      Therefore, availability of the new capability depends on a boolean variable
      allow_smaller_maxphyaddr which is set individually by VMX and SVM init
      routines. On VMX it's always set to true, on SVM it's only set to true
      when NPT is not enabled.
      
      CC: Tom Lendacky <thomas.lendacky@amd.com>
      CC: Babu Moger <babu.moger@amd.com>
      Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
      Message-Id: <20200710154811.418214-10-mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3edd6839
  5. 09 7月, 2020 1 次提交
  6. 23 6月, 2020 1 次提交
  7. 01 6月, 2020 3 次提交
  8. 25 4月, 2020 1 次提交
  9. 21 4月, 2020 1 次提交
  10. 26 3月, 2020 1 次提交
    • P
      KVM: PPC: Book3S HV: Add a capability for enabling secure guests · 9a5788c6
      Paul Mackerras 提交于
      At present, on Power systems with Protected Execution Facility
      hardware and an ultravisor, a KVM guest can transition to being a
      secure guest at will.  Userspace (QEMU) has no way of knowing
      whether a host system is capable of running secure guests.  This
      will present a problem in future when the ultravisor is capable of
      migrating secure guests from one host to another, because
      virtualization management software will have no way to ensure that
      secure guests only run in domains where all of the hosts can
      support secure guests.
      
      This adds a VM capability which has two functions: (a) userspace
      can query it to find out whether the host can support secure guests,
      and (b) userspace can enable it for a guest, which allows that
      guest to become a secure guest.  If userspace does not enable it,
      KVM will return an error when the ultravisor does the hypercall
      that indicates that the guest is starting to transition to a
      secure guest.  The ultravisor will then abort the transition and
      the guest will terminate.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NRam Pai <linuxram@us.ibm.com>
      9a5788c6
  11. 17 3月, 2020 1 次提交
    • J
      KVM: x86: enable dirty log gradually in small chunks · 3c9bd400
      Jay Zhou 提交于
      It could take kvm->mmu_lock for an extended period of time when
      enabling dirty log for the first time. The main cost is to clear
      all the D-bits of last level SPTEs. This situation can benefit from
      manual dirty log protect as well, which can reduce the mmu_lock
      time taken. The sequence is like this:
      
      1. Initialize all the bits of the dirty bitmap to 1 when enabling
         dirty log for the first time
      2. Only write protect the huge pages
      3. KVM_GET_DIRTY_LOG returns the dirty bitmap info
      4. KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level
         SPTEs gradually in small chunks
      
      Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment,
      I did some tests with a 128G windows VM and counted the time taken
      of memory_global_dirty_log_start, here is the numbers:
      
      VM Size        Before    After optimization
      128G           460ms     10ms
      Signed-off-by: NJay Zhou <jianjay.zhou@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c9bd400
  12. 28 2月, 2020 4 次提交
  13. 31 1月, 2020 1 次提交
  14. 28 11月, 2019 1 次提交
  15. 22 10月, 2019 3 次提交
    • S
      KVM: arm64: Provide VCPU attributes for stolen time · 58772e9a
      Steven Price 提交于
      Allow user space to inform the KVM host where in the physical memory
      map the paravirtualized time structures should be located.
      
      User space can set an attribute on the VCPU providing the IPA base
      address of the stolen time structure for that VCPU. This must be
      repeated for every VCPU in the VM.
      
      The address is given in terms of the physical address visible to
      the guest and must be 64 byte aligned. The guest will discover the
      address via a hypercall.
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      58772e9a
    • C
      KVM: arm/arm64: Allow user injection of external data aborts · da345174
      Christoffer Dall 提交于
      In some scenarios, such as buggy guest or incorrect configuration of the
      VMM and firmware description data, userspace will detect a memory access
      to a portion of the IPA, which is not mapped to any MMIO region.
      
      For this purpose, the appropriate action is to inject an external abort
      to the guest.  The kernel already has functionality to inject an
      external abort, but we need to wire up a signal from user space that
      lets user space tell the kernel to do this.
      
      It turns out, we already have the set event functionality which we can
      perfectly reuse for this.
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      da345174
    • C
      KVM: arm/arm64: Allow reporting non-ISV data aborts to userspace · c726200d
      Christoffer Dall 提交于
      For a long time, if a guest accessed memory outside of a memslot using
      any of the load/store instructions in the architecture which doesn't
      supply decoding information in the ESR_EL2 (the ISV bit is not set), the
      kernel would print the following message and terminate the VM as a
      result of returning -ENOSYS to userspace:
      
        load/store instruction decoding not implemented
      
      The reason behind this message is that KVM assumes that all accesses
      outside a memslot is an MMIO access which should be handled by
      userspace, and we originally expected to eventually implement some sort
      of decoding of load/store instructions where the ISV bit was not set.
      
      However, it turns out that many of the instructions which don't provide
      decoding information on abort are not safe to use for MMIO accesses, and
      the remaining few that would potentially make sense to use on MMIO
      accesses, such as those with register writeback, are not used in
      practice.  It also turns out that fetching an instruction from guest
      memory can be a pretty horrible affair, involving stopping all CPUs on
      SMP systems, handling multiple corner cases of address translation in
      software, and more.  It doesn't appear likely that we'll ever implement
      this in the kernel.
      
      What is much more common is that a user has misconfigured his/her guest
      and is actually not accessing an MMIO region, but just hitting some
      random hole in the IPA space.  In this scenario, the error message above
      is almost misleading and has led to a great deal of confusion over the
      years.
      
      It is, nevertheless, ABI to userspace, and we therefore need to
      introduce a new capability that userspace explicitly enables to change
      behavior.
      
      This patch introduces KVM_CAP_ARM_NISV_TO_USER (NISV meaning Non-ISV)
      which does exactly that, and introduces a new exit reason to report the
      event to userspace.  User space can then emulate an exception to the
      guest, restart the guest, suspend the guest, or take any other
      appropriate action as per the policy of the running system.
      Reported-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Reviewed-by: NAlexander Graf <graf@amazon.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      c726200d
  16. 21 10月, 2019 1 次提交
    • F
      KVM: PPC: Report single stepping capability · 1a9167a2
      Fabiano Rosas 提交于
      When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request
      the next instruction to be single stepped via the
      KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure.
      
      This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order
      to inform userspace about the state of single stepping support.
      
      We currently don't have support for guest single stepping implemented
      in Book3S HV so the capability is only present for Book3S PR and
      BookE.
      Signed-off-by: NFabiano Rosas <farosas@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1a9167a2
  17. 24 9月, 2019 1 次提交
  18. 20 9月, 2019 1 次提交
  19. 11 9月, 2019 1 次提交
  20. 09 9月, 2019 1 次提交
    • M
      KVM: arm/arm64: vgic: Allow more than 256 vcpus for KVM_IRQ_LINE · 92f35b75
      Marc Zyngier 提交于
      While parts of the VGIC support a large number of vcpus (we
      bravely allow up to 512), other parts are more limited.
      
      One of these limits is visible in the KVM_IRQ_LINE ioctl, which
      only allows 256 vcpus to be signalled when using the CPU or PPI
      types. Unfortunately, we've cornered ourselves badly by allocating
      all the bits in the irq field.
      
      Since the irq_type subfield (8 bit wide) is currently only taking
      the values 0, 1 and 2 (and we have been careful not to allow anything
      else), let's reduce this field to only 4 bits, and allocate the
      remaining 4 bits to a vcpu2_index, which acts as a multiplier:
      
        vcpu_id = 256 * vcpu2_index + vcpu_index
      
      With that, and a new capability (KVM_CAP_ARM_IRQ_LINE_LAYOUT_2)
      allowing this to be discovered, it becomes possible to inject
      PPIs to up to 4096 vcpus. But please just don't.
      
      Whilst we're there, add a clarification about the use of KVM_IRQ_LINE
      on arm, which is not completely conditionned by KVM_CAP_IRQCHIP.
      Reported-by: NZenghui Yu <yuzenghui@huawei.com>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      92f35b75
  21. 24 7月, 2019 1 次提交
  22. 11 7月, 2019 1 次提交
    • E
      KVM: x86: PMU Event Filter · 66bb8a06
      Eric Hankland 提交于
      Some events can provide a guest with information about other guests or the
      host (e.g. L3 cache stats); providing the capability to restrict access
      to a "safe" set of events would limit the potential for the PMU to be used
      in any side channel attacks. This change introduces a new VM ioctl that
      sets an event filter. If the guest attempts to program a counter for
      any blacklisted or non-whitelisted event, the kernel counter won't be
      created, so any RDPMC/RDMSR will show 0 instances of that event.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [Lots of changes. All remaining bugs are probably mine. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66bb8a06
  23. 05 6月, 2019 1 次提交
  24. 08 5月, 2019 1 次提交
  25. 30 4月, 2019 2 次提交
  26. 24 4月, 2019 1 次提交
    • A
      KVM: arm64: Add capability to advertise ptrauth for guest · a243c16d
      Amit Daniel Kachhap 提交于
      This patch advertises the capability of two cpu feature called address
      pointer authentication and generic pointer authentication. These
      capabilities depend upon system support for pointer authentication and
      VHE mode.
      
      The current arm64 KVM partially implements pointer authentication and
      support of address/generic authentication are tied together. However,
      separate ABI requirements for both of them is added so that any future
      isolated implementation will not require any ABI changes.
      Signed-off-by: NAmit Daniel Kachhap <amit.kachhap@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Cc: kvmarm@lists.cs.columbia.edu
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      a243c16d
  27. 29 3月, 2019 3 次提交
    • D
      KVM: arm64: Add a capability to advertise SVE support · 555f3d03
      Dave Martin 提交于
      To provide a uniform way to check for KVM SVE support amongst other
      features, this patch adds a suitable capability KVM_CAP_ARM_SVE,
      and reports it as present when SVE is available.
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NJulien Thierry <julien.thierry@arm.com>
      Tested-by: Nzhang.lei <zhang.lei@jp.fujitsu.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      555f3d03
    • D
      KVM: arm/arm64: Add KVM_ARM_VCPU_FINALIZE ioctl · 7dd32a0d
      Dave Martin 提交于
      Some aspects of vcpu configuration may be too complex to be
      completed inside KVM_ARM_VCPU_INIT.  Thus, there may be a
      requirement for userspace to do some additional configuration
      before various other ioctls will work in a consistent way.
      
      In particular this will be the case for SVE, where userspace will
      need to negotiate the set of vector lengths to be made available to
      the guest before the vcpu becomes fully usable.
      
      In order to provide an explicit way for userspace to confirm that
      it has finished setting up a particular vcpu feature, this patch
      adds a new ioctl KVM_ARM_VCPU_FINALIZE.
      
      When userspace has opted into a feature that requires finalization,
      typically by means of a feature flag passed to KVM_ARM_VCPU_INIT, a
      matching call to KVM_ARM_VCPU_FINALIZE is now required before
      KVM_RUN or KVM_GET_REG_LIST is allowed.  Individual features may
      impose additional restrictions where appropriate.
      
      No existing vcpu features are affected by this, so current
      userspace implementations will continue to work exactly as before,
      with no need to issue KVM_ARM_VCPU_FINALIZE.
      
      As implemented in this patch, KVM_ARM_VCPU_FINALIZE is currently a
      placeholder: no finalizable features exist yet, so ioctl is not
      required and will always yield EINVAL.  Subsequent patches will add
      the finalization logic to make use of this ioctl for SVE.
      
      No functional change for existing userspace.
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NJulien Thierry <julien.thierry@arm.com>
      Tested-by: Nzhang.lei <zhang.lei@jp.fujitsu.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      7dd32a0d
    • D
      KVM: Allow 2048-bit register access via ioctl interface · 2b953ea3
      Dave Martin 提交于
      The Arm SVE architecture defines registers that are up to 2048 bits
      in size (with some possibility of further future expansion).
      
      In order to avoid the need for an excessively large number of
      ioctls when saving and restoring a vcpu's registers, this patch
      adds a #define to make support for individual 2048-bit registers
      through the KVM_{GET,SET}_ONE_REG ioctl interface official.  This
      will allow each SVE register to be accessed in a single call.
      
      There are sufficient spare bits in the register id size field for
      this change, so there is no ABI impact, providing that
      KVM_GET_REG_LIST does not enumerate any 2048-bit register unless
      userspace explicitly opts in to the relevant architecture-specific
      features.
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Tested-by: Nzhang.lei <zhang.lei@jp.fujitsu.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      2b953ea3
  28. 15 12月, 2018 1 次提交
    • V
      x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID · 2bc39970
      Vitaly Kuznetsov 提交于
      With every new Hyper-V Enlightenment we implement we're forced to add a
      KVM_CAP_HYPERV_* capability. While this approach works it is fairly
      inconvenient: the majority of the enlightenments we do have corresponding
      CPUID feature bit(s) and userspace has to know this anyways to be able to
      expose the feature to the guest.
      
      Add KVM_GET_SUPPORTED_HV_CPUID ioctl (backed by KVM_CAP_HYPERV_CPUID, "one
      cap to rule them all!") returning all Hyper-V CPUID feature leaves.
      
      Using the existing KVM_GET_SUPPORTED_CPUID doesn't seem to be possible:
      Hyper-V CPUID feature leaves intersect with KVM's (e.g. 0x40000000,
      0x40000001) and we would probably confuse userspace in case we decide to
      return these twice.
      
      KVM_CAP_HYPERV_CPUID's number is interim: we're intended to drop
      KVM_CAP_HYPERV_STIMER_DIRECT and use its number instead.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2bc39970
  29. 14 12月, 2018 1 次提交
    • P
      kvm: introduce manual dirty log reprotect · 2a31b9db
      Paolo Bonzini 提交于
      There are two problems with KVM_GET_DIRTY_LOG.  First, and less important,
      it can take kvm->mmu_lock for an extended period of time.  Second, its user
      can actually see many false positives in some cases.  The latter is due
      to a benign race like this:
      
        1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
           them.
        2. The guest modifies the pages, causing them to be marked ditry.
        3. Userspace actually copies the pages.
        4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
           they were not written to since (3).
      
      This is especially a problem for large guests, where the time between
      (1) and (3) can be substantial.  This patch introduces a new
      capability which, when enabled, makes KVM_GET_DIRTY_LOG not
      write-protect the pages it returns.  Instead, userspace has to
      explicitly clear the dirty log bits just before using the content
      of the page.  The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
      64-page granularity rather than requiring to sync a full memslot;
      this way, the mmu_lock is taken for small amounts of time, and
      only a small amount of time will pass between write protection
      of pages and the sending of their content.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2a31b9db