1. 19 4月, 2019 1 次提交
  2. 29 3月, 2019 2 次提交
    • D
      KVM: arm/arm64: Add KVM_ARM_VCPU_FINALIZE ioctl · 7dd32a0d
      Dave Martin 提交于
      Some aspects of vcpu configuration may be too complex to be
      completed inside KVM_ARM_VCPU_INIT.  Thus, there may be a
      requirement for userspace to do some additional configuration
      before various other ioctls will work in a consistent way.
      
      In particular this will be the case for SVE, where userspace will
      need to negotiate the set of vector lengths to be made available to
      the guest before the vcpu becomes fully usable.
      
      In order to provide an explicit way for userspace to confirm that
      it has finished setting up a particular vcpu feature, this patch
      adds a new ioctl KVM_ARM_VCPU_FINALIZE.
      
      When userspace has opted into a feature that requires finalization,
      typically by means of a feature flag passed to KVM_ARM_VCPU_INIT, a
      matching call to KVM_ARM_VCPU_FINALIZE is now required before
      KVM_RUN or KVM_GET_REG_LIST is allowed.  Individual features may
      impose additional restrictions where appropriate.
      
      No existing vcpu features are affected by this, so current
      userspace implementations will continue to work exactly as before,
      with no need to issue KVM_ARM_VCPU_FINALIZE.
      
      As implemented in this patch, KVM_ARM_VCPU_FINALIZE is currently a
      placeholder: no finalizable features exist yet, so ioctl is not
      required and will always yield EINVAL.  Subsequent patches will add
      the finalization logic to make use of this ioctl for SVE.
      
      No functional change for existing userspace.
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NJulien Thierry <julien.thierry@arm.com>
      Tested-by: Nzhang.lei <zhang.lei@jp.fujitsu.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      7dd32a0d
    • D
      KVM: arm/arm64: Add hook for arch-specific KVM initialisation · 0f062bfe
      Dave Martin 提交于
      This patch adds a kvm_arm_init_arch_resources() hook to perform
      subarch-specific initialisation when starting up KVM.
      
      This will be used in a subsequent patch for global SVE-related
      setup on arm64.
      
      No functional change.
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NJulien Thierry <julien.thierry@arm.com>
      Tested-by: Nzhang.lei <zhang.lei@jp.fujitsu.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      0f062bfe
  3. 01 3月, 2019 1 次提交
  4. 23 2月, 2019 1 次提交
  5. 22 2月, 2019 1 次提交
  6. 21 2月, 2019 10 次提交
    • L
      Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()" · a67794ca
      Lan Tianyu 提交于
      The value of "dirty_bitmap[i]" is already check before setting its value
      to mask. The following check of "mask" is redundant. The check of "mask" was
      introduced by commit 58d2930f ("KVM: Eliminate extra function calls in
      kvm_get_dirty_log_protect()"), revert it.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a67794ca
    • N
      KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behaviour in case
      (vcpu->halt_poll_ns != 0) &&
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      
      In this case, vcpu->halt_poll_ns will be multiplied by grow factor
      (halt_poll_ns_grow) which will require several grow iteration in order
      to reach a value bigger than halt_poll_ns_grow_start.
      This means that growing vcpu->halt_poll_ns from value of 0 is slower
      than growing it from a positive value less than halt_poll_ns_grow_start.
      Which is misleading and inaccurate.
      
      Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
      to halt_poll_ns_grow_start in any case that
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      Regardless if vcpu->halt_poll_ns is 0.
      
      use READ_ONCE to get a consistent number for all cases.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dee339b5
    • N
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner 提交于
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • N
      KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behavior in case
      (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
      
      In this case, vcpu->halt_pol_ns will be set to zero.
      That results in shrinking instead of growing.
      
      Fix issue by changing grow_halt_poll_ns() to not modify
      vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7fa08e71
    • S
      KVM: Move the memslot update in-progress flag to bit 63 · 164bf7e5
      Sean Christopherson 提交于
      ...now that KVM won't explode by moving it out of bit 0.  Using bit 63
      eliminates the need to jump over bit 0, e.g. when calculating a new
      memslots generation or when propagating the memslots generation to an
      MMIO spte.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      164bf7e5
    • S
      KVM: Remove the hack to trigger memslot generation wraparound · 0e32958e
      Sean Christopherson 提交于
      x86 captures a subset of the memslot generation (19 bits) in its MMIO
      sptes so that it can expedite emulated MMIO handling by checking only
      the releveant spte, i.e. doesn't need to do a full page fault walk.
      
      Because the MMIO sptes capture only 19 bits (due to limited space in
      the sptes), there is a non-zero probability that the MMIO generation
      could wrap, e.g. after 500k memslot updates.  Since normal usage is
      extremely unlikely to result in 500k memslot updates, a hack was added
      by commit 69c9ea93 ("KVM: MMU: init kvm generation close to mmio
      wrap-around value") to offset the MMIO generation in order to trigger
      a wraparound, e.g. after 150 memslot updates.
      
      When separate memslot generation sequences were assigned to each
      address space, commit 00f034a1 ("KVM: do not bias the generation
      number in kvm_current_mmio_generation") moved the offset logic into the
      initialization of the memslot generation itself so that the per-address
      space bit(s) were not dropped/corrupted by the MMIO shenanigans.
      
      Remove the offset hack for three reasons:
      
        - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
          wrapping the generation doesn't actually test the interesting case
          of having stale MMIO sptes with the new generation number, e.g. old
          sptes with a generation number of 0.
      
        - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
          performance rather important since the probability of invalidating
          MMIO sptes jumps from "effectively never" to "fairly likely".  This
          limits what can be done in future patches, e.g. to simplify the
          invalidation code, as doing so without proper caution could lead to
          a noticeable performance regression.
      
        - Forcing the memslots generation, which is a 64-bit number, to wrap
          prevents KVM from assuming the memslots generation will never wrap.
          This in turn prevents KVM from using an arbitrary bit for the
          "update in-progress" flag, e.g. using bit 63 would immediately
          collide with using a large value as the starting generation number.
          The "update in-progress" flag is effectively forced into bit 0 so
          that it's (subtly) taken into account when incrementing the
          generation.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0e32958e
    • S
      KVM: Explicitly define the "memslot update in-progress" bit · 361209e0
      Sean Christopherson 提交于
      KVM uses bit 0 of the memslots generation as an "update in-progress"
      flag, which is used by x86 to prevent caching MMIO access while the
      memslots are changing.  Although the intended behavior is flag-like,
      e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
      caching data from in-flux memslots, the implementation oftentimes treats
      the bit as part of the generation number itself, e.g. incrementing the
      generation increments twice, once to set the flag and once to clear it.
      
      Prior to commit 4bd518f1 ("KVM: use separate generations for
      each address space"), incorporating the "update in-progress" bit into
      the generation number largely made sense, e.g. "real" generations are
      even, "bogus" generations are odd, most code doesn't need to be aware of
      the bit, etc...
      
      Now that unique memslots generation numbers are assigned to each address
      space, stealthing the in-progress status into the generation number
      results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
      over bit 0 when initializing the memslots generation without any hint as
      to why.
      
      Explicitly define the flag and convert as much code as possible (which
      isn't much) to actually treat it like a flag.  This paves the way for
      eventually using a different bit for "update in-progress" so that it can
      be a flag in truth instead of a awkward extension to the generation
      number.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      361209e0
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258
      Sean Christopherson 提交于
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15248258
    • B
      kvm: Add memcg accounting to KVM allocations · b12ce36a
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In they include:
              vcpu->run
              kvm->coalesced_mmio_ring
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b12ce36a
    • G
      kvm: Use struct_size() in kmalloc() · 90952cd3
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          void *entry[];
      };
      
      instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90952cd3
  7. 20 2月, 2019 14 次提交
  8. 08 2月, 2019 2 次提交
    • J
      KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing · 0db5e022
      James Morse 提交于
      To split up APEIs in_nmi() path, the caller needs to always be
      in_nmi(). KVM shouldn't have to know about this, pull the RAS plumbing
      out into a header file.
      
      Currently guest synchronous external aborts are claimed as RAS
      notifications by handle_guest_sea(), which is hidden in the arch codes
      mm/fault.c. 32bit gets a dummy declaration in system_misc.h.
      
      There is going to be more of this in the future if/when the kernel
      supports the SError-based firmware-first notification mechanism and/or
      kernel-first notifications for both synchronous external abort and
      SError. Each of these will come with some Kconfig symbols and a
      handful of header files.
      
      Create a header file for all this.
      
      This patch gives handle_guest_sea() a 'kvm_' prefix, and moves the
      declarations to kvm_ras.h as preparation for a future patch that moves
      the ACPI-specific RAS code out of mm/fault.c.
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Reviewed-by: NPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Tested-by: NTyler Baicar <tbaicar@codeaurora.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      0db5e022
    • J
      kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) · cfa39381
      Jann Horn 提交于
      kvm_ioctl_create_device() does the following:
      
      1. creates a device that holds a reference to the VM object (with a borrowed
         reference, the VM's refcount has not been bumped yet)
      2. initializes the device
      3. transfers the reference to the device to the caller's file descriptor table
      4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
         reference
      
      The ownership transfer in step 3 must not happen before the reference to the VM
      becomes a proper, non-borrowed reference, which only happens in step 4.
      After step 3, an attacker can close the file descriptor and drop the borrowed
      reference, which can cause the refcount of the kvm object to drop to zero.
      
      This means that we need to grab a reference for the device before
      anon_inode_getfd(), otherwise the VM can disappear from under us.
      
      Fixes: 852b6d57 ("kvm: add device control API")
      Cc: stable@kernel.org
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cfa39381
  9. 07 2月, 2019 3 次提交
  10. 26 1月, 2019 1 次提交
  11. 24 1月, 2019 3 次提交
  12. 12 1月, 2019 1 次提交