1. 20 3月, 2019 4 次提交
    • S
      KVM: arm/arm64: Enforce PTE mappings at stage2 when needed · a80868f3
      Suzuki K Poulose 提交于
      commit 6794ad54 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
      made the checks to skip huge mappings, stricter. However it introduced
      a bug where we still use huge mappings, ignoring the flag to
      use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE.
      
      Also, the checks do not cover the PUD huge pages, that was
      under review during the same period. This patch fixes both
      the issues.
      
      Fixes : 6794ad54 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
      Reported-by: NZenghui Yu <yuzenghui@huawei.com>
      Cc: Zenghui Yu <yuzenghui@huawei.com>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      a80868f3
    • M
      KVM: arm/arm64: vgic-its: Take the srcu lock when parsing the memslots · 7494cec6
      Marc Zyngier 提交于
      Calling kvm_is_visible_gfn() implies that we're parsing the memslots,
      and doing this without the srcu lock is frown upon:
      
      [12704.164532] =============================
      [12704.164544] WARNING: suspicious RCU usage
      [12704.164560] 5.1.0-rc1-00008-g600025238f51-dirty #16 Tainted: G        W
      [12704.164573] -----------------------------
      [12704.164589] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      [12704.164602] other info that might help us debug this:
      [12704.164616] rcu_scheduler_active = 2, debug_locks = 1
      [12704.164631] 6 locks held by qemu-system-aar/13968:
      [12704.164644]  #0: 000000007ebdae4f (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
      [12704.164691]  #1: 000000007d751022 (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
      [12704.164726]  #2: 00000000219d2706 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164761]  #3: 00000000a760aecd (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164794]  #4: 000000000ef8e31d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164827]  #5: 000000007a872093 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [12704.164861] stack backtrace:
      [12704.164878] CPU: 2 PID: 13968 Comm: qemu-system-aar Tainted: G        W         5.1.0-rc1-00008-g600025238f51-dirty #16
      [12704.164887] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
      [12704.164896] Call trace:
      [12704.164910]  dump_backtrace+0x0/0x138
      [12704.164920]  show_stack+0x24/0x30
      [12704.164934]  dump_stack+0xbc/0x104
      [12704.164946]  lockdep_rcu_suspicious+0xcc/0x110
      [12704.164958]  gfn_to_memslot+0x174/0x190
      [12704.164969]  kvm_is_visible_gfn+0x28/0x70
      [12704.164980]  vgic_its_check_id.isra.0+0xec/0x1e8
      [12704.164991]  vgic_its_save_tables_v0+0x1ac/0x330
      [12704.165001]  vgic_its_set_attr+0x298/0x3a0
      [12704.165012]  kvm_device_ioctl_attr+0x9c/0xd8
      [12704.165022]  kvm_device_ioctl+0x8c/0xf8
      [12704.165035]  do_vfs_ioctl+0xc8/0x960
      [12704.165045]  ksys_ioctl+0x8c/0xa0
      [12704.165055]  __arm64_sys_ioctl+0x28/0x38
      [12704.165067]  el0_svc_common+0xd8/0x138
      [12704.165078]  el0_svc_handler+0x38/0x78
      [12704.165089]  el0_svc+0x8/0xc
      
      Make sure the lock is taken when doing this.
      
      Fixes: bf308242 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      7494cec6
    • M
      KVM: arm/arm64: vgic-its: Take the srcu lock when writing to guest memory · a6ecfb11
      Marc Zyngier 提交于
      When halting a guest, QEMU flushes the virtual ITS caches, which
      amounts to writing to the various tables that the guest has allocated.
      
      When doing this, we fail to take the srcu lock, and the kernel
      shouts loudly if running a lockdep kernel:
      
      [   69.680416] =============================
      [   69.680819] WARNING: suspicious RCU usage
      [   69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted
      [   69.682096] -----------------------------
      [   69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      [   69.683225]
      [   69.683225] other info that might help us debug this:
      [   69.683225]
      [   69.683975]
      [   69.683975] rcu_scheduler_active = 2, debug_locks = 1
      [   69.684598] 6 locks held by qemu-system-aar/4097:
      [   69.685059]  #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
      [   69.686087]  #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
      [   69.686919]  #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.687698]  #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.688475]  #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.689978]  #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
      [   69.690729]
      [   69.690729] stack backtrace:
      [   69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18
      [   69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
      [   69.692831] Call trace:
      [   69.694072]  lockdep_rcu_suspicious+0xcc/0x110
      [   69.694490]  gfn_to_memslot+0x174/0x190
      [   69.694853]  kvm_write_guest+0x50/0xb0
      [   69.695209]  vgic_its_save_tables_v0+0x248/0x330
      [   69.695639]  vgic_its_set_attr+0x298/0x3a0
      [   69.696024]  kvm_device_ioctl_attr+0x9c/0xd8
      [   69.696424]  kvm_device_ioctl+0x8c/0xf8
      [   69.696788]  do_vfs_ioctl+0xc8/0x960
      [   69.697128]  ksys_ioctl+0x8c/0xa0
      [   69.697445]  __arm64_sys_ioctl+0x28/0x38
      [   69.697817]  el0_svc_common+0xd8/0x138
      [   69.698173]  el0_svc_handler+0x38/0x78
      [   69.698528]  el0_svc+0x8/0xc
      
      The fix is to obviously take the srcu lock, just like we do on the
      read side of things since bf308242. One wonders why this wasn't
      fixed at the same time, but hey...
      
      Fixes: bf308242 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      a6ecfb11
    • M
      arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabled · ca71228b
      Marc Zyngier 提交于
      The normal interrupt flow is not to enable the vgic when no virtual
      interrupt is to be injected (i.e. the LRs are empty). But when a guest
      is likely to use GICv4 for LPIs, we absolutely need to switch it on
      at all times. Otherwise, VLPIs only get delivered when there is something
      in the LRs, which doesn't happen very often.
      Reported-by: NNianyao Tang <tangnianyao@huawei.com>
      Tested-by: NShameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      ca71228b
  2. 01 3月, 2019 1 次提交
  3. 23 2月, 2019 1 次提交
  4. 22 2月, 2019 1 次提交
  5. 21 2月, 2019 10 次提交
    • L
      Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()" · a67794ca
      Lan Tianyu 提交于
      The value of "dirty_bitmap[i]" is already check before setting its value
      to mask. The following check of "mask" is redundant. The check of "mask" was
      introduced by commit 58d2930f ("KVM: Eliminate extra function calls in
      kvm_get_dirty_log_protect()"), revert it.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a67794ca
    • N
      KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behaviour in case
      (vcpu->halt_poll_ns != 0) &&
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      
      In this case, vcpu->halt_poll_ns will be multiplied by grow factor
      (halt_poll_ns_grow) which will require several grow iteration in order
      to reach a value bigger than halt_poll_ns_grow_start.
      This means that growing vcpu->halt_poll_ns from value of 0 is slower
      than growing it from a positive value less than halt_poll_ns_grow_start.
      Which is misleading and inaccurate.
      
      Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
      to halt_poll_ns_grow_start in any case that
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      Regardless if vcpu->halt_poll_ns is 0.
      
      use READ_ONCE to get a consistent number for all cases.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dee339b5
    • N
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner 提交于
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • N
      KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behavior in case
      (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
      
      In this case, vcpu->halt_pol_ns will be set to zero.
      That results in shrinking instead of growing.
      
      Fix issue by changing grow_halt_poll_ns() to not modify
      vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7fa08e71
    • S
      KVM: Move the memslot update in-progress flag to bit 63 · 164bf7e5
      Sean Christopherson 提交于
      ...now that KVM won't explode by moving it out of bit 0.  Using bit 63
      eliminates the need to jump over bit 0, e.g. when calculating a new
      memslots generation or when propagating the memslots generation to an
      MMIO spte.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      164bf7e5
    • S
      KVM: Remove the hack to trigger memslot generation wraparound · 0e32958e
      Sean Christopherson 提交于
      x86 captures a subset of the memslot generation (19 bits) in its MMIO
      sptes so that it can expedite emulated MMIO handling by checking only
      the releveant spte, i.e. doesn't need to do a full page fault walk.
      
      Because the MMIO sptes capture only 19 bits (due to limited space in
      the sptes), there is a non-zero probability that the MMIO generation
      could wrap, e.g. after 500k memslot updates.  Since normal usage is
      extremely unlikely to result in 500k memslot updates, a hack was added
      by commit 69c9ea93 ("KVM: MMU: init kvm generation close to mmio
      wrap-around value") to offset the MMIO generation in order to trigger
      a wraparound, e.g. after 150 memslot updates.
      
      When separate memslot generation sequences were assigned to each
      address space, commit 00f034a1 ("KVM: do not bias the generation
      number in kvm_current_mmio_generation") moved the offset logic into the
      initialization of the memslot generation itself so that the per-address
      space bit(s) were not dropped/corrupted by the MMIO shenanigans.
      
      Remove the offset hack for three reasons:
      
        - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
          wrapping the generation doesn't actually test the interesting case
          of having stale MMIO sptes with the new generation number, e.g. old
          sptes with a generation number of 0.
      
        - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
          performance rather important since the probability of invalidating
          MMIO sptes jumps from "effectively never" to "fairly likely".  This
          limits what can be done in future patches, e.g. to simplify the
          invalidation code, as doing so without proper caution could lead to
          a noticeable performance regression.
      
        - Forcing the memslots generation, which is a 64-bit number, to wrap
          prevents KVM from assuming the memslots generation will never wrap.
          This in turn prevents KVM from using an arbitrary bit for the
          "update in-progress" flag, e.g. using bit 63 would immediately
          collide with using a large value as the starting generation number.
          The "update in-progress" flag is effectively forced into bit 0 so
          that it's (subtly) taken into account when incrementing the
          generation.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0e32958e
    • S
      KVM: Explicitly define the "memslot update in-progress" bit · 361209e0
      Sean Christopherson 提交于
      KVM uses bit 0 of the memslots generation as an "update in-progress"
      flag, which is used by x86 to prevent caching MMIO access while the
      memslots are changing.  Although the intended behavior is flag-like,
      e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
      caching data from in-flux memslots, the implementation oftentimes treats
      the bit as part of the generation number itself, e.g. incrementing the
      generation increments twice, once to set the flag and once to clear it.
      
      Prior to commit 4bd518f1 ("KVM: use separate generations for
      each address space"), incorporating the "update in-progress" bit into
      the generation number largely made sense, e.g. "real" generations are
      even, "bogus" generations are odd, most code doesn't need to be aware of
      the bit, etc...
      
      Now that unique memslots generation numbers are assigned to each address
      space, stealthing the in-progress status into the generation number
      results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
      over bit 0 when initializing the memslots generation without any hint as
      to why.
      
      Explicitly define the flag and convert as much code as possible (which
      isn't much) to actually treat it like a flag.  This paves the way for
      eventually using a different bit for "update in-progress" so that it can
      be a flag in truth instead of a awkward extension to the generation
      number.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      361209e0
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258
      Sean Christopherson 提交于
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15248258
    • B
      kvm: Add memcg accounting to KVM allocations · b12ce36a
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In they include:
              vcpu->run
              kvm->coalesced_mmio_ring
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b12ce36a
    • G
      kvm: Use struct_size() in kmalloc() · 90952cd3
      Gustavo A. R. Silva 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          void *entry[];
      };
      
      instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90952cd3
  6. 20 2月, 2019 14 次提交
  7. 08 2月, 2019 2 次提交
    • J
      KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing · 0db5e022
      James Morse 提交于
      To split up APEIs in_nmi() path, the caller needs to always be
      in_nmi(). KVM shouldn't have to know about this, pull the RAS plumbing
      out into a header file.
      
      Currently guest synchronous external aborts are claimed as RAS
      notifications by handle_guest_sea(), which is hidden in the arch codes
      mm/fault.c. 32bit gets a dummy declaration in system_misc.h.
      
      There is going to be more of this in the future if/when the kernel
      supports the SError-based firmware-first notification mechanism and/or
      kernel-first notifications for both synchronous external abort and
      SError. Each of these will come with some Kconfig symbols and a
      handful of header files.
      
      Create a header file for all this.
      
      This patch gives handle_guest_sea() a 'kvm_' prefix, and moves the
      declarations to kvm_ras.h as preparation for a future patch that moves
      the ACPI-specific RAS code out of mm/fault.c.
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Reviewed-by: NPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Tested-by: NTyler Baicar <tbaicar@codeaurora.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      0db5e022
    • J
      kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) · cfa39381
      Jann Horn 提交于
      kvm_ioctl_create_device() does the following:
      
      1. creates a device that holds a reference to the VM object (with a borrowed
         reference, the VM's refcount has not been bumped yet)
      2. initializes the device
      3. transfers the reference to the device to the caller's file descriptor table
      4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
         reference
      
      The ownership transfer in step 3 must not happen before the reference to the VM
      becomes a proper, non-borrowed reference, which only happens in step 4.
      After step 3, an attacker can close the file descriptor and drop the borrowed
      reference, which can cause the refcount of the kvm object to drop to zero.
      
      This means that we need to grab a reference for the device before
      anon_inode_getfd(), otherwise the VM can disappear from under us.
      
      Fixes: 852b6d57 ("kvm: add device control API")
      Cc: stable@kernel.org
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cfa39381
  8. 07 2月, 2019 3 次提交
  9. 26 1月, 2019 1 次提交
  10. 24 1月, 2019 3 次提交