1. 19 9月, 2014 10 次提交
  2. 17 9月, 2014 4 次提交
  3. 16 9月, 2014 1 次提交
    • Z
      kvm: ioapic: conditionally delay irq delivery duringeoi broadcast · 184564ef
      Zhang Haoyu 提交于
      Currently, we call ioapic_service() immediately when we find the irq is still
      active during eoi broadcast. But for real hardware, there's some delay between
      the EOI writing and irq delivery.  If we do not emulate this behavior, and
      re-inject the interrupt immediately after the guest sends an EOI and re-enables
      interrupts, a guest might spend all its time in the ISR if it has a broken
      handler for a level-triggered interrupt.
      
      Such livelock actually happens with Windows guests when resuming from
      hibernation.
      
      As there's no way to recognize the broken handle from new raised ones, this patch
      delays an interrupt if 10.000 consecutive EOIs found that the interrupt was
      still high.  The guest can then make a little forward progress, until a proper
      IRQ handler is set or until some detection routine in the guest (such as
      Linux's note_interrupt()) recognizes the situation.
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NZhang Haoyu <zhanghy@sangfor.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      184564ef
  4. 11 9月, 2014 1 次提交
  5. 05 9月, 2014 3 次提交
  6. 03 9月, 2014 2 次提交
    • D
      kvm: fix potentially corrupt mmio cache · ee3d1570
      David Matlack 提交于
      vcpu exits and memslot mutations can run concurrently as long as the
      vcpu does not aquire the slots mutex. Thus it is theoretically possible
      for memslots to change underneath a vcpu that is handling an exit.
      
      If we increment the memslot generation number again after
      synchronize_srcu_expedited(), vcpus can safely cache memslot generation
      without maintaining a single rcu_dereference through an entire vm exit.
      And much of the x86/kvm code does not maintain a single rcu_dereference
      of the current memslots during each exit.
      
      We can prevent the following case:
      
         vcpu (CPU 0)                             | thread (CPU 1)
      --------------------------------------------+--------------------------
      1  vm exit                                  |
      2  srcu_read_unlock(&kvm->srcu)             |
      3  decide to cache something based on       |
           old memslots                           |
      4                                           | change memslots
                                                  | (increments generation)
      5                                           | synchronize_srcu(&kvm->srcu);
      6  retrieve generation # from new memslots  |
      7  tag cache with new memslot generation    |
      8  srcu_read_unlock(&kvm->srcu)             |
      ...                                         |
         <action based on cache occurs even       |
          though the caching decision was based   |
          on the old memslots>                    |
      ...                                         |
         <action *continues* to occur until next  |
          memslot generation change, which may    |
          be never>                               |
                                                  |
      
      By incrementing the generation after synchronizing with kvm->srcu readers,
      we ensure that the generation retrieved in (6) will become invalid soon
      after (8).
      
      Keeping the existing increment is not strictly necessary, but we
      do keep it and just move it for consistency from update_memslots to
      install_new_memslots.  It invalidates old cached MMIOs immediately,
      instead of having to wait for the end of synchronize_srcu_expedited,
      which makes the code more clearly correct in case CPU 1 is preempted
      right after synchronize_srcu() returns.
      
      To avoid halving the generation space in SPTEs, always presume that the
      low bit of the generation is zero when reconstructing a generation number
      out of an SPTE.  This effectively disables MMIO caching in SPTEs during
      the call to synchronize_srcu_expedited.  Using the low bit this way is
      somewhat like a seqcount---where the protected thing is a cache, and
      instead of retrying we can simply punt if we observe the low bit to be 1.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee3d1570
    • P
      KVM: do not bias the generation number in kvm_current_mmio_generation · 00f034a1
      Paolo Bonzini 提交于
      The next patch will give a meaning (a la seqcount) to the low bit of the
      generation number.  Ensure that it matches between kvm->memslots->generation
      and kvm_current_mmio_generation().
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      00f034a1
  7. 29 8月, 2014 2 次提交
  8. 28 8月, 2014 3 次提交
  9. 22 8月, 2014 1 次提交
  10. 21 8月, 2014 1 次提交
  11. 19 8月, 2014 2 次提交
    • C
      virt/kvm/assigned-dev.c: Set 'dev->irq_source_id' to '-1' after free it · 30d1e0e8
      Chen Gang 提交于
      As a generic function, deassign_guest_irq() assumes it can be called
      even if assign_guest_irq() is not be called successfully (which can be
      triggered by ioctl from user mode, indirectly).
      
      So for assign_guest_irq() failure process, need set 'dev->irq_source_id'
      to -1 after free 'dev->irq_source_id', or deassign_guest_irq() may free
      it again.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      30d1e0e8
    • M
      kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601) · 350b8bdd
      Michael S. Tsirkin 提交于
      The third parameter of kvm_iommu_put_pages is wrong,
      It should be 'gfn - slot->base_gfn'.
      
      By making gfn very large, malicious guest or userspace can cause kvm to
      go to this error path, and subsequently to pass a huge value as size.
      Alternatively if gfn is small, then pages would be pinned but never
      unpinned, causing host memory leak and local DOS.
      
      Passing a reasonable but large value could be the most dangerous case,
      because it would unpin a page that should have stayed pinned, and thus
      allow the device to DMA into arbitrary memory.  However, this cannot
      happen because of the condition that can trigger the error:
      
      - out of memory (where you can't allocate even a single page)
        should not be possible for the attacker to trigger
      
      - when exceeding the iommu's address space, guest pages after gfn
        will also exceed the iommu's address space, and inside
        kvm_iommu_put_pages() the iommu_iova_to_phys() will fail.  The
        page thus would not be unpinned at all.
      Reported-by: NJack Morgenstein <jackm@mellanox.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      350b8bdd
  12. 06 8月, 2014 1 次提交
    • P
      KVM: Move more code under CONFIG_HAVE_KVM_IRQFD · c77dcacb
      Paolo Bonzini 提交于
      Commits e4d57e1e (KVM: Move irq notifier implementation into
      eventfd.c, 2014-06-30) included the irq notifier code unconditionally
      in eventfd.c, while it was under CONFIG_HAVE_KVM_IRQCHIP before.
      
      Similarly, commit 297e2105 (KVM: Give IRQFD its own separate enabling
      Kconfig option, 2014-06-30) moved code from CONFIG_HAVE_IRQ_ROUTING
      to CONFIG_HAVE_KVM_IRQFD but forgot to move the pieces that used to be
      under CONFIG_HAVE_KVM_IRQCHIP.
      
      Together, this broke compilation without CONFIG_KVM_XICS.  Fix by adding
      or changing the #ifdefs so that they point at CONFIG_HAVE_KVM_IRQFD.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c77dcacb
  13. 05 8月, 2014 5 次提交
  14. 31 7月, 2014 2 次提交
    • M
      KVM: arm64: GICv3: mandate page-aligned GICV region · fb3ec679
      Marc Zyngier 提交于
      Just like GICv2 was fixed in 63afbe7a
      (kvm: arm64: vgic: fix hyp panic with 64k pages on juno platform),
      mandate the GICV region to be both aligned on a page boundary and
      its size to be a multiple of page size.
      
      This prevents a guest from being able to poke at regions where we
      have no idea what is sitting there.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      fb3ec679
    • P
      KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table · 0f6c0a74
      Paolo Bonzini 提交于
      Currently, the EOI exit bitmap (used for APICv) does not include
      interrupts that are masked.  However, this can cause a bug that manifests
      as an interrupt storm inside the guest.  Alex Williamson reported the
      bug and is the one who really debugged this; I only wrote the patch. :)
      
      The scenario involves a multi-function PCI device with OHCI and EHCI
      USB functions and an audio function, all assigned to the guest, where
      both USB functions use legacy INTx interrupts.
      
      As soon as the guest boots, interrupts for these devices turn into an
      interrupt storm in the guest; the host does not see the interrupt storm.
      Basically the EOI path does not work, and the guest continues to see the
      interrupt over and over, even after it attempts to mask it at the APIC.
      The bug is only visible with older kernels (RHEL6.5, based on 2.6.32
      with not many changes in the area of APIC/IOAPIC handling).
      
      Alex then tried forcing bit 59 (corresponding to the USB functions' IRQ)
      on in the eoi_exit_bitmap and TMR, and things then work.  What happens
      is that VFIO asserts IRQ11, then KVM recomputes the EOI exit bitmap.
      It does not have set bit 59 because the RTE was masked, so the IOAPIC
      never sees the EOI and the interrupt continues to fire in the guest.
      
      My guess was that the guest is masking the interrupt in the redirection
      table in the interrupt routine, i.e. while the interrupt is set in a
      LAPIC's ISR, The simplest fix is to ignore the masking state, we would
      rather have an unnecessary exit rather than a missed IRQ ACK and anyway
      IOAPIC interrupts are not as performance-sensitive as for example MSIs.
      Alex tested this patch and it fixed his bug.
      
      [Thanks to Alex for his precise description of the problem
       and initial debugging effort.  A lot of the text above is
       based on emails exchanged with him.]
      Reported-by: NAlex Williamson <alex.williamson@redhat.com>
      Tested-by: NAlex Williamson <alex.williamson@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f6c0a74
  15. 30 7月, 2014 1 次提交
    • W
      kvm: arm64: vgic: fix hyp panic with 64k pages on juno platform · 63afbe7a
      Will Deacon 提交于
      If the physical address of GICV isn't page-aligned, then we end up
      creating a stage-2 mapping of the page containing it, which causes us to
      map neighbouring memory locations directly into the guest.
      
      As an example, consider a platform with GICV at physical 0x2c02f000
      running a 64k-page host kernel. If qemu maps this into the guest at
      0x80010000, then guest physical addresses 0x80010000 - 0x8001efff will
      map host physical region 0x2c020000 - 0x2c02efff. Accesses to these
      physical regions may cause UNPREDICTABLE behaviour, for example, on the
      Juno platform this will cause an SError exception to EL3, which brings
      down the entire physical CPU resulting in RCU stalls / HYP panics / host
      crashing / wasted weeks of debugging.
      
      SBSA recommends that systems alias the 4k GICV across the bounding 64k
      region, in which case GICV physical could be described as 0x2c020000 in
      the above scenario.
      
      This patch fixes the problem by failing the vgic probe if the physical
      base address or the size of GICV aren't page-aligned. Note that this
      generated a warning in dmesg about freeing enabled IRQs, so I had to
      move the IRQ enabling later in the probe.
      
      Cc: Christoffer Dall <christoffer.dall@linaro.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joel Schopp <joel.schopp@amd.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Acked-by: NPeter Maydell <peter.maydell@linaro.org>
      Acked-by: NJoel Schopp <joel.schopp@amd.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      63afbe7a
  16. 28 7月, 2014 1 次提交