1. 25 11月, 2016 1 次提交
    • R
      KVM: x86: fix out-of-bounds access in lapic · 444fdad8
      Radim Krčmář 提交于
      Cluster xAPIC delivery incorrectly assumed that dest_id <= 0xff.
      With enabled KVM_X2APIC_API_USE_32BIT_IDS in KVM_CAP_X2APIC_API, a
      userspace can send an interrupt with dest_id that results in
      out-of-bounds access.
      
      Found by syzkaller:
      
        BUG: KASAN: slab-out-of-bounds in kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 at addr ffff88003d9ca750
        Read of size 8 by task syz-executor/22923
        CPU: 0 PID: 22923 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
         [...]
        Call Trace:
         [...] __dump_stack lib/dump_stack.c:15
         [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
         [...] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
         [...] print_address_description mm/kasan/report.c:194
         [...] kasan_report_error mm/kasan/report.c:283
         [...] kasan_report+0x231/0x500 mm/kasan/report.c:303
         [...] __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:329
         [...] kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 arch/x86/kvm/lapic.c:824
         [...] kvm_irq_delivery_to_apic+0x132/0x9a0 arch/x86/kvm/irq_comm.c:72
         [...] kvm_set_msi+0x111/0x160 arch/x86/kvm/irq_comm.c:157
         [...] kvm_send_userspace_msi+0x201/0x280 arch/x86/kvm/../../../virt/kvm/irqchip.c:74
         [...] kvm_vm_ioctl+0xba5/0x1670 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3015
         [...] vfs_ioctl fs/ioctl.c:43
         [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
         [...] SYSC_ioctl fs/ioctl.c:694
         [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
         [...] entry_SYSCALL_64_fastpath+0x1f/0xc2
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Fixes: e45115b6 ("KVM: x86: use physical LAPIC array for logical x2APIC")
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      444fdad8
  2. 19 8月, 2016 1 次提交
  3. 04 8月, 2016 1 次提交
    • W
      KVM: lapic: fix access preemption timer stuff even if kernel_irqchip=off · 91005300
      Wanpeng Li 提交于
      BUG: unable to handle kernel NULL pointer dereference at 000000000000008c
      IP: [<ffffffffc04e0180>] kvm_lapic_hv_timer_in_use+0x10/0x20 [kvm]
      PGD 0
      Oops: 0000 [#1] SMP
      Call Trace:
       kvm_arch_vcpu_load+0x86/0x260 [kvm]
       vcpu_load+0x46/0x60 [kvm]
       kvm_vcpu_ioctl+0x79/0x7c0 [kvm]
       ? __lock_is_held+0x54/0x70
       do_vfs_ioctl+0x96/0x6a0
       ? __fget_light+0x2a/0x90
       SyS_ioctl+0x79/0x90
       do_syscall_64+0x7c/0x1e0
       entry_SYSCALL64_slow_path+0x25/0x25
      RIP  [<ffffffffc04e0180>] kvm_lapic_hv_timer_in_use+0x10/0x20 [kvm]
       RSP <ffff8800db1f3d70>
      CR2: 000000000000008c
      ---[ end trace a55fb79d2b3b4ee8 ]---
      
      This can be reproduced steadily by kernel_irqchip=off.
      
      We should not access preemption timer stuff if lapic is emulated in userspace.
      This patch fix it by avoiding access preemption timer stuff when kernel_irqchip=off.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      91005300
  4. 14 7月, 2016 10 次提交
  5. 01 7月, 2016 2 次提交
    • W
      KVM: vmx: fix missed cancellation of TSC deadline timer · 196f20ca
      Wanpeng Li 提交于
      INFO: rcu_sched detected stalls on CPUs/tasks:
       1-...: (11800 GPs behind) idle=45d/140000000000000/0 softirq=0/0 fqs=21663
       (detected by 0, t=65016 jiffies, g=11500, c=11499, q=719)
      Task dump for CPU 1:
      qemu-system-x86 R  running task        0  3529   3525 0x00080808
       ffff8802021791a0 ffff880212895040 0000000000000001 00007f1c2c00db40
       ffff8801dd20fcd3 ffffc90002b98000 ffff8801dd20fc88 ffff8801dd20fcf8
       0000000000000286 ffff8801dd2ac538 ffff8801dd20fcc0 ffffffffc06949c9
      Call Trace:
      ? kvm_write_guest_cached+0xb9/0x160 [kvm]
      ? __delay+0xf/0x20
      ? wait_lapic_expire+0x14a/0x200 [kvm]
      ? kvm_arch_vcpu_ioctl_run+0xcbe/0x1b00 [kvm]
      ? kvm_arch_vcpu_ioctl_run+0xe34/0x1b00 [kvm]
      ? kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
      ? __fget+0x5/0x210
      ? do_vfs_ioctl+0x96/0x6a0
      ? __fget_light+0x2a/0x90
      ? SyS_ioctl+0x79/0x90
      ? do_syscall_64+0x7c/0x1e0
      ? entry_SYSCALL64_slow_path+0x25/0x25
      
      This can be reproduced readily by running a full dynticks guest(since hrtimer
      in guest is heavily used) w/ lapic_timer_advance disabled.
      
      If fail to program hardware preemption timer, we will fallback to hrtimer based
      method, however, a previous programmed preemption timer miss to cancel in this
      scenario which results in one hardware preemption timer and one hrtimer emulated
      tsc deadline timer run simultaneously. So sometimes the target guest deadline
      tsc is earlier than guest tsc, which leads to the computation in vmx_set_hv_timer
      can underflow and cause delta_tsc to be set a huge value, then host soft lockup
      as above.
      
      This patch fix it by cancelling the previous programmed preemption timer if there
      is once we failed to program the new preemption timer and fallback to hrtimer
      based method.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      196f20ca
    • W
      KVM: x86: introduce cancel_hv_tscdeadline · bd97ad0e
      Wanpeng Li 提交于
      Introduce cancel_hv_tscdeadline() to encapsulate preemption
      timer cancel stuff.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd97ad0e
  6. 27 6月, 2016 1 次提交
  7. 16 6月, 2016 2 次提交
    • Y
      KVM: x86: support using the vmx preemption timer for tsc deadline timer · ce7a058a
      Yunhong Jiang 提交于
      The VMX preemption timer can be used to virtualize the TSC deadline timer.
      The VMX preemption timer is armed when the vCPU is running, and a VMExit
      will happen if the virtual TSC deadline timer expires.
      
      When the vCPU thread is blocked because of HLT, KVM will switch to use
      an hrtimer, and then go back to the VMX preemption timer when the vCPU
      thread is unblocked.
      
      This solution avoids the complex OS's hrtimer system, and the host
      timer interrupt handling cost, replacing them with a little math
      (for guest->host TSC and host TSC->preemption timer conversion)
      and a cheaper VMexit.  This benefits latency for isolated pCPUs.
      
      [A word about performance... Yunhong reported a 30% reduction in average
       latency from cyclictest.  I made a similar test with tscdeadline_latency
       from kvm-unit-tests, and measured
      
       - ~20 clock cycles loss (out of ~3200, so less than 1% but still
         statistically significant) in the worst case where the test halts
         just after programming the TSC deadline timer
      
       - ~800 clock cycles gain (25% reduction in latency) in the best case
         where the test busy waits.
      
       I removed the VMX bits from Yunhong's patch, to concentrate them in the
       next patch - Paolo]
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce7a058a
    • Y
      kvm: lapic: separate start_sw_tscdeadline from start_apic_timer · 53f9eedf
      Yunhong Jiang 提交于
      The function to start the tsc deadline timer virtualization will be used
      also by the pre_block hook when we use the preemption timer; change it
      to a separate function. No logic changes.
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      53f9eedf
  8. 19 5月, 2016 4 次提交
  9. 05 4月, 2016 1 次提交
    • L
      kvm: x86: make lapic hrtimer pinned · 61abdbe0
      Luiz Capitulino 提交于
      When a vCPU runs on a nohz_full core, the hrtimer used by
      the lapic emulation code can be migrated to another core.
      When this happens, it's possible to observe milisecond
      latency when delivering timer IRQs to KVM guests.
      
      The huge latency is mainly due to the fact that
      apic_timer_fn() expects to run during a kvm exit. It
      sets KVM_REQ_PENDING_TIMER and let it be handled on kvm
      entry. However, if the timer fires on a different core,
      we have to wait until the next kvm exit for the guest
      to see KVM_REQ_PENDING_TIMER set.
      
      This problem became visible after commit 9642d18e. This
      commit changed the timer migration code to always attempt
      to migrate timers away from nohz_full cores. While it's
      discussable if this is correct/desirable (I don't think
      it is), it's clear that the lapic emulation code has
      a requirement on firing the hrtimer in the same core
      where it was started. This is achieved by making the
      hrtimer pinned.
      
      Lastly, note that KVM has code to migrate timers when a
      vCPU is scheduled to run in different core. However, this
      forced migration may fail. When this happens, we can have
      the same problem. If we want 100% correctness, we'll have
      to modify apic_timer_fn() to cause a kvm exit when it runs
      on a different core than the vCPU. Not sure if this is
      possible.
      
      Here's a reproducer for the issue being fixed:
      
       1. Set all cores but core0 to be nohz_full cores
       2. Start a guest with a single vCPU
       3. Trace apic_timer_fn() and kvm_inject_apic_timer_irqs()
      
      You'll see that apic_timer_fn() will run in core0 while
      kvm_inject_apic_timer_irqs() runs in a different core. If
      you get both on core0, try running a program that takes 100%
      of the CPU and pin it to core0 to force the vCPU out.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61abdbe0
  10. 03 3月, 2016 2 次提交
  11. 25 2月, 2016 1 次提交
    • M
      KVM: Use simple waitqueue for vcpu->wq · 8577370f
      Marcelo Tosatti 提交于
      The problem:
      
      On -rt, an emulated LAPIC timer instances has the following path:
      
      1) hard interrupt
      2) ksoftirqd is scheduled
      3) ksoftirqd wakes up vcpu thread
      4) vcpu thread is scheduled
      
      This extra context switch introduces unnecessary latency in the
      LAPIC path for a KVM guest.
      
      The solution:
      
      Allow waking up vcpu thread from hardirq context,
      thus avoiding the need for ksoftirqd to be scheduled.
      
      Normal waitqueues make use of spinlocks, which on -RT
      are sleepable locks. Therefore, waking up a waitqueue
      waiter involves locking a sleeping lock, which
      is not allowed from hard interrupt context.
      
      cyclictest command line:
      
      This patch reduces the average latency in my tests from 14us to 11us.
      
      Daniel writes:
      Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
      benchmark on mainline. The test was run 1000 times on
      tip/sched/core 4.4.0-rc8-01134-g0905f04e:
      
        ./x86-run x86/tscdeadline_latency.flat -cpu host
      
      with idle=poll.
      
      The test seems not to deliver really stable numbers though most of
      them are smaller. Paolo write:
      
      "Anything above ~10000 cycles means that the host went to C1 or
      lower---the number means more or less nothing in that case.
      
      The mean shows an improvement indeed."
      
      Before:
      
                     min             max         mean           std
      count  1000.000000     1000.000000  1000.000000   1000.000000
      mean   5162.596000  2019270.084000  5824.491541  20681.645558
      std      75.431231   622607.723969    89.575700   6492.272062
      min    4466.000000    23928.000000  5537.926500    585.864966
      25%    5163.000000  16132529.750000  5790.132275  16683.745433
      50%    5175.000000  2281919.000000  5834.654000  23151.990026
      75%    5190.000000  2382865.750000  5861.412950  24148.206168
      max    5228.000000  4175158.000000  6254.827300  46481.048691
      
      After
                     min            max         mean           std
      count  1000.000000     1000.00000  1000.000000   1000.000000
      mean   5143.511000  2076886.10300  5813.312474  21207.357565
      std      77.668322   610413.09583    86.541500   6331.915127
      min    4427.000000    25103.00000  5529.756600    559.187707
      25%    5148.000000  1691272.75000  5784.889825  17473.518244
      50%    5160.000000  2308328.50000  5832.025000  23464.837068
      75%    5172.000000  2393037.75000  5853.177675  24223.969976
      max    5222.000000  3922458.00000  6186.720500  42520.379830
      
      [Patch was originaly based on the swait implementation found in the -rt
       tree. Daniel ported it to mainline's version and gathered the
       benchmark numbers for tscdeadline_latency test.]
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8577370f
  12. 17 2月, 2016 1 次提交
  13. 09 2月, 2016 5 次提交
  14. 26 11月, 2015 3 次提交
    • A
      kvm/x86: Hyper-V synthetic interrupt controller · 5c919412
      Andrey Smetanin 提交于
      SynIC (synthetic interrupt controller) is a lapic extension,
      which is controlled via MSRs and maintains for each vCPU
       - 16 synthetic interrupt "lines" (SINT's); each can be configured to
         trigger a specific interrupt vector optionally with auto-EOI
         semantics
       - a message page in the guest memory with 16 256-byte per-SINT message
         slots
       - an event flag page in the guest memory with 16 2048-bit per-SINT
         event flag areas
      
      The host triggers a SINT whenever it delivers a new message to the
      corresponding slot or flips an event flag bit in the corresponding area.
      The guest informs the host that it can try delivering a message by
      explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
      MSR.
      
      The userspace (qemu) triggers interrupts and receives EOM notifications
      via irqfd with resampler; for that, a GSI is allocated for each
      configured SINT, and irq_routing api is extended to support GSI-SINT
      mapping.
      
      Changes v4:
      * added activation of SynIC by vcpu KVM_ENABLE_CAP
      * added per SynIC active flag
      * added deactivation of APICv upon SynIC activation
      
      Changes v3:
      * added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
      docs
      
      Changes v2:
      * do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
      * add Hyper-V SynIC vectors into EOI exit bitmap
      * Hyper-V SyniIC SINT msr write logic simplified
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c919412
    • A
      kvm/x86: per-vcpu apicv deactivation support · d62caabb
      Andrey Smetanin 提交于
      The decision on whether to use hardware APIC virtualization used to be
      taken globally, based on the availability of the feature in the CPU
      and the value of a module parameter.
      
      However, under certain circumstances we want to control it on per-vcpu
      basis.  In particular, when the userspace activates HyperV synthetic
      interrupt controller (SynIC), APICv has to be disabled as it's
      incompatible with SynIC auto-EOI behavior.
      
      To achieve that, introduce 'apicv_active' flag on struct
      kvm_vcpu_arch, and kvm_vcpu_deactivate_apicv() function to turn APICv
      off.  The flag is initialized based on the module parameter and CPU
      capability, and consulted whenever an APICv-specific action is
      performed.
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d62caabb
    • A
      kvm/x86: split ioapic-handled and EOI exit bitmaps · 6308630b
      Andrey Smetanin 提交于
      The function to determine if the vector is handled by ioapic used to
      rely on the fact that only ioapic-handled vectors were set up to
      cause vmexits when virtual apic was in use.
      
      We're going to break this assumption when introducing Hyper-V
      synthetic interrupts: they may need to cause vmexits too.
      
      To achieve that, introduce a new bitmap dedicated specifically for
      ioapic-handled vectors, and populate EOI exit bitmap from it for now.
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6308630b
  15. 10 11月, 2015 1 次提交
  16. 04 11月, 2015 1 次提交
  17. 14 10月, 2015 1 次提交
  18. 01 10月, 2015 2 次提交
    • F
      KVM: Define a new interface kvm_intr_is_single_vcpu() · 8feb4a04
      Feng Wu 提交于
      This patch defines a new interface kvm_intr_is_single_vcpu(),
      which can returns whether the interrupt is for single-CPU or not.
      
      It is used by VT-d PI, since now we only support single-CPU
      interrupts, For lowest-priority interrupts, if user configures
      it via /proc/irq or uses irqbalance to make it single-CPU, we
      can use PI to deliver the interrupts to it. Full functionality
      of lowest-priority support will be added later.
      Signed-off-by: NFeng Wu <feng.wu@intel.com>
      Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8feb4a04
    • S
      KVM: x86: Add EOI exit bitmap inference · b053b2ae
      Steve Rutherford 提交于
      In order to support a userspace IOAPIC interacting with an in kernel
      APIC, the EOI exit bitmaps need to be configurable.
      
      If the IOAPIC is in userspace (i.e. the irqchip has been split), the
      EOI exit bitmaps will be set whenever the GSI Routes are configured.
      In particular, for the low MSI routes are reservable for userspace
      IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the
      destination vector of the route will be set for the destination VCPU.
      
      The intention is for the userspace IOAPICs to use the reservable MSI
      routes to inject interrupts into the guest.
      
      This is a slight abuse of the notion of an MSI Route, given that MSIs
      classically bypass the IOAPIC. It might be worthwhile to add an
      additional route type to improve clarity.
      
      Compile tested for Intel x86.
      Signed-off-by: NSteve Rutherford <srutherford@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b053b2ae