1. 16 1月, 2018 17 次提交
    • P
      KVM: nVMX: remove unnecessary vmwrite from L2->L1 vmexit · 07f36616
      Paolo Bonzini 提交于
      The POSTED_INTR_NV field is constant (though it differs between the vmcs01 and
      vmcs02), there is no need to reload it on vmexit to L1.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      07f36616
    • P
      KVM: nVMX: initialize more non-shadowed fields in prepare_vmcs02_full · 25a2e4fe
      Paolo Bonzini 提交于
      These fields are also simple copies of the data in the vmcs12 struct.
      For some of them, prepare_vmcs02 was skipping the copy when the field
      was unused.  In prepare_vmcs02_full, we copy them always as long as the
      field exists on the host, because the corresponding execution control
      might be one of the shadowed fields.
      
      Optimization opportunities remain for MSRs that, depending on the
      entry/exit controls, have to be copied from either the vmcs01 or
      the vmcs12: EFER (whose value is partly stored in the entry controls
      too), PAT, DEBUGCTL (and also DR7).  Before moving these three and
      the entry/exit controls to prepare_vmcs02_full, KVM would have to set
      dirty_vmcs12 on writes to the L1 MSRs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      25a2e4fe
    • P
      KVM: nVMX: initialize descriptor cache fields in prepare_vmcs02_full · 8665c3f9
      Paolo Bonzini 提交于
      This part is separate for ease of review, because git prefers to move
      prepare_vmcs02 below the initial long sequence of vmcs_write* operations.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8665c3f9
    • P
      KVM: nVMX: track dirty state of non-shadowed VMCS fields · 74a497fa
      Paolo Bonzini 提交于
      VMCS12 fields that are not handled through shadow VMCS are rarely
      written, and thus they are also almost constant in the vmcs02.  We can
      thus optimize prepare_vmcs02 by skipping all the work for non-shadowed
      fields in the common case.
      
      This patch introduces the (pretty simple) tracking infrastructure; the
      next patches will move work to prepare_vmcs02_full and save a few hundred
      clock cycles per VMRESUME on a Haswell Xeon E5 system:
      
      	                                before  after
      	cpuid                           14159   13869
      	vmcall                          15290   14951
      	inl_from_kernel                 17703   17447
      	outl_to_kernel                  16011   14692
      	self_ipi_sti_nop                16763   15825
      	self_ipi_tpr_sti_nop            17341   15935
      	wr_tsc_adjust_msr               14510   14264
      	rd_tsc_adjust_msr               15018   14311
      	mmio-wildcard-eventfd:pci-mem   16381   14947
      	mmio-datamatch-eventfd:pci-mem  18620   17858
      	portio-wildcard-eventfd:pci-io  15121   14769
      	portio-datamatch-eventfd:pci-io 15761   14831
      
      (average savings 748, stdev 460).
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      74a497fa
    • P
      KVM: VMX: split list of shadowed VMCS field to a separate file · c9e9deae
      Paolo Bonzini 提交于
      Prepare for multiple inclusions of the list.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c9e9deae
    • J
      kvm: vmx: Reduce size of vmcs_field_to_offset_table · 58e9ffae
      Jim Mattson 提交于
      The vmcs_field_to_offset_table was a rather sparse table of short
      integers with a maximum index of 0x6c16, amounting to 55342 bytes. Now
      that we are considering support for multiple VMCS12 formats, it would
      be unfortunate to replicate that large, sparse table. Rotating the
      field encoding (as a 16-bit integer) left by 6 reduces that table to
      5926 bytes.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      58e9ffae
    • J
      kvm: vmx: Change vmcs_field_type to vmcs_field_width · d37f4267
      Jim Mattson 提交于
      Per the SDM, "[VMCS] Fields are grouped by width (16-bit, 32-bit,
      etc.) and type (guest-state, host-state, etc.)." Previously, the width
      was indicated by vmcs_field_type. To avoid confusion when we start
      dealing with both field width and field type, change vmcs_field_type
      to vmcs_field_width.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d37f4267
    • J
      kvm: vmx: Introduce VMCS12_MAX_FIELD_INDEX · 5b15706d
      Jim Mattson 提交于
      This is the highest index value used in any supported VMCS12 field
      encoding. It is used to populate the IA32_VMX_VMCS_ENUM MSR.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      5b15706d
    • P
      KVM: VMX: optimize shadow VMCS copying · 44900ba6
      Paolo Bonzini 提交于
      Because all fields can be read/written with a single vmread/vmwrite on
      64-bit kernels, the switch statements in copy_vmcs12_to_shadow and
      copy_shadow_to_vmcs12 are unnecessary.
      
      What I did in this patch is to copy the two parts of 64-bit fields
      separately on 32-bit kernels, to keep all complicated #ifdef-ery
      in init_vmcs_shadow_fields.  The disadvantage is that 64-bit fields
      have to be listed separately in shadow_read_only/read_write_fields,
      but those are few and we can validate the arrays when building the
      VMREAD and VMWRITE bitmaps.  This saves a few hundred clock cycles
      per nested vmexit.
      
      However there is still a "switch" in vmcs_read_any and vmcs_write_any.
      So, while at it, this patch reorders the fields by type, hoping that
      the branch predictor appreciates it.
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      44900ba6
    • P
      KVM: vmx: shadow more fields that are read/written on every vmexits · c5d167b2
      Paolo Bonzini 提交于
      Compared to when VMCS shadowing was added to KVM, we are reading/writing
      a few more fields: the PML index, the interrupt status and the preemption
      timer value.  The first two are because we are exposing more features
      to nested guests, the preemption timer is simply because we have grown
      a new optimization.  Adding them to the shadow VMCS field lists reduces
      the cost of a vmexit by about 1000 clock cycles for each field that exists
      on bare metal.
      
      On the other hand, the guest BNDCFGS and TSC offset are not written on
      fast paths, so remove them.
      Suggested-by: NJim Mattson <jmattson@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c5d167b2
    • L
      KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2 · 6b697711
      Liran Alon 提交于
      Consider the following scenario:
      1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI
      to CPU B via virtual posted-interrupt mechanism.
      2. CPU B is currently executing L2 guest.
      3. vmx_deliver_nested_posted_interrupt() calls
      kvm_vcpu_trigger_posted_interrupt() which will note that
      vcpu->mode == IN_GUEST_MODE.
      4. Assume that before CPU A sends the physical POSTED_INTR_NESTED_VECTOR
      IPI, CPU B exits from L2 to L0 during event-delivery
      (valid IDT-vectoring-info).
      5. CPU A now sends the physical IPI. The IPI is received in host and
      it's handler (smp_kvm_posted_intr_nested_ipi()) does nothing.
      6. Assume that before CPU A sets pi_pending=true and KVM_REQ_EVENT,
      CPU B continues to run in L0 and reach vcpu_enter_guest(). As
      KVM_REQ_EVENT is not set yet, vcpu_enter_guest() will continue and resume
      L2 guest.
      7. At this point, CPU A sets pi_pending=true and KVM_REQ_EVENT but
      it's too late! CPU B already entered L2 and KVM_REQ_EVENT will only be
      consumed at next L2 entry!
      
      Another scenario to consider:
      1. CPU A calls vmx_deliver_nested_posted_interrupt() to send an IPI
      to CPU B via virtual posted-interrupt mechanism.
      2. Assume that before CPU A calls kvm_vcpu_trigger_posted_interrupt(),
      CPU B is at L0 and is about to resume into L2. Further assume that it is
      in vcpu_enter_guest() after check for KVM_REQ_EVENT.
      3. At this point, CPU A calls kvm_vcpu_trigger_posted_interrupt() which
      will note that vcpu->mode != IN_GUEST_MODE. Therefore, do nothing and
      return false. Then, will set pi_pending=true and KVM_REQ_EVENT.
      4. Now CPU B continue and resumes into L2 guest without processing
      the posted-interrupt until next L2 entry!
      
      To fix both issues, we just need to change
      vmx_deliver_nested_posted_interrupt() to set pi_pending=true and
      KVM_REQ_EVENT before calling kvm_vcpu_trigger_posted_interrupt().
      
      It will fix the first scenario by chaging step (6) to note that
      KVM_REQ_EVENT and pi_pending=true and therefore process
      nested posted-interrupt.
      
      It will fix the second scenario by two possible ways:
      1. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B has changed
      vcpu->mode to IN_GUEST_MODE, physical IPI will be sent and will be received
      when CPU resumes into L2.
      2. If kvm_vcpu_trigger_posted_interrupt() is called while CPU B hasn't yet
      changed vcpu->mode to IN_GUEST_MODE, then after CPU B will change
      vcpu->mode it will call kvm_request_pending() which will return true and
      therefore force another round of vcpu_enter_guest() which will note that
      KVM_REQ_EVENT and pi_pending=true and therefore process nested
      posted-interrupt.
      
      Cc: stable@vger.kernel.org
      Fixes: 705699a1 ("KVM: nVMX: Enable nested posted interrupt processing")
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      [Add kvm_vcpu_kick to also handle the case where L1 doesn't intercept L2 HLT
       and L2 executes HLT instruction. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6b697711
    • L
      KVM: nVMX: Fix injection to L2 when L1 don't intercept external-interrupts · 851c1a18
      Liran Alon 提交于
      Before each vmentry to guest, vcpu_enter_guest() calls sync_pir_to_irr()
      which calls vmx_hwapic_irr_update() to update RVI.
      Currently, vmx_hwapic_irr_update() contains a tweak in case it is called
      when CPU is running L2 and L1 don't intercept external-interrupts.
      In that case, code injects interrupt directly into L2 instead of
      updating RVI.
      
      Besides being hacky (wouldn't expect function updating RVI to also
      inject interrupt), it also doesn't handle this case correctly.
      The code contains several issues:
      1. When code calls kvm_queue_interrupt() it just passes it max_irr which
      represents the highest IRR currently pending in L1 LAPIC.
      This is problematic as interrupt was injected to guest but it's bit is
      still set in LAPIC IRR instead of being cleared from IRR and set in ISR.
      2. Code doesn't check if LAPIC PPR is set to accept an interrupt of
      max_irr priority. It just checks if interrupts are enabled in guest with
      vmx_interrupt_allowed().
      
      To fix the above issues:
      1. Simplify vmx_hwapic_irr_update() to just update RVI.
      Note that this shouldn't happen when CPU is running L2
      (See comment in code).
      2. Since now vmx_hwapic_irr_update() only does logic for L1
      virtual-interrupt-delivery, inject_pending_event() should be the
      one responsible for injecting the interrupt directly into L2.
      Therefore, change kvm_cpu_has_injectable_intr() to check L1
      LAPIC when CPU is running L2.
      3. Change vmx_sync_pir_to_irr() to set KVM_REQ_EVENT when L1
      has a pending injectable interrupt.
      
      Fixes: 963fee16 ("KVM: nVMX: Fix virtual interrupt delivery
      injection")
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Signed-off-by: NLiam Merwick <liam.merwick@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      851c1a18
    • L
      KVM: nVMX: Re-evaluate L1 pending events when running L2 and L1 got posted-interrupt · f27a85c4
      Liran Alon 提交于
      In case posted-interrupt was delivered to CPU while it is in host
      (outside guest), then posted-interrupt delivery will be done by
      calling sync_pir_to_irr() at vmentry after interrupts are disabled.
      
      sync_pir_to_irr() will check vmx->pi_desc.control ON bit and if
      set, it will sync vmx->pi_desc.pir to IRR and afterwards update RVI to
      ensure virtual-interrupt-delivery will dispatch interrupt to guest.
      
      However, it is possible that L1 will receive a posted-interrupt while
      CPU runs at host and is about to enter L2. In this case, the call to
      sync_pir_to_irr() will indeed update the L1's APIC IRR but
      vcpu_enter_guest() will then just resume into L2 guest without
      re-evaluating if it should exit from L2 to L1 as a result of this
      new pending L1 event.
      
      To address this case, if sync_pir_to_irr() has a new L1 injectable
      interrupt and CPU is running L2, we force exit GUEST_MODE which will
      result in another iteration of vcpu_run() run loop which will call
      kvm_vcpu_running() which will call check_nested_events() which will
      handle the pending L1 event properly.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Signed-off-by: NLiam Merwick <liam.merwick@oracle.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f27a85c4
    • L
      KVM: x86: Change __kvm_apic_update_irr() to also return if max IRR updated · e7387b0e
      Liran Alon 提交于
      This commit doesn't change semantics.
      It is done as a preparation for future commits.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Signed-off-by: NLiam Merwick <liam.merwick@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      e7387b0e
    • L
      KVM: nVMX: Fix bug of injecting L2 exception into L1 · 5c7d4f9a
      Liran Alon 提交于
      kvm_clear_exception_queue() should clear pending exception.
      This also includes exceptions which were only marked pending but not
      yet injected. This is because exception.pending is used for both L1
      and L2 to determine if an exception should be raised to guest.
      Note that an exception which is pending but not yet injected will
      be raised again once the guest will be resumed.
      
      Consider the following scenario:
      1) L0 KVM with ignore_msrs=false.
      2) L1 prepare vmcs12 with the following:
          a) No intercepts on MSR (MSR_BITMAP exist and is filled with 0).
          b) No intercept for #GP.
          c) vmx-preemption-timer is configured.
      3) L1 enters into L2.
      4) L2 reads an unhandled MSR that exists in MSR_BITMAP
      (such as 0x1fff).
      
      L2 RDMSR could be handled as described below:
      1) L2 exits to L0 on RDMSR and calls handle_rdmsr().
      2) handle_rdmsr() calls kvm_inject_gp() which sets
      KVM_REQ_EVENT, exception.pending=true and exception.injected=false.
      3) vcpu_enter_guest() consumes KVM_REQ_EVENT and calls
      inject_pending_event() which calls vmx_check_nested_events()
      which sees that exception.pending=true but
      nested_vmx_check_exception() returns 0 and therefore does nothing at
      this point. However let's assume it later sees vmx-preemption-timer
      expired and therefore exits from L2 to L1 by calling
      nested_vmx_vmexit().
      4) nested_vmx_vmexit() calls prepare_vmcs12()
      which calls vmcs12_save_pending_event() but it does nothing as
      exception.injected is false. Also prepare_vmcs12() calls
      kvm_clear_exception_queue() which does nothing as
      exception.injected is already false.
      5) We now return from vmx_check_nested_events() with 0 while still
      having exception.pending=true!
      6) Therefore inject_pending_event() continues
      and we inject L2 exception to L1!...
      
      This commit will fix above issue by changing step (4) to
      clear exception.pending in kvm_clear_exception_queue().
      
      Fixes: 664f8e26 ("KVM: X86: Fix loss of exception which has not yet been injected")
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      5c7d4f9a
    • B
      kvm/vmx: Use local vmx variable in vmx_get_msr() · a6cb099a
      Borislav Petkov 提交于
      ... just like in vmx_set_msr().
      
      No functionality change.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a6cb099a
    • W
      KVM: X86: introduce invalidate_gpa argument to tlb flush · c2ba05cc
      Wanpeng Li 提交于
      Introduce a new bool invalidate_gpa argument to kvm_x86_ops->tlb_flush,
      it will be used by later patches to just flush guest tlb.
      
      For VMX, this will use INVVPID instead of INVEPT, which will invalidate
      combined mappings while keeping guest-physical mappings.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c2ba05cc
  2. 12 1月, 2018 1 次提交
  3. 14 12月, 2017 8 次提交
  4. 06 12月, 2017 2 次提交
  5. 28 11月, 2017 2 次提交
    • W
      KVM: VMX: Fix vmx->nested freeing when no SMI handler · b7455825
      Wanpeng Li 提交于
      Reported by syzkaller:
      
         ------------[ cut here ]------------
         WARNING: CPU: 5 PID: 2939 at arch/x86/kvm/vmx.c:3844 free_loaded_vmcs+0x77/0x80 [kvm_intel]
         CPU: 5 PID: 2939 Comm: repro Not tainted 4.14.0+ #26
         RIP: 0010:free_loaded_vmcs+0x77/0x80 [kvm_intel]
         Call Trace:
          vmx_free_vcpu+0xda/0x130 [kvm_intel]
          kvm_arch_destroy_vm+0x192/0x290 [kvm]
          kvm_put_kvm+0x262/0x560 [kvm]
          kvm_vm_release+0x2c/0x30 [kvm]
          __fput+0x190/0x370
          task_work_run+0xa1/0xd0
          do_exit+0x4d2/0x13e0
          do_group_exit+0x89/0x140
          get_signal+0x318/0xb80
          do_signal+0x8c/0xb40
          exit_to_usermode_loop+0xe4/0x140
          syscall_return_slowpath+0x206/0x230
          entry_SYSCALL_64_fastpath+0x98/0x9a
      
      The syzkaller testcase will execute VMXON/VMLAUCH instructions, so the
      vmx->nested stuff is populated, it will also issue KVM_SMI ioctl. However,
      the testcase is just a simple c program and not be lauched by something
      like seabios which implements smi_handler. Commit 05cade71 (KVM: nSVM:
      fix SMI injection in guest mode) gets out of guest mode and set nested.vmxon
      to false for the duration of SMM according to SDM 34.14.1 "leave VMX
      operation" upon entering SMM. We can't alloc/free the vmx->nested stuff
      each time when entering/exiting SMM since it will induce more overhead. So
      the function vmx_pre_enter_smm() marks nested.vmxon false even if vmx->nested
      stuff is still populated. What it expected is em_rsm() can mark nested.vmxon
      to be true again. However, the smi_handler/rsm will not execute since there
      is no something like seabios in this scenario. The function free_nested()
      fails to free the vmx->nested stuff since the vmx->nested.vmxon is false
      which results in the above warning.
      
      This patch fixes it by also considering the no SMI handler case, luckily
      vmx->nested.smm.vmxon is marked according to the value of vmx->nested.vmxon
      in vmx_pre_enter_smm(), we can take advantage of it and free vmx->nested
      stuff when L1 goes down.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Fixes: 05cade71 (KVM: nSVM: fix SMI injection in guest mode)
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b7455825
    • W
      KVM: VMX: Fix rflags cache during vCPU reset · c37c2873
      Wanpeng Li 提交于
      Reported by syzkaller:
      
         *** Guest State ***
         CR0: actual=0x0000000080010031, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
         CR4: actual=0x0000000000002061, shadow=0x0000000000000000, gh_mask=ffffffffffffe8f1
         CR3 = 0x000000002081e000
         RSP = 0x000000000000fffa  RIP = 0x0000000000000000
         RFLAGS=0x00023000         DR7 = 0x00000000000000
                ^^^^^^^^^^
         ------------[ cut here ]------------
         WARNING: CPU: 6 PID: 24431 at /home/kernel/linux/arch/x86/kvm//x86.c:7302 kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
         CPU: 6 PID: 24431 Comm: reprotest Tainted: G        W  OE   4.14.0+ #26
         RIP: 0010:kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
         RSP: 0018:ffff880291d179e0 EFLAGS: 00010202
         Call Trace:
          kvm_vcpu_ioctl+0x479/0x880 [kvm]
          do_vfs_ioctl+0x142/0x9a0
          SyS_ioctl+0x74/0x80
          entry_SYSCALL_64_fastpath+0x23/0x9a
      
      The failed vmentry is triggered by the following beautified testcase:
      
          #include <unistd.h>
          #include <sys/syscall.h>
          #include <string.h>
          #include <stdint.h>
          #include <linux/kvm.h>
          #include <fcntl.h>
          #include <sys/ioctl.h>
      
          long r[5];
          int main()
          {
              struct kvm_debugregs dr = { 0 };
      
              r[2] = open("/dev/kvm", O_RDONLY);
              r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
              r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
              struct kvm_guest_debug debug = {
                      .control = 0xf0403,
                      .arch = {
                              .debugreg[6] = 0x2,
                              .debugreg[7] = 0x2
                      }
              };
              ioctl(r[4], KVM_SET_GUEST_DEBUG, &debug);
              ioctl(r[4], KVM_RUN, 0);
          }
      
      which testcase tries to setup the processor specific debug
      registers and configure vCPU for handling guest debug events through
      KVM_SET_GUEST_DEBUG.  The KVM_SET_GUEST_DEBUG ioctl will get and set
      rflags in order to set TF bit if single step is needed. All regs' caches
      are reset to avail and GUEST_RFLAGS vmcs field is reset to 0x2 during vCPU
      reset. However, the cache of rflags is not reset during vCPU reset. The
      function vmx_get_rflags() returns an unreset rflags cache value since
      the cache is marked avail, it is 0 after boot. Vmentry fails if the
      rflags reserved bit 1 is 0.
      
      This patch fixes it by resetting both the GUEST_RFLAGS vmcs field and
      its cache to 0x2 during vCPU reset.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c37c2873
  6. 17 11月, 2017 9 次提交
  7. 03 11月, 2017 1 次提交