1. 11 4月, 2016 2 次提交
    • D
      kvm: x86: do not leak guest xcr0 into host interrupt handlers · fc5b7f3b
      David Matlack 提交于
      An interrupt handler that uses the fpu can kill a KVM VM, if it runs
      under the following conditions:
       - the guest's xcr0 register is loaded on the cpu
       - the guest's fpu context is not loaded
       - the host is using eagerfpu
      
      Note that the guest's xcr0 register and fpu context are not loaded as
      part of the atomic world switch into "guest mode". They are loaded by
      KVM while the cpu is still in "host mode".
      
      Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
      interrupt handler will look something like this:
      
      if (irq_fpu_usable()) {
              kernel_fpu_begin();
      
              [... code that uses the fpu ...]
      
              kernel_fpu_end();
      }
      
      As long as the guest's fpu is not loaded and the host is using eager
      fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
      returns true). The interrupt handler proceeds to use the fpu with
      the guest's xcr0 live.
      
      kernel_fpu_begin() saves the current fpu context. If this uses
      XSAVE[OPT], it may leave the xsave area in an undesirable state.
      According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
      if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
      xcr0[i] == 0 following an XSAVE.
      
      kernel_fpu_end() restores the fpu context. Now if any bit i in
      XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
      fault is trapped and SIGSEGV is delivered to the current process.
      
      Only pre-4.2 kernels appear to be vulnerable to this sequence of
      events. Commit 653f52c3 ("kvm,x86: load guest FPU context more eagerly")
      from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.
      
      This patch fixes the bug by keeping the host's xcr0 loaded outside
      of the interrupts-disabled region where KVM switches into guest mode.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      [Move load after goto cancel_injection. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc5b7f3b
    • X
      KVM: MMU: fix permission_fault() · 7a98205d
      Xiao Guangrong 提交于
      kvm-unit-tests complained about the PFEC is not set properly, e.g,:
      test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15
      expected 5
      Dump mapping: address: 0x123400000000
      ------L4: 3e95007
      ------L3: 3e96007
      ------L2: 2000083
      
      It's caused by the reason that PFEC returned to guest is copied from the
      PFEC triggered by shadow page table
      
      This patch fixes it and makes the logic of updating errcode more clean
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      [Do not assume pfec.p=1. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a98205d
  2. 05 4月, 2016 1 次提交
    • L
      kvm: x86: make lapic hrtimer pinned · 61abdbe0
      Luiz Capitulino 提交于
      When a vCPU runs on a nohz_full core, the hrtimer used by
      the lapic emulation code can be migrated to another core.
      When this happens, it's possible to observe milisecond
      latency when delivering timer IRQs to KVM guests.
      
      The huge latency is mainly due to the fact that
      apic_timer_fn() expects to run during a kvm exit. It
      sets KVM_REQ_PENDING_TIMER and let it be handled on kvm
      entry. However, if the timer fires on a different core,
      we have to wait until the next kvm exit for the guest
      to see KVM_REQ_PENDING_TIMER set.
      
      This problem became visible after commit 9642d18e. This
      commit changed the timer migration code to always attempt
      to migrate timers away from nohz_full cores. While it's
      discussable if this is correct/desirable (I don't think
      it is), it's clear that the lapic emulation code has
      a requirement on firing the hrtimer in the same core
      where it was started. This is achieved by making the
      hrtimer pinned.
      
      Lastly, note that KVM has code to migrate timers when a
      vCPU is scheduled to run in different core. However, this
      forced migration may fail. When this happens, we can have
      the same problem. If we want 100% correctness, we'll have
      to modify apic_timer_fn() to cause a kvm exit when it runs
      on a different core than the vCPU. Not sure if this is
      possible.
      
      Here's a reproducer for the issue being fixed:
      
       1. Set all cores but core0 to be nohz_full cores
       2. Start a guest with a single vCPU
       3. Trace apic_timer_fn() and kvm_inject_apic_timer_irqs()
      
      You'll see that apic_timer_fn() will run in core0 while
      kvm_inject_apic_timer_irqs() runs in a different core. If
      you get both on core0, try running a program that takes 100%
      of the CPU and pin it to core0 to force the vCPU out.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61abdbe0
  3. 01 4月, 2016 3 次提交
  4. 23 3月, 2016 1 次提交
    • P
      KVM: page_track: fix access to NULL slot · a6adb106
      Paolo Bonzini 提交于
      This happens when doing the reboot test from virt-tests:
      
      [  131.833653] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  131.842461] IP: [<ffffffffa0950087>] kvm_page_track_is_active+0x17/0x60 [kvm]
      [  131.850500] PGD 0
      [  131.852763] Oops: 0000 [#1] SMP
      [  132.007188] task: ffff880075fbc500 ti: ffff880850a3c000 task.ti: ffff880850a3c000
      [  132.138891] Call Trace:
      [  132.141639]  [<ffffffffa092bd11>] page_fault_handle_page_track+0x31/0x40 [kvm]
      [  132.149732]  [<ffffffffa093380f>] paging64_page_fault+0xff/0x910 [kvm]
      [  132.172159]  [<ffffffffa092c734>] kvm_mmu_page_fault+0x64/0x110 [kvm]
      [  132.179372]  [<ffffffffa06743c2>] handle_exception+0x1b2/0x430 [kvm_intel]
      [  132.187072]  [<ffffffffa067a301>] vmx_handle_exit+0x1e1/0xc50 [kvm_intel]
      ...
      
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Fixes: 3d0c27adSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a6adb106
  5. 22 3月, 2016 15 次提交
  6. 10 3月, 2016 2 次提交
    • P
      KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 · 5f0b8199
      Paolo Bonzini 提交于
      KVM has special logic to handle pages with pte.u=1 and pte.w=0 when
      CR0.WP=1.  These pages' SPTEs flip continuously between two states:
      U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed)
      and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed).
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1.
      When guest EFER has the NX bit cleared, the reserved bit check thinks
      that the latter state is invalid; teach it that the smep_andnot_wp case
      will also use the NX bit of SPTEs.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.inel.com>
      Fixes: c258b62bSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f0b8199
    • P
      KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo · 844a5fe2
      Paolo Bonzini 提交于
      Yes, all of these are needed. :) This is admittedly a bit odd, but
      kvm-unit-tests access.flat tests this if you run it with "-cpu host"
      and of course ept=0.
      
      KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
      specially when pte.u=1/pte.w=0/CR0.WP=0.  Such writes cause a fault
      when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
      When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
      restarts execution.  This will still cause a user write to fault, while
      supervisor writes will succeed.  User reads will fault spuriously now,
      and KVM will then flip U and W again in the SPTE (U=1, W=0).  User reads
      will be enabled and supervisor writes disabled, going back to the
      originary situation where supervisor writes fault spuriously.
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0.  If the guest has not enabled NX, the result is a continuous
      stream of page faults due to the NX bit being reserved.
      
      The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
      switch.  (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
      control, so they do not use user-return notifiers for EFER---if they did,
      EFER.NX would be forced to the same value as the host).
      
      There is another bug in the reserved bit check, which I've split to a
      separate patch for easier application to stable kernels.
      
      Cc: stable@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Fixes: f6577a5fSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      844a5fe2
  7. 09 3月, 2016 2 次提交
  8. 08 3月, 2016 10 次提交
  9. 05 3月, 2016 1 次提交
  10. 04 3月, 2016 3 次提交
    • X
      KVM: MMU: check kvm_mmu_pages and mmu_page_path indices · e23d3fef
      Xiao Guangrong 提交于
      Give a special invalid index to the root of the walk, so that we
      can check the consistency of kvm_mmu_pages and mmu_page_path.
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      [Extracted from a bigger patch proposed by Guangrong. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e23d3fef
    • P
      KVM: MMU: Fix ubsan warnings · 0a47cd85
      Paolo Bonzini 提交于
      kvm_mmu_pages_init is doing some really yucky stuff.  It is setting
      up a sentinel for mmu_page_clear_parents; however, because of a) the
      way levels are numbered starting from 1 and b) the way mmu_page_path
      sizes its arrays with PT64_ROOT_LEVEL-1 elements, the access can be
      out of bounds.  This is harmless because the code overwrites up to the
      first two elements of parents->idx and these are initialized, and
      because the sentinel is not needed in this case---mmu_page_clear_parents
      exits anyway when it gets to the end of the array.  However ubsan
      complains, and everyone else should too.
      
      This fix does three things.  First it makes the mmu_page_path arrays
      PT64_ROOT_LEVEL elements in size, so that we can write to them without
      checking the level in advance.  Second it disintegrates kvm_mmu_pages_init
      between mmu_unsync_walk (to reset the struct kvm_mmu_pages) and
      for_each_sp (to place the NULL sentinel at the end of the current path).
      This is okay because the mmu_page_path is only used in
      mmu_pages_clear_parents; mmu_pages_clear_parents itself is called within
      a for_each_sp iterator, and hence always after a call to mmu_pages_next.
      Third it changes mmu_pages_clear_parents to just use the sentinel to
      stop iteration, without checking the bounds on level.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reported-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0a47cd85
    • P
      KVM: MMU: cleanup handle_abnormal_pfn · 798e88b3
      Paolo Bonzini 提交于
      The goto and temporary variable are unnecessary, just use return
      statements.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      798e88b3