1. 15 11月, 2019 1 次提交
  2. 26 9月, 2019 1 次提交
  3. 24 9月, 2019 1 次提交
  4. 12 9月, 2019 2 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
    • W
      KVM: LAPIC: Micro optimize IPI latency · 2b0911d1
      Wanpeng Li 提交于
      This patch optimizes the virtual IPI emulation sequence:
      
      write ICR2                     write ICR2
      write ICR                      read ICR2
      read ICR            ==>        send virtual IPI
      read ICR2                      write ICR
      send virtual IPI
      
      It can reduce kvm-unit-tests/vmexit.flat IPI testing latency(from sender
      send IPI to sender receive the ACK) from 3319 cycles to 3203 cycles on
      SKylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2b0911d1
  5. 14 8月, 2019 1 次提交
    • R
      kvm: x86: skip populating logical dest map if apic is not sw enabled · b14c876b
      Radim Krcmar 提交于
      recalculate_apic_map does not santize ldr and it's possible that
      multiple bits are set. In that case, a previous valid entry
      can potentially be overwritten by an invalid one.
      
      This condition is hit when booting a 32 bit, >8 CPU, RHEL6 guest and then
      triggering a crash to boot a kdump kernel. This is the sequence of
      events:
      1. Linux boots in bigsmp mode and enables PhysFlat, however, it still
      writes to the LDR which probably will never be used.
      2. However, when booting into kdump, the stale LDR values remain as
      they are not cleared by the guest and there isn't a apic reset.
      3. kdump boots with 1 cpu, and uses Logical Destination Mode but the
      logical map has been overwritten and points to an inactive vcpu.
      Signed-off-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b14c876b
  6. 05 8月, 2019 1 次提交
  7. 02 8月, 2019 1 次提交
  8. 20 7月, 2019 1 次提交
    • W
      KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da
      Wanpeng Li 提交于
      Dedicated instances are currently disturbed by unnecessary jitter due
      to the emulated lapic timers firing on the same pCPUs where the
      vCPUs reside.  There is no hardware virtual timer on Intel for guest
      like ARM, so both programming timer in guest and the emulated timer fires
      incur vmexits.  This patch tries to avoid vmexit when the emulated timer
      fires, at least in dedicated instance scenario when nohz_full is enabled.
      
      In that case, the emulated timers can be offload to the nearest busy
      housekeeping cpus since APICv has been found for several years in server
      processors. The guest timer interrupt can then be injected via posted interrupts,
      which are delivered by the housekeeping cpu once the emulated timer fires.
      
      The host should tuned so that vCPUs are placed on isolated physical
      processors, and with several pCPUs surplus for busy housekeeping.
      If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
      ~3% redis performance benefit can be observed on Skylake server, and the
      number of external interrupt vmexits drops substantially.  Without patch
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
      EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
      
      While with patch:
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
      EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c5f81da
  9. 18 7月, 2019 1 次提交
    • W
      KVM: LAPIC: Make lapic timer unpinned · 4d151bf3
      Wanpeng Li 提交于
      Commit 61abdbe0 ("kvm: x86: make lapic hrtimer pinned") pinned the
      lapic timer to avoid to wait until the next kvm exit for the guest to
      see KVM_REQ_PENDING_TIMER set. There is another solution to give a kick
      after setting the KVM_REQ_PENDING_TIMER bit, make lapic timer unpinned
      will be used in follow up patches.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d151bf3
  10. 16 7月, 2019 1 次提交
  11. 06 7月, 2019 1 次提交
  12. 05 7月, 2019 2 次提交
  13. 03 7月, 2019 1 次提交
    • W
      KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC · bb34e690
      Wanpeng Li 提交于
      Thomas reported that:
      
       | Background:
       |
       |    In preparation of supporting IPI shorthands I changed the CPU offline
       |    code to software disable the local APIC instead of just masking it.
       |    That's done by clearing the APIC_SPIV_APIC_ENABLED bit in the APIC_SPIV
       |    register.
       |
       | Failure:
       |
       |    When the CPU comes back online the startup code triggers occasionally
       |    the warning in apic_pending_intr_clear(). That complains that the IRRs
       |    are not empty.
       |
       |    The offending vector is the local APIC timer vector who's IRR bit is set
       |    and stays set.
       |
       | It took me quite some time to reproduce the issue locally, but now I can
       | see what happens.
       |
       | It requires apicv_enabled=0, i.e. full apic emulation. With apicv_enabled=1
       | (and hardware support) it behaves correctly.
       |
       | Here is the series of events:
       |
       |     Guest CPU
       |
       |     goes down
       |
       |       native_cpu_disable()
       |
       | 			apic_soft_disable();
       |
       |     play_dead()
       |
       |     ....
       |
       |     startup()
       |
       |       if (apic_enabled())
       |         apic_pending_intr_clear()	<- Not taken
       |
       |      enable APIC
       |
       |         apic_pending_intr_clear()	<- Triggers warning because IRR is stale
       |
       | When this happens then the deadline timer or the regular APIC timer -
       | happens with both, has fired shortly before the APIC is disabled, but the
       | interrupt was not serviced because the guest CPU was in an interrupt
       | disabled region at that point.
       |
       | The state of the timer vector ISR/IRR bits:
       |
       |     	     	       	        ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      1
       |
       | Now one would assume that the IRR is cleared after the INIT reset, but this
       | happens only on CPU0.
       |
       | Why?
       |
       | Because our CPU0 hotplug is just for testing to make sure nothing breaks
       | and goes through an NMI wakeup vehicle because INIT would send it through
       | the boots-trap code which is not really working if that CPU was not
       | physically unplugged.
       |
       | Now looking at a real world APIC the situation in that case is:
       |
       |     	     	       	      	ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      0
       |
       | Why?
       |
       | Once the dying CPU reenables interrupts the pending interrupt gets
       | delivered as a spurious interupt and then the state is clear.
       |
       | While that CPU0 hotplug test case is surely an esoteric issue, the APIC
       | emulation is still wrong, Even if the play_dead() code would not enable
       | interrupts then the pending IRR bit would turn into an ISR .. interrupt
       | when the APIC is reenabled on startup.
      
      From SDM 10.4.7.2 Local APIC State After It Has Been Software Disabled
      * Pending interrupts in the IRR and ISR registers are held and require
        masking or handling by the CPU.
      
      In Thomas's testing, hardware cpu will not respect soft disable LAPIC
      when IRR has already been set or APICv posted-interrupt is in flight,
      so we can skip soft disable APIC checking when clearing IRR and set ISR,
      continue to respect soft disable APIC when attempting to set IRR.
      Reported-by: NRong Chen <rong.a.chen@intel.com>
      Reported-by: NFeng Tang <feng.tang@intel.com>
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rong Chen <rong.a.chen@intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb34e690
  14. 20 6月, 2019 1 次提交
    • S
      KVM: x86: Fix apic dangling pointer in vcpu · a251fb90
      Saar Amar 提交于
      The function kvm_create_lapic() attempts to allocate the apic structure
      and sets a pointer to it in the virtual processor structure. However, if
      get_zeroed_page() failed, the function frees the apic chunk, but forgets
      to set the pointer in the vcpu to NULL. It's not a security issue since
      there isn't a use of that pointer if kvm_create_lapic() returns error,
      but it's more accurate that way.
      Signed-off-by: NSaar Amar <saaramar@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a251fb90
  15. 19 6月, 2019 1 次提交
  16. 18 6月, 2019 2 次提交
  17. 05 6月, 2019 3 次提交
    • W
      KVM: LAPIC: Optimize timer latency further · b6c4bc65
      Wanpeng Li 提交于
      Advance lapic timer tries to hidden the hypervisor overhead between the
      host emulated timer fires and the guest awares the timer is fired. However,
      it just hidden the time between apic_timer_fn/handle_preemption_timer ->
      wait_lapic_expire, instead of the real position of vmentry which is
      mentioned in the orignial commit d0659d94 ("KVM: x86: add option to
      advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
      the end of wait_lapic_expire and before world switch on my haswell desktop.
      
      This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
      it takes the real overhead time between apic_timer_fn/handle_preemption_timer
      and before world switch into consideration when adaptively tuning timer
      advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
      on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6c4bc65
    • W
      KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexit · ec0671d5
      Wanpeng Li 提交于
      wait_lapic_expire() call was moved above guest_enter_irqoff() because of
      its tracepoint, which violated the RCU extended quiescent state invoked
      by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
      below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
      VM-Enter, but trace it after VM-Exit. This can help us to move
      wait_lapic_expire() just before vmentry in the later patch.
      
      [1] Commit 8b89fe1f ("kvm: x86: move tracepoints outside extended quiescent state")
      [2] https://patchwork.kernel.org/patch/7821111/
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Track whether wait_lapic_expire was called, and do not invoke the tracepoint
       if not. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec0671d5
    • W
      KVM: LAPIC: Extract adaptive tune timer advancement logic · 84ea3aca
      Wanpeng Li 提交于
      Extract adaptive tune timer advancement logic to a single function.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Rename new function. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      84ea3aca
  18. 01 5月, 2019 5 次提交
  19. 19 4月, 2019 5 次提交
    • S
      KVM: lapic: Convert guest TSC to host time domain if necessary · b6aa57c6
      Sean Christopherson 提交于
      To minimize the latency of timer interrupts as observed by the guest,
      KVM adjusts the values it programs into the host timers to account for
      the host's overhead of programming and handling the timer event.  In
      the event that the adjustments are too aggressive, i.e. the timer fires
      earlier than the guest expects, KVM busy waits immediately prior to
      entering the guest.
      
      Currently, KVM manually converts the delay from nanoseconds to clock
      cycles.  But, the conversion is done in the guest's time domain, while
      the delay occurs in the host's time domain.  This is perfectly ok when
      the guest and host are using the same TSC ratio, but if the guest is
      using a different ratio then the delay may not be accurate and could
      wait too little or too long.
      
      When the guest is not using the host's ratio, convert the delay from
      guest clock cycles to host nanoseconds and use ndelay() instead of
      __delay() to provide more accurate timing.  Because converting to
      nanoseconds is relatively expensive, e.g. requires division and more
      multiplication ops, continue using __delay() directly when guest and
      host TSCs are running at the same ratio.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6aa57c6
    • S
      KVM: lapic: Allow user to disable adaptive tuning of timer advancement · c3941d9e
      Sean Christopherson 提交于
      The introduction of adaptive tuning of lapic timer advancement did not
      allow for the scenario where userspace would want to disable adaptive
      tuning but still employ timer advancement, e.g. for testing purposes or
      to handle a use case where adaptive tuning is unable to settle on a
      suitable time.  This is epecially pertinent now that KVM places a hard
      threshold on the maximum advancment time.
      
      Rework the timer semantics to accept signed values, with a value of '-1'
      being interpreted as "use adaptive tuning with KVM's internal default",
      and any other value being used as an explicit advancement time, e.g. a
      time of '0' effectively disables advancement.
      
      Note, this does not completely restore the original behavior of
      lapic_timer_advance_ns.  Prior to tracking the advancement per vCPU,
      which is necessary to support autotuning, userspace could adjust
      lapic_timer_advance_ns for *running* vCPU.  With per-vCPU tracking, the
      module params are snapshotted at vCPU creation, i.e. applying a new
      advancement effectively requires restarting a VM.
      
      Dynamically updating a running vCPU is possible, e.g. a helper could be
      added to retrieve the desired delay, choosing between the global module
      param and the per-VCPU value depending on whether or not auto-tuning is
      (globally) enabled, but introduces a great deal of complexity.  The
      wrapper itself is not complex, but understanding and documenting the
      effects of dynamically toggling auto-tuning and/or adjusting the timer
      advancement is nigh impossible since the behavior would be dependent on
      KVM's implementation as well as compiler optimizations.  In other words,
      providing stable behavior would require extremely careful consideration
      now and in the future.
      
      Given that the expected use of a manually-tuned timer advancement is to
      "tune once, run many", use the vastly simpler approach of recognizing
      changes to the module params only when creating a new vCPU.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3941d9e
    • S
      KVM: lapic: Track lapic timer advance per vCPU · 39497d76
      Sean Christopherson 提交于
      Automatically adjusting the globally-shared timer advancement could
      corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
      the advancement value.  That could be partially fixed by using a local
      variable for the arithmetic, but it would still be susceptible to a
      race when setting timer_advance_adjust_done.
      
      And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
      correct calibration for a given vCPU may not apply to all vCPUs.
      
      Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
      effectively violated when finding a stable advancement takes an extended
      amount of timer.
      
      Opportunistically change the definition of lapic_timer_advance_ns to
      a u32 so that it matches the style of struct kvm_timer.  Explicitly
      pass the param to kvm_create_lapic() so that it doesn't have to be
      exposed to lapic.c, thus reducing the probability of unintentionally
      using the global value instead of the per-vCPU value.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      39497d76
    • S
      KVM: lapic: Disable timer advancement if adaptive tuning goes haywire · 57bf67e7
      Sean Christopherson 提交于
      To minimize the latency of timer interrupts as observed by the guest,
      KVM adjusts the values it programs into the host timers to account for
      the host's overhead of programming and handling the timer event.  Now
      that the timer advancement is automatically tuned during runtime, it's
      effectively unbounded by default, e.g. if KVM is running as L1 the
      advancement can measure in hundreds of milliseconds.
      
      Disable timer advancement if adaptive tuning yields an advancement of
      more than 5000ns, as large advancements can break reasonable assumptions
      of the guest, e.g. that a timer configured to fire after 1ms won't
      arrive on the next instruction.  Although KVM busy waits to mitigate the
      case of a timer event arriving too early, complications can arise when
      shifting the interrupt too far, e.g. kvm-unit-test's vmx.interrupt test
      will fail when its "host" exits on interrupts as KVM may inject the INTR
      before the guest executes STI+HLT.   Arguably the unit test is "broken"
      in the sense that delaying a timer interrupt by 1ms doesn't technically
      guarantee the interrupt will arrive after STI+HLT, but it's a reasonable
      assumption that KVM should support.
      
      Furthermore, an unbounded advancement also effectively unbounds the time
      spent busy waiting, e.g. if the guest programs a timer with a very large
      delay.
      
      5000ns is a somewhat arbitrary threshold.  When running on bare metal,
      which is the intended use case, timer advancement is expected to be in
      the general vicinity of 1000ns.  5000ns is high enough that false
      positives are unlikely, while not being so high as to negatively affect
      the host's performance/stability.
      
      Note, a future patch will enable userspace to disable KVM's adaptive
      tuning, which will allow priveleged userspace will to specifying an
      advancement value in excess of this arbitrary threshold in order to
      satisfy an abnormal use case.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57bf67e7
    • L
      KVM: x86: Consider LAPIC TSC-Deadline timer expired if deadline too short · c09d65d9
      Liran Alon 提交于
      If guest sets MSR_IA32_TSCDEADLINE to value such that in host
      time-domain it's shorter than lapic_timer_advance_ns, we can
      reach a case that we call hrtimer_start() with expiration time set at
      the past.
      
      Because lapic_timer.timer is init with HRTIMER_MODE_ABS_PINNED, it
      is not allowed to run in softirq and therefore will never expire.
      
      To avoid such a scenario, verify that deadline expiration time is set on
      host time-domain further than (now + lapic_timer_advance_ns).
      
      A future patch can also consider adding a min_timer_deadline_ns module parameter,
      similar to min_timer_period_us to avoid races that amount of ns it takes
      to run logic could still call hrtimer_start() with expiration timer set
      at the past.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c09d65d9
  20. 16 4月, 2019 1 次提交
  21. 21 2月, 2019 1 次提交
    • B
      kvm: x86: Add memcg accounting to KVM allocations · 254272ce
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In x86, they include:
      	vcpu->arch.pio_data
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      254272ce
  22. 26 1月, 2019 1 次提交
    • G
      KVM: x86: Mark expected switch fall-throughs · b2869f28
      Gustavo A. R. Silva 提交于
      In preparation to enabling -Wimplicit-fallthrough, mark switch
      cases where we are expecting to fall through.
      
      This patch fixes the following warnings:
      
      arch/x86/kvm/lapic.c:1037:27: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/lapic.c:1876:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/hyperv.c:1637:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/svm.c:4396:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/mmu.c:4372:36: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:3835:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:7938:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:2015:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:1773:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This patch is part of the ongoing efforts to enabling -Wimplicit-fallthrough.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2869f28
  23. 15 12月, 2018 1 次提交
  24. 27 11月, 2018 2 次提交
    • Y
      KVM: x86: fix empty-body warnings · 354cb410
      Yi Wang 提交于
      We get the following warnings about empty statements when building
      with 'W=1':
      
      arch/x86/kvm/lapic.c:632:53: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1907:42: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1936:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      arch/x86/kvm/lapic.c:1975:44: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      
      Rework the debug helper macro to get rid of these warnings.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      354cb410
    • W
      KVM: LAPIC: Fix pv ipis use-before-initialization · 38ab012f
      Wanpeng Li 提交于
      Reported by syzkaller:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
       PGD 800000040410c067 P4D 800000040410c067 PUD 40410d067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 3 PID: 2567 Comm: poc Tainted: G           OE     4.19.0-rc5 #16
       RIP: 0010:kvm_pv_send_ipi+0x94/0x350 [kvm]
       Call Trace:
        kvm_emulate_hypercall+0x3cc/0x700 [kvm]
        handle_vmcall+0xe/0x10 [kvm_intel]
        vmx_handle_exit+0xc1/0x11b0 [kvm_intel]
        vcpu_enter_guest+0x9fb/0x1910 [kvm]
        kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
        kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
        do_vfs_ioctl+0xa5/0x690
        ksys_ioctl+0x6d/0x80
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x83/0x6e0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The reason is that the apic map has not yet been initialized, the testcase
      triggers pv_send_ipi interface by vmcall which results in kvm->arch.apic_map
      is dereferenced. This patch fixes it by checking whether or not apic map is
      NULL and bailing out immediately if that is the case.
      
      Fixes: 4180bf1b (KVM: X86: Implement "send IPI" hypercall)
      Reported-by: NWei Wu <ww9210@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wei Wu <ww9210@gmail.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      38ab012f
  25. 29 10月, 2018 1 次提交
  26. 17 10月, 2018 1 次提交
    • V
      x86/kvm/lapic: preserve gfn_to_hva_cache len on cache reinit · a7c42bb6
      Vitaly Kuznetsov 提交于
      vcpu->arch.pv_eoi is accessible through both HV_X64_MSR_VP_ASSIST_PAGE and
      MSR_KVM_PV_EOI_EN so on migration userspace may try to restore them in any
      order. Values match, however, kvm_lapic_enable_pv_eoi() uses different
      length: for Hyper-V case it's the whole struct hv_vp_assist_page, for KVM
      native case it is 8. In case we restore KVM-native MSR last cache will
      be reinitialized with len=8 so trying to access VP assist page beyond
      8 bytes with kvm_read_guest_cached() will fail.
      
      Check if we re-initializing cache for the same address and preserve length
      in case it was greater.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a7c42bb6