1. 19 5月, 2016 9 次提交
  2. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  3. 12 5月, 2016 1 次提交
  4. 09 5月, 2016 1 次提交
  5. 03 5月, 2016 1 次提交
  6. 29 4月, 2016 1 次提交
    • B
      KVM: x86: fix ordering of cr0 initialization code in vmx_cpu_reset · f2463247
      Bruce Rogers 提交于
      Commit d28bc9dd reversed the order of two lines which initialize cr0,
      allowing the current (old) cr0 value to mess up vcpu initialization.
      This was observed in the checks for cr0 X86_CR0_WP bit in the context of
      kvm_mmu_reset_context(). Besides, setting vcpu->arch.cr0 after vmx_set_cr0()
      is completely redundant. Change the order back to ensure proper vcpu
      initialization.
      
      The combination of booting with ovmf firmware when guest vcpus > 1 and kvm's
      ept=N option being set results in a VM-entry failure. This patch fixes that.
      
      Fixes: d28bc9dd ("KVM: x86: INIT and reset sequences are different")
      Cc: stable@vger.kernel.org
      Signed-off-by: NBruce Rogers <brogers@suse.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f2463247
  7. 20 4月, 2016 3 次提交
  8. 11 4月, 2016 3 次提交
    • P
      KVM: x86: mask CPUID(0xD,0x1).EAX against host value · 316314ca
      Paolo Bonzini 提交于
      This ensures that the guest doesn't see XSAVE extensions
      (e.g. xgetbv1 or xsavec) that the host lacks.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      316314ca
    • D
      kvm: x86: do not leak guest xcr0 into host interrupt handlers · fc5b7f3b
      David Matlack 提交于
      An interrupt handler that uses the fpu can kill a KVM VM, if it runs
      under the following conditions:
       - the guest's xcr0 register is loaded on the cpu
       - the guest's fpu context is not loaded
       - the host is using eagerfpu
      
      Note that the guest's xcr0 register and fpu context are not loaded as
      part of the atomic world switch into "guest mode". They are loaded by
      KVM while the cpu is still in "host mode".
      
      Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
      interrupt handler will look something like this:
      
      if (irq_fpu_usable()) {
              kernel_fpu_begin();
      
              [... code that uses the fpu ...]
      
              kernel_fpu_end();
      }
      
      As long as the guest's fpu is not loaded and the host is using eager
      fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
      returns true). The interrupt handler proceeds to use the fpu with
      the guest's xcr0 live.
      
      kernel_fpu_begin() saves the current fpu context. If this uses
      XSAVE[OPT], it may leave the xsave area in an undesirable state.
      According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
      if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
      xcr0[i] == 0 following an XSAVE.
      
      kernel_fpu_end() restores the fpu context. Now if any bit i in
      XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
      fault is trapped and SIGSEGV is delivered to the current process.
      
      Only pre-4.2 kernels appear to be vulnerable to this sequence of
      events. Commit 653f52c3 ("kvm,x86: load guest FPU context more eagerly")
      from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.
      
      This patch fixes the bug by keeping the host's xcr0 loaded outside
      of the interrupts-disabled region where KVM switches into guest mode.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      [Move load after goto cancel_injection. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc5b7f3b
    • X
      KVM: MMU: fix permission_fault() · 7a98205d
      Xiao Guangrong 提交于
      kvm-unit-tests complained about the PFEC is not set properly, e.g,:
      test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15
      expected 5
      Dump mapping: address: 0x123400000000
      ------L4: 3e95007
      ------L3: 3e96007
      ------L2: 2000083
      
      It's caused by the reason that PFEC returned to guest is copied from the
      PFEC triggered by shadow page table
      
      This patch fixes it and makes the logic of updating errcode more clean
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      [Do not assume pfec.p=1. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a98205d
  9. 08 4月, 2016 1 次提交
  10. 07 4月, 2016 1 次提交
  11. 05 4月, 2016 1 次提交
    • L
      kvm: x86: make lapic hrtimer pinned · 61abdbe0
      Luiz Capitulino 提交于
      When a vCPU runs on a nohz_full core, the hrtimer used by
      the lapic emulation code can be migrated to another core.
      When this happens, it's possible to observe milisecond
      latency when delivering timer IRQs to KVM guests.
      
      The huge latency is mainly due to the fact that
      apic_timer_fn() expects to run during a kvm exit. It
      sets KVM_REQ_PENDING_TIMER and let it be handled on kvm
      entry. However, if the timer fires on a different core,
      we have to wait until the next kvm exit for the guest
      to see KVM_REQ_PENDING_TIMER set.
      
      This problem became visible after commit 9642d18e. This
      commit changed the timer migration code to always attempt
      to migrate timers away from nohz_full cores. While it's
      discussable if this is correct/desirable (I don't think
      it is), it's clear that the lapic emulation code has
      a requirement on firing the hrtimer in the same core
      where it was started. This is achieved by making the
      hrtimer pinned.
      
      Lastly, note that KVM has code to migrate timers when a
      vCPU is scheduled to run in different core. However, this
      forced migration may fail. When this happens, we can have
      the same problem. If we want 100% correctness, we'll have
      to modify apic_timer_fn() to cause a kvm exit when it runs
      on a different core than the vCPU. Not sure if this is
      possible.
      
      Here's a reproducer for the issue being fixed:
      
       1. Set all cores but core0 to be nohz_full cores
       2. Start a guest with a single vCPU
       3. Trace apic_timer_fn() and kvm_inject_apic_timer_irqs()
      
      You'll see that apic_timer_fn() will run in core0 while
      kvm_inject_apic_timer_irqs() runs in a different core. If
      you get both on core0, try running a program that takes 100%
      of the CPU and pin it to core0 to force the vCPU out.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61abdbe0
  12. 02 4月, 2016 2 次提交
  13. 01 4月, 2016 4 次提交
    • Y
      kvm: set page dirty only if page has been writable · 14f47605
      Yu Zhao 提交于
      In absence of shadow dirty mask, there is no need to set page dirty
      if page has never been writable. This is a tiny optimization but
      good to have for people who care much about dirty page tracking.
      Signed-off-by: NYu Zhao <yuzhao@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14f47605
    • P
      KVM: x86: reduce default value of halt_poll_ns parameter · 14ebda33
      Paolo Bonzini 提交于
      Windows lets applications choose the frequency of the timer tick,
      and in Windows 10 the maximum rate was changed from 1024 Hz to
      2048 Hz.  Unfortunately, because of the way the Windows API
      works, most applications who need a higher rate than the default
      64 Hz will just do
      
         timeGetDevCaps(&tc, sizeof(tc));
         timeBeginPeriod(tc.wPeriodMin);
      
      and pick the maximum rate.  This causes very high CPU usage when
      playing media or games on Windows 10, even if the guest does not
      actually use the CPU very much, because the frequent timer tick
      causes halt_poll_ns to kick in.
      
      There is no really good solution, especially because Microsoft
      could sooner or later bump the limit to 4096 Hz, but for now
      the best we can do is lower a bit the upper limit for
      halt_poll_ns. :-(
      Reported-by: NJon Panozzo <jonp@lime-technology.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14ebda33
    • P
      KVM: Hyper-V: do not do hypercall userspace exits if SynIC is disabled · a2b5c3c0
      Paolo Bonzini 提交于
      If SynIC is disabled, there is nothing that userspace can do to
      handle these exits; on the other hand, userspace probably will
      not know about KVM_EXIT_HYPERV_HCALL and complain about it or
      even exit.  Just prevent anything bad from happening by handling
      the hypercall in KVM and returning an "invalid hypercall" code.
      
      Fixes: 83326e43
      Cc: Andrey Smetanin <irqlevel@gmail.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a2b5c3c0
    • Y
      KVM: x86: Inject pending interrupt even if pending nmi exist · 321c5658
      Yuki Shibuya 提交于
      Non maskable interrupts (NMI) are preferred to interrupts in current
      implementation. If a NMI is pending and NMI is blocked by the result
      of nmi_allowed(), pending interrupt is not injected and
      enable_irq_window() is not executed, even if interrupts injection is
      allowed.
      
      In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
      In this case, interrupts are needed to execute iret that intends end
      of NMI. The flag of blocking new NMI is not cleared until the guest
      execute the iret, and interrupts are blocked by pending NMI. Due to
      this, iret can't be invoked in the guest, and the guest is starved
      until block is cleared by some events (e.g. canceling injection).
      
      This patch injects pending interrupts, when it's allowed, even if NMI
      is blocked. And, If an interrupts is pending after executing
      inject_pending_event(), enable_irq_window() is executed regardless of
      NMI pending counter.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NYuki Shibuya <shibuya.yk@ncos.nec.co.jp>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      321c5658
  14. 31 3月, 2016 1 次提交
    • P
      perf/x86/amd/ibs: Fix pmu::stop() nesting · 85dc6002
      Peter Zijlstra 提交于
      Patch 5a50f529 ("perf/x86/ibs: Fix race with IBS_STARTING state")
      closed a big hole while opening another, smaller hole.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: 5a50f529 ("perf/x86/ibs: Fix race with IBS_STARTING state")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      85dc6002
  15. 29 3月, 2016 8 次提交
  16. 26 3月, 2016 2 次提交
    • S
      ACPI / processor: Request native thermal interrupt handling via _OSC · a2121167
      Srinivas Pandruvada 提交于
      There are several reports of freeze on enabling HWP (Hardware PStates)
      feature on Skylake-based systems by the Intel P-states driver. The root
      cause is identified as the HWP interrupts causing BIOS code to freeze.
      
      HWP interrupts use the thermal LVT which can be handled by Linux
      natively, but on the affected Skylake-based systems SMM will respond
      to it by default.  This is a problem for several reasons:
       - On the affected systems the SMM thermal LVT handler is broken (it
         will crash when invoked) and a BIOS update is necessary to fix it.
       - With thermal interrupt handled in SMM we lose all of the reporting
         features of the arch/x86/kernel/cpu/mcheck/therm_throt driver.
       - Some thermal drivers like x86-package-temp depend on the thermal
         threshold interrupts signaled via the thermal LVT.
       - The HWP interrupts are useful for debugging and tuning
         performance (if the kernel can handle them).
      The native handling of thermal interrupts needs to be enabled
      because of that.
      
      This requires some way to tell SMM that the OS can handle thermal
      interrupts.  That can be done by using _OSC/_PDC in processor
      scope very early during ACPI initialization.
      
      The meaning of _OSC/_PDC bit 12 in processor scope is whether or
      not the OS supports native handling of interrupts for Collaborative
      Processor Performance Control (CPPC) notifications.  Since on
      HWP-capable systems CPPC is a firmware interface to HWP, setting
      this bit effectively tells the firmware that the OS will handle
      thermal interrupts natively going forward.
      
      For details on _OSC/_PDC refer to:
      http://www.intel.com/content/www/us/en/standards/processor-vendor-specific-acpi-specification.html
      
      To implement the _OSC/_PDC handshake as described, introduce a new
      function, acpi_early_processor_osc(), that walks the ACPI
      namespace looking for ACPI processor objects and invokes _OSC for
      them with bit 12 in the capabilities buffer set and terminates the
      namespace walk on the first success.
      
      Also modify intel_thermal_interrupt() to clear HWP status bits in
      the HWP_STATUS MSR to acknowledge HWP interrupts (which prevents
      them from firing continuously).
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      [ rjw: Subject & changelog, function rename ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a2121167
    • A
      mm, kasan: stackdepot implementation. Enable stackdepot for SLAB · cd11016e
      Alexander Potapenko 提交于
      Implement the stack depot and provide CONFIG_STACKDEPOT.  Stack depot
      will allow KASAN store allocation/deallocation stack traces for memory
      chunks.  The stack traces are stored in a hash table and referenced by
      handles which reside in the kasan_alloc_meta and kasan_free_meta
      structures in the allocated memory chunks.
      
      IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
      duplication.
      
      Right now stackdepot support is only enabled in SLAB allocator.  Once
      KASAN features in SLAB are on par with those in SLUB we can switch SLUB
      to stackdepot as well, thus removing the dependency on SLUB stack
      bookkeeping, which wastes a lot of memory.
      
      This patch is based on the "mm: kasan: stack depots" patch originally
      prepared by Dmitry Chernenkov.
      
      Joonsoo has said that he plans to reuse the stackdepot code for the
      mm/page_owner.c debugging facility.
      
      [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t]
      [aryabinin@virtuozzo.com: comment style fixes]
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd11016e