1. 18 12月, 2009 1 次提交
    • S
      x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system · 18374d89
      Suresh Siddha 提交于
      John Blackwood reported:
      > on an older Dell PowerEdge 6650 system with 8 cpus (4 are hyper-threaded),
      > and  32 bit (x86) kernel, once you change the irq smp_affinity of an irq
      > to be less than all cpus in the system, you can never change really the
      > irq smp_affinity back to be all cpus in the system (0xff) again,
      > even though no error status is returned on the "/bin/echo ff >
      > /proc/irq/[n]/smp_affinity" operation.
      >
      > This is due to that fact that BAD_APICID has the same value as
      > all cpus (0xff) on 32bit kernels, and thus the value returned from
      > set_desc_affinity() via the cpu_mask_to_apicid_and() function is treated
      > as a failure in set_ioapic_affinity_irq_desc(), and no affinity changes
      > are made.
      
      set_desc_affinity() is already checking if the incoming cpu mask
      intersects with the cpu online mask or not. So there is no need
      for the apic op cpu_mask_to_apicid_and() to check again
      and return BAD_APICID.
      
      Remove the BAD_APICID return value from cpu_mask_to_apicid_and()
      and also fix set_desc_affinity() to return -1 instead of using BAD_APICID
      to represent error conditions (as cpu_mask_to_apicid_and() can return
      logical or physical apicid values and BAD_APICID is really to represent
      bad physical apic id).
      Reported-by: NJohn Blackwood <john.blackwood@ccur.com>
      Root-caused-by: NJohn Blackwood <john.blackwood@ccur.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <1261103386.2535.409.camel@sbs-t61>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      18374d89
  2. 17 12月, 2009 3 次提交
  3. 16 12月, 2009 1 次提交
  4. 15 12月, 2009 1 次提交
    • F
      x86: Split swiotlb initialization into two stages · 186a2502
      FUJITA Tomonori 提交于
      The commit f4780ca0 moves
      swiotlb initialization before dma32_free_bootmem(). It's
      supposed to fix a bug that the commit
      75f1cdf1 introduced, we
      initialize SWIOTLB right after dma32_free_bootmem so we wrongly
      steal memory area allocated for GART with broken BIOS earlier.
      
      However, the above commit introduced another problem, which
      likely breaks machines with huge amount of memory. Such a box
      use the majority of DMA32_ZONE so there is no memory for
      swiotlb.
      
      With this patch, the x86 IOMMU initialization sequence are:
      
      1. We set swiotlb to 1 in the case of (max_pfn > MAX_DMA32_PFN
         && !no_iommu). If swiotlb usage is forced by the boot option,
         we go to the step 3 and finish (we don't try to detect IOMMUs).
      
      2. We call the detection functions of all the IOMMUs. The
         detection function sets x86_init.iommu.iommu_init to the IOMMU
         initialization function (so we can avoid calling the
         initialization functions of all the IOMMUs needlessly).
      
      3. We initialize swiotlb (and set dma_ops to swiotlb_dma_ops) if
         swiotlb is set to 1.
      
      4. If the IOMMU initialization function doesn't need swiotlb
         (e.g. the initialization is sucessful) then sets swiotlb to zero.
      
      5. If we find that swiotlb is set to zero, we free swiotlb
         resource.
      Reported-by: NYinghai Lu <yinghai@kernel.org>
      Reported-by: NRoland Dreier <rdreier@cisco.com>
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      LKML-Reference: <20091215204729A.fujita.tomonori@lab.ntt.co.jp>
      Tested-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      186a2502
  5. 13 12月, 2009 1 次提交
  6. 12 12月, 2009 1 次提交
    • B
      x86, msr: Add support for non-contiguous cpumasks · 50542251
      Borislav Petkov 提交于
      The current rd/wrmsr_on_cpus helpers assume that the supplied
      cpumasks are contiguous. However, there are machines out there
      like some K8 multinode Opterons which have a non-contiguous core
      enumeration on each node (e.g. cores 0,2 on node 0 instead of 0,1), see
      http://www.gossamer-threads.com/lists/linux/kernel/1160268.
      
      This patch fixes out-of-bounds writes (see URL above) by adding per-CPU
      msr structs which are used on the respective cores.
      
      Additionally, two helpers, msrs_{alloc,free}, are provided for use by
      the callers of the MSR accessors.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Mauro Carvalho Chehab <mchehab@redhat.com>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      LKML-Reference: <20091211171440.GD31998@aftab>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      50542251
  7. 11 12月, 2009 1 次提交
    • Y
      x86: Use find_e820() instead of hard coded trampoline address · 893f38d1
      Yinghai Lu 提交于
      Jens found the following crash/regression:
      
      [    0.000000] found SMP MP-table at [ffff8800000fdd80] fdd80
      [    0.000000] Kernel panic - not syncing: Overlapping early reservations 12-f011 MP-table mpc to 0-fff BIOS data page
      
      and
      
      [    0.000000] Kernel panic - not syncing: Overlapping early reservations 12-f011 MP-table mpc to 6000-7fff TRAMPOLINE
      
      and bisected it to b24c2a92 ("x86: Move find_smp_config()
      earlier and avoid bootmem usage").
      
      It turns out the BIOS is using the first 64k for mptable,
      without reserving it.
      
      So try to find good range for the real-mode trampoline instead of
      hard coding it, in case some bios tries to use that range for sth.
      Reported-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Tested-by: NJens Axboe <jens.axboe@oracle.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      LKML-Reference: <4B21630A.6000308@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      893f38d1
  8. 10 12月, 2009 7 次提交
  9. 09 12月, 2009 1 次提交
  10. 06 12月, 2009 1 次提交
  11. 05 12月, 2009 1 次提交
  12. 04 12月, 2009 1 次提交
  13. 03 12月, 2009 17 次提交
    • A
      KVM: VMX: Fix comparison of guest efer with stale host value · d5696725
      Avi Kivity 提交于
      update_transition_efer() masks out some efer bits when deciding whether
      to switch the msr during guest entry; for example, NX is emulated using the
      mmu so we don't need to disable it, and LMA/LME are handled by the hardware.
      
      However, with shared msrs, the comparison is made against a stale value;
      at the time of the guest switch we may be running with another guest's efer.
      
      Fix by deferring the mask/compare to the actual point of guest entry.
      
      Noted by Marcelo.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d5696725
    • A
      KVM: x86 emulator: limit instructions to 15 bytes · eb3c79e6
      Avi Kivity 提交于
      While we are never normally passed an instruction that exceeds 15 bytes,
      smp games can cause us to attempt to interpret one, which will cause
      large latencies in non-preempt hosts.
      
      Cc: stable@kernel.org
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      eb3c79e6
    • J
      KVM: x86: Add KVM_GET/SET_VCPU_EVENTS · 3cfc3092
      Jan Kiszka 提交于
      This new IOCTL exports all yet user-invisible states related to
      exceptions, interrupts, and NMIs. Together with appropriate user space
      changes, this fixes sporadic problems of vmsave/restore, live migration
      and system reset.
      
      [avi: future-proof abi by adding a flags field]
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      3cfc3092
    • A
      KVM: x86 shared msr infrastructure · 18863bdd
      Avi Kivity 提交于
      The various syscall-related MSRs are fairly expensive to switch.  Currently
      we switch them on every vcpu preemption, which is far too often:
      
      - if we're switching to a kernel thread (idle task, threaded interrupt,
        kernel-mode virtio server (vhost-net), for example) and back, then
        there's no need to switch those MSRs since kernel threasd won't
        be exiting to userspace.
      
      - if we're switching to another guest running an identical OS, most likely
        those MSRs will have the same value, so there's little point in reloading
        them.
      
      - if we're running the same OS on the guest and host, the MSRs will have
        identical values and reloading is unnecessary.
      
      This patch uses the new user return notifiers to implement last-minute
      switching, and checks the msr values to avoid unnecessary reloading.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      18863bdd
    • G
      KVM: allow userspace to adjust kvmclock offset · afbcf7ab
      Glauber Costa 提交于
      When we migrate a kvm guest that uses pvclock between two hosts, we may
      suffer a large skew. This is because there can be significant differences
      between the monotonic clock of the hosts involved. When a new host with
      a much larger monotonic time starts running the guest, the view of time
      will be significantly impacted.
      
      Situation is much worse when we do the opposite, and migrate to a host with
      a smaller monotonic clock.
      
      This proposed ioctl will allow userspace to inform us what is the monotonic
      clock value in the source host, so we can keep the time skew short, and
      more importantly, never goes backwards. Userspace may also need to trigger
      the current data, since from the first migration onwards, it won't be
      reflected by a simple call to clock_gettime() anymore.
      
      [marcelo: future-proof abi with a flags field]
      [jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]
      Signed-off-by: NGlauber Costa <glommer@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      afbcf7ab
    • J
      KVM: SVM: Cleanup NMI singlestep · 6be7d306
      Jan Kiszka 提交于
      Push the NMI-related singlestep variable into vcpu_svm. It's dealing
      with an AMD-specific deficit, nothing generic for x86.
      Acked-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      
       arch/x86/include/asm/kvm_host.h |    1 -
       arch/x86/kvm/svm.c              |   12 +++++++-----
       2 files changed, 7 insertions(+), 6 deletions(-)
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      6be7d306
    • J
      KVM: x86: Fix guest single-stepping while interruptible · 94fe45da
      Jan Kiszka 提交于
      Commit 705c5323 opened the doors of hell by unconditionally injecting
      single-step flags as long as guest_debug signaled this. This doesn't
      work when the guest branches into some interrupt or exception handler
      and triggers a vmexit with flag reloading.
      
      Fix it by saving cs:rip when user space requests single-stepping and
      restricting the trace flag injection to this guest code position.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      94fe45da
    • E
      KVM: Xen PV-on-HVM guest support · ffde22ac
      Ed Swierk 提交于
      Support for Xen PV-on-HVM guests can be implemented almost entirely in
      userspace, except for handling one annoying MSR that maps a Xen
      hypercall blob into guest address space.
      
      A generic mechanism to delegate MSR writes to userspace seems overkill
      and risks encouraging similar MSR abuse in the future.  Thus this patch
      adds special support for the Xen HVM MSR.
      
      I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
      KVM which MSR the guest will write to, as well as the starting address
      and size of the hypercall blobs (one each for 32-bit and 64-bit) that
      userspace has loaded from files.  When the guest writes to the MSR, KVM
      copies one page of the blob from userspace to the guest.
      
      I've tested this patch with a hacked-up version of Gerd's userspace
      code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
      FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.
      
      [jan: fix i386 build warning]
      [avi: future proof abi with a flags field]
      Signed-off-by: NEd Swierk <eswierk@aristanetworks.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      ffde22ac
    • M
      KVM: SVM: Support Pause Filter in AMD processors · 565d0998
      Mark Langsdorf 提交于
      New AMD processors (Family 0x10 models 8+) support the Pause
      Filter Feature.  This feature creates a new field in the VMCB
      called Pause Filter Count.  If Pause Filter Count is greater
      than 0 and intercepting PAUSEs is enabled, the processor will
      increment an internal counter when a PAUSE instruction occurs
      instead of intercepting.  When the internal counter reaches the
      Pause Filter Count value, a PAUSE intercept will occur.
      
      This feature can be used to detect contended spinlocks,
      especially when the lock holding VCPU is not scheduled.
      Rescheduling another VCPU prevents the VCPU seeking the
      lock from wasting its quantum by spinning idly.
      
      Experimental results show that most spinlocks are held
      for less than 1000 PAUSE cycles or more than a few
      thousand.  Default the Pause Filter Counter to 3000 to
      detect the contended spinlocks.
      
      Processor support for this feature is indicated by a CPUID
      bit.
      
      On a 24 core system running 4 guests each with 16 VCPUs,
      this patch improved overall performance of each guest's
      32 job kernbench by approximately 3-5% when combined
      with a scheduler algorithm thati caused the VCPU to
      sleep for a brief period. Further performance improvement
      may be possible with a more sophisticated yield algorithm.
      Signed-off-by: NMark Langsdorf <mark.langsdorf@amd.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      565d0998
    • Z
      KVM: VMX: Add support for Pause-Loop Exiting · 4b8d54f9
      Zhai, Edwin 提交于
      New NHM processors will support Pause-Loop Exiting by adding 2 VM-execution
      control fields:
      PLE_Gap    - upper bound on the amount of time between two successive
                   executions of PAUSE in a loop.
      PLE_Window - upper bound on the amount of time a guest is allowed to execute in
                   a PAUSE loop
      
      If the time, between this execution of PAUSE and previous one, exceeds the
      PLE_Gap, processor consider this PAUSE belongs to a new loop.
      Otherwise, processor determins the the total execution time of this loop(since
      1st PAUSE in this loop), and triggers a VM exit if total time exceeds the
      PLE_Window.
      * Refer SDM volume 3b section 21.6.13 & 22.1.3.
      
      Pause-Loop Exiting can be used to detect Lock-Holder Preemption, where one VP
      is sched-out after hold a spinlock, then other VPs for same lock are sched-in
      to waste the CPU time.
      
      Our tests indicate that most spinlocks are held for less than 212 cycles.
      Performance tests show that with 2X LP over-commitment we can get +2% perf
      improvement for kernel build(Even more perf gain with more LPs).
      Signed-off-by: NZhai Edwin <edwin.zhai@intel.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      4b8d54f9
    • J
      KVM: x86: Rework guest single-step flag injection and filtering · 91586a3b
      Jan Kiszka 提交于
      Push TF and RF injection and filtering on guest single-stepping into the
      vender get/set_rflags callbacks. This makes the whole mechanism more
      robust wrt user space IOCTL order and instruction emulations.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      91586a3b
    • J
      KVM: x86: Refactor guest debug IOCTL handling · 355be0b9
      Jan Kiszka 提交于
      Much of so far vendor-specific code for setting up guest debug can
      actually be handled by the generic code. This also fixes a minor deficit
      in the SVM part /wrt processing KVM_GUESTDBG_ENABLE.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      355be0b9
    • A
      KVM: Activate Virtualization On Demand · 10474ae8
      Alexander Graf 提交于
      X86 CPUs need to have some magic happening to enable the virtualization
      extensions on them. This magic can result in unpleasant results for
      users, like blocking other VMMs from working (vmx) or using invalid TLB
      entries (svm).
      
      Currently KVM activates virtualization when the respective kernel module
      is loaded. This blocks us from autoloading KVM modules without breaking
      other VMMs.
      
      To circumvent this problem at least a bit, this patch introduces on
      demand activation of virtualization. This means, that instead
      virtualization is enabled on creation of the first virtual machine
      and disabled on destruction of the last one.
      
      So using this, KVM can be easily autoloaded, while keeping other
      hypervisors usable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      10474ae8
    • G
      KVM: Move irq ack notifier list to arch independent code · 136bdfee
      Gleb Natapov 提交于
      Mask irq notifier list is already there.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      136bdfee
    • G
      KVM: Maintain back mapping from irqchip/pin to gsi · 3e71f88b
      Gleb Natapov 提交于
      Maintain back mapping from irqchip/pin to gsi to speedup
      interrupt acknowledgment notifications.
      
      [avi: build fix on non-x86/ia64]
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      3e71f88b
    • G
      KVM: Move irq sharing information to irqchip level · 1a6e4a8c
      Gleb Natapov 提交于
      This removes assumptions that max GSIs is smaller than number of pins.
      Sharing is tracked on pin level not GSI level.
      
      [avi: no PIC on ia64]
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      1a6e4a8c
    • A
      KVM: Don't pass kvm_run arguments · 851ba692
      Avi Kivity 提交于
      They're just copies of vcpu->run, which is readily accessible.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      851ba692
  14. 02 12月, 2009 3 次提交