1. 24 7月, 2011 2 次提交
  2. 12 7月, 2011 13 次提交
    • G
      KVM: KVM Steal time guest/host interface · 9ddabbe7
      Glauber Costa 提交于
      To implement steal time, we need the hypervisor to pass the guest information
      about how much time was spent running other processes outside the VM.
      This is per-vcpu, and using the kvmclock structure for that is an abuse
      we decided not to make.
      
      In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
      holds the memory area address containing information about steal time
      
      This patch contains the headers for it. I am keeping it separate to facilitate
      backports to people who wants to backport the kernel part but not the
      hypervisor, or the other way around.
      Signed-off-by: NGlauber Costa <glommer@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NEric B Munson <emunson@mgebm.net>
      CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Anthony Liguori <aliguori@us.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      9ddabbe7
    • P
      KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests · aa04b4cc
      Paul Mackerras 提交于
      This adds infrastructure which will be needed to allow book3s_hv KVM to
      run on older POWER processors, including PPC970, which don't support
      the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
      Offset (RMO) facility.  These processors require a physically
      contiguous, aligned area of memory for each guest.  When the guest does
      an access in real mode (MMU off), the address is compared against a
      limit value, and if it is lower, the address is ORed with an offset
      value (from the Real Mode Offset Register (RMOR)) and the result becomes
      the real address for the access.  The size of the RMA has to be one of
      a set of supported values, which usually includes 64MB, 128MB, 256MB
      and some larger powers of 2.
      
      Since we are unlikely to be able to allocate 64MB or more of physically
      contiguous memory after the kernel has been running for a while, we
      allocate a pool of RMAs at boot time using the bootmem allocator.  The
      size and number of the RMAs can be set using the kvm_rma_size=xx and
      kvm_rma_count=xx kernel command line options.
      
      KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
      of the pool of preallocated RMAs.  The capability value is 1 if the
      processor can use an RMA but doesn't require one (because it supports
      the VRMA facility), or 2 if the processor requires an RMA for each guest.
      
      This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
      pool and returns a file descriptor which can be used to map the RMA.  It
      also returns the size of the RMA in the argument structure.
      
      Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
      ioctl calls from userspace.  To cope with this, we now preallocate the
      kvm->arch.ram_pginfo array when the VM is created with a size sufficient
      for up to 64GB of guest memory.  Subsequently we will get rid of this
      array and use memory associated with each memslot instead.
      
      This moves most of the code that translates the user addresses into
      host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
      to kvmppc_core_prepare_memory_region.  Also, instead of having to look
      up the VMA for each page in order to check the page size, we now check
      that the pages we get are compound pages of 16MB.  However, if we are
      adding memory that is mapped to an RMA, we don't bother with calling
      get_user_pages_fast and instead just offset from the base pfn for the
      RMA.
      
      Typically the RMA gets added after vcpus are created, which makes it
      inconvenient to have the LPCR (logical partition control register) value
      in the vcpu->arch struct, since the LPCR controls whether the processor
      uses RMA or VRMA for the guest.  This moves the LPCR value into the
      kvm->arch struct and arranges for the MER (mediated external request)
      bit, which is the only bit that varies between vcpus, to be set in
      assembly code when going into the guest if there is a pending external
      interrupt request.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      aa04b4cc
    • P
      KVM: PPC: Allow book3s_hv guests to use SMT processor modes · 371fefd6
      Paul Mackerras 提交于
      This lifts the restriction that book3s_hv guests can only run one
      hardware thread per core, and allows them to use up to 4 threads
      per core on POWER7.  The host still has to run single-threaded.
      
      This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
      capability.  The return value of the ioctl querying this capability
      is the number of vcpus per virtual CPU core (vcore), currently 4.
      
      To use this, the host kernel should be booted with all threads
      active, and then all the secondary threads should be offlined.
      This will put the secondary threads into nap mode.  KVM will then
      wake them from nap mode and use them for running guest code (while
      they are still offline).  To wake the secondary threads, we send
      them an IPI using a new xics_wake_cpu() function, implemented in
      arch/powerpc/sysdev/xics/icp-native.c.  In other words, at this stage
      we assume that the platform has a XICS interrupt controller and
      we are using icp-native.c to drive it.  Since the woken thread will
      need to acknowledge and clear the IPI, we also export the base
      physical address of the XICS registers using kvmppc_set_xics_phys()
      for use in the low-level KVM book3s code.
      
      When a vcpu is created, it is assigned to a virtual CPU core.
      The vcore number is obtained by dividing the vcpu number by the
      number of threads per core in the host.  This number is exported
      to userspace via the KVM_CAP_PPC_SMT capability.  If qemu wishes
      to run the guest in single-threaded mode, it should make all vcpu
      numbers be multiples of the number of threads per core.
      
      We distinguish three states of a vcpu: runnable (i.e., ready to execute
      the guest), blocked (that is, idle), and busy in host.  We currently
      implement a policy that the vcore can run only when all its threads
      are runnable or blocked.  This way, if a vcpu needs to execute elsewhere
      in the kernel or in qemu, it can do so without being starved of CPU
      by the other vcpus.
      
      When a vcore starts to run, it executes in the context of one of the
      vcpu threads.  The other vcpu threads all go to sleep and stay asleep
      until something happens requiring the vcpu thread to return to qemu,
      or to wake up to run the vcore (this can happen when another vcpu
      thread goes from busy in host state to blocked).
      
      It can happen that a vcpu goes from blocked to runnable state (e.g.
      because of an interrupt), and the vcore it belongs to is already
      running.  In that case it can start to run immediately as long as
      the none of the vcpus in the vcore have started to exit the guest.
      We send the next free thread in the vcore an IPI to get it to start
      to execute the guest.  It synchronizes with the other threads via
      the vcore->entry_exit_count field to make sure that it doesn't go
      into the guest if the other vcpus are exiting by the time that it
      is ready to actually enter the guest.
      
      Note that there is no fixed relationship between the hardware thread
      number and the vcpu number.  Hardware threads are assigned to vcpus
      as they become runnable, so we will always use the lower-numbered
      hardware threads in preference to higher-numbered threads if not all
      the vcpus in the vcore are runnable, regardless of which vcpus are
      runnable.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      371fefd6
    • D
      KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode · 54738c09
      David Gibson 提交于
      This improves I/O performance for guests using the PAPR
      paravirtualization interface by making the H_PUT_TCE hcall faster, by
      implementing it in real mode.  H_PUT_TCE is used for updating virtual
      IOMMU tables, and is used both for virtual I/O and for real I/O in the
      PAPR interface.
      
      Since this moves the IOMMU tables into the kernel, we define a new
      KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables.  The
      ioctl returns a file descriptor which can be used to mmap the newly
      created table.  The qemu driver models use them in the same way as
      userspace managed tables, but they can be updated directly by the
      guest with a real-mode H_PUT_TCE implementation, reducing the number
      of host/guest context switches during guest IO.
      
      There are certain circumstances where it is useful for userland qemu
      to write to the TCE table even if the kernel H_PUT_TCE path is used
      most of the time.  Specifically, allowing this will avoid awkwardness
      when we need to reset the table.  More importantly, we will in the
      future need to write the table in order to restore its state after a
      checkpoint resume or migration.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54738c09
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948
    • S
      KVM: PPC: e500: enable magic page · a4cd8b23
      Scott Wood 提交于
      This is a shared page used for paravirtualization.  It is always present
      in the guest kernel's effective address space at the address indicated
      by the hypercall that enables it.
      
      The physical address specified by the hypercall is not used, as
      e500 does not have real mode.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a4cd8b23
    • A
      KVM: MMU: Adjust shadow paging to work when SMEP=1 and CR0.WP=0 · 411c588d
      Avi Kivity 提交于
      When CR0.WP=0, we sometimes map user pages as kernel pages (to allow
      the kernel to write to them).  Unfortunately this also allows the kernel
      to fetch from these pages, even if CR4.SMEP is set.
      
      Adjust for this by also setting NX on the spte in these circumstances.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      411c588d
    • J
      KVM: Fix KVM_ASSIGN_SET_MSIX_ENTRY documentation · 58f0964e
      Jan Kiszka 提交于
      The documented behavior did not match the implemented one (which also
      never changed).
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      58f0964e
    • J
      KVM: Clarify KVM_ASSIGN_PCI_DEVICE documentation · 91e3d71d
      Jan Kiszka 提交于
      Neither host_irq nor the guest_msi struct are used anymore today.
      Tag the former, drop the latter to avoid confusion.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      91e3d71d
    • J
      KVM: Fixup documentation section numbering · 7f4382e8
      Jan Kiszka 提交于
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      7f4382e8
    • S
      KVM: Document KVM_IOEVENTFD · 55399a02
      Sasha Levin 提交于
      Document KVM_IOEVENTFD that can be used to receive
      notifications of PIO/MMIO events without triggering
      an exit.
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      55399a02
    • N
      KVM: nVMX: Documentation · 823e3965
      Nadav Har'El 提交于
      This patch includes a brief introduction to the nested vmx feature in the
      Documentation/kvm directory. The document also includes a copy of the
      vmcs12 structure, as requested by Avi Kivity.
      
      [marcelo: move to Documentation/virtual/kvm]
      Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      823e3965
    • A
      KVM: Document KVM_GET_LAPIC, KVM_SET_LAPIC ioctl · e7677933
      Avi Kivity 提交于
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      e7677933
  3. 07 7月, 2011 2 次提交
  4. 03 7月, 2011 2 次提交
  5. 02 7月, 2011 2 次提交
  6. 22 6月, 2011 3 次提交
  7. 18 6月, 2011 1 次提交
    • S
      USB: Fix up URB error codes to reflect implementation. · a9e75863
      Sarah Sharp 提交于
      Documentation/usb/error-codes.txt mentions that urb->status can be set to
      -EXDEV, if the isochronous transfer was not fully completed.  However, in
      practice, EHCI, UHCI, and OHCI all only set -EXDEV in the individual frame
      status, never in the URB status.  Those host controller actually always
      pass in a zero status to usb_hcd_giveback_urb, and rely on the core to set
      the appropriate status value.
      
      The xHCI driver ran into issues with the uvcvideo driver when it tried to
      set -EXDEV in urb->status, because the driver refused to submit URBs, and
      the userspace camera application's video froze.
      
      Clean up the documentation to reflect the actual implementation.
      Signed-off-by: NSarah Sharp <sarah.a.sharp@linux.intel.com>
      Acked-by: NAlan Stern <stern@rowland.harvard.edu>
      a9e75863
  8. 16 6月, 2011 7 次提交
  9. 15 6月, 2011 1 次提交
    • S
      rcu: Use softirq to address performance regression · 09223371
      Shaohua Li 提交于
      Commit a26ac245(rcu: move TREE_RCU from softirq to kthread)
      introduced performance regression. In an AIM7 test, this commit degraded
      performance by about 40%.
      
      The commit runs rcu callbacks in a kthread instead of softirq. We observed
      high rate of context switch which is caused by this. Out test system has
      64 CPUs and HZ is 1000, so we saw more than 64k context switch per second
      which is caused by RCU's per-CPU kthread.  A trace showed that most of
      the time the RCU per-CPU kthread doesn't actually handle any callbacks,
      but instead just does a very small amount of work handling grace periods.
      This means that RCU's per-CPU kthreads are making the scheduler do quite
      a bit of work in order to allow a very small amount of RCU-related
      processing to be done.
      
      Alex Shi's analysis determined that this slowdown is due to lock
      contention within the scheduler.  Unfortunately, as Peter Zijlstra points
      out, the scheduler's real-time semantics require global action, which
      means that this contention is inherent in real-time scheduling.  (Yes,
      perhaps someone will come up with a workaround -- otherwise, -rt is not
      going to do well on large SMP systems -- but this patch will work around
      this issue in the meantime.  And "the meantime" might well be forever.)
      
      This patch therefore re-introduces softirq processing to RCU, but only
      for core RCU work.  RCU callbacks are still executed in kthread context,
      so that only a small amount of RCU work runs in softirq context in the
      common case.  This should minimize ksoftirqd execution, allowing us to
      skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Tested-by: N"Alex,Shi" <alex.shi@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      09223371
  10. 09 6月, 2011 1 次提交
  11. 08 6月, 2011 1 次提交
    • A
      usb-storage: redo incorrect reads · 21c13a4f
      Alan Stern 提交于
      Some USB mass-storage devices have bugs that cause them not to handle
      the first READ(10) command they receive correctly.  The Corsair
      Padlock v2 returns completely bogus data for its first read (possibly
      it returns the data in encrypted form even though the device is
      supposed to be unlocked).  The Feiya SD/SDHC card reader fails to
      complete the first READ(10) command after it is plugged in or after a
      new card is inserted, returning a status code that indicates it thinks
      the command was invalid, which prevents the kernel from retrying the
      read.
      
      Since the first read of a new device or a new medium is for the
      partition sector, the kernel is unable to retrieve the device's
      partition table.  Users have to manually issue an "hdparm -z" or
      "blockdev --rereadpt" command before they can access the device.
      
      This patch (as1470) works around the problem.  It adds a new quirk
      flag, US_FL_INVALID_READ10, indicating that the first READ(10) should
      always be retried immediately, as should any failing READ(10) commands
      (provided the preceding READ(10) command succeeded, to avoid getting
      stuck in a loop).  The patch also adds appropriate unusual_devs
      entries containing the new flag.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Tested-by: NSven Geggus <sven-usbst@geggus.net>
      Tested-by: NPaul Hartman <paul.hartman+linux@gmail.com>
      CC: Matthew Dharm <mdharm-usb@one-eyed-alien.net>
      CC: <stable@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      21c13a4f
  12. 01 6月, 2011 1 次提交
    • Y
      intel-iommu: Enable super page (2MiB, 1GiB, etc.) support · 6dd9a7c7
      Youquan Song 提交于
      There are no externally-visible changes with this. In the loop in the
      internal __domain_mapping() function, we simply detect if we are mapping:
        - size >= 2MiB, and
        - virtual address aligned to 2MiB, and
        - physical address aligned to 2MiB, and
        - on hardware that supports superpages.
      
      (and likewise for larger superpages).
      
      We automatically use a superpage for such mappings. We never have to
      worry about *breaking* superpages, since we trust that we will always
      *unmap* the same range that was mapped. So all we need to do is ensure
      that dma_pte_clear_range() will also cope with superpages.
      
      Adjust pfn_to_dma_pte() to take a superpage 'level' as an argument, so
      it can return a PTE at the appropriate level rather than always
      extending the page tables all the way down to level 1. Again, this is
      simplified by the fact that we should never encounter existing small
      pages when we're creating a mapping; any old mapping that used the same
      virtual range will have been entirely removed and its obsolete page
      tables freed.
      
      Provide an 'intel_iommu=sp_off' argument on the command line as a
      chicken bit. Not that it should ever be required.
      
      ==
      
      The original commit seen in the iommu-2.6.git was Youquan's
      implementation (and completion) of my own half-baked code which I'd
      typed into an email. Followed by half a dozen subsequent 'fixes'.
      
      I've taken the unusual step of rewriting history and collapsing the
      original commits in order to keep the main history simpler, and make
      life easier for the people who are going to have to backport this to
      older kernels. And also so I can give it a more coherent commit comment
      which (hopefully) gives a better explanation of what's going on.
      
      The original sequence of commits leading to identical code was:
      
      Youquan Song (3):
            intel-iommu: super page support
            intel-iommu: Fix superpage alignment calculation error
            intel-iommu: Fix superpage level calculation error in dma_pfn_level_pte()
      
      David Woodhouse (4):
            intel-iommu: Precalculate superpage support for dmar_domain
            intel-iommu: Fix hardware_largepage_caps()
            intel-iommu: Fix inappropriate use of superpages in __domain_mapping()
            intel-iommu: Fix phys_pfn in __domain_mapping for sglist pages
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      6dd9a7c7
  13. 30 5月, 2011 2 次提交
  14. 29 5月, 2011 2 次提交
    • L
      x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param · 5d4c47e0
      Len Brown 提交于
      mwait_idle() is a C1-only idle loop intended to be more efficient
      than HLT on SMP hardware that supports it.
      
      But mwait_idle() has been replaced by the more general
      mwait_idle_with_hints(), which handles both C1 and deeper C-states.
      ACPI uses only mwait_idle_with_hints(), and never uses mwait_idle().
      
      Deprecate mwait_idle() and the "idle=mwait" cmdline param
      to simplify the x86 idle code.
      
      After this change, kernels configured with
      (!CONFIG_ACPI=n && !CONFIG_INTEL_IDLE=n) when run on hardware
      that support MWAIT will simply use HLT.  If MWAIT is desired
      on those systems, cpuidle and the cpuidle drivers above
      can be used.
      
      cc: x86@kernel.org
      cc: stable@kernel.org # .39.x
      Signed-off-by: NLen Brown <len.brown@intel.com>
      5d4c47e0
    • L
      x86 idle: deprecate "no-hlt" cmdline param · cdaab4a0
      Len Brown 提交于
      We'd rather that modern machines not check if HLT works on
      every entry into idle, for the benefit of machines that had
      marginal electricals 15-years ago.  If those machines are still running
      the upstream kernel, they can use "idle=poll".  The only difference
      will be that they'll now invoke HLT in machine_hlt().
      
      cc: x86@kernel.org # .39.x
      Signed-off-by: NLen Brown <len.brown@intel.com>
      cdaab4a0