1. 21 4月, 2015 8 次提交
    • P
      KVM: PPC: Book3S HV: Use bitmap of active threads rather than count · 7d6c40da
      Paul Mackerras 提交于
      Currently, the entry_exit_count field in the kvmppc_vcore struct
      contains two 8-bit counts, one of the threads that have started entering
      the guest, and one of the threads that have started exiting the guest.
      This changes it to an entry_exit_map field which contains two bitmaps
      of 8 bits each.  The advantage of doing this is that it gives us a
      bitmap of which threads need to be signalled when exiting the guest.
      That means that we no longer need to use the trick of setting the
      HDEC to 0 to pull the other threads out of the guest, which led in
      some cases to a spurious HDEC interrupt on the next guest entry.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7d6c40da
    • P
      KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken · 5d5b99cd
      Paul Mackerras 提交于
      We can tell when a secondary thread has finished running a guest by
      the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there
      is no real need for the nap_count field in the kvmppc_vcore struct.
      This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu
      pointers of the secondary threads rather than polling vc->nap_count.
      Besides reducing the size of the kvmppc_vcore struct by 8 bytes,
      this also means that we can tell which secondary threads have got
      stuck and thus print a more informative error message.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5d5b99cd
    • P
      KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu · 25fedfca
      Paul Mackerras 提交于
      Rather than calling cond_resched() in kvmppc_run_core() before doing
      the post-processing for the vcpus that we have just run (that is,
      calling kvmppc_handle_exit_hv(), kvmppc_set_timer(), etc.), we now do
      that post-processing before calling cond_resched(), and that post-
      processing is moved out into its own function, post_guest_process().
      
      The reschedule point is now in kvmppc_run_vcpu() and we define a new
      vcore state, VCORE_PREEMPT, to indicate that that the vcore's runner
      task is runnable but not running.  (Doing the reschedule with the
      vcore in VCORE_INACTIVE state would be bad because there are potentially
      other vcpus waiting for the runner in kvmppc_wait_for_exec() which
      then wouldn't get woken up.)
      
      Also, we make use of the handy cond_resched_lock() function, which
      unlocks and relocks vc->lock for us around the reschedule.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      25fedfca
    • P
      KVM: PPC: Book3S HV: Minor cleanups · 1f09c3ed
      Paul Mackerras 提交于
      * Remove unused kvmppc_vcore::n_busy field.
      * Remove setting of RMOR, since it was only used on PPC970 and the
        PPC970 KVM support has been removed.
      * Don't use r1 or r2 in setting the runlatch since they are
        conventionally reserved for other things; use r0 instead.
      * Streamline the code a little and remove the ext_interrupt_to_host
        label.
      * Add some comments about register usage.
      * hcall_try_real_mode doesn't need to be global, and can't be
        called from C code anyway.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1f09c3ed
    • P
      KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update · d911f0be
      Paul Mackerras 提交于
      Previously, if kvmppc_run_core() was running a VCPU that needed a VPA
      update (i.e. one of its 3 virtual processor areas needed to be pinned
      in memory so the host real mode code can update it on guest entry and
      exit), we would drop the vcore lock and do the update there and then.
      Future changes will make it inconvenient to drop the lock, so instead
      we now remove it from the list of runnable VCPUs and wake up its
      VCPU task.  This will have the effect that the VCPU task will exit
      kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and
      re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call
      to kvmppc_update_vpas() and then rejoin the vcore.
      
      The one complication is that the runner VCPU (whose VCPU task is the
      current task) might be one of the ones that gets removed from the
      runnable list.  In that case we just return from kvmppc_run_core()
      and let the code in kvmppc_run_vcpu() wake up another VCPU task to be
      the runner if necessary.
      
      This all means that the VCORE_STARTING state is no longer used, so we
      remove it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d911f0be
    • P
      KVM: PPC: Book3S HV: Accumulate timing information for real-mode code · b6c295df
      Paul Mackerras 提交于
      This reads the timebase at various points in the real-mode guest
      entry/exit code and uses that to accumulate total, minimum and
      maximum time spent in those parts of the code.  Currently these
      times are accumulated per vcpu in 5 parts of the code:
      
      * rm_entry - time taken from the start of kvmppc_hv_entry() until
        just before entering the guest.
      * rm_intr - time from when we take a hypervisor interrupt in the
        guest until we either re-enter the guest or decide to exit to the
        host.  This includes time spent handling hcalls in real mode.
      * rm_exit - time from when we decide to exit the guest until the
        return from kvmppc_hv_entry().
      * guest - time spend in the guest
      * cede - time spent napping in real mode due to an H_CEDE hcall
        while other threads in the same vcore are active.
      
      These times are exposed in debugfs in a directory per vcpu that
      contains a file called "timings".  This file contains one line for
      each of the 5 timings above, with the name followed by a colon and
      4 numbers, which are the count (number of times the code has been
      executed), the total time, the minimum time, and the maximum time,
      all in nanoseconds.
      
      The overhead of the extra code amounts to about 30ns for an hcall that
      is handled in real mode (e.g. H_SET_DABR), which is about 25%.  Since
      production environments may not wish to incur this overhead, the new
      code is conditional on a new config symbol,
      CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b6c295df
    • P
      KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT · e23a808b
      Paul Mackerras 提交于
      This creates a debugfs directory for each HV guest (assuming debugfs
      is enabled in the kernel config), and within that directory, a file
      by which the contents of the guest's HPT (hashed page table) can be
      read.  The directory is named vmnnnn, where nnnn is the PID of the
      process that created the guest.  The file is named "htab".  This is
      intended to help in debugging problems in the host's management
      of guest memory.
      
      The contents of the file consist of a series of lines like this:
      
        3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196
      
      The first field is the index of the entry in the HPT, the second and
      third are the HPT entry, so the third entry contains the real page
      number that is mapped by the entry if the entry's valid bit is set.
      The fourth field is the guest's view of the second doubleword of the
      entry, so it contains the guest physical address.  (The format of the
      second through fourth fields are described in the Power ISA and also
      in arch/powerpc/include/asm/mmu-hash64.h.)
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e23a808b
    • A
      KVM: PPC: Book3S HV: Remove RMA-related variables from code · 31037eca
      Aneesh Kumar K.V 提交于
      We don't support real-mode areas now that 970 support is removed.
      Remove the remaining details of rma from the code.  Also rename
      rma_setup_done to hpte_setup_done to better reflect the changes.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      31037eca
  2. 24 3月, 2015 1 次提交
    • D
      powerpc: Cleanup KVM emulated load/store endian handling · d078eed3
      David Gibson 提交于
      Sometimes the KVM code on powerpc needs to emulate load or store
      instructions from the guest, which can include both normal and byte
      reversed forms.
      
      We currently (AFAICT) handle this correctly, but some variable names are
      very misleading.  In particular we use "is_bigendian" in several places to
      actually mean "is the IO the same endian as the host", but we now support
      little-endian powerpc hosts.  This also ties into the misleadingly named
      ld_le*() and st_le*() functions, which in fact always byteswap, even on
      an LE host.
      
      This patch cleans this up by renaming to more accurate "host_swabbed", and
      uses the generic swab*() functions instead of the powerpc specific and
      misleadingly named ld_le*() and st_le*() functions.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d078eed3
  3. 06 2月, 2015 1 次提交
    • P
      kvm: add halt_poll_ns module parameter · f7819512
      Paolo Bonzini 提交于
      This patch introduces a new module parameter for the KVM module; when it
      is present, KVM attempts a bit of polling on every HLT before scheduling
      itself out via kvm_vcpu_block.
      
      This parameter helps a lot for latency-bound workloads---in particular
      I tested it with O_DSYNC writes with a battery-backed disk in the host.
      In this case, writes are fast (because the data doesn't have to go all
      the way to the platters) but they cannot be merged by either the host or
      the guest.  KVM's performance here is usually around 30% of bare metal,
      or 50% if you use cache=directsync or cache=writethrough (these
      parameters avoid that the guest sends pointless flush requests, and
      at the same time they are not slow because of the battery-backed cache).
      The bad performance happens because on every halt the host CPU decides
      to halt itself too.  When the interrupt comes, the vCPU thread is then
      migrated to a new physical CPU, and in general the latency is horrible
      because the vCPU thread has to be scheduled back in.
      
      With this patch performance reaches 60-65% of bare metal and, more
      important, 99% of what you get if you use idle=poll in the guest.  This
      means that the tunable gets rid of this particular bottleneck, and more
      work can be done to improve performance in the kernel or QEMU.
      
      Of course there is some price to pay; every time an otherwise idle vCPUs
      is interrupted by an interrupt, it will poll unnecessarily and thus
      impose a little load on the host.  The above results were obtained with
      a mostly random value of the parameter (500000), and the load was around
      1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
      
      The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
      that can be used to tune the parameter.  It counts how many HLT
      instructions received an interrupt during the polling period; each
      successful poll avoids that Linux schedules the VCPU thread out and back
      in, and may also avoid a likely trip to C1 and back for the physical CPU.
      
      While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
      Of these halts, almost all are failed polls.  During the benchmark,
      instead, basically all halts end within the polling period, except a more
      or less constant stream of 50 per second coming from vCPUs that are not
      running the benchmark.  The wasted time is thus very low.  Things may
      be slightly different for Windows VMs, which have a ~10 ms timer tick.
      
      The effect is also visible on Marcelo's recently-introduced latency
      test for the TSC deadline timer.  Though of course a non-RT kernel has
      awful latency bounds, the latency of the timer is around 8000-10000 clock
      cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
      deadline timer, thus, the effect is both a smaller average latency and
      a smaller variance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7819512
  4. 17 12月, 2014 4 次提交
    • S
      KVM: PPC: Book3S HV: Improve H_CONFER implementation · 90fd09f8
      Sam Bobroff 提交于
      Currently the H_CONFER hcall is implemented in kernel virtual mode,
      meaning that whenever a guest thread does an H_CONFER, all the threads
      in that virtual core have to exit the guest.  This is bad for
      performance because it interrupts the other threads even if they
      are doing useful work.
      
      The H_CONFER hcall is called by a guest VCPU when it is spinning on a
      spinlock and it detects that the spinlock is held by a guest VCPU that
      is currently not running on a physical CPU.  The idea is to give this
      VCPU's time slice to the holder VCPU so that it can make progress
      towards releasing the lock.
      
      To avoid having the other threads exit the guest unnecessarily,
      we add a real-mode implementation of H_CONFER that checks whether
      the other threads are doing anything.  If all the other threads
      are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
      it returns H_TOO_HARD which causes a guest exit and allows the
      H_CONFER to be handled in virtual mode.
      
      Otherwise it spins for a short time (up to 10 microseconds) to give
      other threads the chance to observe that this thread is trying to
      confer.  The spin loop also terminates when any thread exits the guest
      or when all other threads are idle or trying to confer.  If the
      timeout is reached, the H_CONFER returns H_SUCCESS.  In this case the
      guest VCPU will recheck the spinlock word and most likely call
      H_CONFER again.
      
      This also improves the implementation of the H_CONFER virtual mode
      handler.  If the VCPU is part of a virtual core (vcore) which is
      runnable, there will be a 'runner' VCPU which has taken responsibility
      for running the vcore.  In this case we yield to the runner VCPU
      rather than the target VCPU.
      
      We also introduce a check on the target VCPU's yield count: if it
      differs from the yield count passed to H_CONFER, the target VCPU
      has run since H_CONFER was called and may have already released
      the lock.  This check is required by PAPR.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      90fd09f8
    • P
      KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register · 4a157d61
      Paul Mackerras 提交于
      There are two ways in which a guest instruction can be obtained from
      the guest in the guest exit code in book3s_hv_rmhandlers.S.  If the
      exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
      instruction), the offending instruction is in the HEIR register
      (Hypervisor Emulation Instruction Register).  If the exit was caused
      by a load or store to an emulated MMIO device, we load the instruction
      from the guest by turning data relocation on and loading the instruction
      with an lwz instruction.
      
      Unfortunately, in the case where the guest has opposite endianness to
      the host, these two methods give results of different endianness, but
      both get put into vcpu->arch.last_inst.  The HEIR value has been loaded
      using guest endianness, whereas the lwz will load the instruction using
      host endianness.  The rest of the code that uses vcpu->arch.last_inst
      assumes it was loaded using host endianness.
      
      To fix this, we define a new vcpu field to store the HEIR value.  Then,
      in kvmppc_handle_exit_hv(), we transfer the value from this new field to
      vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
      differ.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4a157d61
    • P
      KVM: PPC: Book3S HV: Remove code for PPC970 processors · c17b98cf
      Paul Mackerras 提交于
      This removes the code that was added to enable HV KVM to work
      on PPC970 processors.  The PPC970 is an old CPU that doesn't
      support virtualizing guest memory.  Removing PPC970 support also
      lets us remove the code for allocating and managing contiguous
      real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
      case, the code for pinning pages of guest memory when first
      accessed and keeping track of which pages have been pinned, and
      the code for handling H_ENTER hypercalls in virtual mode.
      
      Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
      The KVM_CAP_PPC_RMA capability now always returns 0.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c17b98cf
    • P
      KVM: PPC: Book3S HV: Simplify locking around stolen time calculations · 2711e248
      Paul Mackerras 提交于
      Currently the calculations of stolen time for PPC Book3S HV guests
      uses fields in both the vcpu struct and the kvmppc_vcore struct.  The
      fields in the kvmppc_vcore struct are protected by the
      vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for
      running the virtual core.  This works correctly but confuses lockdep,
      because it sees that the code takes the tbacct_lock for a vcpu in
      kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in
      vcore_stolen_time(), and it thinks there is a possibility of deadlock,
      causing it to print reports like this:
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted
      ---------------------------------------------
      qemu-system-ppc/6188 is trying to acquire lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      
      but task is already holding lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by qemu-system-ppc/6188:
       #0:  (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm]
       #1:  (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv]
       #2:  (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      stack backtrace:
      CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89
      Call Trace:
      [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable)
      [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190
      [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0
      [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70
      [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv]
      [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv]
      [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
      [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm]
      [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770
      [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0
      [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98
      
      In order to make the locking easier to analyse, we change the code to
      use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and
      preempt_tb fields.  This lock needs to be an irq-safe lock since it is
      used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv()
      functions, which are called with the scheduler rq lock held, which is
      an irq-safe lock.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2711e248
  5. 24 9月, 2014 2 次提交
    • T
      kvm: Add arch specific mmu notifier for page invalidation · fe71557a
      Tang Chen 提交于
      This will be used to let the guest run while the APIC access page is
      not pinned.  Because subsequent patches will fill in the function
      for x86, place the (still empty) x86 implementation in the x86.c file
      instead of adding an inline function in kvm_host.h.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fe71557a
    • A
      kvm: Fix page ageing bugs · 57128468
      Andres Lagar-Cavilla 提交于
      1. We were calling clear_flush_young_notify in unmap_one, but we are
      within an mmu notifier invalidate range scope. The spte exists no more
      (due to range_start) and the accessed bit info has already been
      propagated (due to kvm_pfn_set_accessed). Simply call
      clear_flush_young.
      
      2. We clear_flush_young on a primary MMU PMD, but this may be mapped
      as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
      This required expanding the interface of the clear_flush_young mmu
      notifier, so a lot of code has been trivially touched.
      
      3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
      the access bit by blowing the spte. This requires proper synchronizing
      with MMU notifier consumers, like every other removal of spte's does.
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57128468
  6. 22 9月, 2014 3 次提交
  7. 29 8月, 2014 3 次提交
  8. 29 7月, 2014 1 次提交
  9. 28 7月, 2014 9 次提交
    • A
      KVM: PPC: Move kvmppc_ld/st to common code · 35c4a733
      Alexander Graf 提交于
      We have enough common infrastructure now to resolve GVA->GPA mappings at
      runtime. With this we can move our book3s specific helpers to load / store
      in guest virtual address space to common code as well.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      35c4a733
    • S
      Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8 · 9678cdaa
      Stewart Smith 提交于
      The POWER8 processor has a Micro Partition Prefetch Engine, which is
      a fancy way of saying "has way to store and load contents of L2 or
      L2+MRU way of L3 cache". We initiate the storing of the log (list of
      addresses) using the logmpp instruction and start restore by writing
      to a SPR.
      
      The logmpp instruction takes parameters in a single 64bit register:
      - starting address of the table to store log of L2/L2+L3 cache contents
        - 32kb for L2
        - 128kb for L2+L3
        - Aligned relative to maximum size of the table (32kb or 128kb)
      - Log control (no-op, L2 only, L2 and L3, abort logout)
      
      We should abort any ongoing logging before initiating one.
      
      To initiate restore, we write to the MPPR SPR. The format of what to write
      to the SPR is similar to the logmpp instruction parameter:
      - starting address of the table to read from (same alignment requirements)
      - table size (no data, until end of table)
      - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
        32 cycles)
      
      The idea behind loading and storing the contents of L2/L3 cache is to
      reduce memory latency in a system that is frequently swapping vcores on
      a physical CPU.
      
      The best case scenario for doing this is when some vcores are doing very
      cache heavy workloads. The worst case is when they have about 0 cache hits,
      so we just generate needless memory operations.
      
      This implementation just does L2 store/load. In my benchmarks this proves
      to be useful.
      
      Benchmark 1:
       - 16 core POWER8
       - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
       - No split core/SMT
       - two guests running sysbench memory test.
         sysbench --test=memory --num-threads=8 run
       - one guest running apache bench (of default HTML page)
         ab -n 490000 -c 400 http://localhost/
      
      This benchmark aims to measure performance of real world application (apache)
      where other guests are cache hot with their own workloads. The sysbench memory
      benchmark does pointer sized writes to a (small) memory buffer in a loop.
      
      In this benchmark with this patch I can see an improvement both in requests
      per second (~5%) and in mean and median response times (again, about 5%).
      The spread of minimum and maximum response times were largely unchanged.
      
      benchmark 2:
       - Same VM config as benchmark 1
       - all three guests running sysbench memory benchmark
      
      This benchmark aims to see if there is a positive or negative affect to this
      cache heavy benchmark. Although due to the nature of the benchmark (stores) we
      may not see a difference in performance, but rather hopefully an improvement
      in consistency of performance (when vcore switched in, don't have to wait
      many times for cachelines to be pulled in)
      
      The results of this benchmark are improvements in consistency of performance
      rather than performance itself. With this patch, the few outliers in duration
      go away and we get more consistent performance in each guest.
      
      benchmark 3:
       - same 3 guests and CPU configuration as benchmark 1 and 2.
       - two idle guests
       - 1 guest running STREAM benchmark
      
      This scenario also saw performance improvement with this patch. On Copy and
      Scale workloads from STREAM, I got 5-6% improvement with this patch. For
      Add and triad, it was around 10% (or more).
      
      benchmark 4:
       - same 3 guests as previous benchmarks
       - two guests running sysbench --memory, distinctly different cache heavy
         workload
       - one guest running STREAM benchmark.
      
      Similar improvements to benchmark 3.
      
      benchmark 5:
       - 1 guest, 8 VCPUs, Ubuntu 14.04
       - Host configured with split core (SMT8, subcores-per-core=4)
       - STREAM benchmark
      
      In this benchmark, we see a 10-20% performance improvement across the board
      of STREAM benchmark results with this patch.
      
      Based on preliminary investigation and microbenchmarks
      by Prerna Saxena <prerna@linux.vnet.ibm.com>
      Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9678cdaa
    • A
      KVM: PPC: Remove 440 support · b2677b8d
      Alexander Graf 提交于
      The 440 target hasn't been properly functioning for a few releases and
      before I was the only one who fixes a very serious bug that indicates to
      me that nobody used it before either.
      
      Furthermore KVM on 440 is slow to the extent of unusable.
      
      We don't have to carry along completely unused code. Remove 440 and give
      us one less thing to worry about.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b2677b8d
    • B
      kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit · 99e99d19
      Bharat Bhushan 提交于
      SPRN_SPRG is used by debug interrupt handler, so this is required for
      debug support.
      Signed-off-by: NBharat Bhushan <Bharat.Bhushan@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      99e99d19
    • A
      KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct · 1287cb3f
      Alexander Graf 提交于
      When building KVM with a lot of vcores (NR_CPUS is big), we can potentially
      get out of the ld immediate range for dereferences inside that struct.
      
      Move the array to the end of our kvm_arch struct. This fixes compilation
      issues with NR_CPUS=2048 for me.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1287cb3f
    • M
      KVM: PPC: e500: Emulate power management control SPR · debf27d6
      Mihai Caraman 提交于
      For FSL e6500 core the kernel uses power management SPR register (PWRMGTCR0)
      to enable idle power down for cores and devices by setting up the idle count
      period at boot time. With the host already controlling the power management
      configuration the guest could simply benefit from it, so emulate guest request
      as a general store.
      Signed-off-by: NMihai Caraman <mihai.caraman@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      debf27d6
    • P
      KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling · 699a0ea0
      Paul Mackerras 提交于
      This provides a way for userspace controls which sPAPR hcalls get
      handled in the kernel.  Each hcall can be individually enabled or
      disabled for in-kernel handling, except for H_RTAS.  The exception
      for H_RTAS is because userspace can already control whether
      individual RTAS functions are handled in-kernel or not via the
      KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
      H_RTAS is out of the normal sequence of hcall numbers.
      
      Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the
      KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM.
      The args field of the struct kvm_enable_cap specifies the hcall number
      in args[0] and the enable/disable flag in args[1]; 0 means disable
      in-kernel handling (so that the hcall will always cause an exit to
      userspace) and 1 means enable.  Enabling or disabling in-kernel
      handling of an hcall is effective across the whole VM.
      
      The ability for KVM_ENABLE_CAP to be used on a VM file descriptor
      on PowerPC is new, added by this commit.  The KVM_CAP_ENABLE_CAP_VM
      capability advertises that this ability exists.
      
      When a VM is created, an initial set of hcalls are enabled for
      in-kernel handling.  The set that is enabled is the set that have
      an in-kernel implementation at this point.  Any new hcall
      implementations from this point onwards should not be added to the
      default set without a good reason.
      
      No distinction is made between real-mode and virtual-mode hcall
      implementations; the one setting controls them both.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      699a0ea0
    • A
      KVM: PPC: BOOK3S: PR: Emulate instruction counter · 06da28e7
      Aneesh Kumar K.V 提交于
      Writing to IC is not allowed in the privileged mode.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      06da28e7
    • A
      KVM: PPC: BOOK3S: PR: Emulate virtual timebase register · 8f42ab27
      Aneesh Kumar K.V 提交于
      virtual time base register is a per VM, per cpu register that needs
      to be saved and restored on vm exit and entry. Writing to VTB is not
      allowed in the privileged mode.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [agraf: fix compile error]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8f42ab27
  10. 06 7月, 2014 1 次提交
  11. 30 5月, 2014 5 次提交
    • A
      KVM: PPC: Disable NX for old magic page using guests · f3383cf8
      Alexander Graf 提交于
      Old guests try to use the magic page, but map their trampoline code inside
      of an NX region.
      
      Since we can't fix those old kernels, try to detect whether the guest is sane
      or not. If not, just disable NX functionality in KVM so that old guests at
      least work at all. For newer guests, add a bit that we can set to keep NX
      functionality available.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f3383cf8
    • A
      KVM: PPC: Book3S PR: Expose TAR facility to guest · e14e7a1e
      Alexander Graf 提交于
      POWER8 implements a new register called TAR. This register has to be
      enabled in FSCR and then from KVM's point of view is mere storage.
      
      This patch enables the guest to use TAR.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e14e7a1e
    • A
      KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR · 616dff86
      Alexander Graf 提交于
      POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
      which contains its status message in a new register called FSCR.
      
      Handle these exits and try to emulate instructions for unhandled facilities.
      Follow-on patches enable KVM to expose specific facilities into the guest.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      616dff86
    • A
      KVM: PPC: Make shared struct aka magic page guest endian · 5deb8e7a
      Alexander Graf 提交于
      The shared (magic) page is a data structure that contains often used
      supervisor privileged SPRs accessible via memory to the user to reduce
      the number of exits we have to take to read/write them.
      
      When we actually share this structure with the guest we have to maintain
      it in guest endianness, because some of the patch tricks only work with
      native endian load/store operations.
      
      Since we only share the structure with either host or guest in little
      endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.
      
      For booke, the shared struct stays big endian. For book3s_64 hv we maintain
      the struct in host native endian, since it never gets shared with the guest.
      
      For book3s_64 pr we introduce a variable that tells us which endianness the
      shared struct is in and route every access to it through helper inline
      functions that evaluate this variable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5deb8e7a
    • A
      KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest · e5ee5422
      Aneesh Kumar K.V 提交于
      This patch make sure we inherit the LE bit correctly in different case
      so that we can run Little Endian distro in PR mode
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e5ee5422
  12. 27 1月, 2014 2 次提交