1. 21 4月, 2015 12 次提交
    • P
      KVM: PPC: Book3S HV: Minor cleanups · 1f09c3ed
      Paul Mackerras 提交于
      * Remove unused kvmppc_vcore::n_busy field.
      * Remove setting of RMOR, since it was only used on PPC970 and the
        PPC970 KVM support has been removed.
      * Don't use r1 or r2 in setting the runlatch since they are
        conventionally reserved for other things; use r0 instead.
      * Streamline the code a little and remove the ext_interrupt_to_host
        label.
      * Add some comments about register usage.
      * hcall_try_real_mode doesn't need to be global, and can't be
        called from C code anyway.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1f09c3ed
    • P
      KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update · d911f0be
      Paul Mackerras 提交于
      Previously, if kvmppc_run_core() was running a VCPU that needed a VPA
      update (i.e. one of its 3 virtual processor areas needed to be pinned
      in memory so the host real mode code can update it on guest entry and
      exit), we would drop the vcore lock and do the update there and then.
      Future changes will make it inconvenient to drop the lock, so instead
      we now remove it from the list of runnable VCPUs and wake up its
      VCPU task.  This will have the effect that the VCPU task will exit
      kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and
      re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call
      to kvmppc_update_vpas() and then rejoin the vcore.
      
      The one complication is that the runner VCPU (whose VCPU task is the
      current task) might be one of the ones that gets removed from the
      runnable list.  In that case we just return from kvmppc_run_core()
      and let the code in kvmppc_run_vcpu() wake up another VCPU task to be
      the runner if necessary.
      
      This all means that the VCORE_STARTING state is no longer used, so we
      remove it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d911f0be
    • P
      KVM: PPC: Book3S HV: Accumulate timing information for real-mode code · b6c295df
      Paul Mackerras 提交于
      This reads the timebase at various points in the real-mode guest
      entry/exit code and uses that to accumulate total, minimum and
      maximum time spent in those parts of the code.  Currently these
      times are accumulated per vcpu in 5 parts of the code:
      
      * rm_entry - time taken from the start of kvmppc_hv_entry() until
        just before entering the guest.
      * rm_intr - time from when we take a hypervisor interrupt in the
        guest until we either re-enter the guest or decide to exit to the
        host.  This includes time spent handling hcalls in real mode.
      * rm_exit - time from when we decide to exit the guest until the
        return from kvmppc_hv_entry().
      * guest - time spend in the guest
      * cede - time spent napping in real mode due to an H_CEDE hcall
        while other threads in the same vcore are active.
      
      These times are exposed in debugfs in a directory per vcpu that
      contains a file called "timings".  This file contains one line for
      each of the 5 timings above, with the name followed by a colon and
      4 numbers, which are the count (number of times the code has been
      executed), the total time, the minimum time, and the maximum time,
      all in nanoseconds.
      
      The overhead of the extra code amounts to about 30ns for an hcall that
      is handled in real mode (e.g. H_SET_DABR), which is about 25%.  Since
      production environments may not wish to incur this overhead, the new
      code is conditional on a new config symbol,
      CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b6c295df
    • P
      KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT · e23a808b
      Paul Mackerras 提交于
      This creates a debugfs directory for each HV guest (assuming debugfs
      is enabled in the kernel config), and within that directory, a file
      by which the contents of the guest's HPT (hashed page table) can be
      read.  The directory is named vmnnnn, where nnnn is the PID of the
      process that created the guest.  The file is named "htab".  This is
      intended to help in debugging problems in the host's management
      of guest memory.
      
      The contents of the file consist of a series of lines like this:
      
        3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196
      
      The first field is the index of the entry in the HPT, the second and
      third are the HPT entry, so the third entry contains the real page
      number that is mapped by the entry if the entry's valid bit is set.
      The fourth field is the guest's view of the second doubleword of the
      entry, so it contains the guest physical address.  (The format of the
      second through fourth fields are described in the Power ISA and also
      in arch/powerpc/include/asm/mmu-hash64.h.)
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e23a808b
    • S
      KVM: PPC: Book3S HV: Add ICP real mode counters · 6e0365b7
      Suresh Warrier 提交于
      Add two counters to count how often we generate real-mode ICS resend
      and reject events. The counters provide some performance statistics
      that could be used in the future to consider if the real mode functions
      need further optimizing. The counters are displayed as part of IPC and
      ICP state provided by /sys/debug/kernel/powerpc/kvm* for each VM.
      
      Also added two counters that count (approximately) how many times we
      don't find an ICP or ICS we're looking for. These are not currently
      exposed through sysfs, but can be useful when debugging crashes.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6e0365b7
    • S
      KVM: PPC: Book3S HV: Move virtual mode ICP functions to real-mode · b0221556
      Suresh Warrier 提交于
      Interrupt-based hypercalls return H_TOO_HARD to inform KVM that it needs
      to switch to the host to complete the rest of hypercall function in
      virtual mode. This patch ports the virtual mode ICS/ICP reject and resend
      functions to be runnable in hypervisor real mode, thus avoiding the need
      to switch to the host to execute these functions in virtual mode. However,
      the hypercalls continue to return H_TOO_HARD for vcpu_wakeup and notify
      events - these events cannot be done in real mode and they will still need
      a switch to host virtual mode.
      
      There are sufficient differences between the real mode code and the
      virtual mode code for the ICS/ICP resend and reject functions that
      for now the code has been duplicated instead of sharing common code.
      In the future, we can look at creating common functions.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b0221556
    • S
      KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock · 34cb7954
      Suresh Warrier 提交于
      Replaces the ICS mutex lock with a spin lock since we will be porting
      these routines to real mode. Note that we need to disable interrupts
      before we take the lock in anticipation of the fact that on the guest
      side, we are running in the context of a hard irq and interrupts are
      disabled (EE bit off) when the lock is acquired. Again, because we
      will be acquiring the lock in hypervisor real mode, we need to use
      an arch_spinlock_t instead of a normal spinlock here as we want to
      avoid running any lockdep code (which may not be safe to execute in
      real mode).
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      34cb7954
    • S
      KVM: PPC: Book3S HV: Add guest->host real mode completion counters · 878610fe
      Suresh E. Warrier 提交于
      Add counters to track number of times we switch from guest real mode
      to host virtual mode during an interrupt-related hyper call because the
      hypercall requires actions that cannot be completed in real mode. This
      will help when making optimizations that reduce guest-host transitions.
      
      It is safe to use an ordinary increment rather than an atomic operation
      because there is one ICP per virtual CPU and kvmppc_xics_rm_complete()
      only works on the ICP for the current VCPU.
      
      The counters are displayed as part of IPC and ICP state provided by
      /sys/debug/kernel/powerpc/kvm* for each VM.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      878610fe
    • A
      KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte · a4bd6eb0
      Aneesh Kumar K.V 提交于
      This adds helper routines for locking and unlocking HPTEs, and uses
      them in the rest of the code.  We don't change any locking rules in
      this patch.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a4bd6eb0
    • A
      KVM: PPC: Book3S HV: Remove RMA-related variables from code · 31037eca
      Aneesh Kumar K.V 提交于
      We don't support real-mode areas now that 970 support is removed.
      Remove the remaining details of rma from the code.  Also rename
      rma_setup_done to hpte_setup_done to better reflect the changes.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      31037eca
    • M
      KVM: PPC: Book3S HV: Add fast real-mode H_RANDOM implementation. · e928e9cb
      Michael Ellerman 提交于
      Some PowerNV systems include a hardware random-number generator.
      This HWRNG is present on POWER7+ and POWER8 chips and is capable of
      generating one 64-bit random number every microsecond.  The random
      numbers are produced by sampling a set of 64 unstable high-frequency
      oscillators and are almost completely entropic.
      
      PAPR defines an H_RANDOM hypercall which guests can use to obtain one
      64-bit random sample from the HWRNG.  This adds a real-mode
      implementation of the H_RANDOM hypercall.  This hypercall was
      implemented in real mode because the latency of reading the HWRNG is
      generally small compared to the latency of a guest exit and entry for
      all the threads in the same virtual core.
      
      Userspace can detect the presence of the HWRNG and the H_RANDOM
      implementation by querying the KVM_CAP_PPC_HWRNG capability.  The
      H_RANDOM hypercall implementation will only be invoked when the guest
      does an H_RANDOM hypercall if userspace first enables the in-kernel
      H_RANDOM implementation using the KVM_CAP_PPC_ENABLE_HCALL capability.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e928e9cb
    • D
      kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM · 99342cf8
      David Gibson 提交于
      On POWER, storage caching is usually configured via the MMU - attributes
      such as cache-inhibited are stored in the TLB and the hashed page table.
      
      This makes correctly performing cache inhibited IO accesses awkward when
      the MMU is turned off (real mode).  Some CPU models provide special
      registers to control the cache attributes of real mode load and stores but
      this is not at all consistent.  This is a problem in particular for SLOF,
      the firmware used on KVM guests, which runs entirely in real mode, but
      which needs to do IO to load the kernel.
      
      To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
      and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
      a logical address (aka guest physical address).  SLOF uses these for IO.
      
      However, because these are implemented within qemu, not the host kernel,
      these bypass any IO devices emulated within KVM itself.  The simplest way
      to see this problem is to attempt to boot a KVM guest from a virtio-blk
      device with iothread / dataplane enabled.  The iothread code relies on an
      in kernel implementation of the virtio queue notification, which is not
      triggered by the IO hcalls, and so the guest will stall in SLOF unable to
      load the guest OS.
      
      This patch addresses this by providing in-kernel implementations of the
      2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
      address not handled by the KVM IO bus will cause a VM exit, hitting the
      qemu implementation as before.
      
      Note that a userspace change is also required, in order to enable these
      new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [agraf: fix compilation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      99342cf8
  2. 08 4月, 2015 1 次提交
  3. 27 3月, 2015 2 次提交
  4. 20 3月, 2015 3 次提交
    • P
      KVM: PPC: Book3S HV: Fix instruction emulation · 2bf27601
      Paul Mackerras 提交于
      Commit 4a157d61 ("KVM: PPC: Book3S HV: Fix endianness of
      instruction obtained from HEIR register") had the side effect that
      we no longer reset vcpu->arch.last_inst to -1 on guest exit in
      the cases where the instruction is not fetched from the guest.
      This means that if instruction emulation turns out to be required
      in those cases, the host will emulate the wrong instruction, since
      vcpu->arch.last_inst will contain the last instruction that was
      emulated.
      
      This fixes it by making sure that vcpu->arch.last_inst is reset
      to -1 in those cases.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2bf27601
    • P
      KVM: PPC: Book3S HV: Endian fix for accessing VPA yield count · ecb6d618
      Paul Mackerras 提交于
      The VPA (virtual processor area) is defined by PAPR and is therefore
      big-endian, so we need a be32_to_cpu when reading it in
      kvmppc_get_yield_count().  Without this, H_CONFER always fails on a
      little-endian host, causing SMP guests to waste time spinning on
      spinlocks.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ecb6d618
    • P
      KVM: PPC: Book3S HV: Fix spinlock/mutex ordering issue in kvmppc_set_lpcr() · 8f902b00
      Paul Mackerras 提交于
      Currently, kvmppc_set_lpcr() has a spinlock around the whole function,
      and inside that does mutex_lock(&kvm->lock).  It is not permitted to
      take a mutex while holding a spinlock, because the mutex_lock might
      call schedule().  In addition, this causes lockdep to warn about a
      lock ordering issue:
      
      ======================================================
      [ INFO: possible circular locking dependency detected ]
      3.18.0-kvm-04645-gdfea862-dirty #131 Not tainted
      -------------------------------------------------------
      qemu-system-ppc/8179 is trying to acquire lock:
       (&kvm->lock){+.+.+.}, at: [<d00000000ecc1f54>] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv]
      
      but task is already holding lock:
       (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecc1ea0>] .kvmppc_set_lpcr+0x40/0x1c0 [kvm_hv]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&(&vcore->lock)->rlock){+.+...}:
             [<c000000000b3c120>] .mutex_lock_nested+0x80/0x570
             [<d00000000ecc7a14>] .kvmppc_vcpu_run_hv+0xc4/0xe40 [kvm_hv]
             [<d00000000eb9f5cc>] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
             [<d00000000eb9cb24>] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
             [<d00000000eb94478>] .kvm_vcpu_ioctl+0x4a8/0x7b0 [kvm]
             [<c00000000026cbb4>] .do_vfs_ioctl+0x444/0x770
             [<c00000000026cfa4>] .SyS_ioctl+0xc4/0xe0
             [<c000000000009264>] syscall_exit+0x0/0x98
      
      -> #0 (&kvm->lock){+.+.+.}:
             [<c0000000000ff28c>] .lock_acquire+0xcc/0x1a0
             [<c000000000b3c120>] .mutex_lock_nested+0x80/0x570
             [<d00000000ecc1f54>] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv]
             [<d00000000ecc510c>] .kvmppc_set_one_reg_hv+0x4dc/0x990 [kvm_hv]
             [<d00000000eb9f234>] .kvmppc_set_one_reg+0x44/0x330 [kvm]
             [<d00000000eb9c9dc>] .kvm_vcpu_ioctl_set_one_reg+0x5c/0x150 [kvm]
             [<d00000000eb9ced4>] .kvm_arch_vcpu_ioctl+0x214/0x2c0 [kvm]
             [<d00000000eb940b0>] .kvm_vcpu_ioctl+0xe0/0x7b0 [kvm]
             [<c00000000026cbb4>] .do_vfs_ioctl+0x444/0x770
             [<c00000000026cfa4>] .SyS_ioctl+0xc4/0xe0
             [<c000000000009264>] syscall_exit+0x0/0x98
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&(&vcore->lock)->rlock);
                                     lock(&kvm->lock);
                                     lock(&(&vcore->lock)->rlock);
        lock(&kvm->lock);
      
       *** DEADLOCK ***
      
      2 locks held by qemu-system-ppc/8179:
       #0:  (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f18>] .vcpu_load+0x28/0x90 [kvm]
       #1:  (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecc1ea0>] .kvmppc_set_lpcr+0x40/0x1c0 [kvm_hv]
      
      stack backtrace:
      CPU: 4 PID: 8179 Comm: qemu-system-ppc Not tainted 3.18.0-kvm-04645-gdfea862-dirty #131
      Call Trace:
      [c000001a66c0f310] [c000000000b486ac] .dump_stack+0x88/0xb4 (unreliable)
      [c000001a66c0f390] [c0000000000f8bec] .print_circular_bug+0x27c/0x3d0
      [c000001a66c0f440] [c0000000000fe9e8] .__lock_acquire+0x2028/0x2190
      [c000001a66c0f5d0] [c0000000000ff28c] .lock_acquire+0xcc/0x1a0
      [c000001a66c0f6a0] [c000000000b3c120] .mutex_lock_nested+0x80/0x570
      [c000001a66c0f7c0] [d00000000ecc1f54] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv]
      [c000001a66c0f860] [d00000000ecc510c] .kvmppc_set_one_reg_hv+0x4dc/0x990 [kvm_hv]
      [c000001a66c0f8d0] [d00000000eb9f234] .kvmppc_set_one_reg+0x44/0x330 [kvm]
      [c000001a66c0f960] [d00000000eb9c9dc] .kvm_vcpu_ioctl_set_one_reg+0x5c/0x150 [kvm]
      [c000001a66c0f9f0] [d00000000eb9ced4] .kvm_arch_vcpu_ioctl+0x214/0x2c0 [kvm]
      [c000001a66c0faf0] [d00000000eb940b0] .kvm_vcpu_ioctl+0xe0/0x7b0 [kvm]
      [c000001a66c0fcb0] [c00000000026cbb4] .do_vfs_ioctl+0x444/0x770
      [c000001a66c0fd90] [c00000000026cfa4] .SyS_ioctl+0xc4/0xe0
      [c000001a66c0fe30] [c000000000009264] syscall_exit+0x0/0x98
      
      This fixes it by moving the mutex_lock()/mutex_unlock() pair outside
      the spin-locked region.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8f902b00
  5. 13 2月, 2015 1 次提交
  6. 06 2月, 2015 1 次提交
    • P
      kvm: add halt_poll_ns module parameter · f7819512
      Paolo Bonzini 提交于
      This patch introduces a new module parameter for the KVM module; when it
      is present, KVM attempts a bit of polling on every HLT before scheduling
      itself out via kvm_vcpu_block.
      
      This parameter helps a lot for latency-bound workloads---in particular
      I tested it with O_DSYNC writes with a battery-backed disk in the host.
      In this case, writes are fast (because the data doesn't have to go all
      the way to the platters) but they cannot be merged by either the host or
      the guest.  KVM's performance here is usually around 30% of bare metal,
      or 50% if you use cache=directsync or cache=writethrough (these
      parameters avoid that the guest sends pointless flush requests, and
      at the same time they are not slow because of the battery-backed cache).
      The bad performance happens because on every halt the host CPU decides
      to halt itself too.  When the interrupt comes, the vCPU thread is then
      migrated to a new physical CPU, and in general the latency is horrible
      because the vCPU thread has to be scheduled back in.
      
      With this patch performance reaches 60-65% of bare metal and, more
      important, 99% of what you get if you use idle=poll in the guest.  This
      means that the tunable gets rid of this particular bottleneck, and more
      work can be done to improve performance in the kernel or QEMU.
      
      Of course there is some price to pay; every time an otherwise idle vCPUs
      is interrupted by an interrupt, it will poll unnecessarily and thus
      impose a little load on the host.  The above results were obtained with
      a mostly random value of the parameter (500000), and the load was around
      1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
      
      The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
      that can be used to tune the parameter.  It counts how many HLT
      instructions received an interrupt during the polling period; each
      successful poll avoids that Linux schedules the VCPU thread out and back
      in, and may also avoid a likely trip to C1 and back for the physical CPU.
      
      While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
      Of these halts, almost all are failed polls.  During the benchmark,
      instead, basically all halts end within the polling period, except a more
      or less constant stream of 50 per second coming from vCPUs that are not
      running the benchmark.  The wasted time is thus very low.  Things may
      be slightly different for Windows VMs, which have a ~10 ms timer tick.
      
      The effect is also visible on Marcelo's recently-introduced latency
      test for the TSC deadline timer.  Though of course a non-RT kernel has
      awful latency bounds, the latency of the timer is around 8000-10000 clock
      cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
      deadline timer, thus, the effect is both a smaller average latency and
      a smaller variance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7819512
  7. 23 1月, 2015 1 次提交
  8. 19 1月, 2015 1 次提交
  9. 07 1月, 2015 1 次提交
    • P
      rcu: Make SRCU optional by using CONFIG_SRCU · 83fe27ea
      Pranith Kumar 提交于
      SRCU is not necessary to be compiled by default in all cases. For tinification
      efforts not compiling SRCU unless necessary is desirable.
      
      The current patch tries to make compiling SRCU optional by introducing a new
      Kconfig option CONFIG_SRCU which is selected when any of the components making
      use of SRCU are selected.
      
      If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
      
         text    data     bss     dec     hex filename
         2007       0       0    2007     7d7 kernel/rcu/srcu.o
      
      Size of arch/powerpc/boot/zImage changes from
      
         text    data     bss     dec     hex filename
       831552   64180   23944  919676   e087c arch/powerpc/boot/zImage : before
       829504   64180   23952  917636   e0084 arch/powerpc/boot/zImage : after
      
      so the savings are about ~2000 bytes.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Josh Triplett <josh@joshtriplett.org>
      CC: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
      83fe27ea
  10. 29 12月, 2014 1 次提交
  11. 19 12月, 2014 1 次提交
  12. 18 12月, 2014 1 次提交
  13. 17 12月, 2014 9 次提交
    • S
      KVM: PPC: Book3S HV: Improve H_CONFER implementation · 90fd09f8
      Sam Bobroff 提交于
      Currently the H_CONFER hcall is implemented in kernel virtual mode,
      meaning that whenever a guest thread does an H_CONFER, all the threads
      in that virtual core have to exit the guest.  This is bad for
      performance because it interrupts the other threads even if they
      are doing useful work.
      
      The H_CONFER hcall is called by a guest VCPU when it is spinning on a
      spinlock and it detects that the spinlock is held by a guest VCPU that
      is currently not running on a physical CPU.  The idea is to give this
      VCPU's time slice to the holder VCPU so that it can make progress
      towards releasing the lock.
      
      To avoid having the other threads exit the guest unnecessarily,
      we add a real-mode implementation of H_CONFER that checks whether
      the other threads are doing anything.  If all the other threads
      are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
      it returns H_TOO_HARD which causes a guest exit and allows the
      H_CONFER to be handled in virtual mode.
      
      Otherwise it spins for a short time (up to 10 microseconds) to give
      other threads the chance to observe that this thread is trying to
      confer.  The spin loop also terminates when any thread exits the guest
      or when all other threads are idle or trying to confer.  If the
      timeout is reached, the H_CONFER returns H_SUCCESS.  In this case the
      guest VCPU will recheck the spinlock word and most likely call
      H_CONFER again.
      
      This also improves the implementation of the H_CONFER virtual mode
      handler.  If the VCPU is part of a virtual core (vcore) which is
      runnable, there will be a 'runner' VCPU which has taken responsibility
      for running the vcore.  In this case we yield to the runner VCPU
      rather than the target VCPU.
      
      We also introduce a check on the target VCPU's yield count: if it
      differs from the yield count passed to H_CONFER, the target VCPU
      has run since H_CONFER was called and may have already released
      the lock.  This check is required by PAPR.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      90fd09f8
    • P
      KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register · 4a157d61
      Paul Mackerras 提交于
      There are two ways in which a guest instruction can be obtained from
      the guest in the guest exit code in book3s_hv_rmhandlers.S.  If the
      exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
      instruction), the offending instruction is in the HEIR register
      (Hypervisor Emulation Instruction Register).  If the exit was caused
      by a load or store to an emulated MMIO device, we load the instruction
      from the guest by turning data relocation on and loading the instruction
      with an lwz instruction.
      
      Unfortunately, in the case where the guest has opposite endianness to
      the host, these two methods give results of different endianness, but
      both get put into vcpu->arch.last_inst.  The HEIR value has been loaded
      using guest endianness, whereas the lwz will load the instruction using
      host endianness.  The rest of the code that uses vcpu->arch.last_inst
      assumes it was loaded using host endianness.
      
      To fix this, we define a new vcpu field to store the HEIR value.  Then,
      in kvmppc_handle_exit_hv(), we transfer the value from this new field to
      vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
      differ.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4a157d61
    • P
      KVM: PPC: Book3S HV: Remove code for PPC970 processors · c17b98cf
      Paul Mackerras 提交于
      This removes the code that was added to enable HV KVM to work
      on PPC970 processors.  The PPC970 is an old CPU that doesn't
      support virtualizing guest memory.  Removing PPC970 support also
      lets us remove the code for allocating and managing contiguous
      real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
      case, the code for pinning pages of guest memory when first
      accessed and keeping track of which pages have been pinned, and
      the code for handling H_ENTER hypercalls in virtual mode.
      
      Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
      The KVM_CAP_PPC_RMA capability now always returns 0.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c17b98cf
    • S
      KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions · 3c78f78a
      Suresh E. Warrier 提交于
      This patch adds trace points in the guest entry and exit code and also
      for exceptions handled by the host in kernel mode - hypercalls and page
      faults. The new events are added to /sys/kernel/debug/tracing/events
      under a new subsystem called kvm_hv.
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3c78f78a
    • P
      KVM: PPC: Book3S HV: Simplify locking around stolen time calculations · 2711e248
      Paul Mackerras 提交于
      Currently the calculations of stolen time for PPC Book3S HV guests
      uses fields in both the vcpu struct and the kvmppc_vcore struct.  The
      fields in the kvmppc_vcore struct are protected by the
      vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for
      running the virtual core.  This works correctly but confuses lockdep,
      because it sees that the code takes the tbacct_lock for a vcpu in
      kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in
      vcore_stolen_time(), and it thinks there is a possibility of deadlock,
      causing it to print reports like this:
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted
      ---------------------------------------------
      qemu-system-ppc/6188 is trying to acquire lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      
      but task is already holding lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by qemu-system-ppc/6188:
       #0:  (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm]
       #1:  (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv]
       #2:  (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      stack backtrace:
      CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89
      Call Trace:
      [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable)
      [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190
      [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0
      [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70
      [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv]
      [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv]
      [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
      [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm]
      [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770
      [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0
      [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98
      
      In order to make the locking easier to analyse, we change the code to
      use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and
      preempt_tb fields.  This lock needs to be an irq-safe lock since it is
      used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv()
      functions, which are called with the scheduler rq lock held, which is
      an irq-safe lock.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2711e248
    • R
      arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function · a0499cf7
      Rickard Strandqvist 提交于
      Remove the function inst_set_field() that is not used anywhere.
      
      This was partially found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a0499cf7
    • R
      arch: powerpc: kvm: book3s_pr.c: Remove unused function · 6178839b
      Rickard Strandqvist 提交于
      Remove the function get_fpr_index() that is not used anywhere.
      
      This was partially found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6178839b
    • R
      arch: powerpc: kvm: book3s.c: Remove some unused functions · 54ca162a
      Rickard Strandqvist 提交于
      Removes some functions that are not used anywhere:
      kvmppc_core_load_guest_debugstate() kvmppc_core_load_host_debugstate()
      
      This was partially found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54ca162a
    • R
      arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function · 24aaaf22
      Rickard Strandqvist 提交于
      Remove the function sr_nx() that is not used anywhere.
      
      This was partially found by using a static code analysis program called cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      24aaaf22
  14. 15 12月, 2014 5 次提交
    • S
      KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked · 1bc5d59c
      Suresh E. Warrier 提交于
      The kvmppc_vcore_blocked() code does not check for the wait condition
      after putting the process on the wait queue. This means that it is
      possible for an external interrupt to become pending, but the vcpu to
      remain asleep until the next decrementer interrupt.  The fix is to
      make one last check for pending exceptions and ceded state before
      calling schedule().
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1bc5d59c
    • C
      KVM: PPC: Book3S HV: ptes are big endian · ffada016
      Cédric Le Goater 提交于
      When being restored from qemu, the kvm_get_htab_header are in native
      endian, but the ptes are big endian.
      
      This patch fixes restore on a KVM LE host. Qemu also needs a fix for
      this :
      
           http://lists.nongnu.org/archive/html/qemu-ppc/2014-11/msg00008.htmlSigned-off-by: NCédric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ffada016
    • S
      KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI · 5b88cda6
      Suresh E. Warrier 提交于
      This fixes some inaccuracies in the state machine for the virtualized
      ICP when implementing the H_IPI hcall (Set_MFFR and related states):
      
      1. The old code wipes out any pending interrupts when the new MFRR is
         more favored than the CPPR but less favored than a pending
         interrupt (by always modifying xisr and the pending_pri). This can
         cause us to lose a pending external interrupt.
      
         The correct code here is to only modify the pending_pri and xisr in
         the ICP if the MFRR is equal to or more favored than the current
         pending pri (since in this case, it is guaranteed that that there
         cannot be a pending external interrupt). The code changes are
         required in both kvmppc_rm_h_ipi and kvmppc_h_ipi.
      
      2. Again, in both kvmppc_rm_h_ipi and kvmppc_h_ipi, there is a check
         for whether MFRR is being made less favored AND further if new MFFR
         is also less favored than the current CPPR, we check for any
         resends pending in the ICP. These checks look like they are
         designed to cover the case where if the MFRR is being made less
         favored, we opportunistically trigger a resend of any interrupts
         that had been previously rejected. Although, this is not a state
         described by PAPR, this is an action we actually need to do
         especially if the CPPR is already at 0xFF.  Because in this case,
         the resend bit will stay on until another ICP state change which
         may be a long time coming and the interrupt stays pending until
         then. The current code which checks for MFRR < CPPR is broken when
         CPPR is 0xFF since it will not get triggered in that case.
      
         Ideally, we would want to do a resend only if
      
         	prio(pending_interrupt) < mfrr && prio(pending_interrupt) < cppr
      
         where pending interrupt is the one that was rejected. But we don't
         have the priority of the pending interrupt state saved, so we
         simply trigger a resend whenever the MFRR is made less favored.
      
      3. In kvmppc_rm_h_ipi, where we save state to pass resends to the
         virtual mode, we also need to save the ICP whose need_resend we
         reset since this does not need to be my ICP (vcpu->arch.icp) as is
         incorrectly assumed by the current code. A new field rm_resend_icp
         is added to the kvmppc_icp structure for this purpose.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5b88cda6
    • P
      KVM: PPC: Book3S HV: Fix KSM memory corruption · b4a83900
      Paul Mackerras 提交于
      Testing with KSM active in the host showed occasional corruption of
      guest memory.  Typically a page that should have contained zeroes
      would contain values that look like the contents of a user process
      stack (values such as 0x0000_3fff_xxxx_xxx).
      
      Code inspection in kvmppc_h_protect revealed that there was a race
      condition with the possibility of granting write access to a page
      which is read-only in the host page tables.  The code attempts to keep
      the host mapping read-only if the host userspace PTE is read-only, but
      if that PTE had been temporarily made invalid for any reason, the
      read-only check would not trigger and the host HPTE could end up
      read-write.  Examination of the guest HPT in the failure situation
      revealed that there were indeed shared pages which should have been
      read-only that were mapped read-write.
      
      To close this race, we don't let a page go from being read-only to
      being read-write, as far as the real HPTE mapping the page is
      concerned (the guest view can go to read-write, but the actual mapping
      stays read-only).  When the guest tries to write to the page, we take
      an HDSI and let kvmppc_book3s_hv_page_fault take care of providing a
      writable HPTE for the page.
      
      This eliminates the occasional corruption of shared pages
      that was previously seen with KSM active.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4a83900
    • M
      KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI · dee6f24c
      Mahesh Salgaonkar 提交于
      When we get an HMI (hypervisor maintenance interrupt) while in a
      guest, we see that guest enters into paused state.  The reason is, in
      kvmppc_handle_exit_hv it falls through default path and returns to
      host instead of resuming guest.  This causes guest to enter into
      paused state.  HMI is a hypervisor only interrupt and it is safe to
      resume the guest since the host has handled it already.  This patch
      adds a switch case to resume the guest.
      
      Without this patch we see guest entering into paused state with following
      console messages:
      
      [ 3003.329351] Severe Hypervisor Maintenance interrupt [Recovered]
      [ 3003.329356]  Error detail: Timer facility experienced an error
      [ 3003.329359] 	HMER: 0840000000000000
      [ 3003.329360] 	TFMR: 4a12000980a84000
      [ 3003.329366] vcpu c0000007c35094c0 (40):
      [ 3003.329368] pc  = c0000000000c2ba0  msr = 8000000000009032  trap = e60
      [ 3003.329370] r 0 = c00000000021ddc0  r16 = 0000000000000046
      [ 3003.329372] r 1 = c00000007a02bbd0  r17 = 00003ffff27d5d98
      [ 3003.329375] r 2 = c0000000010980b8  r18 = 00001fffffc9a0b0
      [ 3003.329377] r 3 = c00000000142d6b8  r19 = c00000000142d6b8
      [ 3003.329379] r 4 = 0000000000000002  r20 = 0000000000000000
      [ 3003.329381] r 5 = c00000000524a110  r21 = 0000000000000000
      [ 3003.329383] r 6 = 0000000000000001  r22 = 0000000000000000
      [ 3003.329386] r 7 = 0000000000000000  r23 = c00000000524a110
      [ 3003.329388] r 8 = 0000000000000000  r24 = 0000000000000001
      [ 3003.329391] r 9 = 0000000000000001  r25 = c00000007c31da38
      [ 3003.329393] r10 = c0000000014280b8  r26 = 0000000000000002
      [ 3003.329395] r11 = 746f6f6c2f68656c  r27 = c00000000524a110
      [ 3003.329397] r12 = 0000000028004484  r28 = c00000007c31da38
      [ 3003.329399] r13 = c00000000fe01400  r29 = 0000000000000002
      [ 3003.329401] r14 = 0000000000000046  r30 = c000000003011e00
      [ 3003.329403] r15 = ffffffffffffffba  r31 = 0000000000000002
      [ 3003.329404] ctr = c00000000041a670  lr  = c000000000272520
      [ 3003.329405] srr0 = c00000000007e8d8 srr1 = 9000000000001002
      [ 3003.329406] sprg0 = 0000000000000000 sprg1 = c00000000fe01400
      [ 3003.329407] sprg2 = c00000000fe01400 sprg3 = 0000000000000005
      [ 3003.329408] cr = 48004482  xer = 2000000000000000  dsisr = 42000000
      [ 3003.329409] dar = 0000010015020048
      [ 3003.329410] fault dar = 0000010015020048 dsisr = 42000000
      [ 3003.329411] SLB (8 entries):
      [ 3003.329412]   ESID = c000000008000000 VSID = 40016e7779000510
      [ 3003.329413]   ESID = d000000008000001 VSID = 400142add1000510
      [ 3003.329414]   ESID = f000000008000004 VSID = 4000eb1a81000510
      [ 3003.329415]   ESID = 00001f000800000b VSID = 40004fda0a000d90
      [ 3003.329416]   ESID = 00003f000800000c VSID = 400039f536000d90
      [ 3003.329417]   ESID = 000000001800000d VSID = 0001251b35150d90
      [ 3003.329417]   ESID = 000001000800000e VSID = 4001e46090000d90
      [ 3003.329418]   ESID = d000080008000019 VSID = 40013d349c000400
      [ 3003.329419] lpcr = c048800001847001 sdr1 = 0000001b19000006 last_inst = ffffffff
      [ 3003.329421] trap=0xe60 | pc=0xc0000000000c2ba0 | msr=0x8000000000009032
      [ 3003.329524] Severe Hypervisor Maintenance interrupt [Recovered]
      [ 3003.329526]  Error detail: Timer facility experienced an error
      [ 3003.329527] 	HMER: 0840000000000000
      [ 3003.329527] 	TFMR: 4a12000980a94000
      [ 3006.359786] Severe Hypervisor Maintenance interrupt [Recovered]
      [ 3006.359792]  Error detail: Timer facility experienced an error
      [ 3006.359795] 	HMER: 0840000000000000
      [ 3006.359797] 	TFMR: 4a12000980a84000
      
       Id    Name                           State
      ----------------------------------------------------
       2     guest2                         running
       3     guest3                         paused
       4     guest4                         running
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      dee6f24c