1. 19 11月, 2013 2 次提交
    • P
      KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call · c9438092
      Paul Mackerras 提交于
      Running a kernel with CONFIG_PROVE_RCU=y yields the following diagnostic:
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      3.12.0-rc5-kvm+ #9 Not tainted
      -------------------------------
      
      include/linux/kvm_host.h:473 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      1 lock held by qemu-system-ppc/4831:
      
      stack backtrace:
      CPU: 28 PID: 4831 Comm: qemu-system-ppc Not tainted 3.12.0-rc5-kvm+ #9
      Call Trace:
      [c000000be462b2a0] [c00000000001644c] .show_stack+0x7c/0x1f0 (unreliable)
      [c000000be462b370] [c000000000ad57c0] .dump_stack+0x88/0xb4
      [c000000be462b3f0] [c0000000001315e8] .lockdep_rcu_suspicious+0x138/0x180
      [c000000be462b480] [c00000000007862c] .gfn_to_memslot+0x13c/0x170
      [c000000be462b510] [c00000000007d384] .gfn_to_hva_prot+0x24/0x90
      [c000000be462b5a0] [c00000000007d420] .kvm_read_guest_page+0x30/0xd0
      [c000000be462b630] [c00000000007d528] .kvm_read_guest+0x68/0x110
      [c000000be462b6e0] [c000000000084594] .kvmppc_rtas_hcall+0x34/0x180
      [c000000be462b7d0] [c000000000097934] .kvmppc_pseries_do_hcall+0x74/0x830
      [c000000be462b880] [c0000000000990e8] .kvmppc_vcpu_run_hv+0xff8/0x15a0
      [c000000be462b9e0] [c0000000000839cc] .kvmppc_vcpu_run+0x2c/0x40
      [c000000be462ba50] [c0000000000810b4] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0
      [c000000be462bae0] [c00000000007b508] .kvm_vcpu_ioctl+0x478/0x730
      [c000000be462bca0] [c00000000025532c] .do_vfs_ioctl+0x4dc/0x7a0
      [c000000be462bd80] [c0000000002556b4] .SyS_ioctl+0xc4/0xe0
      [c000000be462be30] [c000000000009ee4] syscall_exit+0x0/0x98
      
      To fix this, we take the SRCU read lock around the kvmppc_rtas_hcall()
      call.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c9438092
    • P
      KVM: PPC: Book3S HV: Make tbacct_lock irq-safe · bf3d32e1
      Paul Mackerras 提交于
      Lockdep reported that there is a potential for deadlock because
      vcpu->arch.tbacct_lock is not irq-safe, and is sometimes taken inside
      the rq_lock (run-queue lock) in the scheduler, which is taken within
      interrupts.  The lockdep splat looks like:
      
      ======================================================
      [ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
      3.12.0-rc5-kvm+ #8 Not tainted
      ------------------------------------------------------
      qemu-system-ppc/4803 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}, at: [<c0000000000947ac>] .kvmppc_core_vcpu_put_hv+0x2c/0xa0
      
      and this task is already holding:
      (&rq->lock){-.-.-.}, at: [<c000000000ac16c0>] .__schedule+0x180/0xaa0
      which would create a new lock dependency:
      (&rq->lock){-.-.-.} -> (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}
      
      but this new dependency connects a HARDIRQ-irq-safe lock:
      (&rq->lock){-.-.-.}
      ... which became HARDIRQ-irq-safe at:
       [<c00000000013797c>] .lock_acquire+0xbc/0x190
       [<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60
       [<c0000000000f8564>] .scheduler_tick+0x54/0x180
       [<c0000000000c2610>] .update_process_times+0x70/0xa0
       [<c00000000012cdfc>] .tick_periodic+0x3c/0xe0
       [<c00000000012cec8>] .tick_handle_periodic+0x28/0xb0
       [<c00000000001ef40>] .timer_interrupt+0x120/0x2e0
       [<c000000000002868>] decrementer_common+0x168/0x180
       [<c0000000001c7ca4>] .get_page_from_freelist+0x924/0xc10
       [<c0000000001c8e00>] .__alloc_pages_nodemask+0x200/0xba0
       [<c0000000001c9eb8>] .alloc_pages_exact_nid+0x68/0x110
       [<c000000000f4c3ec>] .page_cgroup_init+0x1e0/0x270
       [<c000000000f24480>] .start_kernel+0x3e0/0x4e4
       [<c000000000009d30>] .start_here_common+0x20/0x70
      
      to a HARDIRQ-irq-unsafe lock:
      (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}
      ... which became HARDIRQ-irq-unsafe at:
      ...  [<c00000000013797c>] .lock_acquire+0xbc/0x190
       [<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60
       [<c0000000000946ac>] .kvmppc_core_vcpu_load_hv+0x2c/0x100
       [<c00000000008394c>] .kvmppc_core_vcpu_load+0x2c/0x40
       [<c000000000081000>] .kvm_arch_vcpu_load+0x10/0x30
       [<c00000000007afd4>] .vcpu_load+0x64/0xd0
       [<c00000000007b0f8>] .kvm_vcpu_ioctl+0x68/0x730
       [<c00000000025530c>] .do_vfs_ioctl+0x4dc/0x7a0
       [<c000000000255694>] .SyS_ioctl+0xc4/0xe0
       [<c000000000009ee4>] syscall_exit+0x0/0x98
      
      Some users have reported this deadlock occurring in practice, though
      the reports have been primarily on 3.10.x-based kernels.
      
      This fixes the problem by making tbacct_lock be irq-safe.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      bf3d32e1
  2. 18 10月, 2013 2 次提交
  3. 17 10月, 2013 11 次提交
    • A
      kvm: powerpc: book3s: Support building HV and PR KVM as module · 2ba9f0d8
      Aneesh Kumar K.V 提交于
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [agraf: squash in compile fix]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2ba9f0d8
    • A
      kvm: powerpc: book3s: Add is_hv_enabled to kvmppc_ops · 699cc876
      Aneesh Kumar K.V 提交于
      This help us to identify whether we are running with hypervisor mode KVM
      enabled. The change is needed so that we can have both HV and PR kvm
      enabled in the same kernel.
      
      If both HV and PR KVM are included, interrupts come in to the HV version
      of the kvmppc_interrupt code, which then jumps to the PR handler,
      renamed to kvmppc_interrupt_pr, if the guest is a PR guest.
      
      Allowing both PR and HV in the same kernel required some changes to
      kvm_dev_ioctl_check_extension(), since the values returned now can't
      be selected with #ifdefs as much as previously. We look at is_hv_enabled
      to return the right value when checking for capabilities.For capabilities that
      are only provided by HV KVM, we return the HV value only if
      is_hv_enabled is true. For capabilities provided by PR KVM but not HV,
      we return the PR value only if is_hv_enabled is false.
      
      NOTE: in later patch we replace is_hv_enabled with a static inline
      function comparing kvm_ppc_ops
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      699cc876
    • A
      kvm: powerpc: Add kvmppc_ops callback · 3a167bea
      Aneesh Kumar K.V 提交于
      This patch add a new callback kvmppc_ops. This will help us in enabling
      both HV and PR KVM together in the same kernel. The actual change to
      enable them together is done in the later patch in the series.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [agraf: squash in booke changes]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3a167bea
    • P
      kvm: powerpc: book3s hv: Fix vcore leak · f1378b1c
      Paul Mackerras 提交于
      add kvmppc_free_vcores() to free the kvmppc_vcore structures
      that we allocate for a guest, which are currently being leaked.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f1378b1c
    • P
      KVM: PPC: Book3S HV: Don't crash host on unknown guest interrupt · f3271d4c
      Paul Mackerras 提交于
      If we come out of a guest with an interrupt that we don't know about,
      instead of crashing the host with a BUG(), we now return to userspace
      with the exit reason set to KVM_EXIT_UNKNOWN and the trap vector in
      the hw.hardware_exit_reason field of the kvm_run structure, as is done
      on x86.  Note that run->exit_reason is already set to KVM_EXIT_UNKNOWN
      at the beginning of kvmppc_handle_exit().
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f3271d4c
    • P
      KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7 · 388cc6e1
      Paul Mackerras 提交于
      This enables us to use the Processor Compatibility Register (PCR) on
      POWER7 to put the processor into architecture 2.05 compatibility mode
      when running a guest.  In this mode the new instructions and registers
      that were introduced on POWER7 are disabled in user mode.  This
      includes all the VSX facilities plus several other instructions such
      as ldbrx, stdbrx, popcntw, popcntd, etc.
      
      To select this mode, we have a new register accessible through the
      set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT.  Setting
      this to zero gives the full set of capabilities of the processor.
      Setting it to one of the "logical" PVR values defined in PAPR puts
      the vcpu into the compatibility mode for the corresponding
      architecture level.  The supported values are:
      
      0x0f000002	Architecture 2.05 (POWER6)
      0x0f000003	Architecture 2.06 (POWER7)
      0x0f100003	Architecture 2.06+ (POWER7+)
      
      Since the PCR is per-core, the architecture compatibility level and
      the corresponding PCR value are stored in the struct kvmppc_vcore, and
      are therefore shared between all vcpus in a virtual core.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: squash in fix to add missing break statements and documentation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      388cc6e1
    • P
      KVM: PPC: Book3S HV: Add support for guest Program Priority Register · 4b8473c9
      Paul Mackerras 提交于
      POWER7 and later IBM server processors have a register called the
      Program Priority Register (PPR), which controls the priority of
      each hardware CPU SMT thread, and affects how fast it runs compared
      to other SMT threads.  This priority can be controlled by writing to
      the PPR or by use of a set of instructions of the form or rN,rN,rN
      which are otherwise no-ops but have been defined to set the priority
      to particular levels.
      
      This adds code to context switch the PPR when entering and exiting
      guests and to make the PPR value accessible through the SET/GET_ONE_REG
      interface.  When entering the guest, we set the PPR as late as
      possible, because if we are setting a low thread priority it will
      make the code run slowly from that point on.  Similarly, the
      first-level interrupt handlers save the PPR value in the PACA very
      early on, and set the thread priority to the medium level, so that
      the interrupt handling code runs at a reasonable speed.
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4b8473c9
    • P
      KVM: PPC: Book3S HV: Store LPCR value for each virtual core · a0144e2a
      Paul Mackerras 提交于
      This adds the ability to have a separate LPCR (Logical Partitioning
      Control Register) value relating to a guest for each virtual core,
      rather than only having a single value for the whole VM.  This
      corresponds to what real POWER hardware does, where there is a LPCR
      per CPU thread but most of the fields are required to have the same
      value on all active threads in a core.
      
      The per-virtual-core LPCR can be read and written using the
      GET/SET_ONE_REG interface.  Userspace can can only modify the
      following fields of the LPCR value:
      
      DPFD	Default prefetch depth
      ILE	Interrupt little-endian
      TC	Translation control (secondary HPT hash group search disable)
      
      We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which
      contains bits relating to memory management, i.e. the Virtualized
      Partition Memory (VPM) bits and the bits relating to guest real mode.
      When this default value is updated, the update needs to be propagated
      to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do
      that.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix whitespace]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a0144e2a
    • P
      KVM: PPC: Book3S HV: Implement H_CONFER · 42d7604d
      Paul Mackerras 提交于
      The H_CONFER hypercall is used when a guest vcpu is spinning on a lock
      held by another vcpu which has been preempted, and the spinning vcpu
      wishes to give its timeslice to the lock holder.  We implement this
      in the straightforward way using kvm_vcpu_yield_to().
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      42d7604d
    • P
      KVM: PPC: Book3S HV: Implement timebase offset for guests · 93b0f4dc
      Paul Mackerras 提交于
      This allows guests to have a different timebase origin from the host.
      This is needed for migration, where a guest can migrate from one host
      to another and the two hosts might have a different timebase origin.
      However, the timebase seen by the guest must not go backwards, and
      should go forwards only by a small amount corresponding to the time
      taken for the migration.
      
      Therefore this provides a new per-vcpu value accessed via the one_reg
      interface using the new KVM_REG_PPC_TB_OFFSET identifier.  This value
      defaults to 0 and is not modified by KVM.  On entering the guest, this
      value is added onto the timebase, and on exiting the guest, it is
      subtracted from the timebase.
      
      This is only supported for recent POWER hardware which has the TBU40
      (timebase upper 40 bits) register.  Writing to the TBU40 register only
      alters the upper 40 bits of the timebase, leaving the lower 24 bits
      unchanged.  This provides a way to modify the timebase for guest
      migration without disturbing the synchronization of the timebase
      registers across CPU cores.  The kernel rounds up the value given
      to a multiple of 2^24.
      
      Timebase values stored in KVM structures (struct kvm_vcpu, struct
      kvmppc_vcore, etc.) are stored as host timebase values.  The timebase
      values in the dispatch trace log need to be guest timebase values,
      however, since that is read directly by the guest.  This moves the
      setting of vcpu->arch.dec_expires on guest exit to a point after we
      have restored the host timebase so that vcpu->arch.dec_expires is a
      host timebase value.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      93b0f4dc
    • P
      KVM: PPC: Book3S HV: Save/restore SIAR and SDAR along with other PMU registers · 14941789
      Paul Mackerras 提交于
      Currently we are not saving and restoring the SIAR and SDAR registers in
      the PMU (performance monitor unit) on guest entry and exit.  The result
      is that performance monitoring tools in the guest could get false
      information about where a program was executing and what data it was
      accessing at the time of a performance monitor interrupt.  This fixes
      it by saving and restoring these registers along with the other PMU
      registers on guest entry/exit.
      
      This also provides a way for userspace to access these values for a
      vcpu via the one_reg interface.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      14941789
  4. 28 8月, 2013 1 次提交
  5. 26 8月, 2013 1 次提交
  6. 23 8月, 2013 1 次提交
  7. 14 8月, 2013 1 次提交
  8. 09 8月, 2013 1 次提交
  9. 08 7月, 2013 1 次提交
  10. 11 6月, 2013 1 次提交
  11. 01 6月, 2013 1 次提交
  12. 30 4月, 2013 1 次提交
  13. 27 4月, 2013 5 次提交
    • P
      KVM: PPC: Book3S HV: Improve real-mode handling of external interrupts · 4619ac88
      Paul Mackerras 提交于
      This streamlines our handling of external interrupts that come in
      while we're in the guest.  First, when waking up a hardware thread
      that was napping, we split off the "napping due to H_CEDE" case
      earlier, and use the code that handles an external interrupt (0x500)
      in the guest to handle that too.  Secondly, the code that handles
      those external interrupts now checks if any other thread is exiting
      to the host before bouncing an external interrupt to the guest, and
      also checks that there is actually an external interrupt pending for
      the guest before setting the LPCR MER bit (mediated external request).
      
      This also makes sure that we clear the "ceded" flag when we handle a
      wakeup from cede in real mode, and fixes a potential infinite loop
      in kvmppc_run_vcpu() which can occur if we ever end up with the ceded
      flag set but MSR[EE] off.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4619ac88
    • B
      KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM · 54695c30
      Benjamin Herrenschmidt 提交于
      Currently, we wake up a CPU by sending a host IPI with
      smp_send_reschedule() to thread 0 of that core, which will take all
      threads out of the guest, and cause them to re-evaluate their
      interrupt status on the way back in.
      
      This adds a mechanism to differentiate real host IPIs from IPIs sent
      by KVM for guest threads to poke each other, in order to target the
      guest threads precisely when possible and avoid that global switch of
      the core to host state.
      
      We then use this new facility in the in-kernel XICS code.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54695c30
    • B
      KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller · bc5ad3f3
      Benjamin Herrenschmidt 提交于
      This adds in-kernel emulation of the XICS (eXternal Interrupt
      Controller Specification) interrupt controller specified by PAPR, for
      both HV and PR KVM guests.
      
      The XICS emulation supports up to 1048560 interrupt sources.
      Interrupt source numbers below 16 are reserved; 0 is used to mean no
      interrupt and 2 is used for IPIs.  Internally these are represented in
      blocks of 1024, called ICS (interrupt controller source) entities, but
      that is not visible to userspace.
      
      Each vcpu gets one ICP (interrupt controller presentation) entity,
      used to store the per-vcpu state such as vcpu priority, pending
      interrupt state, IPI request, etc.
      
      This does not include any API or any way to connect vcpus to their
      ICP state; that will be added in later patches.
      
      This is based on an initial implementation by Michael Ellerman
      <michael@ellerman.id.au> reworked by Benjamin Herrenschmidt and
      Paul Mackerras.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix typo, add dependency on !KVM_MPIC]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      bc5ad3f3
    • M
      KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls · 8e591cb7
      Michael Ellerman 提交于
      For pseries machine emulation, in order to move the interrupt
      controller code to the kernel, we need to intercept some RTAS
      calls in the kernel itself.  This adds an infrastructure to allow
      in-kernel handlers to be registered for RTAS services by name.
      A new ioctl, KVM_PPC_RTAS_DEFINE_TOKEN, then allows userspace to
      associate token values with those service names.  Then, when the
      guest requests an RTAS service with one of those token values, it
      will be handled by the relevant in-kernel handler rather than being
      passed up to userspace as at present.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix warning]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8e591cb7
    • P
      KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map · c35635ef
      Paul Mackerras 提交于
      At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
      done by the host to the virtual processor areas (VPAs) and dispatch
      trace logs (DTLs) registered by the guest.  This is because those
      modifications are done either in real mode or in the host kernel
      context, and in neither case does the access go through the guest's
      HPT, and thus no change (C) bit gets set in the guest's HPT.
      
      However, the changes done by the host do need to be tracked so that
      the modified pages get transferred when doing live migration.  In
      order to track these modifications, this adds a dirty flag to the
      struct representing the VPA/DTL areas, and arranges to set the flag
      when the VPA/DTL gets modified by the host.  Then, when we are
      collecting the dirty log, we also check the dirty flags for the
      VPA and DTL for each vcpu and set the relevant bit in the dirty log
      if necessary.  Doing this also means we now need to keep track of
      the guest physical address of the VPA/DTL areas.
      
      So as not to lose track of modifications to a VPA/DTL area when it gets
      unregistered, or when a new area gets registered in its place, we need
      to transfer the dirty state to the rmap chain.  This adds code to
      kvmppc_unpin_guest_page() to do that if the area was dirty.  To simplify
      that code, we now require that all VPA, DTL and SLB shadow buffer areas
      fit within a single host page.  Guests already comply with this
      requirement because pHyp requires that these areas not cross a 4k
      boundary.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c35635ef
  14. 10 4月, 2013 1 次提交
  15. 05 3月, 2013 1 次提交
  16. 14 12月, 2012 1 次提交
  17. 06 12月, 2012 3 次提交
    • P
      KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking · b4072df4
      Paul Mackerras 提交于
      Currently, if a machine check interrupt happens while we are in the
      guest, we exit the guest and call the host's machine check handler,
      which tends to cause the host to panic.  Some machine checks can be
      triggered by the guest; for example, if the guest creates two entries
      in the SLB that map the same effective address, and then accesses that
      effective address, the CPU will take a machine check interrupt.
      
      To handle this better, when a machine check happens inside the guest,
      we call a new function, kvmppc_realmode_machine_check(), while still in
      real mode before exiting the guest.  On POWER7, it handles the cases
      that the guest can trigger, either by flushing and reloading the SLB,
      or by flushing the TLB, and then it delivers the machine check interrupt
      directly to the guest without going back to the host.  On POWER7, the
      OPAL firmware patches the machine check interrupt vector so that it
      gets control first, and it leaves behind its analysis of the situation
      in a structure pointed to by the opal_mc_evt field of the paca.  The
      kvmppc_realmode_machine_check() function looks at this, and if OPAL
      reports that there was no error, or that it has handled the error, we
      also go straight back to the guest with a machine check.  We have to
      deliver a machine check to the guest since the machine check interrupt
      might have trashed valid values in SRR0/1.
      
      If the machine check is one we can't handle in real mode, and one that
      OPAL hasn't already handled, or on PPC970, we exit the guest and call
      the host's machine check handler.  We do this by jumping to the
      machine_check_fwnmi label, rather than absolute address 0x200, because
      we don't want to re-execute OPAL's handler on POWER7.  On PPC970, the
      two are equivalent because address 0x200 just contains a branch.
      
      Then, if the host machine check handler decides that the system can
      continue executing, kvmppc_handle_exit() delivers a machine check
      interrupt to the guest -- once again to let the guest know that SRR0/1
      have been modified.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix checkpatch warnings]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4072df4
    • P
      KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations · 1b400ba0
      Paul Mackerras 提交于
      When we change or remove a HPT (hashed page table) entry, we can do
      either a global TLB invalidation (tlbie) that works across the whole
      machine, or a local invalidation (tlbiel) that only affects this core.
      Currently we do local invalidations if the VM has only one vcpu or if
      the guest requests it with the H_LOCAL flag, though the guest Linux
      kernel currently doesn't ever use H_LOCAL.  Then, to cope with the
      possibility that vcpus moving around to different physical cores might
      expose stale TLB entries, there is some code in kvmppc_hv_entry to
      flush the whole TLB of entries for this VM if either this vcpu is now
      running on a different physical core from where it last ran, or if this
      physical core last ran a different vcpu.
      
      There are a number of problems on POWER7 with this as it stands:
      
      - The TLB invalidation is done per thread, whereas it only needs to be
        done per core, since the TLB is shared between the threads.
      - With the possibility of the host paging out guest pages, the use of
        H_LOCAL by an SMP guest is dangerous since the guest could possibly
        retain and use a stale TLB entry pointing to a page that had been
        removed from the guest.
      - The TLB invalidations that we do when a vcpu moves from one physical
        core to another are unnecessary in the case of an SMP guest that isn't
        using H_LOCAL.
      - The optimization of using local invalidations rather than global should
        apply to guests with one virtual core, not just one vcpu.
      
      (None of this applies on PPC970, since there we always have to
      invalidate the whole TLB when entering and leaving the guest, and we
      can't support paging out guest memory.)
      
      To fix these problems and simplify the code, we now maintain a simple
      cpumask of which cpus need to flush the TLB on entry to the guest.
      (This is indexed by cpu, though we only ever use the bits for thread
      0 of each core.)  Whenever we do a local TLB invalidation, we set the
      bits for every cpu except the bit for thread 0 of the core that we're
      currently running on.  Whenever we enter a guest, we test and clear the
      bit for our core, and flush the TLB if it was set.
      
      On initial startup of the VM, and when resetting the HPT, we set all the
      bits in the need_tlb_flush cpumask, since any core could potentially have
      stale TLB entries from the previous VM to use the same LPID, or the
      previous contents of the HPT.
      
      Then, we maintain a count of the number of online virtual cores, and use
      that when deciding whether to use a local invalidation rather than the
      number of online vcpus.  The code to make that decision is extracted out
      into a new function, global_invalidates().  For multi-core guests on
      POWER7 (i.e. when we are using mmu notifiers), we now never do local
      invalidations regardless of the H_LOCAL flag.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1b400ba0
    • P
      KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT · a2932923
      Paul Mackerras 提交于
      A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
      this fd return the contents of the HPT (hashed page table), writes
      create and/or remove entries in the HPT.  There is a new capability,
      KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
      takes an argument structure with the index of the first HPT entry to
      read out and a set of flags.  The flags indicate whether the user is
      intending to read or write the HPT, and whether to return all entries
      or only the "bolted" entries (those with the bolted bit, 0x10, set in
      the first doubleword).
      
      This is intended for use in implementing qemu's savevm/loadvm and for
      live migration.  Therefore, on reads, the first pass returns information
      about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
      end of the HPT, it returns from the read.  Subsequent reads only return
      information about HPTEs that have changed since they were last read.
      A read that finds no changed HPTEs in the HPT following where the last
      read finished will return 0 bytes.
      
      The format of the data provides a simple run-length compression of the
      invalid entries.  Each block of data starts with a header that indicates
      the index (position in the HPT, which is just an array), the number of
      valid entries starting at that index (may be zero), and the number of
      invalid entries following those valid entries.  The valid entries, 16
      bytes each, follow the header.  The invalid entries are not explicitly
      represented.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix documentation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a2932923
  18. 30 10月, 2012 5 次提交
    • P
      KVM: PPC: Book3S HV: Allow DTL to be set to address 0, length 0 · 9f8c8c78
      Paul Mackerras 提交于
      Commit 55b665b0 ("KVM: PPC: Book3S HV: Provide a way for userspace
      to get/set per-vCPU areas") includes a check on the length of the
      dispatch trace log (DTL) to make sure the buffer is at least one entry
      long.  This is appropriate when registering a buffer, but the
      interface also allows for any existing buffer to be unregistered by
      specifying a zero address.  In this case the length check is not
      appropriate.  This makes the check conditional on the address being
      non-zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9f8c8c78
    • P
      KVM: PPC: Book3S HV: Fix accounting of stolen time · c7b67670
      Paul Mackerras 提交于
      Currently the code that accounts stolen time tends to overestimate the
      stolen time, and will sometimes report more stolen time in a DTL
      (dispatch trace log) entry than has elapsed since the last DTL entry.
      This can cause guests to underflow the user or system time measured
      for some tasks, leading to ridiculous CPU percentages and total runtimes
      being reported by top and other utilities.
      
      In addition, the current code was designed for the previous policy where
      a vcore would only run when all the vcpus in it were runnable, and so
      only counted stolen time on a per-vcore basis.  Now that a vcore can
      run while some of the vcpus in it are doing other things in the kernel
      (e.g. handling a page fault), we need to count the time when a vcpu task
      is preempted while it is not running as part of a vcore as stolen also.
      
      To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
      vcpu_load/put functions to count preemption time while the vcpu is
      in that state.  Handling the transitions between the RUNNING and
      BUSY_IN_HOST states requires checking and updating two variables
      (accumulated time stolen and time last preempted), so we add a new
      spinlock, vcpu->arch.tbacct_lock.  This protects both the per-vcpu
      stolen/preempt-time variables, and the per-vcore variables while this
      vcpu is running the vcore.
      
      Finally, we now don't count time spent in userspace as stolen time.
      The task could be executing in userspace on behalf of the vcpu, or
      it could be preempted, or the vcpu could be genuinely stopped.  Since
      we have no way of dividing up the time between these cases, we don't
      count any of it as stolen.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c7b67670
    • P
      KVM: PPC: Book3S HV: Run virtual core whenever any vcpus in it can run · 8455d79e
      Paul Mackerras 提交于
      Currently the Book3S HV code implements a policy on multi-threaded
      processors (i.e. POWER7) that requires all of the active vcpus in a
      virtual core to be ready to run before we run the virtual core.
      However, that causes problems on reset, because reset stops all vcpus
      except vcpu 0, and can also reduce throughput since all four threads
      in a virtual core have to wait whenever any one of them hits a
      hypervisor page fault.
      
      This relaxes the policy, allowing the virtual core to run as soon as
      any vcpu in it is runnable.  With this, the KVMPPC_VCPU_STOPPED state
      and the KVMPPC_VCPU_BUSY_IN_HOST state have been combined into a single
      KVMPPC_VCPU_NOTREADY state, since we no longer need to distinguish
      between them.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8455d79e
    • P
      KVM: PPC: Book3S HV: Fixes for late-joining threads · 2f12f034
      Paul Mackerras 提交于
      If a thread in a virtual core becomes runnable while other threads
      in the same virtual core are already running in the guest, it is
      possible for the latecomer to join the others on the core without
      first pulling them all out of the guest.  Currently this only happens
      rarely, when a vcpu is first started.  This fixes some bugs and
      omissions in the code in this case.
      
      First, we need to check for VPA updates for the latecomer and make
      a DTL entry for it.  Secondly, if it comes along while the master
      vcpu is doing a VPA update, we don't need to do anything since the
      master will pick it up in kvmppc_run_core.  To handle this correctly
      we introduce a new vcore state, VCORE_STARTING.  Thirdly, there is
      a race because we currently clear the hardware thread's hwthread_req
      before waiting to see it get to nap.  A latecomer thread could have
      its hwthread_req cleared before it gets to test it, and therefore
      never increment the nap_count, leading to messages about wait_for_nap
      timeouts.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2f12f034
    • P
      KVM: PPC: Book3s HV: Don't access runnable threads list without vcore lock · 913d3ff9
      Paul Mackerras 提交于
      There were a few places where we were traversing the list of runnable
      threads in a virtual core, i.e. vc->runnable_threads, without holding
      the vcore spinlock.  This extends the places where we hold the vcore
      spinlock to cover everywhere that we traverse that list.
      
      Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault,
      this moves the call of it from kvmppc_handle_exit out to
      kvmppc_vcpu_run, where we don't hold the vcore lock.
      
      In kvmppc_vcore_blocked, we don't actually need to check whether
      all vcpus are ceded and don't have any pending exceptions, since the
      caller has already done that.  The caller (kvmppc_run_vcpu) wasn't
      actually checking for pending exceptions, so we add that.
      
      The change of if to while in kvmppc_run_vcpu is to make sure that we
      never call kvmppc_remove_runnable() when the vcore state is RUNNING or
      EXITING.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      913d3ff9