1. 11 10月, 2013 2 次提交
    • P
      powerpc: Provide for giveup_fpu/altivec to save state in alternate location · 18461960
      Paul Mackerras 提交于
      This provides a facility which is intended for use by KVM, where the
      contents of the FP/VSX and VMX (Altivec) registers can be saved away
      to somewhere other than the thread_struct when kernel code wants to
      use floating point or VMX instructions.  This is done by providing a
      pointer in the thread_struct to indicate where the state should be
      saved to.  The giveup_fpu() and giveup_altivec() functions test these
      pointers and save state to the indicated location if they are non-NULL.
      Note that the MSR_FP/VEC bits in task->thread.regs->msr are still used
      to indicate whether the CPU register state is live, even when an
      alternate save location is being used.
      
      This also provides load_fp_state() and load_vr_state() functions, which
      load up FP/VSX and VMX state from memory into the CPU registers, and
      corresponding store_fp_state() and store_vr_state() functions, which
      store FP/VSX and VMX state into memory from the CPU registers.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      18461960
    • P
      powerpc: Put FP/VSX and VR state into structures · de79f7b9
      Paul Mackerras 提交于
      This creates new 'thread_fp_state' and 'thread_vr_state' structures
      to store FP/VSX state (including FPSCR) and Altivec/VSX state
      (including VSCR), and uses them in the thread_struct.  In the
      thread_fp_state, the FPRs and VSRs are represented as u64 rather
      than double, since we rarely perform floating-point computations
      on the values, and this will enable the structures to be used
      in KVM code as well.  Similarly FPSCR is now a u64 rather than
      a structure of two 32-bit values.
      
      This takes the offsets out of the macros such as SAVE_32FPRS,
      REST_32FPRS, etc.  This enables the same macros to be used for normal
      and transactional state, enabling us to delete the transactional
      versions of the macros.   This also removes the unused do_load_up_fpu
      and do_load_up_altivec, which were in fact buggy since they didn't
      create large enough stack frames to account for the fact that
      load_up_fpu and load_up_altivec are not designed to be called from C
      and assume that their caller's stack frame is an interrupt frame.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      de79f7b9
  2. 25 9月, 2013 1 次提交
    • B
      powerpc: Remove ksp_limit on ppc64 · cbc9565e
      Benjamin Herrenschmidt 提交于
      We've been keeping that field in thread_struct for a while, it contains
      the "limit" of the current stack pointer and is meant to be used for
      detecting stack overflows.
      
      It has a few problems however:
      
       - First, it was never actually *used* on 64-bit. Set and updated but
      not actually exploited
      
       - When switching stack to/from irq and softirq stacks, it's update
      is racy unless we hard disable interrupts, which is costly. This
      is fine on 32-bit as we don't soft-disable there but not on 64-bit.
      
      Thus rather than fixing 2 in order to implement 1 in some hypothetical
      future, let's remove the code completely from 64-bit. In order to avoid
      a clutter of ifdef's, we remove the updates from C code completely
      during interrupt stack switching, and instead maintain it from the
      asm helper that is used to do the stack switching in the first place.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      cbc9565e
  3. 09 8月, 2013 1 次提交
    • M
      powerpc/tm: Fix context switching TAR, PPR and DSCR SPRs · 28e61cc4
      Michael Neuling 提交于
      If a transaction is rolled back, the Target Address Register (TAR), Processor
      Priority Register (PPR) and Data Stream Control Register (DSCR) should be
      restored to the checkpointed values before the transaction began.  Any changes
      to these SPRs inside the transaction should not be visible in the abort
      handler.
      
      Currently Linux doesn't save or restore the checkpointed TAR, PPR or DSCR.  If
      we preempt a processes inside a transaction which has modified any of these, on
      process restore, that same transaction may be aborted we but we won't see the
      checkpointed versions of these SPRs.
      
      This adds checkpointed versions of these SPRs to the thread_struct and adds the
      save/restore of these three SPRs to the treclaim/trechkpt code.
      
      Without this if any of these SPRs are modified during a transaction, users may
      incorrectly see a speculated SPR value even if the transaction is aborted.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Cc: <stable@vger.kernel.org> [v3.10]
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      28e61cc4
  4. 25 7月, 2013 1 次提交
  5. 01 7月, 2013 1 次提交
  6. 20 6月, 2013 1 次提交
    • B
      powerpc: Restore dbcr0 on user space exit · 13d543cd
      Bharat Bhushan 提交于
      On BookE (Branch taken + Single Step) is as same as Branch Taken
      on BookS and in Linux we simulate BookS behavior for BookE as well.
      When doing so, in Branch taken handling we want to set DBCR0_IC but
      we update the current->thread->dbcr0 and not DBCR0.
      
      Now on 64bit the current->thread.dbcr0 (and other debug registers)
      is synchronized ONLY on context switch flow. But after handling
      Branch taken in debug exception if we return back to user space
      without context switch then single stepping change (DBCR0_ICMP)
      does not get written in h/w DBCR0 and Instruction Complete exception
      does not happen.
      
      This fixes using ptrace reliably on BookE-PowerPC
      
      lmbench latency test (lat_syscall) Results are (they varies a little
      on each run)
      
      1) ./lat_syscall <action> /dev/shm/uImage
      
      action:	Open	read	write	stat	fstat	null
      Before:	3.8618	0.2017	0.2851	1.6789	0.2256	0.0856
      After:	3.8580	0.2017	0.2851	1.6955	0.2255	0.0856
      
      1) ./lat_syscall -P 2 -N 10 <action> /dev/shm/uImage
      action:	Open	read	write	stat	fstat	null
      Before:	4.1388	0.2238	0.3066	1.7106	0.2256	0.0856
      After:	4.1413	0.2236	0.3062	1.7107	0.2256	0.0856
      
      [ Slightly modified to avoid extra branch in the fast path
        on Book3S and fix build on all non-BookE 64-bit -- BenH
      ]
      Signed-off-by: NBharat Bhushan <bharat.bhushan@freescale.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      13d543cd
  7. 24 5月, 2013 1 次提交
  8. 02 5月, 2013 1 次提交
  9. 27 4月, 2013 2 次提交
    • B
      KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM · 54695c30
      Benjamin Herrenschmidt 提交于
      Currently, we wake up a CPU by sending a host IPI with
      smp_send_reschedule() to thread 0 of that core, which will take all
      threads out of the guest, and cause them to re-evaluate their
      interrupt status on the way back in.
      
      This adds a mechanism to differentiate real host IPIs from IPIs sent
      by KVM for guest threads to poke each other, in order to target the
      guest threads precisely when possible and avoid that global switch of
      the core to host state.
      
      We then use this new facility in the in-kernel XICS code.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54695c30
    • P
      KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map · c35635ef
      Paul Mackerras 提交于
      At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
      done by the host to the virtual processor areas (VPAs) and dispatch
      trace logs (DTLs) registered by the guest.  This is because those
      modifications are done either in real mode or in the host kernel
      context, and in neither case does the access go through the guest's
      HPT, and thus no change (C) bit gets set in the guest's HPT.
      
      However, the changes done by the host do need to be tracked so that
      the modified pages get transferred when doing live migration.  In
      order to track these modifications, this adds a dirty flag to the
      struct representing the VPA/DTL areas, and arranges to set the flag
      when the VPA/DTL gets modified by the host.  Then, when we are
      collecting the dirty log, we also check the dirty flags for the
      VPA and DTL for each vcpu and set the relevant bit in the dirty log
      if necessary.  Doing this also means we now need to keep track of
      the guest physical address of the VPA/DTL areas.
      
      So as not to lose track of modifications to a VPA/DTL area when it gets
      unregistered, or when a new area gets registered in its place, we need
      to transfer the dirty state to the rmap chain.  This adds code to
      kvmppc_unpin_guest_page() to do that if the area was dirty.  To simplify
      that code, we now require that all VPA, DTL and SLB shadow buffer areas
      fit within a single host page.  Guests already comply with this
      requirement because pHyp requires that these areas not cross a 4k
      boundary.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c35635ef
  10. 22 3月, 2013 1 次提交
  11. 15 2月, 2013 3 次提交
  12. 13 2月, 2013 1 次提交
  13. 08 2月, 2013 1 次提交
  14. 10 1月, 2013 1 次提交
  15. 06 12月, 2012 1 次提交
    • P
      KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations · 1b400ba0
      Paul Mackerras 提交于
      When we change or remove a HPT (hashed page table) entry, we can do
      either a global TLB invalidation (tlbie) that works across the whole
      machine, or a local invalidation (tlbiel) that only affects this core.
      Currently we do local invalidations if the VM has only one vcpu or if
      the guest requests it with the H_LOCAL flag, though the guest Linux
      kernel currently doesn't ever use H_LOCAL.  Then, to cope with the
      possibility that vcpus moving around to different physical cores might
      expose stale TLB entries, there is some code in kvmppc_hv_entry to
      flush the whole TLB of entries for this VM if either this vcpu is now
      running on a different physical core from where it last ran, or if this
      physical core last ran a different vcpu.
      
      There are a number of problems on POWER7 with this as it stands:
      
      - The TLB invalidation is done per thread, whereas it only needs to be
        done per core, since the TLB is shared between the threads.
      - With the possibility of the host paging out guest pages, the use of
        H_LOCAL by an SMP guest is dangerous since the guest could possibly
        retain and use a stale TLB entry pointing to a page that had been
        removed from the guest.
      - The TLB invalidations that we do when a vcpu moves from one physical
        core to another are unnecessary in the case of an SMP guest that isn't
        using H_LOCAL.
      - The optimization of using local invalidations rather than global should
        apply to guests with one virtual core, not just one vcpu.
      
      (None of this applies on PPC970, since there we always have to
      invalidate the whole TLB when entering and leaving the guest, and we
      can't support paging out guest memory.)
      
      To fix these problems and simplify the code, we now maintain a simple
      cpumask of which cpus need to flush the TLB on entry to the guest.
      (This is indexed by cpu, though we only ever use the bits for thread
      0 of each core.)  Whenever we do a local TLB invalidation, we set the
      bits for every cpu except the bit for thread 0 of the core that we're
      currently running on.  Whenever we enter a guest, we test and clear the
      bit for our core, and flush the TLB if it was set.
      
      On initial startup of the VM, and when resetting the HPT, we set all the
      bits in the need_tlb_flush cpumask, since any core could potentially have
      stale TLB entries from the previous VM to use the same LPID, or the
      previous contents of the HPT.
      
      Then, we maintain a count of the number of online virtual cores, and use
      that when deciding whether to use a local invalidation rather than the
      number of online vcpus.  The code to make that decision is extracted out
      into a new function, global_invalidates().  For multi-core guests on
      POWER7 (i.e. when we are using mmu notifiers), we now never do local
      invalidations regardless of the H_LOCAL flag.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1b400ba0
  16. 07 9月, 2012 1 次提交
  17. 05 9月, 2012 1 次提交
  18. 11 7月, 2012 1 次提交
    • A
      powerpc: Add VDSO version of getcpu · 18ad51dd
      Anton Blanchard 提交于
      We have a request for a fast method of getting CPU and NUMA node IDs
      from userspace. This patch implements a getcpu VDSO function,
      similar to x86.
      
      Ben suggested we use SPRG3 which is userspace readable. SPRG3 can be
      modified by a KVM guest, so we save the SPRG3 value in the paca and
      restore it when transitioning from the guest to the host.
      
      I have a glibc patch that implements sched_getcpu on top of this.
      Testing on a POWER7:
      
      baseline: 538 cycles
      vdso:      30 cycles
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      18ad51dd
  19. 30 4月, 2012 1 次提交
  20. 08 4月, 2012 4 次提交
    • P
      KVM: PPC: Book 3S: Fix compilation for !HV configs · 7657f408
      Paul Mackerras 提交于
      Commits 2f5cdd5487 ("KVM: PPC: Book3S HV: Make secondary threads more
      robust against stray IPIs") and 1c2066b0f7 ("KVM: PPC: Book3S HV: Make
      virtual processor area registration more robust") added fields to
      struct kvm_vcpu_arch inside #ifdef CONFIG_KVM_BOOK3S_64_HV regions,
      and added lines to arch/powerpc/kernel/asm-offsets.c to generate
      assembler constants for their offsets.  Unfortunately this led to
      compile errors on Book 3S machines for configs that had KVM enabled
      but not CONFIG_KVM_BOOK3S_64_HV.  This fixes the problem by moving
      the offending lines inside #ifdef CONFIG_KVM_BOOK3S_64_HV regions.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7657f408
    • P
      KVM: PPC: Book3S HV: Make virtual processor area registration more robust · 2e25aa5f
      Paul Mackerras 提交于
      The PAPR API allows three sorts of per-virtual-processor areas to be
      registered (VPA, SLB shadow buffer, and dispatch trace log), and
      furthermore, these can be registered and unregistered for another
      virtual CPU.  Currently we just update the vcpu fields pointing to
      these areas at the time of registration or unregistration.  If this
      is done on another vcpu, there is the possibility that the target vcpu
      is using those fields at the time and could end up using a bogus
      pointer and corrupting memory.
      
      This fixes the race by making the target cpu itself do the update, so
      we can be sure that the update happens at a time when the fields
      aren't being used.  Each area now has a struct kvmppc_vpa which is
      used to manage these updates.  There is also a spinlock which protects
      access to all of the kvmppc_vpa structs, other than to the pinned_addr
      fields.  (We could have just taken the spinlock when using the vpa,
      slb_shadow or dtl fields, but that would mean taking the spinlock on
      every guest entry and exit.)
      
      This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
      which is what the rest of the kernel uses.
      
      Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
      the need to initialize vcpu->arch.vpa_update_lock.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      2e25aa5f
    • P
      KVM: PPC: Book3S HV: Make secondary threads more robust against stray IPIs · f0888f70
      Paul Mackerras 提交于
      Currently on POWER7, if we are running the guest on a core and we don't
      need all the hardware threads, we do nothing to ensure that the unused
      threads aren't executing in the kernel (other than checking that they
      are offline).  We just assume they're napping and we don't do anything
      to stop them trying to enter the kernel while the guest is running.
      This means that a stray IPI can wake up the hardware thread and it will
      then try to enter the kernel, but since the core is in guest context,
      it will execute code from the guest in hypervisor mode once it turns the
      MMU on, which tends to lead to crashes or hangs in the host.
      
      This fixes the problem by adding two new one-byte flags in the
      kvmppc_host_state structure in the PACA which are used to interlock
      between the primary thread and the unused secondary threads when entering
      the guest.  With these flags, the primary thread can ensure that the
      unused secondaries are not already in kernel mode (i.e. handling a stray
      IPI) and then indicate that they should not try to enter the kernel
      if they do get woken for any reason.  Instead they will go into KVM code,
      find that there is no vcpu to run, acknowledge and clear the IPI and go
      back to nap mode.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      f0888f70
    • S
      KVM: PPC: booke: category E.HV (GS-mode) support · d30f6e48
      Scott Wood 提交于
      Chips such as e500mc that implement category E.HV in Power ISA 2.06
      provide hardware virtualization features, including a new MSR mode for
      guest state.  The guest OS can perform many operations without trapping
      into the hypervisor, including transitions to and from guest userspace.
      
      Since we can use SRR1[GS] to reliably tell whether an exception came from
      guest state, instead of messing around with IVPR, we use DO_KVM similarly
      to book3s.
      
      Current issues include:
       - Machine checks from guest state are not routed to the host handler.
       - The guest can cause a host oops by executing an emulated instruction
         in a page that lacks read permission.  Existing e500/4xx support has
         the same problem.
      
      Includes work by Ashish Kalra <Ashish.Kalra@freescale.com>,
      Varun Sethi <Varun.Sethi@freescale.com>, and
      Liu Yu <yu.liu@freescale.com>.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      [agraf: remove pt_regs usage]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d30f6e48
  21. 21 3月, 2012 1 次提交
  22. 09 3月, 2012 1 次提交
    • B
      powerpc: Rework lazy-interrupt handling · 7230c564
      Benjamin Herrenschmidt 提交于
      The current implementation of lazy interrupts handling has some
      issues that this tries to address.
      
      We don't do the various workarounds we need to do when re-enabling
      interrupts in some cases such as when returning from an interrupt
      and thus we may still lose or get delayed decrementer or doorbell
      interrupts.
      
      The current scheme also makes it much harder to handle the external
      "edge" interrupts provided by some BookE processors when using the
      EPR facility (External Proxy) and the Freescale Hypervisor.
      
      Additionally, we tend to keep interrupts hard disabled in a number
      of cases, such as decrementer interrupts, external interrupts, or
      when a masked decrementer interrupt is pending. This is sub-optimal.
      
      This is an attempt at fixing it all in one go by reworking the way
      we do the lazy interrupt disabling from the ground up.
      
      The base idea is to replace the "hard_enabled" field with a
      "irq_happened" field in which we store a bit mask of what interrupt
      occurred while soft-disabled.
      
      When re-enabling, either via arch_local_irq_restore() or when returning
      from an interrupt, we can now decide what to do by testing bits in that
      field.
      
      We then implement replaying of the missed interrupts either by
      re-using the existing exception frame (in exception exit case) or via
      the creation of a new one from an assembly trampoline (in the
      arch_local_irq_enable case).
      
      This removes the need to play with the decrementer to try to create
      fake interrupts, among others.
      
      In addition, this adds a few refinements:
      
       - We no longer  hard disable decrementer interrupts that occur
      while soft-disabled. We now simply bump the decrementer back to max
      (on BookS) or leave it stopped (on BookE) and continue with hard interrupts
      enabled, which means that we'll potentially get better sample quality from
      performance monitor interrupts.
      
       - Timer, decrementer and doorbell interrupts now hard-enable
      shortly after removing the source of the interrupt, which means
      they no longer run entirely hard disabled. Again, this will improve
      perf sample quality.
      
       - On Book3E 64-bit, we now make the performance monitor interrupt
      act as an NMI like Book3S (the necessary C code for that to work
      appear to already be present in the FSL perf code, notably calling
      nmi_enter instead of irq_enter). (This also fixes a bug where BookE
      perfmon interrupts could clobber r14 ... oops)
      
       - We could make "masked" decrementer interrupts act as NMIs when doing
      timer-based perf sampling to improve the sample quality.
      
      Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      ---
      
      v2:
      
      - Add hard-enable to decrementer, timer and doorbells
      - Fix CR clobber in masked irq handling on BookE
      - Make embedded perf interrupt act as an NMI
      - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
        to retrigger an interrupt without preventing hard-enable
      
      v3:
      
       - Fix or vs. ori bug on Book3E
       - Fix enabling of interrupts for some exceptions on Book3E
      
      v4:
      
       - Fix resend of doorbells on return from interrupt on Book3E
      
      v5:
      
       - Rebased on top of my latest series, which involves some significant
      rework of some aspects of the patch.
      
      v6:
       - 32-bit compile fix
       - more compile fixes with various .config combos
       - factor out the asm code to soft-disable interrupts
       - remove the C wrapper around preempt_schedule_irq
      
      v7:
       - Fix a bug with hard irq state tracking on native power7
      7230c564
  23. 05 3月, 2012 2 次提交
    • P
      KVM: PPC: Implement MMIO emulation support for Book3S HV guests · 697d3899
      Paul Mackerras 提交于
      This provides the low-level support for MMIO emulation in Book3S HV
      guests.  When the guest tries to map a page which is not covered by
      any memslot, that page is taken to be an MMIO emulation page.  Instead
      of inserting a valid HPTE, we insert an HPTE that has the valid bit
      clear but another hypervisor software-use bit set, which we call
      HPTE_V_ABSENT, to indicate that this is an absent page.  An
      absent page is treated much like a valid page as far as guest hcalls
      (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
      an absent HPTE doesn't need to be invalidated with tlbie since it
      was never valid as far as the hardware is concerned.
      
      When the guest accesses a page for which there is an absent HPTE, it
      will take a hypervisor data storage interrupt (HDSI) since we now set
      the VPM1 bit in the LPCR.  Our HDSI handler for HPTE-not-present faults
      looks up the hash table and if it finds an absent HPTE mapping the
      requested virtual address, will switch to kernel mode and handle the
      fault in kvmppc_book3s_hv_page_fault(), which at present just calls
      kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
      
      This is based on an earlier patch by Benjamin Herrenschmidt, but since
      heavily reworked.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      697d3899
    • S
      KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn · b5904972
      Scott Wood 提交于
      This allows additional registers to be accessed by the guest
      in PR-mode KVM without trapping.
      
      SPRG4-7 are readable from userspace.  On booke, KVM will sync
      these registers when it enters the guest, so that accesses from
      guest userspace will work.  The guest kernel, OTOH, must consistently
      use either the real registers or the shared area between exits.  This
      also applies to the already-paravirted SPRG3.
      
      On non-booke, it's not clear to what extent SPRG4-7 are supported
      (they're not architected for book3s, but exist on at least some classic
      chips).  They are copied in the get/set regs ioctls, but I do not see any
      non-booke emulation.  I also do not see any syncing with real registers
      (in PR-mode) including the user-readable SPRG3.  This patch should not
      make that situation any worse.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b5904972
  24. 08 12月, 2011 1 次提交
    • P
      powerpc: Provide a way for KVM to indicate that NV GPR values are lost · 2fde6d20
      Paul Mackerras 提交于
      This fixes a problem where a CPU thread coming out of nap mode can
      think it has valid values in the nonvolatile GPRs (r14 - r31) as saved
      away in power7_idle, but in fact the values have been trashed because
      the thread was used for KVM in the mean time.  The result is that the
      thread crashes because code that called power7_idle (e.g.,
      pnv_smp_cpu_kill_self()) goes to use values in registers that have
      been trashed.
      
      The bit field in SRR1 that tells whether state was lost only reflects
      the most recent nap, which may not have been the nap instruction in
      power7_idle.  So we need an extra PACA field to indicate that state
      has been lost even if SRR1 indicates that the most recent nap didn't
      lose state.  We clear this field when saving the state in power7_idle,
      we set it to a non-zero value when we use the thread for KVM, and we
      test it in power7_wakeup_noloss.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2fde6d20
  25. 26 9月, 2011 2 次提交
    • P
      KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code · 19ccb76a
      Paul Mackerras 提交于
      With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
      core), whenever a CPU goes idle, we have to pull all the other
      hardware threads in the core out of the guest, because the H_CEDE
      hcall is handled in the kernel.  This is inefficient.
      
      This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
      in real mode.  When a guest vcpu does an H_CEDE hcall, we now only
      exit to the kernel if all the other vcpus in the same core are also
      idle.  Otherwise we mark this vcpu as napping, save state that could
      be lost in nap mode (mainly GPRs and FPRs), and execute the nap
      instruction.  When the thread wakes up, because of a decrementer or
      external interrupt, we come back in at kvm_start_guest (from the
      system reset interrupt vector), find the `napping' flag set in the
      paca, and go to the resume path.
      
      This has some other ramifications.  First, when starting a core, we
      now start all the threads, both those that are immediately runnable and
      those that are idle.  This is so that we don't have to pull all the
      threads out of the guest when an idle thread gets a decrementer interrupt
      and wants to start running.  In fact the idle threads will all start
      with the H_CEDE hcall returning; being idle they will just do another
      H_CEDE immediately and go to nap mode.
      
      This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
      These functions have been restructured to make them simpler and clearer.
      We introduce a level of indirection in the wait queue that gets woken
      when external and decrementer interrupts get generated for a vcpu, so
      that we can have the 4 vcpus in a vcore using the same wait queue.
      We need this because the 4 vcpus are being handled by one thread.
      
      Secondly, when we need to exit from the guest to the kernel, we now
      have to generate an IPI for any napping threads, because an HDEC
      interrupt doesn't wake up a napping thread.
      
      Thirdly, we now need to be able to handle virtual external interrupts
      and decrementer interrupts becoming pending while a thread is napping,
      and deliver those interrupts to the guest when the thread wakes.
      This is done in kvmppc_cede_reentry, just before fast_guest_return.
      
      Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
      and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
      from kvm_arch_vcpu_runnable.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      19ccb76a
    • P
      KVM: PPC: book3s_pr: Simplify transitions between virtual and real mode · 02143947
      Paul Mackerras 提交于
      This simplifies the way that the book3s_pr makes the transition to
      real mode when entering the guest.  We now call kvmppc_entry_trampoline
      (renamed from kvmppc_rmcall) in the base kernel using a normal function
      call instead of doing an indirect call through a pointer in the vcpu.
      If kvm is a module, the module loader takes care of generating a
      trampoline as it does for other calls to functions outside the module.
      
      kvmppc_entry_trampoline then disables interrupts and jumps to
      kvmppc_handler_trampoline_enter in real mode using an rfi[d].
      That then uses the link register as the address to return to
      (potentially in module space) when the guest exits.
      
      This also simplifies the way that we call the Linux interrupt handler
      when we exit the guest due to an external, decrementer or performance
      monitor interrupt.  Instead of turning on the MMU, then deciding that
      we need to call the Linux handler and turning the MMU back off again,
      we now go straight to the handler at the point where we would turn the
      MMU on.  The handler will then return to the virtual-mode code
      (potentially in the module).
      
      Along the way, this moves the setting and clearing of the HID5 DCBZ32
      bit into real-mode interrupts-off code, and also makes sure that
      we clear the MSR[RI] bit before loading values into SRR0/1.
      
      The net result is that we no longer need any code addresses to be
      stored in vcpu->arch.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      02143947
  26. 20 9月, 2011 1 次提交
  27. 12 7月, 2011 5 次提交
    • P
      KVM: PPC: book3s_hv: Add support for PPC970-family processors · 9e368f29
      Paul Mackerras 提交于
      This adds support for running KVM guests in supervisor mode on those
      PPC970 processors that have a usable hypervisor mode.  Unfortunately,
      Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
      1), but the YDL PowerStation does have a usable hypervisor mode.
      
      There are several differences between the PPC970 and POWER7 in how
      guests are managed.  These differences are accommodated using the
      CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
      bits.  Notably, on PPC970:
      
      * The LPCR, LPID or RMOR registers don't exist, and the functions of
        those registers are provided by bits in HID4 and one bit in HID0.
      
      * External interrupts can be directed to the hypervisor, but unlike
        POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
        SRR0/1 not HSRR0/1.
      
      * There is no virtual RMA (VRMA) mode; the guest must use an RMO
        (real mode offset) area.
      
      * The TLB entries are not tagged with the LPID, so it is necessary to
        flush the whole TLB on partition switch.  Furthermore, when switching
        partitions we have to ensure that no other CPU is executing the tlbie
        or tlbsync instructions in either the old or the new partition,
        otherwise undefined behaviour can occur.
      
      * The PMU has 8 counters (PMC registers) rather than 6.
      
      * The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.
      
      * The SLB has 64 entries rather than 32.
      
      * There is no mediated external interrupt facility, so if we switch to
        a guest that has a virtual external interrupt pending but the guest
        has MSR[EE] = 0, we have to arrange to have an interrupt pending for
        it so that we can get control back once it re-enables interrupts.  We
        do that by sending ourselves an IPI with smp_send_reschedule after
        hard-disabling interrupts.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9e368f29
    • P
      KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests · aa04b4cc
      Paul Mackerras 提交于
      This adds infrastructure which will be needed to allow book3s_hv KVM to
      run on older POWER processors, including PPC970, which don't support
      the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
      Offset (RMO) facility.  These processors require a physically
      contiguous, aligned area of memory for each guest.  When the guest does
      an access in real mode (MMU off), the address is compared against a
      limit value, and if it is lower, the address is ORed with an offset
      value (from the Real Mode Offset Register (RMOR)) and the result becomes
      the real address for the access.  The size of the RMA has to be one of
      a set of supported values, which usually includes 64MB, 128MB, 256MB
      and some larger powers of 2.
      
      Since we are unlikely to be able to allocate 64MB or more of physically
      contiguous memory after the kernel has been running for a while, we
      allocate a pool of RMAs at boot time using the bootmem allocator.  The
      size and number of the RMAs can be set using the kvm_rma_size=xx and
      kvm_rma_count=xx kernel command line options.
      
      KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
      of the pool of preallocated RMAs.  The capability value is 1 if the
      processor can use an RMA but doesn't require one (because it supports
      the VRMA facility), or 2 if the processor requires an RMA for each guest.
      
      This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
      pool and returns a file descriptor which can be used to map the RMA.  It
      also returns the size of the RMA in the argument structure.
      
      Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
      ioctl calls from userspace.  To cope with this, we now preallocate the
      kvm->arch.ram_pginfo array when the VM is created with a size sufficient
      for up to 64GB of guest memory.  Subsequently we will get rid of this
      array and use memory associated with each memslot instead.
      
      This moves most of the code that translates the user addresses into
      host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
      to kvmppc_core_prepare_memory_region.  Also, instead of having to look
      up the VMA for each page in order to check the page size, we now check
      that the pages we get are compound pages of 16MB.  However, if we are
      adding memory that is mapped to an RMA, we don't bother with calling
      get_user_pages_fast and instead just offset from the base pfn for the
      RMA.
      
      Typically the RMA gets added after vcpus are created, which makes it
      inconvenient to have the LPCR (logical partition control register) value
      in the vcpu->arch struct, since the LPCR controls whether the processor
      uses RMA or VRMA for the guest.  This moves the LPCR value into the
      kvm->arch struct and arranges for the MER (mediated external request)
      bit, which is the only bit that varies between vcpus, to be set in
      assembly code when going into the guest if there is a pending external
      interrupt request.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      aa04b4cc
    • P
      KVM: PPC: Allow book3s_hv guests to use SMT processor modes · 371fefd6
      Paul Mackerras 提交于
      This lifts the restriction that book3s_hv guests can only run one
      hardware thread per core, and allows them to use up to 4 threads
      per core on POWER7.  The host still has to run single-threaded.
      
      This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
      capability.  The return value of the ioctl querying this capability
      is the number of vcpus per virtual CPU core (vcore), currently 4.
      
      To use this, the host kernel should be booted with all threads
      active, and then all the secondary threads should be offlined.
      This will put the secondary threads into nap mode.  KVM will then
      wake them from nap mode and use them for running guest code (while
      they are still offline).  To wake the secondary threads, we send
      them an IPI using a new xics_wake_cpu() function, implemented in
      arch/powerpc/sysdev/xics/icp-native.c.  In other words, at this stage
      we assume that the platform has a XICS interrupt controller and
      we are using icp-native.c to drive it.  Since the woken thread will
      need to acknowledge and clear the IPI, we also export the base
      physical address of the XICS registers using kvmppc_set_xics_phys()
      for use in the low-level KVM book3s code.
      
      When a vcpu is created, it is assigned to a virtual CPU core.
      The vcore number is obtained by dividing the vcpu number by the
      number of threads per core in the host.  This number is exported
      to userspace via the KVM_CAP_PPC_SMT capability.  If qemu wishes
      to run the guest in single-threaded mode, it should make all vcpu
      numbers be multiples of the number of threads per core.
      
      We distinguish three states of a vcpu: runnable (i.e., ready to execute
      the guest), blocked (that is, idle), and busy in host.  We currently
      implement a policy that the vcore can run only when all its threads
      are runnable or blocked.  This way, if a vcpu needs to execute elsewhere
      in the kernel or in qemu, it can do so without being starved of CPU
      by the other vcpus.
      
      When a vcore starts to run, it executes in the context of one of the
      vcpu threads.  The other vcpu threads all go to sleep and stay asleep
      until something happens requiring the vcpu thread to return to qemu,
      or to wake up to run the vcore (this can happen when another vcpu
      thread goes from busy in host state to blocked).
      
      It can happen that a vcpu goes from blocked to runnable state (e.g.
      because of an interrupt), and the vcore it belongs to is already
      running.  In that case it can start to run immediately as long as
      the none of the vcpus in the vcore have started to exit the guest.
      We send the next free thread in the vcore an IPI to get it to start
      to execute the guest.  It synchronizes with the other threads via
      the vcore->entry_exit_count field to make sure that it doesn't go
      into the guest if the other vcpus are exiting by the time that it
      is ready to actually enter the guest.
      
      Note that there is no fixed relationship between the hardware thread
      number and the vcpu number.  Hardware threads are assigned to vcpus
      as they become runnable, so we will always use the lower-numbered
      hardware threads in preference to higher-numbered threads if not all
      the vcpus in the vcore are runnable, regardless of which vcpus are
      runnable.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      371fefd6
    • P
      KVM: PPC: Handle some PAPR hcalls in the kernel · a8606e20
      Paul Mackerras 提交于
      This adds the infrastructure for handling PAPR hcalls in the kernel,
      either early in the guest exit path while we are still in real mode,
      or later once the MMU has been turned back on and we are in the full
      kernel context.  The advantage of handling hcalls in real mode if
      possible is that we avoid two partition switches -- and this will
      become more important when we support SMT4 guests, since a partition
      switch means we have to pull all of the threads in the core out of
      the guest.  The disadvantage is that we can only access the kernel
      linear mapping, not anything vmalloced or ioremapped, since the MMU
      is off.
      
      This also adds code to handle the following hcalls in real mode:
      
      H_ENTER       Add an HPTE to the hashed page table
      H_REMOVE      Remove an HPTE from the hashed page table
      H_READ        Read HPTEs from the hashed page table
      H_PROTECT     Change the protection bits in an HPTE
      H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
      H_SET_DABR    Set the data address breakpoint register
      
      Plus code to handle the following hcalls in the kernel:
      
      H_CEDE        Idle the vcpu until an interrupt or H_PROD hcall arrives
      H_PROD        Wake up a ceded vcpu
      H_REGISTER_VPA Register a virtual processor area (VPA)
      
      The code that runs in real mode has to be in the base kernel, not in
      the module, if KVM is compiled as a module.  The real-mode code can
      only access the kernel linear mapping, not vmalloc or ioremap space.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a8606e20
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948