1. 17 10月, 2013 8 次提交
    • A
      kvm: powerpc: book3s: pr: Rename KVM_BOOK3S_PR to KVM_BOOK3S_PR_POSSIBLE · 7aa79938
      Aneesh Kumar K.V 提交于
      With later patches supporting PR kvm as a kernel module, the changes
      that has to be built into the main kernel binary to enable PR KVM module
      is now selected via KVM_BOOK3S_PR_POSSIBLE
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7aa79938
    • B
      powerpc: move debug registers in a structure · 95791988
      Bharat Bhushan 提交于
      This way we can use same data type struct with KVM and
      also help in using other debug related function.
      Signed-off-by: NBharat Bhushan <bharat.bhushan@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      95791988
    • P
      KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu · a2d56020
      Paul Mackerras 提交于
      Currently PR-style KVM keeps the volatile guest register values
      (R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
      the main kvm_vcpu struct.  For 64-bit, the shadow_vcpu exists in two
      places, a kmalloc'd struct and in the PACA, and it gets copied back
      and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
      can't rely on being able to access the kmalloc'd struct.
      
      This changes the code to copy the volatile values into the shadow_vcpu
      as one of the last things done before entering the guest.  Similarly
      the values are copied back out of the shadow_vcpu to the kvm_vcpu
      immediately after exiting the guest.  We arrange for interrupts to be
      still disabled at this point so that we can't get preempted on 64-bit
      and end up copying values from the wrong PACA.
      
      This means that the accessor functions in kvm_book3s.h for these
      registers are greatly simplified, and are same between PR and HV KVM.
      In places where accesses to shadow_vcpu fields are now replaced by
      accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
      Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.
      
      With this, the time to read the PVR one million times in a loop went
      from 567.7ms to 575.5ms (averages of 6 values), an increase of about
      1.4% for this worse-case test for guest entries and exits.  The
      standard deviation of the measurements is about 11ms, so the
      difference is only marginally significant statistically.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a2d56020
    • P
      KVM: PPC: Book3S HV: Support POWER6 compatibility mode on POWER7 · 388cc6e1
      Paul Mackerras 提交于
      This enables us to use the Processor Compatibility Register (PCR) on
      POWER7 to put the processor into architecture 2.05 compatibility mode
      when running a guest.  In this mode the new instructions and registers
      that were introduced on POWER7 are disabled in user mode.  This
      includes all the VSX facilities plus several other instructions such
      as ldbrx, stdbrx, popcntw, popcntd, etc.
      
      To select this mode, we have a new register accessible through the
      set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT.  Setting
      this to zero gives the full set of capabilities of the processor.
      Setting it to one of the "logical" PVR values defined in PAPR puts
      the vcpu into the compatibility mode for the corresponding
      architecture level.  The supported values are:
      
      0x0f000002	Architecture 2.05 (POWER6)
      0x0f000003	Architecture 2.06 (POWER7)
      0x0f100003	Architecture 2.06+ (POWER7+)
      
      Since the PCR is per-core, the architecture compatibility level and
      the corresponding PCR value are stored in the struct kvmppc_vcore, and
      are therefore shared between all vcpus in a virtual core.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: squash in fix to add missing break statements and documentation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      388cc6e1
    • P
      KVM: PPC: Book3S HV: Add support for guest Program Priority Register · 4b8473c9
      Paul Mackerras 提交于
      POWER7 and later IBM server processors have a register called the
      Program Priority Register (PPR), which controls the priority of
      each hardware CPU SMT thread, and affects how fast it runs compared
      to other SMT threads.  This priority can be controlled by writing to
      the PPR or by use of a set of instructions of the form or rN,rN,rN
      which are otherwise no-ops but have been defined to set the priority
      to particular levels.
      
      This adds code to context switch the PPR when entering and exiting
      guests and to make the PPR value accessible through the SET/GET_ONE_REG
      interface.  When entering the guest, we set the PPR as late as
      possible, because if we are setting a low thread priority it will
      make the code run slowly from that point on.  Similarly, the
      first-level interrupt handlers save the PPR value in the PACA very
      early on, and set the thread priority to the medium level, so that
      the interrupt handling code runs at a reasonable speed.
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4b8473c9
    • P
      KVM: PPC: Book3S HV: Store LPCR value for each virtual core · a0144e2a
      Paul Mackerras 提交于
      This adds the ability to have a separate LPCR (Logical Partitioning
      Control Register) value relating to a guest for each virtual core,
      rather than only having a single value for the whole VM.  This
      corresponds to what real POWER hardware does, where there is a LPCR
      per CPU thread but most of the fields are required to have the same
      value on all active threads in a core.
      
      The per-virtual-core LPCR can be read and written using the
      GET/SET_ONE_REG interface.  Userspace can can only modify the
      following fields of the LPCR value:
      
      DPFD	Default prefetch depth
      ILE	Interrupt little-endian
      TC	Translation control (secondary HPT hash group search disable)
      
      We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which
      contains bits relating to memory management, i.e. the Virtualized
      Partition Memory (VPM) bits and the bits relating to guest real mode.
      When this default value is updated, the update needs to be propagated
      to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do
      that.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix whitespace]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a0144e2a
    • P
      KVM: PPC: Book3S HV: Implement timebase offset for guests · 93b0f4dc
      Paul Mackerras 提交于
      This allows guests to have a different timebase origin from the host.
      This is needed for migration, where a guest can migrate from one host
      to another and the two hosts might have a different timebase origin.
      However, the timebase seen by the guest must not go backwards, and
      should go forwards only by a small amount corresponding to the time
      taken for the migration.
      
      Therefore this provides a new per-vcpu value accessed via the one_reg
      interface using the new KVM_REG_PPC_TB_OFFSET identifier.  This value
      defaults to 0 and is not modified by KVM.  On entering the guest, this
      value is added onto the timebase, and on exiting the guest, it is
      subtracted from the timebase.
      
      This is only supported for recent POWER hardware which has the TBU40
      (timebase upper 40 bits) register.  Writing to the TBU40 register only
      alters the upper 40 bits of the timebase, leaving the lower 24 bits
      unchanged.  This provides a way to modify the timebase for guest
      migration without disturbing the synchronization of the timebase
      registers across CPU cores.  The kernel rounds up the value given
      to a multiple of 2^24.
      
      Timebase values stored in KVM structures (struct kvm_vcpu, struct
      kvmppc_vcore, etc.) are stored as host timebase values.  The timebase
      values in the dispatch trace log need to be guest timebase values,
      however, since that is read directly by the guest.  This moves the
      setting of vcpu->arch.dec_expires on guest exit to a point after we
      have restored the host timebase so that vcpu->arch.dec_expires is a
      host timebase value.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      93b0f4dc
    • P
      KVM: PPC: Book3S HV: Save/restore SIAR and SDAR along with other PMU registers · 14941789
      Paul Mackerras 提交于
      Currently we are not saving and restoring the SIAR and SDAR registers in
      the PMU (performance monitor unit) on guest entry and exit.  The result
      is that performance monitoring tools in the guest could get false
      information about where a program was executing and what data it was
      accessing at the time of a performance monitor interrupt.  This fixes
      it by saving and restoring these registers along with the other PMU
      registers on guest entry/exit.
      
      This also provides a way for userspace to access these values for a
      vcpu via the one_reg interface.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      14941789
  2. 09 8月, 2013 1 次提交
    • M
      powerpc/tm: Fix context switching TAR, PPR and DSCR SPRs · 28e61cc4
      Michael Neuling 提交于
      If a transaction is rolled back, the Target Address Register (TAR), Processor
      Priority Register (PPR) and Data Stream Control Register (DSCR) should be
      restored to the checkpointed values before the transaction began.  Any changes
      to these SPRs inside the transaction should not be visible in the abort
      handler.
      
      Currently Linux doesn't save or restore the checkpointed TAR, PPR or DSCR.  If
      we preempt a processes inside a transaction which has modified any of these, on
      process restore, that same transaction may be aborted we but we won't see the
      checkpointed versions of these SPRs.
      
      This adds checkpointed versions of these SPRs to the thread_struct and adds the
      save/restore of these three SPRs to the treclaim/trechkpt code.
      
      Without this if any of these SPRs are modified during a transaction, users may
      incorrectly see a speculated SPR value even if the transaction is aborted.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Cc: <stable@vger.kernel.org> [v3.10]
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      28e61cc4
  3. 25 7月, 2013 1 次提交
  4. 01 7月, 2013 1 次提交
  5. 20 6月, 2013 1 次提交
    • B
      powerpc: Restore dbcr0 on user space exit · 13d543cd
      Bharat Bhushan 提交于
      On BookE (Branch taken + Single Step) is as same as Branch Taken
      on BookS and in Linux we simulate BookS behavior for BookE as well.
      When doing so, in Branch taken handling we want to set DBCR0_IC but
      we update the current->thread->dbcr0 and not DBCR0.
      
      Now on 64bit the current->thread.dbcr0 (and other debug registers)
      is synchronized ONLY on context switch flow. But after handling
      Branch taken in debug exception if we return back to user space
      without context switch then single stepping change (DBCR0_ICMP)
      does not get written in h/w DBCR0 and Instruction Complete exception
      does not happen.
      
      This fixes using ptrace reliably on BookE-PowerPC
      
      lmbench latency test (lat_syscall) Results are (they varies a little
      on each run)
      
      1) ./lat_syscall <action> /dev/shm/uImage
      
      action:	Open	read	write	stat	fstat	null
      Before:	3.8618	0.2017	0.2851	1.6789	0.2256	0.0856
      After:	3.8580	0.2017	0.2851	1.6955	0.2255	0.0856
      
      1) ./lat_syscall -P 2 -N 10 <action> /dev/shm/uImage
      action:	Open	read	write	stat	fstat	null
      Before:	4.1388	0.2238	0.3066	1.7106	0.2256	0.0856
      After:	4.1413	0.2236	0.3062	1.7107	0.2256	0.0856
      
      [ Slightly modified to avoid extra branch in the fast path
        on Book3S and fix build on all non-BookE 64-bit -- BenH
      ]
      Signed-off-by: NBharat Bhushan <bharat.bhushan@freescale.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      13d543cd
  6. 24 5月, 2013 1 次提交
  7. 02 5月, 2013 1 次提交
  8. 27 4月, 2013 2 次提交
    • B
      KVM: PPC: Book3S HV: Speed up wakeups of CPUs on HV KVM · 54695c30
      Benjamin Herrenschmidt 提交于
      Currently, we wake up a CPU by sending a host IPI with
      smp_send_reschedule() to thread 0 of that core, which will take all
      threads out of the guest, and cause them to re-evaluate their
      interrupt status on the way back in.
      
      This adds a mechanism to differentiate real host IPIs from IPIs sent
      by KVM for guest threads to poke each other, in order to target the
      guest threads precisely when possible and avoid that global switch of
      the core to host state.
      
      We then use this new facility in the in-kernel XICS code.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54695c30
    • P
      KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map · c35635ef
      Paul Mackerras 提交于
      At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
      done by the host to the virtual processor areas (VPAs) and dispatch
      trace logs (DTLs) registered by the guest.  This is because those
      modifications are done either in real mode or in the host kernel
      context, and in neither case does the access go through the guest's
      HPT, and thus no change (C) bit gets set in the guest's HPT.
      
      However, the changes done by the host do need to be tracked so that
      the modified pages get transferred when doing live migration.  In
      order to track these modifications, this adds a dirty flag to the
      struct representing the VPA/DTL areas, and arranges to set the flag
      when the VPA/DTL gets modified by the host.  Then, when we are
      collecting the dirty log, we also check the dirty flags for the
      VPA and DTL for each vcpu and set the relevant bit in the dirty log
      if necessary.  Doing this also means we now need to keep track of
      the guest physical address of the VPA/DTL areas.
      
      So as not to lose track of modifications to a VPA/DTL area when it gets
      unregistered, or when a new area gets registered in its place, we need
      to transfer the dirty state to the rmap chain.  This adds code to
      kvmppc_unpin_guest_page() to do that if the area was dirty.  To simplify
      that code, we now require that all VPA, DTL and SLB shadow buffer areas
      fit within a single host page.  Guests already comply with this
      requirement because pHyp requires that these areas not cross a 4k
      boundary.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c35635ef
  9. 22 3月, 2013 1 次提交
  10. 15 2月, 2013 3 次提交
  11. 13 2月, 2013 1 次提交
  12. 08 2月, 2013 1 次提交
  13. 10 1月, 2013 1 次提交
  14. 06 12月, 2012 1 次提交
    • P
      KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations · 1b400ba0
      Paul Mackerras 提交于
      When we change or remove a HPT (hashed page table) entry, we can do
      either a global TLB invalidation (tlbie) that works across the whole
      machine, or a local invalidation (tlbiel) that only affects this core.
      Currently we do local invalidations if the VM has only one vcpu or if
      the guest requests it with the H_LOCAL flag, though the guest Linux
      kernel currently doesn't ever use H_LOCAL.  Then, to cope with the
      possibility that vcpus moving around to different physical cores might
      expose stale TLB entries, there is some code in kvmppc_hv_entry to
      flush the whole TLB of entries for this VM if either this vcpu is now
      running on a different physical core from where it last ran, or if this
      physical core last ran a different vcpu.
      
      There are a number of problems on POWER7 with this as it stands:
      
      - The TLB invalidation is done per thread, whereas it only needs to be
        done per core, since the TLB is shared between the threads.
      - With the possibility of the host paging out guest pages, the use of
        H_LOCAL by an SMP guest is dangerous since the guest could possibly
        retain and use a stale TLB entry pointing to a page that had been
        removed from the guest.
      - The TLB invalidations that we do when a vcpu moves from one physical
        core to another are unnecessary in the case of an SMP guest that isn't
        using H_LOCAL.
      - The optimization of using local invalidations rather than global should
        apply to guests with one virtual core, not just one vcpu.
      
      (None of this applies on PPC970, since there we always have to
      invalidate the whole TLB when entering and leaving the guest, and we
      can't support paging out guest memory.)
      
      To fix these problems and simplify the code, we now maintain a simple
      cpumask of which cpus need to flush the TLB on entry to the guest.
      (This is indexed by cpu, though we only ever use the bits for thread
      0 of each core.)  Whenever we do a local TLB invalidation, we set the
      bits for every cpu except the bit for thread 0 of the core that we're
      currently running on.  Whenever we enter a guest, we test and clear the
      bit for our core, and flush the TLB if it was set.
      
      On initial startup of the VM, and when resetting the HPT, we set all the
      bits in the need_tlb_flush cpumask, since any core could potentially have
      stale TLB entries from the previous VM to use the same LPID, or the
      previous contents of the HPT.
      
      Then, we maintain a count of the number of online virtual cores, and use
      that when deciding whether to use a local invalidation rather than the
      number of online vcpus.  The code to make that decision is extracted out
      into a new function, global_invalidates().  For multi-core guests on
      POWER7 (i.e. when we are using mmu notifiers), we now never do local
      invalidations regardless of the H_LOCAL flag.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1b400ba0
  15. 07 9月, 2012 1 次提交
  16. 05 9月, 2012 1 次提交
  17. 11 7月, 2012 1 次提交
    • A
      powerpc: Add VDSO version of getcpu · 18ad51dd
      Anton Blanchard 提交于
      We have a request for a fast method of getting CPU and NUMA node IDs
      from userspace. This patch implements a getcpu VDSO function,
      similar to x86.
      
      Ben suggested we use SPRG3 which is userspace readable. SPRG3 can be
      modified by a KVM guest, so we save the SPRG3 value in the paca and
      restore it when transitioning from the guest to the host.
      
      I have a glibc patch that implements sched_getcpu on top of this.
      Testing on a POWER7:
      
      baseline: 538 cycles
      vdso:      30 cycles
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      18ad51dd
  18. 30 4月, 2012 1 次提交
  19. 08 4月, 2012 4 次提交
    • P
      KVM: PPC: Book 3S: Fix compilation for !HV configs · 7657f408
      Paul Mackerras 提交于
      Commits 2f5cdd5487 ("KVM: PPC: Book3S HV: Make secondary threads more
      robust against stray IPIs") and 1c2066b0f7 ("KVM: PPC: Book3S HV: Make
      virtual processor area registration more robust") added fields to
      struct kvm_vcpu_arch inside #ifdef CONFIG_KVM_BOOK3S_64_HV regions,
      and added lines to arch/powerpc/kernel/asm-offsets.c to generate
      assembler constants for their offsets.  Unfortunately this led to
      compile errors on Book 3S machines for configs that had KVM enabled
      but not CONFIG_KVM_BOOK3S_64_HV.  This fixes the problem by moving
      the offending lines inside #ifdef CONFIG_KVM_BOOK3S_64_HV regions.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7657f408
    • P
      KVM: PPC: Book3S HV: Make virtual processor area registration more robust · 2e25aa5f
      Paul Mackerras 提交于
      The PAPR API allows three sorts of per-virtual-processor areas to be
      registered (VPA, SLB shadow buffer, and dispatch trace log), and
      furthermore, these can be registered and unregistered for another
      virtual CPU.  Currently we just update the vcpu fields pointing to
      these areas at the time of registration or unregistration.  If this
      is done on another vcpu, there is the possibility that the target vcpu
      is using those fields at the time and could end up using a bogus
      pointer and corrupting memory.
      
      This fixes the race by making the target cpu itself do the update, so
      we can be sure that the update happens at a time when the fields
      aren't being used.  Each area now has a struct kvmppc_vpa which is
      used to manage these updates.  There is also a spinlock which protects
      access to all of the kvmppc_vpa structs, other than to the pinned_addr
      fields.  (We could have just taken the spinlock when using the vpa,
      slb_shadow or dtl fields, but that would mean taking the spinlock on
      every guest entry and exit.)
      
      This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
      which is what the rest of the kernel uses.
      
      Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
      the need to initialize vcpu->arch.vpa_update_lock.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      2e25aa5f
    • P
      KVM: PPC: Book3S HV: Make secondary threads more robust against stray IPIs · f0888f70
      Paul Mackerras 提交于
      Currently on POWER7, if we are running the guest on a core and we don't
      need all the hardware threads, we do nothing to ensure that the unused
      threads aren't executing in the kernel (other than checking that they
      are offline).  We just assume they're napping and we don't do anything
      to stop them trying to enter the kernel while the guest is running.
      This means that a stray IPI can wake up the hardware thread and it will
      then try to enter the kernel, but since the core is in guest context,
      it will execute code from the guest in hypervisor mode once it turns the
      MMU on, which tends to lead to crashes or hangs in the host.
      
      This fixes the problem by adding two new one-byte flags in the
      kvmppc_host_state structure in the PACA which are used to interlock
      between the primary thread and the unused secondary threads when entering
      the guest.  With these flags, the primary thread can ensure that the
      unused secondaries are not already in kernel mode (i.e. handling a stray
      IPI) and then indicate that they should not try to enter the kernel
      if they do get woken for any reason.  Instead they will go into KVM code,
      find that there is no vcpu to run, acknowledge and clear the IPI and go
      back to nap mode.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      f0888f70
    • S
      KVM: PPC: booke: category E.HV (GS-mode) support · d30f6e48
      Scott Wood 提交于
      Chips such as e500mc that implement category E.HV in Power ISA 2.06
      provide hardware virtualization features, including a new MSR mode for
      guest state.  The guest OS can perform many operations without trapping
      into the hypervisor, including transitions to and from guest userspace.
      
      Since we can use SRR1[GS] to reliably tell whether an exception came from
      guest state, instead of messing around with IVPR, we use DO_KVM similarly
      to book3s.
      
      Current issues include:
       - Machine checks from guest state are not routed to the host handler.
       - The guest can cause a host oops by executing an emulated instruction
         in a page that lacks read permission.  Existing e500/4xx support has
         the same problem.
      
      Includes work by Ashish Kalra <Ashish.Kalra@freescale.com>,
      Varun Sethi <Varun.Sethi@freescale.com>, and
      Liu Yu <yu.liu@freescale.com>.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      [agraf: remove pt_regs usage]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d30f6e48
  20. 21 3月, 2012 1 次提交
  21. 09 3月, 2012 1 次提交
    • B
      powerpc: Rework lazy-interrupt handling · 7230c564
      Benjamin Herrenschmidt 提交于
      The current implementation of lazy interrupts handling has some
      issues that this tries to address.
      
      We don't do the various workarounds we need to do when re-enabling
      interrupts in some cases such as when returning from an interrupt
      and thus we may still lose or get delayed decrementer or doorbell
      interrupts.
      
      The current scheme also makes it much harder to handle the external
      "edge" interrupts provided by some BookE processors when using the
      EPR facility (External Proxy) and the Freescale Hypervisor.
      
      Additionally, we tend to keep interrupts hard disabled in a number
      of cases, such as decrementer interrupts, external interrupts, or
      when a masked decrementer interrupt is pending. This is sub-optimal.
      
      This is an attempt at fixing it all in one go by reworking the way
      we do the lazy interrupt disabling from the ground up.
      
      The base idea is to replace the "hard_enabled" field with a
      "irq_happened" field in which we store a bit mask of what interrupt
      occurred while soft-disabled.
      
      When re-enabling, either via arch_local_irq_restore() or when returning
      from an interrupt, we can now decide what to do by testing bits in that
      field.
      
      We then implement replaying of the missed interrupts either by
      re-using the existing exception frame (in exception exit case) or via
      the creation of a new one from an assembly trampoline (in the
      arch_local_irq_enable case).
      
      This removes the need to play with the decrementer to try to create
      fake interrupts, among others.
      
      In addition, this adds a few refinements:
      
       - We no longer  hard disable decrementer interrupts that occur
      while soft-disabled. We now simply bump the decrementer back to max
      (on BookS) or leave it stopped (on BookE) and continue with hard interrupts
      enabled, which means that we'll potentially get better sample quality from
      performance monitor interrupts.
      
       - Timer, decrementer and doorbell interrupts now hard-enable
      shortly after removing the source of the interrupt, which means
      they no longer run entirely hard disabled. Again, this will improve
      perf sample quality.
      
       - On Book3E 64-bit, we now make the performance monitor interrupt
      act as an NMI like Book3S (the necessary C code for that to work
      appear to already be present in the FSL perf code, notably calling
      nmi_enter instead of irq_enter). (This also fixes a bug where BookE
      perfmon interrupts could clobber r14 ... oops)
      
       - We could make "masked" decrementer interrupts act as NMIs when doing
      timer-based perf sampling to improve the sample quality.
      
      Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      ---
      
      v2:
      
      - Add hard-enable to decrementer, timer and doorbells
      - Fix CR clobber in masked irq handling on BookE
      - Make embedded perf interrupt act as an NMI
      - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
        to retrigger an interrupt without preventing hard-enable
      
      v3:
      
       - Fix or vs. ori bug on Book3E
       - Fix enabling of interrupts for some exceptions on Book3E
      
      v4:
      
       - Fix resend of doorbells on return from interrupt on Book3E
      
      v5:
      
       - Rebased on top of my latest series, which involves some significant
      rework of some aspects of the patch.
      
      v6:
       - 32-bit compile fix
       - more compile fixes with various .config combos
       - factor out the asm code to soft-disable interrupts
       - remove the C wrapper around preempt_schedule_irq
      
      v7:
       - Fix a bug with hard irq state tracking on native power7
      7230c564
  22. 05 3月, 2012 2 次提交
    • P
      KVM: PPC: Implement MMIO emulation support for Book3S HV guests · 697d3899
      Paul Mackerras 提交于
      This provides the low-level support for MMIO emulation in Book3S HV
      guests.  When the guest tries to map a page which is not covered by
      any memslot, that page is taken to be an MMIO emulation page.  Instead
      of inserting a valid HPTE, we insert an HPTE that has the valid bit
      clear but another hypervisor software-use bit set, which we call
      HPTE_V_ABSENT, to indicate that this is an absent page.  An
      absent page is treated much like a valid page as far as guest hcalls
      (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
      an absent HPTE doesn't need to be invalidated with tlbie since it
      was never valid as far as the hardware is concerned.
      
      When the guest accesses a page for which there is an absent HPTE, it
      will take a hypervisor data storage interrupt (HDSI) since we now set
      the VPM1 bit in the LPCR.  Our HDSI handler for HPTE-not-present faults
      looks up the hash table and if it finds an absent HPTE mapping the
      requested virtual address, will switch to kernel mode and handle the
      fault in kvmppc_book3s_hv_page_fault(), which at present just calls
      kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
      
      This is based on an earlier patch by Benjamin Herrenschmidt, but since
      heavily reworked.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      697d3899
    • S
      KVM: PPC: Paravirtualize SPRG4-7, ESR, PIR, MASn · b5904972
      Scott Wood 提交于
      This allows additional registers to be accessed by the guest
      in PR-mode KVM without trapping.
      
      SPRG4-7 are readable from userspace.  On booke, KVM will sync
      these registers when it enters the guest, so that accesses from
      guest userspace will work.  The guest kernel, OTOH, must consistently
      use either the real registers or the shared area between exits.  This
      also applies to the already-paravirted SPRG3.
      
      On non-booke, it's not clear to what extent SPRG4-7 are supported
      (they're not architected for book3s, but exist on at least some classic
      chips).  They are copied in the get/set regs ioctls, but I do not see any
      non-booke emulation.  I also do not see any syncing with real registers
      (in PR-mode) including the user-readable SPRG3.  This patch should not
      make that situation any worse.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b5904972
  23. 08 12月, 2011 1 次提交
    • P
      powerpc: Provide a way for KVM to indicate that NV GPR values are lost · 2fde6d20
      Paul Mackerras 提交于
      This fixes a problem where a CPU thread coming out of nap mode can
      think it has valid values in the nonvolatile GPRs (r14 - r31) as saved
      away in power7_idle, but in fact the values have been trashed because
      the thread was used for KVM in the mean time.  The result is that the
      thread crashes because code that called power7_idle (e.g.,
      pnv_smp_cpu_kill_self()) goes to use values in registers that have
      been trashed.
      
      The bit field in SRR1 that tells whether state was lost only reflects
      the most recent nap, which may not have been the nap instruction in
      power7_idle.  So we need an extra PACA field to indicate that state
      has been lost even if SRR1 indicates that the most recent nap didn't
      lose state.  We clear this field when saving the state in power7_idle,
      we set it to a non-zero value when we use the thread for KVM, and we
      test it in power7_wakeup_noloss.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2fde6d20
  24. 26 9月, 2011 2 次提交
    • P
      KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code · 19ccb76a
      Paul Mackerras 提交于
      With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
      core), whenever a CPU goes idle, we have to pull all the other
      hardware threads in the core out of the guest, because the H_CEDE
      hcall is handled in the kernel.  This is inefficient.
      
      This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
      in real mode.  When a guest vcpu does an H_CEDE hcall, we now only
      exit to the kernel if all the other vcpus in the same core are also
      idle.  Otherwise we mark this vcpu as napping, save state that could
      be lost in nap mode (mainly GPRs and FPRs), and execute the nap
      instruction.  When the thread wakes up, because of a decrementer or
      external interrupt, we come back in at kvm_start_guest (from the
      system reset interrupt vector), find the `napping' flag set in the
      paca, and go to the resume path.
      
      This has some other ramifications.  First, when starting a core, we
      now start all the threads, both those that are immediately runnable and
      those that are idle.  This is so that we don't have to pull all the
      threads out of the guest when an idle thread gets a decrementer interrupt
      and wants to start running.  In fact the idle threads will all start
      with the H_CEDE hcall returning; being idle they will just do another
      H_CEDE immediately and go to nap mode.
      
      This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
      These functions have been restructured to make them simpler and clearer.
      We introduce a level of indirection in the wait queue that gets woken
      when external and decrementer interrupts get generated for a vcpu, so
      that we can have the 4 vcpus in a vcore using the same wait queue.
      We need this because the 4 vcpus are being handled by one thread.
      
      Secondly, when we need to exit from the guest to the kernel, we now
      have to generate an IPI for any napping threads, because an HDEC
      interrupt doesn't wake up a napping thread.
      
      Thirdly, we now need to be able to handle virtual external interrupts
      and decrementer interrupts becoming pending while a thread is napping,
      and deliver those interrupts to the guest when the thread wakes.
      This is done in kvmppc_cede_reentry, just before fast_guest_return.
      
      Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
      and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
      from kvm_arch_vcpu_runnable.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      19ccb76a
    • P
      KVM: PPC: book3s_pr: Simplify transitions between virtual and real mode · 02143947
      Paul Mackerras 提交于
      This simplifies the way that the book3s_pr makes the transition to
      real mode when entering the guest.  We now call kvmppc_entry_trampoline
      (renamed from kvmppc_rmcall) in the base kernel using a normal function
      call instead of doing an indirect call through a pointer in the vcpu.
      If kvm is a module, the module loader takes care of generating a
      trampoline as it does for other calls to functions outside the module.
      
      kvmppc_entry_trampoline then disables interrupts and jumps to
      kvmppc_handler_trampoline_enter in real mode using an rfi[d].
      That then uses the link register as the address to return to
      (potentially in module space) when the guest exits.
      
      This also simplifies the way that we call the Linux interrupt handler
      when we exit the guest due to an external, decrementer or performance
      monitor interrupt.  Instead of turning on the MMU, then deciding that
      we need to call the Linux handler and turning the MMU back off again,
      we now go straight to the handler at the point where we would turn the
      MMU on.  The handler will then return to the virtual-mode code
      (potentially in the module).
      
      Along the way, this moves the setting and clearing of the HID5 DCBZ32
      bit into real-mode interrupts-off code, and also makes sure that
      we clear the MSR[RI] bit before loading values into SRR0/1.
      
      The net result is that we no longer need any code addresses to be
      stored in vcpu->arch.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      02143947
  25. 20 9月, 2011 1 次提交