1. 28 4月, 2017 3 次提交
    • N
      powerpc/64s: Dedicated system reset interrupt stack · b1ee8a3d
      Nicholas Piggin 提交于
      The system reset interrupt is used for crash/debug situations, so it is
      desirable to have as little impact on the normal state of the system as
      possible.
      
      Currently it uses the current kernel stack to process the exception.
      This stores into the stack which may be involved with the crash. The
      stack pointer may be corrupted, or it may have overflowed.
      
      Avoid or minimise these problems by creating a dedicated NMI stack for
      the system reset interrupt to use.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b1ee8a3d
    • N
      powerpc/64s: Disallow system reset vs system reset reentrancy · c4f3b52c
      Nicholas Piggin 提交于
      In preparation for using a dedicated stack for system reset interrupts,
      prevent a nested system reset from recovering, in order to simplify
      code that is called in crash/debug path. This allows a system reset
      interrupt to just use the base stack pointer.
      
      Keep an in_nmi nesting counter similarly to the in_mce counter. Consider
      the interrrupt non-recoverable if it is taken inside another system
      reset.
      
      Interrupt nesting could be allowed similarly to MCE, but system reset
      is a special case that's not for normal operation, so simplicity wins
      until there is requirement for nested system reset interrupts.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c4f3b52c
    • N
      powerpc/64s: Fix system reset vs general interrupt reentrancy · a3d96f70
      Nicholas Piggin 提交于
      The system reset interrupt can occur when MSR_EE=0, and it currently
      uses the PACA_EXGEN save area.
      
      Some PACA_EXGEN interrupts have a window where MSR_RI=1 and MSR_EE=0
      when the save area is still in use. A system reset interrupt in this
      window can lead to undetected corruption when the save area gets
      overwritten.
      
      This patch introduces PACA_EXNMI save area for system reset exceptions,
      which closes this corruption window. It's also helpful to retain the
      EXGEN state for debugging situations, even if not considering the
      recoverability aspect.
      
      This patch also moves the PACA_EXMC area down to a less frequently used
      part of the paca with the new save area.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a3d96f70
  2. 11 4月, 2017 1 次提交
    • G
      powerpc/powernv: Recover correct PACA on wakeup from a stop on P9 DD1 · 17ed4c8f
      Gautham R. Shenoy 提交于
      POWER9 DD1.0 hardware has a bug where the SPRs of a thread waking up
      from stop 0,1,2 with ESL=1 can endup being misplaced in the core. Thus
      the HSPRG0 of a thread waking up from can contain the paca pointer of
      its sibling.
      
      This patch implements a context recovery framework within threads of a
      core, by provisioning space in paca_struct for saving every sibling
      threads's paca pointers. Basically, we should be able to arrive at the
      right paca pointer from any of the thread's existing paca pointer.
      
      At bootup, during powernv idle-init, we save the paca address of every
      CPU in each one its siblings paca_struct in the slot corresponding to
      this CPU's index in the core.
      
      On wakeup from a stop, the thread will determine its index in the core
      from the TIR register and recover its PACA pointer by indexing into
      the correct slot in the provisioned space in the current PACA.
      
      Furthermore, ensure that the NVGPRs are restored from the stack on the
      way out by setting the NAPSTATELOST in paca.
      
      [Changelog written with inputs from svaidy@linux.vnet.ibm.com]
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Call it a bug]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      17ed4c8f
  3. 01 4月, 2017 1 次提交
  4. 31 3月, 2017 1 次提交
  5. 14 1月, 2017 1 次提交
  6. 09 9月, 2016 1 次提交
    • P
      powerpc: move hmi.c to arch/powerpc/kvm/ · 3f257774
      Paolo Bonzini 提交于
      hmi.c functions are unused unless sibling_subcore_state is nonzero, and
      that in turn happens only if KVM is in use.  So move the code to
      arch/powerpc/kvm/, putting it under CONFIG_KVM_BOOK3S_HV_POSSIBLE
      rather than CONFIG_PPC_BOOK3S_64.  The sibling_subcore_state is also
      included in struct paca_struct only if KVM is supported by the kernel.
      
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: kvm-ppc@vger.kernel.org
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3f257774
  7. 22 8月, 2016 1 次提交
    • P
      powerpc: move hmi.c to arch/powerpc/kvm/ · 7c379526
      Paolo Bonzini 提交于
      hmi.c functions are unused unless sibling_subcore_state is nonzero, and
      that in turn happens only if KVM is in use.  So move the code to
      arch/powerpc/kvm/, putting it under CONFIG_KVM_BOOK3S_HV_POSSIBLE
      rather than CONFIG_PPC_BOOK3S_64.  The sibling_subcore_state is also
      included in struct paca_struct only if KVM is supported by the kernel.
      
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: kvm-ppc@vger.kernel.org
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      7c379526
  8. 09 7月, 2016 1 次提交
  9. 20 6月, 2016 1 次提交
    • M
      KVM: PPC: Book3S HV: Fix TB corruption in guest exit path on HMI interrupt · fd7bacbc
      Mahesh Salgaonkar 提交于
      When a guest is assigned to a core it converts the host Timebase (TB)
      into guest TB by adding guest timebase offset before entering into
      guest. During guest exit it restores the guest TB to host TB. This means
      under certain conditions (Guest migration) host TB and guest TB can differ.
      
      When we get an HMI for TB related issues the opal HMI handler would
      try fixing errors and restore the correct host TB value. With no guest
      running, we don't have any issues. But with guest running on the core
      we run into TB corruption issues.
      
      If we get an HMI while in the guest, the current HMI handler invokes opal
      hmi handler before forcing guest to exit. The guest exit path subtracts
      the guest TB offset from the current TB value which may have already
      been restored with host value by opal hmi handler. This leads to incorrect
      host and guest TB values.
      
      With split-core, things become more complex. With split-core, TB also gets
      split and each subcore gets its own TB register. When a hmi handler fixes
      a TB error and restores the TB value, it affects all the TB values of
      sibling subcores on the same core. On TB errors all the thread in the core
      gets HMI. With existing code, the individual threads call opal hmi handle
      independently which can easily throw TB out of sync if we have guest
      running on subcores. Hence we will need to co-ordinate with all the
      threads before making opal hmi handler call followed by TB resync.
      
      This patch introduces a sibling subcore state structure (shared by all
      threads in the core) in paca which holds information about whether sibling
      subcores are in Guest mode or host mode. An array in_guest[] of size
      MAX_SUBCORE_PER_CORE=4 is used to maintain the state of each subcore.
      The subcore id is used as index into in_guest[] array. Only primary
      thread entering/exiting the guest is responsible to set/unset its
      designated array element.
      
      On TB error, we get HMI interrupt on every thread on the core. Upon HMI,
      this patch will now force guest to vacate the core/subcore. Primary
      thread from each subcore will then turn off its respective bit
      from the above bitmap during the guest exit path just after the
      guest->host partition switch is complete.
      
      All other threads that have just exited the guest OR were already in host
      will wait until all other subcores clears their respective bit.
      Once all the subcores turn off their respective bit, all threads will
      will make call to opal hmi handler.
      
      It is not necessary that opal hmi handler would resync the TB value for
      every HMI interrupts. It would do so only for the HMI caused due to
      TB errors. For rest, it would not touch TB value. Hence to make things
      simpler, primary thread would call TB resync explicitly once for each
      core immediately after opal hmi handler instead of subtracting guest
      offset from TB. TB resync call will restore the TB with host value.
      Thus we can be sure about the TB state.
      
      One of the primary threads exiting the guest will take up the
      responsibility of calling TB resync. It will use one of the top bits
      (bit 63) from subcore state flags bitmap to make the decision. The first
      primary thread (among the subcores) that is able to set the bit will
      have to call the TB resync. Rest all other threads will wait until TB
      resync is complete.  Once TB resync is complete all threads will then
      proceed.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      fd7bacbc
  10. 09 1月, 2016 1 次提交
  11. 27 12月, 2015 1 次提交
  12. 19 12月, 2015 1 次提交
  13. 01 4月, 2015 1 次提交
    • K
      powerpc: book3e_64: fix the align size for paca_struct · 016f8cf0
      Kevin Hao 提交于
      All the cache line size of the current book3e 64bit SoCs are 64 bytes.
      So we should use this size to align the member of paca_struct.
      This only change the paca_struct's members which are private to book3e
      CPUs, and should not have any effect to book3s ones. With this, we save
      192 bytes. Also change it to __aligned(size) since it is preferred over
      __attribute__((aligned(size))).
      
      Before:
      	/* size: 1920, cachelines: 30, members: 46 */
      	/* sum members: 1667, holes: 6, sum holes: 141 */
      	/* padding: 112 */
      
      After:
      	/* size: 1728, cachelines: 27, members: 46 */
      	/* sum members: 1667, holes: 4, sum holes: 13 */
      	/* padding: 48 */
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      016f8cf0
  14. 15 12月, 2014 2 次提交
    • S
      powernv/powerpc: Add winkle support for offline cpus · 77b54e9f
      Shreyas B. Prabhu 提交于
      Winkle is a deep idle state supported in power8 chips. A core enters
      winkle when all the threads of the core enter winkle. In this state
      power supply to the entire chiplet i.e core, private L2 and private L3
      is turned off. As a result it gives higher powersavings compared to
      sleep.
      
      But entering winkle results in a total hypervisor state loss. Hence the
      hypervisor context has to be preserved before entering winkle and
      restored upon wake up.
      
      Power-on Reset Engine (PORE) is a dedicated engine which is responsible
      for powering on the chiplet during wake up. It can be programmed to
      restore the register contests of a few specific registers. This patch
      uses PORE to restore register state wherever possible and uses stack to
      save and restore rest of the necessary registers.
      
      With hypervisor state restore things fall under three categories-
      per-core state, per-subcore state and per-thread state. To manage this,
      extend the infrastructure introduced for sleep. Mainly we add a paca
      variable subcore_sibling_mask. Using this and the core_idle_state we can
      distingush first thread in core and subcore.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      77b54e9f
    • S
      powernv/cpuidle: Redesign idle states management · 7cba160a
      Shreyas B. Prabhu 提交于
      Deep idle states like sleep and winkle are per core idle states. A core
      enters these states only when all the threads enter either the
      particular idle state or a deeper one. There are tasks like fastsleep
      hardware bug workaround and hypervisor core state save which have to be
      done only by the last thread of the core entering deep idle state and
      similarly tasks like timebase resync, hypervisor core register restore
      that have to be done only by the first thread waking up from these
      state.
      
      The current idle state management does not have a way to distinguish the
      first/last thread of the core waking/entering idle states. Tasks like
      timebase resync are done for all the threads. This is not only is
      suboptimal, but can cause functionality issues when subcores and kvm is
      involved.
      
      This patch adds the necessary infrastructure to track idle states of
      threads in a per-core structure. It uses this info to perform tasks like
      fastsleep workaround and timebase resync only once per core.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7cba160a
  15. 02 12月, 2014 1 次提交
  16. 05 8月, 2014 1 次提交
  17. 28 7月, 2014 1 次提交
  18. 28 5月, 2014 1 次提交
    • S
      powerpc: Fix regression of per-CPU DSCR setting · 1739ea9e
      Sam bobroff 提交于
      Since commit "efcac658 powerpc: Per process DSCR + some fixes (try#4)"
      it is no longer possible to set the DSCR on a per-CPU basis.
      
      The old behaviour was to minipulate the DSCR SPR directly but this is no
      longer sufficient: the value is quickly overwritten by context switching.
      
      This patch stores the per-CPU DSCR value in a kernel variable rather than
      directly in the SPR and it is used whenever a process has not set the DSCR
      itself. The sysfs interface (/sys/devices/system/cpu/cpuN/dscr) is unchanged.
      
      Writes to the old global default (/sys/devices/system/cpu/dscr_default)
      now set all of the per-CPU values and reads return the last written value.
      
      The new per-CPU default is added to the paca_struct and is used everywhere
      outside of sysfs.c instead of the old global default.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1739ea9e
  19. 20 3月, 2014 2 次提交
    • S
      powerpc/booke64: Critical and machine check exception support · 609af38f
      Scott Wood 提交于
      Add special state saving for critical and machine check exceptions.
      
      Most of this code could be used to handle debug exceptions taken from
      kernel space, but actually doing so is outside the scope of this patch.
      
      The various critical and machine check exceptions now point to their
      real handlers, rather than hanging the kernel.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      609af38f
    • S
      powerpc/booke64: Use SPRG7 for VDSO · 9d378dfa
      Scott Wood 提交于
      Previously SPRG3 was marked for use by both VDSO and critical
      interrupts (though critical interrupts were not fully implemented).
      
      In commit 8b64a9df ("powerpc/booke64:
      Use SPRG0/3 scratch for bolted TLB miss & crit int"), Mihai Caraman
      made an attempt to resolve this conflict by restoring the VDSO value
      early in the critical interrupt, but this has some issues:
      
       - It's incompatible with EXCEPTION_COMMON which restores r13 from the
         by-then-overwritten scratch (this cost me some debugging time).
       - It forces critical exceptions to be a special case handled
         differently from even machine check and debug level exceptions.
       - It didn't occur to me that it was possible to make this work at all
         (by doing a final "ld r13, PACA_EXCRIT+EX_R13(r13)") until after
         I made (most of) this patch. :-)
      
      It might be worth investigating using a load rather than SPRG on return
      from all exceptions (except TLB misses where the scratch never leaves
      the SPRG) -- it could save a few cycles.  Until then, let's stick with
      SPRG for all exceptions.
      
      Since we cannot use SPRG4-7 for scratch without corrupting the state of
      a KVM guest, move VDSO to SPRG7 on book3e.  Since neither SPRG4-7 nor
      critical interrupts exist on book3s, SPRG3 is still used for VDSO
      there.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Cc: Mihai Caraman <mihai.caraman@freescale.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: kvm-ppc@vger.kernel.org
      9d378dfa
  20. 15 1月, 2014 1 次提交
  21. 10 1月, 2014 1 次提交
    • S
      powerpc/e6500: TLB miss handler with hardware tablewalk support · 28efc35f
      Scott Wood 提交于
      There are a few things that make the existing hw tablewalk handlers
      unsuitable for e6500:
      
       - Indirect entries go in TLB1 (though the resulting direct entries go in
         TLB0).
      
       - It has threads, but no "tlbsrx." -- so we need a spinlock and
         a normal "tlbsx".  Because we need this lock, hardware tablewalk
         is mandatory on e6500 unless we want to add spinlock+tlbsx to
         the normal bolted TLB miss handler.
      
       - TLB1 has no HES (nor next-victim hint) so we need software round robin
         (TODO: integrate this round robin data with hugetlb/KVM)
      
       - The existing tablewalk handlers map half of a page table at a time,
         because IBM hardware has a fixed 1MiB indirect page size.  e6500
         has variable size indirect entries, with a minimum of 2MiB.
         So we can't do the half-page indirect mapping, and even if we
         could it would be less efficient than mapping the full page.
      
       - Like on e5500, the linear mapping is bolted, so we don't need the
         overhead of supporting nested tlb misses.
      
      Note that hardware tablewalk does not work in rev1 of e6500.
      We do not expect to support e6500 rev1 in mainline Linux.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Cc: Mihai Caraman <mihai.caraman@freescale.com>
      28efc35f
  22. 05 12月, 2013 1 次提交
  23. 17 10月, 2013 1 次提交
  24. 14 8月, 2013 2 次提交
    • A
      powerpc: Make rwlocks endian safe · 54bb7f4b
      Anton Blanchard 提交于
      Our ppc64 spinlocks and rwlocks use a trick where a lock token and
      the paca index are placed in the lock with a single store. Since we
      are using two u16s they need adjusting for little endian.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      54bb7f4b
    • M
      powerpc: Avoid link stack corruption for MMU on exceptions · bc2e6c6a
      Michael Neuling 提交于
      When we have MMU on exceptions (POWER8) and a relocatable kernel, we
      need to branch from the initial exception vectors at 0x0 to up high
      where the kernel might be located.  Currently we do this using the link
      register.
      
      Unfortunately this corrupts the link stack and instead we should use the
      count register.  We did this for the syscall entry path in:
        6a404806 powerpc: Avoid link stack corruption in MMU on syscall entry path
      but I stupidly forgot to do the same for other exceptions.
      
      This patch changes the initial exception vectors to use the count
      register instead of the link register when we need to branch up to the
      relocated kernel.
      
      I have a dodgy userspace test which loops calling a function that reads
      the PVR (mfpvr in userspace will be emulated by the kernel via the
      program check exception).  On POWER8 and with CONFIG_RELOCATABLE=y, I
      get a ~10% performance improvement with my userspace test with this
      patch.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      bc2e6c6a
  25. 15 2月, 2013 2 次提交
  26. 10 1月, 2013 1 次提交
  27. 17 9月, 2012 1 次提交
  28. 07 9月, 2012 1 次提交
  29. 09 3月, 2012 1 次提交
    • B
      powerpc: Rework lazy-interrupt handling · 7230c564
      Benjamin Herrenschmidt 提交于
      The current implementation of lazy interrupts handling has some
      issues that this tries to address.
      
      We don't do the various workarounds we need to do when re-enabling
      interrupts in some cases such as when returning from an interrupt
      and thus we may still lose or get delayed decrementer or doorbell
      interrupts.
      
      The current scheme also makes it much harder to handle the external
      "edge" interrupts provided by some BookE processors when using the
      EPR facility (External Proxy) and the Freescale Hypervisor.
      
      Additionally, we tend to keep interrupts hard disabled in a number
      of cases, such as decrementer interrupts, external interrupts, or
      when a masked decrementer interrupt is pending. This is sub-optimal.
      
      This is an attempt at fixing it all in one go by reworking the way
      we do the lazy interrupt disabling from the ground up.
      
      The base idea is to replace the "hard_enabled" field with a
      "irq_happened" field in which we store a bit mask of what interrupt
      occurred while soft-disabled.
      
      When re-enabling, either via arch_local_irq_restore() or when returning
      from an interrupt, we can now decide what to do by testing bits in that
      field.
      
      We then implement replaying of the missed interrupts either by
      re-using the existing exception frame (in exception exit case) or via
      the creation of a new one from an assembly trampoline (in the
      arch_local_irq_enable case).
      
      This removes the need to play with the decrementer to try to create
      fake interrupts, among others.
      
      In addition, this adds a few refinements:
      
       - We no longer  hard disable decrementer interrupts that occur
      while soft-disabled. We now simply bump the decrementer back to max
      (on BookS) or leave it stopped (on BookE) and continue with hard interrupts
      enabled, which means that we'll potentially get better sample quality from
      performance monitor interrupts.
      
       - Timer, decrementer and doorbell interrupts now hard-enable
      shortly after removing the source of the interrupt, which means
      they no longer run entirely hard disabled. Again, this will improve
      perf sample quality.
      
       - On Book3E 64-bit, we now make the performance monitor interrupt
      act as an NMI like Book3S (the necessary C code for that to work
      appear to already be present in the FSL perf code, notably calling
      nmi_enter instead of irq_enter). (This also fixes a bug where BookE
      perfmon interrupts could clobber r14 ... oops)
      
       - We could make "masked" decrementer interrupts act as NMIs when doing
      timer-based perf sampling to improve the sample quality.
      
      Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      ---
      
      v2:
      
      - Add hard-enable to decrementer, timer and doorbells
      - Fix CR clobber in masked irq handling on BookE
      - Make embedded perf interrupt act as an NMI
      - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want
        to retrigger an interrupt without preventing hard-enable
      
      v3:
      
       - Fix or vs. ori bug on Book3E
       - Fix enabling of interrupts for some exceptions on Book3E
      
      v4:
      
       - Fix resend of doorbells on return from interrupt on Book3E
      
      v5:
      
       - Rebased on top of my latest series, which involves some significant
      rework of some aspects of the patch.
      
      v6:
       - 32-bit compile fix
       - more compile fixes with various .config combos
       - factor out the asm code to soft-disable interrupts
       - remove the C wrapper around preempt_schedule_irq
      
      v7:
       - Fix a bug with hard irq state tracking on native power7
      7230c564
  30. 08 12月, 2011 1 次提交
    • P
      powerpc: Provide a way for KVM to indicate that NV GPR values are lost · 2fde6d20
      Paul Mackerras 提交于
      This fixes a problem where a CPU thread coming out of nap mode can
      think it has valid values in the nonvolatile GPRs (r14 - r31) as saved
      away in power7_idle, but in fact the values have been trashed because
      the thread was used for KVM in the mean time.  The result is that the
      thread crashes because code that called power7_idle (e.g.,
      pnv_smp_cpu_kill_self()) goes to use values in registers that have
      been trashed.
      
      The bit field in SRR1 that tells whether state was lost only reflects
      the most recent nap, which may not have been the nap instruction in
      power7_idle.  So we need an extra PACA field to indicate that state
      has been lost even if SRR1 indicates that the most recent nap didn't
      lose state.  We clear this field when saving the state in power7_idle,
      we set it to a non-zero value when we use the thread for KVM, and we
      test it in power7_wakeup_noloss.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2fde6d20
  31. 20 9月, 2011 1 次提交
  32. 12 7月, 2011 2 次提交
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948
    • P
      KVM: PPC: Split host-state fields out of kvmppc_book3s_shadow_vcpu · 3c42bf8a
      Paul Mackerras 提交于
      There are several fields in struct kvmppc_book3s_shadow_vcpu that
      temporarily store bits of host state while a guest is running,
      rather than anything relating to the particular guest or vcpu.
      This splits them out into a new kvmppc_host_state structure and
      modifies the definitions in asm-offsets.c to suit.
      
      On 32-bit, we have a kvmppc_host_state structure inside the
      kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
      to get to them both with one pointer.  On 64-bit they are separate
      fields in the PACA.  This means that on 64-bit we don't need to
      copy the kvmppc_host_state in and out on vcpu load/unload, and
      in future will mean that the book3s_hv code doesn't need a
      shadow_vcpu struct in the PACA at all.  That does mean that we
      have to be careful not to rely on any values persisting in the
      hstate field of the paca across any point where we could block
      or get preempted.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3c42bf8a
  33. 29 6月, 2011 1 次提交
    • S
      powerpc/book3e-64: use a separate TLB handler when linear map is bolted · f67f4ef5
      Scott Wood 提交于
      On MMUs such as FSL where we can guarantee the entire linear mapping is
      bolted, we don't need to worry about linear TLB misses.  If on top of
      that we do a full table walk, we get rid of all recursive TLB faults, and
      can dispense with some state saving.  This gains a few percent on
      TLB-miss-heavy workloads, and around 50% on a benchmark that had a high
      rate of virtual page table faults under the normal handler.
      
      While touching the EX_TLB layout, remove EX_TLB_MMUCR0, EX_TLB_SRR0, and
      EX_TLB_SRR1 as they're not used.
      
      [BenH: Fixed build with 64K pages (wsp config)]
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f67f4ef5