1. 01 12月, 2015 1 次提交
    • P
      powerpc/64: Include KVM guest test in all interrupt vectors · 31a40e2b
      Paul Mackerras 提交于
      Currently, if HV KVM is configured but PR KVM isn't, we don't include
      a test to see whether we were interrupted in KVM guest context for the
      set of interrupts which get delivered directly to the guest by hardware
      if they occur in the guest.  This includes things like program
      interrupts.
      
      However, the recent bug where userspace could set the MSR for a VCPU
      to have an illegal value in the TS field, and thus cause a TM Bad Thing
      type of program interrupt on the hrfid that enters the guest, showed that
      we can never be completely sure that these interrupts can never occur
      in the guest entry/exit code.  If one of these interrupts does happen
      and we have HV KVM configured but not PR KVM, then we end up trying to
      run the handler in the host with the MMU set to the guest MMU context,
      which generally ends badly.
      
      Thus, for robustness it is better to have the test in every interrupt
      vector, so that if some way is found to trigger some interrupt in the
      guest entry/exit path, we can handle it without immediately crashing
      the host.
      
      This means that the distinction between KVMTEST and KVMTEST_PR goes
      away.  Thus we delete KVMTEST_PR and associated macros and use KVMTEST
      everywhere that we previously used either KVMTEST_PR or KVMTEST.  It
      also means that SOFTEN_TEST_HV_201 becomes the same as SOFTEN_TEST_PR,
      so we deleted SOFTEN_TEST_HV_201 and use SOFTEN_TEST_PR instead.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      31a40e2b
  2. 02 6月, 2015 2 次提交
  3. 23 3月, 2015 1 次提交
  4. 15 12月, 2014 2 次提交
    • S
      powernv/powerpc: Add winkle support for offline cpus · 77b54e9f
      Shreyas B. Prabhu 提交于
      Winkle is a deep idle state supported in power8 chips. A core enters
      winkle when all the threads of the core enter winkle. In this state
      power supply to the entire chiplet i.e core, private L2 and private L3
      is turned off. As a result it gives higher powersavings compared to
      sleep.
      
      But entering winkle results in a total hypervisor state loss. Hence the
      hypervisor context has to be preserved before entering winkle and
      restored upon wake up.
      
      Power-on Reset Engine (PORE) is a dedicated engine which is responsible
      for powering on the chiplet during wake up. It can be programmed to
      restore the register contests of a few specific registers. This patch
      uses PORE to restore register state wherever possible and uses stack to
      save and restore rest of the necessary registers.
      
      With hypervisor state restore things fall under three categories-
      per-core state, per-subcore state and per-thread state. To manage this,
      extend the infrastructure introduced for sleep. Mainly we add a paca
      variable subcore_sibling_mask. Using this and the core_idle_state we can
      distingush first thread in core and subcore.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      77b54e9f
    • S
      powernv/cpuidle: Redesign idle states management · 7cba160a
      Shreyas B. Prabhu 提交于
      Deep idle states like sleep and winkle are per core idle states. A core
      enters these states only when all the threads enter either the
      particular idle state or a deeper one. There are tasks like fastsleep
      hardware bug workaround and hypervisor core state save which have to be
      done only by the last thread of the core entering deep idle state and
      similarly tasks like timebase resync, hypervisor core register restore
      that have to be done only by the first thread waking up from these
      state.
      
      The current idle state management does not have a way to distinguish the
      first/last thread of the core waking/entering idle states. Tasks like
      timebase resync are done for all the threads. This is not only is
      suboptimal, but can cause functionality issues when subcores and kvm is
      involved.
      
      This patch adds the necessary infrastructure to track idle states of
      threads in a per-core structure. It uses this info to perform tasks like
      fastsleep workaround and timebase resync only once per core.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7cba160a
  5. 08 12月, 2014 1 次提交
    • P
      powerpc/powernv: Return to cpu offline loop when finished in KVM guest · 56548fc0
      Paul Mackerras 提交于
      When a secondary hardware thread has finished running a KVM guest, we
      currently put that thread into nap mode using a nap instruction in
      the KVM code.  This changes the code so that instead of doing a nap
      instruction directly, we instead cause the call to power7_nap() that
      put the thread into nap mode to return.  The reason for doing this is
      to avoid having the KVM code having to know what low-power mode to
      put the thread into.
      
      In the case of a secondary thread used to run a KVM guest, the thread
      will be offline from the point of view of the host kernel, and the
      relevant power7_nap() call is the one in pnv_smp_cpu_disable().
      In this case we don't want to clear pending IPIs in the offline loop
      in that function, since that might cause us to miss the wakeup for
      the next time the thread needs to run a guest.  To tell whether or
      not to clear the interrupt, we use the SRR1 value returned from
      power7_nap(), and check if it indicates an external interrupt.  We
      arrange that the return from power7_nap() when we have finished running
      a guest returns 0, so pending interrupts don't get flushed in that
      case.
      
      Note that it is important a secondary thread that has finished
      executing in the guest, or that didn't have a guest to run, should
      not return to power7_nap's caller while the kvm_hstate.hwthread_req
      flag in the PACA is non-zero, because the return from power7_nap
      will reenable the MMU, and the MMU might still be in guest context.
      In this situation we spin at low priority in real mode waiting for
      hwthread_req to become zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      56548fc0
  6. 05 12月, 2014 1 次提交
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  7. 02 12月, 2014 1 次提交
  8. 12 11月, 2014 1 次提交
    • S
      powerpc: Save/restore PPR for KVM hypercalls · 8b91a255
      Suresh E. Warrier 提交于
      The system call FLIH (first-level interrupt handler) at 0xc00
      unconditionally sets hardware priority to medium. For hypercalls, this
      means we lose guest OS priority. The front end (do_kvm_0x**) to the
      KVM interrupt handler always assumes that PPR priority is saved in
      PACA exception save area, so it copies this to the kvm_hstate
      structure. For hypercalls, this would be the saved priority from any
      previous exception. Eventually, the guest gets resumed with an
      incorrect priority.
      
      The fix is to save the PPR priority in PACA exception save area before
      switching HMT priorities in the FLIH so that existing code described above
      in the KVM interrupt handler can copy it from there into the VCPU's saved
      context.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [mpe: Dropped HMT_MEDIUM_PPR_DISCARD and reworded comment]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8b91a255
  9. 10 10月, 2014 1 次提交
  10. 13 8月, 2014 1 次提交
    • G
      powerpc: Fix "attempt to move .org backwards" error · 11d54904
      Guenter Roeck 提交于
      Once again, we see
      
      arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
      arch/powerpc/kernel/exceptions-64s.S:865: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:866: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:890: Error: attempt to move .org backwards
      
      when compiling ppc:allmodconfig.
      
      This time the problem has been caused by to commit 0869b6fd
      ("powerpc/book3s: Add basic infrastructure to handle HMI in Linux"),
      which adds functions hmi_exception_early and hmi_exception_after_realmode
      into a critical (size-limited) code area, even though that does not appear
      to be necessary.
      
      Move those functions to a non-critical area of the file.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      11d54904
  11. 05 8月, 2014 1 次提交
  12. 28 7月, 2014 3 次提交
  13. 12 6月, 2014 1 次提交
  14. 11 6月, 2014 2 次提交
    • M
      powerpc/book3s: Add stack overflow check in machine check handler. · e75ad93a
      Mahesh Salgaonkar 提交于
      Currently machine check handler does not check for stack overflow for
      nested machine check. If we hit another MCE while inside the machine check
      handler repeatedly from same address then we get into risk of stack
      overflow which can cause huge memory corruption. This patch limits the
      nested MCE level to 4 and panic when we cross level 4.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e75ad93a
    • M
      powerpc/book3s: Fix machine check handling for unhandled errors · 2749a2f2
      Mahesh Salgaonkar 提交于
      Current code does not check for unhandled/unrecovered errors and return from
      interrupt if it is recoverable exception which in-turn triggers same machine
      check exception in a loop causing hypervisor to be unresponsive.
      
      This patch fixes this situation and forces hypervisor to panic for
      unhandled/unrecovered errors.
      
      This patch also fixes another issue where unrecoverable_exception routine
      was called in real mode in case of unrecoverable exception (MSR_RI = 0).
      This causes another exception vector 0x300 (data access) during system crash
      leading to confusion while debugging cause of the system crash.
      
      Also turn ME bit off while going down, so that when another MCE is hit during
      panic path, system will checkstop and hypervisor will get restarted cleanly
      by SP.
      
      With the above fixes we now throw correct console messages (see below) while
      crashing the system in case of unhandled/unrecoverable machine checks.
      
      --------------
      Severe Machine check interrupt [[Not recovered]
        Initiator: CPU
        Error type: UE [Instruction fetch]
          Effective address: 0000000030002864
      Oops: Machine check, sig: 7 [#1]
      SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: bork(O) bridge stp llc kvm [last unloaded: bork]
      CPU: 36 PID: 55162 Comm: bash Tainted: G           O 3.14.0mce #1
      task: c000002d72d022d0 ti: c000000007ec0000 task.ti: c000002d72de4000
      NIP: 0000000030002864 LR: 00000000300151a4 CTR: 000000003001518c
      REGS: c000000007ec3d80 TRAP: 0200   Tainted: G           O  (3.14.0mce)
      MSR: 9000000000041002 <SF,HV,ME,RI>  CR: 28222848  XER: 20000000
      CFAR: 0000000030002838 DAR: d0000000004d0000 DSISR: 00000000 SOFTE: 1
      GPR00: 000000003001512c 0000000031f92cb0 0000000030078af0 0000000030002864
      GPR04: d0000000004d0000 0000000000000000 0000000030002864 ffffffffffffffc9
      GPR08: 0000000000000024 0000000030008af0 000000000000002c c00000000150e728
      GPR12: 9000000000041002 0000000031f90000 0000000010142550 0000000040000000
      GPR16: 0000000010143cdc 0000000000000000 00000000101306fc 00000000101424dc
      GPR20: 00000000101424e0 000000001013c6f0 0000000000000000 0000000000000000
      GPR24: 0000000010143ce0 00000000100f6440 c000002d72de7e00 c000002d72860250
      GPR28: c000002d72860240 c000002d72ac0038 0000000000000008 0000000000040000
      NIP [0000000030002864] 0x30002864
      LR [00000000300151a4] 0x300151a4
      Call Trace:
      Instruction dump:
      XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
      XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
      ---[ end trace 7285f0beac1e29d3 ]---
      
      Sending IPI to other CPUs
      IPI complete
      OPAL V3 detected !
      --------------
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2749a2f2
  15. 23 4月, 2014 4 次提交
  16. 09 4月, 2014 1 次提交
  17. 24 3月, 2014 1 次提交
  18. 05 3月, 2014 2 次提交
  19. 30 12月, 2013 1 次提交
    • M
      powerpc: Fix "attempt to move .org backwards" error · 4e243b79
      Mahesh Salgaonkar 提交于
      With recent machine check patch series changes, The exception vectors
      starting from 0x4300 are now overflowing with allyesconfig. Fix that by
      moving machine_check_common and machine_check_handle_early code out of
      that region to make enough room for exception vector area.
      
      Fixes this build error reportes by Stephen:
      
      arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
      arch/powerpc/kernel/exceptions-64s.S:958: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:959: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:983: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:984: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:1003: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:1013: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:1014: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:1015: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:1016: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:1017: Error: attempt to move .org backwards
      arch/powerpc/kernel/exceptions-64s.S:1018: Error: attempt to move .org backwards
      
      [Moved the code further down as it introduced link errors due to too long
       relative branches to the masked interrupts handlers from the exception
       prologs. Also removed the useless feature section --BenH
      ]
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Tested-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4e243b79
  20. 05 12月, 2013 3 次提交
    • M
      powerpc/book3s: Queue up and process delayed MCE events. · b5ff4211
      Mahesh Salgaonkar 提交于
      When machine check real mode handler can not continue into host kernel
      in V mode, it returns from the interrupt and we loose MCE event which
      never gets logged. In such a situation queue up the MCE event so that
      we can log it later when we get back into host kernel with r1 pointing to
      kernel stack e.g. during syscall exit.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b5ff4211
    • M
      powerpc/book3s: Return from interrupt if coming from evil context. · 1c51089f
      Mahesh Salgaonkar 提交于
      We can get machine checks from any context. We need to make sure that
      we handle all of them correctly. If we are coming from hypervisor user-space,
      we can continue in host kernel in virtual mode to deliver the MC event.
      If we got woken up from power-saving mode then we may come in with one of
      the following state:
       a. No state loss
       b. Supervisor state loss
       c. Hypervisor state loss
      For (a) and (b), we go back to nap again. State (c) is fatal, keep spinning.
      
      For all other context which we not sure of queue up the MCE event and return
      from the interrupt.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1c51089f
    • M
      powerpc/book3s: handle machine check in Linux host. · 1e9b4507
      Mahesh Salgaonkar 提交于
      Move machine check entry point into Linux. So far we were dependent on
      firmware to decode MCE error details and handover the high level info to OS.
      
      This patch introduces early machine check routine that saves the MCE
      information (srr1, srr0, dar and dsisr) to the emergency stack. We allocate
      stack frame on emergency stack and set the r1 accordingly. This allows us to be
      prepared to take another exception without loosing context. One thing to note
      here that, if we get another machine check while ME bit is off then we risk a
      checkstop. Hence we restrict ourselves to save only MCE information and
      register saved on PACA_EXMC save are before we turn the ME bit on. We use
      paca->in_mce flag to differentiate between first entry and nested machine check
      entry which helps proper use of emergency stack. We increment paca->in_mce
      every time we enter in early machine check handler and decrement it while
      leaving. When we enter machine check early handler first time (paca->in_mce ==
      0), we are sure nobody is using MC emergency stack and allocate a stack frame
      at the start of the emergency stack. During subsequent entry (paca->in_mce >
      0), we know that r1 points inside emergency stack and we allocate separate
      stack frame accordingly. This prevents us from clobbering MCE information
      during nested machine checks.
      
      The early machine check handler changes are placed under CPU_FTR_HVMODE
      section. This makes sure that the early machine check handler will get executed
      only in hypervisor kernel.
      
      This is the code flow:
      
      		Machine Check Interrupt
      			|
      			V
      		   0x200 vector				  ME=0, IR=0, DR=0
      			|
      			V
      	+-----------------------------------------------+
      	|machine_check_pSeries_early:			| ME=0, IR=0, DR=0
      	|	Alloc frame on emergency stack		|
      	|	Save srr1, srr0, dar and dsisr on stack |
      	+-----------------------------------------------+
      			|
      		(ME=1, IR=0, DR=0, RFID)
      			|
      			V
      		machine_check_handle_early		  ME=1, IR=0, DR=0
      			|
      			V
      	+-----------------------------------------------+
      	|	machine_check_early (r3=pt_regs)	| ME=1, IR=0, DR=0
      	|	Things to do: (in next patches)		|
      	|		Flush SLB for SLB errors	|
      	|		Flush TLB for TLB errors	|
      	|		Decode and save MCE info	|
      	+-----------------------------------------------+
      			|
      	(Fall through existing exception handler routine.)
      			|
      			V
      		machine_check_pSerie			  ME=1, IR=0, DR=0
      			|
      		(ME=1, IR=1, DR=1, RFID)
      			|
      			V
      		machine_check_common			  ME=1, IR=1, DR=1
      			.
      			.
      			.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1e9b4507
  21. 17 10月, 2013 3 次提交
  22. 27 8月, 2013 2 次提交
  23. 14 8月, 2013 1 次提交
    • P
      powerpc: Fix denormalized exception handler · 630573c1
      Paul Mackerras 提交于
      The denormalized exception handler (denorm_exception_hv) has a couple
      of bugs.  If the CONFIG_PPC_DENORMALISATION option is not selected,
      or the HSRR1_DENORM bit is not set in HSRR1, we don't test whether the
      interrupt occurred within a KVM guest.  On the other hand, if the
      HSRR1_DENORM bit is set and CONFIG_PPC_DENORMALISATION is enabled,
      we corrupt the CFAR and PPR.
      
      To correct these problems, this replaces the open-coded version of
      EXCEPTION_PROLOG_1 that is there currently, and that is missing the
      saving of PPR and CFAR values to the PACA, with an instance of
      EXCEPTION_PROLOG_1.  This adds an explicit KVMTEST after testing
      whether the exception is one we can handle, and adds code to restore
      the CFAR on exit.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      630573c1
  24. 09 8月, 2013 1 次提交
  25. 01 7月, 2013 2 次提交