1. 28 7月, 2014 30 次提交
  2. 06 7月, 2014 1 次提交
  3. 12 6月, 2014 2 次提交
  4. 11 6月, 2014 7 次提交
    • M
      powerpc/book3s: Fix guest MC delivery mechanism to avoid soft lockups in guest. · 74845bc2
      Mahesh Salgaonkar 提交于
      Currently we forward MCEs to guest which have been recovered by guest.
      And for unhandled errors we do not deliver the MCE to guest. It looks like
      with no support of FWNMI in qemu, guest just panics whenever we deliver the
      recovered MCEs to guest. Also, the existig code used to return to host for
      unhandled errors which was casuing guest to hang with soft lockups inside
      guest and makes it difficult to recover guest instance.
      
      This patch now forwards all fatal MCEs to guest causing guest to crash/panic.
      And, for recovered errors we just go back to normal functioning of guest
      instead of returning to host. This fixes soft lockup issues in guest.
      This patch also fixes an issue where guest MCE events were not logged to
      host console.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      74845bc2
    • M
      powerpc/book3s: Increment the mce counter during machine_check_early call. · e6654d5b
      Mahesh Salgaonkar 提交于
      We don't see MCE counter getting increased in /proc/interrupts which gives
      false impression of no MCE occurred even when there were MCE events.
      The machine check early handling was added for PowerKVM and we missed to
      increment the MCE count in the early handler.
      
      We also increment mce counters in the machine_check_exception call, but
      in most cases where we handle the error hypervisor never reaches there
      unless its fatal and we want to crash. Only during fatal situation we may
      see double increment of mce count. We need to fix that. But for
      now it always good to have some count increased instead of zero.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e6654d5b
    • M
      powerpc/book3s: Add stack overflow check in machine check handler. · e75ad93a
      Mahesh Salgaonkar 提交于
      Currently machine check handler does not check for stack overflow for
      nested machine check. If we hit another MCE while inside the machine check
      handler repeatedly from same address then we get into risk of stack
      overflow which can cause huge memory corruption. This patch limits the
      nested MCE level to 4 and panic when we cross level 4.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      e75ad93a
    • M
      powerpc/book3s: Fix machine check handling for unhandled errors · 2749a2f2
      Mahesh Salgaonkar 提交于
      Current code does not check for unhandled/unrecovered errors and return from
      interrupt if it is recoverable exception which in-turn triggers same machine
      check exception in a loop causing hypervisor to be unresponsive.
      
      This patch fixes this situation and forces hypervisor to panic for
      unhandled/unrecovered errors.
      
      This patch also fixes another issue where unrecoverable_exception routine
      was called in real mode in case of unrecoverable exception (MSR_RI = 0).
      This causes another exception vector 0x300 (data access) during system crash
      leading to confusion while debugging cause of the system crash.
      
      Also turn ME bit off while going down, so that when another MCE is hit during
      panic path, system will checkstop and hypervisor will get restarted cleanly
      by SP.
      
      With the above fixes we now throw correct console messages (see below) while
      crashing the system in case of unhandled/unrecoverable machine checks.
      
      --------------
      Severe Machine check interrupt [[Not recovered]
        Initiator: CPU
        Error type: UE [Instruction fetch]
          Effective address: 0000000030002864
      Oops: Machine check, sig: 7 [#1]
      SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: bork(O) bridge stp llc kvm [last unloaded: bork]
      CPU: 36 PID: 55162 Comm: bash Tainted: G           O 3.14.0mce #1
      task: c000002d72d022d0 ti: c000000007ec0000 task.ti: c000002d72de4000
      NIP: 0000000030002864 LR: 00000000300151a4 CTR: 000000003001518c
      REGS: c000000007ec3d80 TRAP: 0200   Tainted: G           O  (3.14.0mce)
      MSR: 9000000000041002 <SF,HV,ME,RI>  CR: 28222848  XER: 20000000
      CFAR: 0000000030002838 DAR: d0000000004d0000 DSISR: 00000000 SOFTE: 1
      GPR00: 000000003001512c 0000000031f92cb0 0000000030078af0 0000000030002864
      GPR04: d0000000004d0000 0000000000000000 0000000030002864 ffffffffffffffc9
      GPR08: 0000000000000024 0000000030008af0 000000000000002c c00000000150e728
      GPR12: 9000000000041002 0000000031f90000 0000000010142550 0000000040000000
      GPR16: 0000000010143cdc 0000000000000000 00000000101306fc 00000000101424dc
      GPR20: 00000000101424e0 000000001013c6f0 0000000000000000 0000000000000000
      GPR24: 0000000010143ce0 00000000100f6440 c000002d72de7e00 c000002d72860250
      GPR28: c000002d72860240 c000002d72ac0038 0000000000000008 0000000000040000
      NIP [0000000030002864] 0x30002864
      LR [00000000300151a4] 0x300151a4
      Call Trace:
      Instruction dump:
      XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
      XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
      ---[ end trace 7285f0beac1e29d3 ]---
      
      Sending IPI to other CPUs
      IPI complete
      OPAL V3 detected !
      --------------
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      2749a2f2
    • G
      powerpc/eeh: Dump PE location code · 357b2f3d
      Gavin Shan 提交于
      As Ben suggested, it's meaningful to dump PE's location code
      for site engineers when hitting EEH errors. The patch introduces
      function eeh_pe_loc_get() to retireve the location code from
      dev-tree so that we can output it when hitting EEH errors.
      
      If primary PE bus is root bus, the PHB's dev-node would be tried
      prior to root port's dev-node. Otherwise, the upstream bridge's
      dev-node of the primary PE bus will be check for the location code
      directly.
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      357b2f3d
    • M
      powerpc/powernv: Enable POWER8 doorbell IPIs · d4e58e59
      Michael Neuling 提交于
      This patch enables POWER8 doorbell IPIs on powernv.
      
      Since doorbells can only IPI within a core, we test to see when we can use
      doorbells and if not we fall back to XICS.  This also enables hypervisor
      doorbells to wakeup us up from nap/sleep via the LPCR PECEDH bit.
      
      Based on tests by Anton, the best case IPI latency between two threads dropped
      from 894ns to 512ns.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d4e58e59
    • G
      powerpc/powernv: Fix killed EEH event · 5c7a35e3
      Gavin Shan 提交于
      On PowerNV platform, EEH errors are reported by IO accessors or poller
      driven by interrupt. After the PE is isolated, we won't produce EEH
      event for the PE. The current implementation has possibility of EEH
      event lost in this way:
      
      The interrupt handler queues one "special" event, which drives the poller.
      EEH thread doesn't pick the special event yet. IO accessors kicks in, the
      frozen PE is marked as "isolated" and EEH event is queued to the list.
      EEH thread runs because of special event and purge all existing EEH events.
      However, we never produce an other EEH event for the frozen PE. Eventually,
      the PE is marked as "isolated" and we don't have EEH event to recover it.
      
      The patch fixes the issue to keep EEH events for PEs that have been
      marked as "isolated" with the help of additional "force" help to
      eeh_remove_event().
      Reported-by: NRolf Brudeseth <rolfb@us.ibm.com>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      5c7a35e3