1. 10 3月, 2017 1 次提交
  2. 30 1月, 2017 1 次提交
    • A
      powerpc/powernv: Initialise nest mmu · 1d0761d2
      Alistair Popple 提交于
      POWER9 contains an off core mmu called the nest mmu (NMMU). This is
      used by other hardware units on the chip to translate virtual
      addresses into real addresses. The unit attempting an address
      translation provides the majority of the context required for the
      translation request except for the base address of the partition table
      (ie. the PTCR) which needs to be programmed into the NMMU.
      
      This patch adds a call to OPAL to set the PTCR for the nest mmu in
      opal_init().
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1d0761d2
  3. 23 11月, 2016 1 次提交
  4. 14 11月, 2016 1 次提交
  5. 09 8月, 2016 1 次提交
  6. 21 7月, 2016 1 次提交
  7. 08 7月, 2016 1 次提交
  8. 29 6月, 2016 1 次提交
  9. 09 2月, 2016 1 次提交
  10. 27 12月, 2015 2 次提交
    • A
      powerpc/powernv: Fix minor off-by-one error in opal_mce_check_early_recovery() · dc3799bb
      Andrew Donnellan 提交于
      Fix off-by-one error in opal_mce_check_early_recovery() when checking
      whether the NIP falls within OPAL space.
      Signed-off-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      dc3799bb
    • R
      powerpc/powernv: Add a kmsg_dumper that flushes console output on panic · affddff6
      Russell Currey 提交于
      On BMC machines, console output is controlled by the OPAL firmware and is
      only flushed when its pollers are called.  When the kernel is in a panic
      state, it no longer calls these pollers and thus console output does not
      completely flush, causing some output from the panic to be lost.
      
      Output is only actually lost when the kernel is configured to not power off
      or reboot after panic (i.e. CONFIG_PANIC_TIMEOUT is set to 0) since OPAL
      flushes the console buffer as part of its power down routines.  Before this
      patch, however, only partial output would be printed during the timeout wait.
      
      This patch adds a new kmsg_dumper which gets called at panic time to ensure
      panic output is not lost.  It accomplishes this by calling OPAL_CONSOLE_FLUSH
      in the OPAL API, and if that is not available, the pollers are called enough
      times to (hopefully) completely flush the buffer.
      
      The flushing mechanism will only affect output printed at and before the
      kmsg_dump call in kernel/panic.c:panic().  As such, the "end Kernel panic"
      message may still be truncated as follows:
      
      >Call Trace:
      >[c000000f1f603b00] [c0000000008e9458] dump_stack+0x90/0xbc (unreliable)
      >[c000000f1f603b30] [c0000000008e7e78] panic+0xf8/0x2c4
      >[c000000f1f603bc0] [c000000000be4860] mount_block_root+0x288/0x33c
      >[c000000f1f603c80] [c000000000be4d14] prepare_namespace+0x1f4/0x254
      >[c000000f1f603d00] [c000000000be43e8] kernel_init_freeable+0x318/0x350
      >[c000000f1f603dc0] [c00000000000bd74] kernel_init+0x24/0x130
      >[c000000f1f603e30] [c0000000000095b0] ret_from_kernel_thread+0x5c/0xac
      >---[ end Kernel panic - not
      
      This functionality is implemented as a kmsg_dumper as it seems to be the
      most sensible way to introduce platform-specific functionality to the
      panic function.
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      affddff6
  11. 17 12月, 2015 4 次提交
  12. 09 10月, 2015 1 次提交
    • D
      powerpc/powernv: Panic on unhandled Machine Check · f2dd80ec
      Daniel Axtens 提交于
      All unrecovered machine check errors on PowerNV should cause an
      immediate panic. There are 2 reasons that this is the right policy:
      it's not safe to continue, and we're already trying to reboot.
      
      Firstly, if we go through the recovery process and do not successfully
      recover, we can't be sure about the state of the machine, and it is
      not safe to recover and proceed.
      
      Linux knows about the following sources of Machine Check Errors:
      - Uncorrectable Errors (UE)
      - Effective - Real Address Translation (ERAT)
      - Segment Lookaside Buffer (SLB)
      - Translation Lookaside Buffer (TLB)
      - Unknown/Unrecognised
      
      In the SLB, TLB and ERAT cases, we can further categorise these as
      parity errors, multihit errors or unknown/unrecognised.
      
      We can handle SLB errors by flushing and reloading the SLB. We can
      handle TLB and ERAT multihit errors by flushing the TLB. (It appears
      we may not handle TLB and ERAT parity errors: I will investigate
      further and send a followup patch if appropriate.)
      
      This leaves us with uncorrectable errors. Uncorrectable errors are
      usually the result of ECC memory detecting an error that it cannot
      correct, but they also crop up in the context of PCI cards failing
      during DMA writes, and during CAPI error events.
      
      There are several types of UE, and there are 3 places a UE can occur:
      Skiboot, the kernel, and userspace. For Skiboot errors, we have the
      facility to make some recoverable. For userspace, we can simply kill
      (SIGBUS) the affected process. We have no meaningful way to deal with
      UEs in kernel space or in unrecoverable sections of Skiboot.
      
      Currently, these unrecovered UEs fall through to
      machine_check_expection() in traps.c, which calls die(), which OOPSes
      and sends SIGBUS to the process. This sometimes allows us to stumble
      onwards. For example we've seen UEs kill the kernel eehd and
      khugepaged. However, the process killed could have held a lock, or it
      could have been a more important process, etc: we can no longer make
      any assertions about the state of the machine. Similarly if we see a
      UE in skiboot (and again we've seen this happen), we're not in a
      position where we can make any assertions about the state of the
      machine.
      
      Likewise, for unknown or unrecognised errors, we're not able to say
      anything about the state of the machine.
      
      Therefore, if we have an unrecovered MCE, the most appropriate thing
      to do is to panic.
      
      The second reason is that since e784b649 ("powerpc/powernv: Invoke
      opal_cec_reboot2() on unrecoverable machine check errors."), we
      attempt a special OPAL reboot on an unhandled MCE. This is so the
      hardware can record error data for later debugging.
      
      The comments in that commit assert that we are heading down the panic
      path anyway. At the moment this is not always true. With UEs in kernel
      space, for instance, they are marked as recoverable by the hardware,
      so if the attempt to reboot failed (e.g. old Skiboot), we wouldn't
      panic() but would simply die() and OOPS. It doesn't make sense to be
      staggering on if we've just tried to reboot: we should panic().
      
      Explicitly panic() on unrecovered MCEs on PowerNV.
      Update the comments appropriately.
      
      This fixes some hangs following EEH events on cxlflash setups.
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Reviewed-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f2dd80ec
  13. 20 8月, 2015 1 次提交
  14. 06 8月, 2015 1 次提交
  15. 18 6月, 2015 1 次提交
  16. 05 6月, 2015 3 次提交
  17. 04 6月, 2015 1 次提交
  18. 22 5月, 2015 4 次提交
  19. 11 4月, 2015 1 次提交
  20. 31 3月, 2015 1 次提交
  21. 25 3月, 2015 3 次提交
  22. 16 3月, 2015 2 次提交
  23. 28 1月, 2015 3 次提交
  24. 27 1月, 2015 1 次提交
  25. 15 12月, 2014 1 次提交
  26. 14 12月, 2014 1 次提交