1. 17 6月, 2021 1 次提交
    • X
      ACPI: APEI: fix synchronous external aborts in user-mode · ccb5ecdc
      Xiaofei Tan 提交于
      Before commit 8fcc4ae6 ("arm64: acpi: Make apei_claim_sea()
      synchronise with APEI's irq work"), do_sea() would unconditionally
      signal the affected task from the arch code. Since that change,
      the GHES driver sends the signals.
      
      This exposes a problem as errors the GHES driver doesn't understand
      or doesn't handle effectively are silently ignored. It will cause
      the errors get taken again, and circulate endlessly. User-space task
      get stuck in this loop.
      
      Existing firmware on Kunpeng9xx systems reports cache errors with the
      'ARM Processor Error' CPER records.
      
      Do memory failure handling for ARM Processor Error Section just like
      for Memory Error Section.
      
      Fixes: 8fcc4ae6 ("arm64: acpi: Make apei_claim_sea() synchronise with APEI's irq work")
      Signed-off-by: NXiaofei Tan <tanxiaofei@huawei.com>
      Reviewed-by: NJames Morse <james.morse@arm.com>
      [ rjw: Subject edit ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      ccb5ecdc
  2. 18 10月, 2020 1 次提交
    • J
      task_work: cleanup notification modes · 91989c70
      Jens Axboe 提交于
      A previous commit changed the notification mode from true/false to an
      int, allowing notify-no, notify-yes, or signal-notify. This was
      backwards compatible in the sense that any existing true/false user
      would translate to either 0 (on notification sent) or 1, the latter
      which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.
      
      Clean this up properly, and define a proper enum for the notification
      mode. Now we have:
      
      - TWA_NONE. This is 0, same as before the original change, meaning no
        notification requested.
      - TWA_RESUME. This is 1, same as before the original change, meaning
        that we use TIF_NOTIFY_RESUME.
      - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
        notification.
      
      Clean up all the callers, switching their 0/1/false/true to using the
      appropriate TWA_* mode for notifications.
      
      Fixes: e91b4816 ("task_work: teach task_work_add() to do signal_wake_up()")
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      91989c70
  3. 16 9月, 2020 1 次提交
  4. 03 6月, 2020 1 次提交
  5. 20 5月, 2020 1 次提交
    • J
      ACPI: APEI: Kick the memory_failure() queue for synchronous errors · 7f17b4a1
      James Morse 提交于
      memory_failure() offlines or repairs pages of memory that have been
      discovered to be corrupt. These may be detected by an external
      component, (e.g. the memory controller), and notified via an IRQ.
      In this case the work is queued as not all of memory_failure()s work
      can happen in IRQ context.
      
      If the error was detected as a result of user-space accessing a
      corrupt memory location the CPU may take an abort instead. On arm64
      this is a 'synchronous external abort', and on a firmware first
      system it is replayed using NOTIFY_SEA.
      
      This notification has NMI like properties, (it can interrupt
      IRQ-masked code), so the memory_failure() work is queued. If we
      return to user-space before the queued memory_failure() work is
      processed, we will take the fault again. This loop may cause platform
      firmware to exceed some threshold and reboot when Linux could have
      recovered from this error.
      
      For NMIlike notifications keep track of whether memory_failure() work
      was queued, and make task_work pending to flush out the queue.
      To save memory allocations, the task_work is allocated as part of
      the ghes_estatus_node, and free()ing it back to the pool is deferred.
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Tested-by: NTyler Baicar <baicar@os.amperecomputing.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      7f17b4a1
  6. 22 3月, 2020 1 次提交
  7. 13 1月, 2020 1 次提交
    • B
      apei/ghes: Do not delay GHES polling · cea79e7e
      Bhaskar Upadhaya 提交于
      Currently, the ghes_poll_func() timer callback is registered with the
      TIMER_DEFERRABLE flag. Thus, it is run when the CPU eventually wakes
      up together with a subsequent non-deferrable timer and not at the precisely
      configured polling interval.
      
      For polling mode, the polling interval configured by firmware should not
      be exceeded according to the ACPI spec 6.3, Table 18-394. The definition
      of the polling interval is:
      
      "Indicates the poll interval in milliseconds OSPM should use to
       periodically check the error source for the presence of an error
       condition."
      
      If this interval is extended due to the timer callback deferring, error
      records can get lost. Which we are observing on our ThunderX2 platforms.
      
      Therefore, remove the TIMER_DEFERRABLE flag so that the timer callback
      executes at the precise interval.
      Signed-off-by: NBhaskar Upadhaya <bupadhaya@marvell.com>
      [ bp: Subject & changelog ]
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      cea79e7e
  8. 18 10月, 2019 1 次提交
  9. 21 8月, 2019 1 次提交
  10. 05 8月, 2019 1 次提交
  11. 05 7月, 2019 1 次提交
  12. 31 5月, 2019 1 次提交
  13. 11 2月, 2019 1 次提交
  14. 08 2月, 2019 19 次提交
  15. 21 12月, 2018 1 次提交
  16. 12 5月, 2018 1 次提交
  17. 02 5月, 2018 1 次提交
    • B
      ghes, EDAC: Fix ghes_edac registration · cc7f3f13
      Borislav Petkov 提交于
      Tony reported seeing
      
        "Internal error: Can't find EDAC structure"
      
      when injecting correctable errors due to the fact that ghes_edac would
      still load even if the whitelist won't hit. Drop the pr_err() in
      ghes_edac_report_mem_error() for now due to the hacky way how ghes_edac
      depends on ghes.c.
      
      While at it, make ghes_edac_register() return an error if it doesn't hit
      in the whitelist as it is the only sensible thing to do in that
      situation.
      
      Furthermore, move the call to it to happen last in ghes_probe() so that
      GHES initializing properly does not depend on ghes_edac init at all
      as latter is only reporting errors and not required for GHES's proper
      functioning.
      Reviewed-by: NToshi Kani <toshi.kani@hpe.com>
      Tested-by: NSughosh Ganu <sughosh.ganu@arm.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20180420182015.zao3olss4tvvlxki@agluck-desk
      cc7f3f13
  18. 24 1月, 2018 1 次提交
  19. 05 12月, 2017 3 次提交
  20. 07 11月, 2017 1 次提交