• J
    ACPI: APEI: Kick the memory_failure() queue for synchronous errors · 4d6a8607
    James Morse 提交于
    fix #28612342
    
    commit 7f17b4a121d0d50eca22cb1edebf0a157f3e43bf upstream
    
    memory_failure() offlines or repairs pages of memory that have been
    discovered to be corrupt. These may be detected by an external
    component, (e.g. the memory controller), and notified via an IRQ.
    In this case the work is queued as not all of memory_failure()s work
    can happen in IRQ context.
    
    If the error was detected as a result of user-space accessing a
    corrupt memory location the CPU may take an abort instead. On arm64
    this is a 'synchronous external abort', and on a firmware first
    system it is replayed using NOTIFY_SEA.
    
    This notification has NMI like properties, (it can interrupt
    IRQ-masked code), so the memory_failure() work is queued. If we
    return to user-space before the queued memory_failure() work is
    processed, we will take the fault again. This loop may cause platform
    firmware to exceed some threshold and reboot when Linux could have
    recovered from this error.
    
    For NMIlike notifications keep track of whether memory_failure() work
    was queued, and make task_work pending to flush out the queue.
    To save memory allocations, the task_work is allocated as part of
    the ghes_estatus_node, and free()ing it back to the pool is deferred.
    Signed-off-by: NJames Morse <james.morse@arm.com>
    Tested-by: NTyler Baicar <baicar@os.amperecomputing.com>
    Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
    Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
    Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
    Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
    4d6a8607
ghes.h 3.0 KB