• N
    mm/hwpoison: fix error page recovered but reported "not recovered" · 046545a6
    Naoya Horiguchi 提交于
    When an uncorrected memory error is consumed there is a race between the
    CMCI from the memory controller reporting an uncorrected error with a
    UCNA signature, and the core reporting and SRAR signature machine check
    when the data is about to be consumed.
    
    If the CMCI wins that race, the page is marked poisoned when
    uc_decode_notifier() calls memory_failure() and the machine check
    processing code finds the page already poisoned.  It calls
    kill_accessing_process() to make sure a SIGBUS is sent.  But returns the
    wrong error code.
    
    Console log looks like this:
    
      mce: Uncorrected hardware memory error in user-access at 3710b3400
      Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
      Memory failure: 0x3710b3: already hardware poisoned
      Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
      mce: Memory error not recovered
    
    kill_accessing_process() is supposed to return -EHWPOISON to notify that
    SIGBUS is already set to the process and kill_me_maybe() doesn't have to
    send it again.  But current code simply fails to do this, so fix it to
    make sure to work as intended.  This change avoids the noise message
    "Memory error not recovered" and skips duplicate SIGBUSs.
    
    [tony.luck@intel.com: reword some parts of commit message]
    
    Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev
    Fixes: a3f5d80e ("mm,hwpoison: send SIGBUS with error virutal address")
    Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
    Reported-by: NYouquan Song <youquan.song@intel.com>
    Cc: Tony Luck <tony.luck@intel.com>
    Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
    046545a6
memory-failure.c 60.4 KB