1. 15 8月, 2015 1 次提交
  2. 07 8月, 2015 4 次提交
  3. 25 6月, 2015 9 次提交
    • X
      tracing: add trace event for memory-failure · 97f0b134
      Xie XiuQi 提交于
      RAS user space tools like rasdaemon which base on trace event, could
      receive mce error event, but no memory recovery result event.  So, I want
      to add this event to make this scenario complete.
      
      This patch add a event at ras group for memory-failure.
      
      The output like below:
      #  tracer: nop
      #
      #  entries-in-buffer/entries-written: 2/2   #P:24
      #
      #                               _-----=> irqs-off
      #                              / _----=> need-resched
      #                             | / _---=> hardirq/softirq
      #                             || / _--=> preempt-depth
      #                             ||| /     delay
      #            TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
      #               | |       |   ||||       |         |
             mce-inject-13150 [001] ....   277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed
      
      [xiexiuqi@huawei.com: fix build error]
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97f0b134
    • X
      memory-failure: change type of action_result's param 3 to enum · cc3e2af4
      Xie XiuQi 提交于
      Change type of action_result's param 3 to enum for type consistency,
      and rename mf_outcome to mf_result for clearly.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc3e2af4
    • X
      memory-failure: export page_type and action result · cc637b17
      Xie XiuQi 提交于
      Export 'outcome' and 'action_page_type' to mm.h, so we could use
      this emnus outside.
      
      This patch is preparation for adding trace events for memory-failure
      recovery action.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc637b17
    • N
      mm/memory-failure: me_huge_page() does nothing for thp · 2491ffee
      Naoya Horiguchi 提交于
      memory_failure() is supposed not to handle thp itself, but to split it.
      But if something were wrong and page_action() were called on thp,
      me_huge_page() (action routine for hugepages) should be better to take
      no action, rather than to take wrong action prepared for hugetlb (which
      triggers BUG_ON().)
      
      This change is for potential problems, but makes sense to me because thp
      is an actively developing feature and this code path can be open in the
      future.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2491ffee
    • N
      mm: soft-offline: don't free target page in successful page migration · add05cec
      Naoya Horiguchi 提交于
      Stress testing showed that soft offline events for a process iterating
      "mmap-pagefault-munmap" loop can trigger
      VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
      
        Soft offlining page 0x70fe1 at 0x70100008d000
        Soft offlining page 0x705fb at 0x70300008d000
        page:ffffea0001c3f840 count:0 mapcount:0 mapping:          (null) index:0x2
        flags: 0x1fffff80800000(hwpoison)
        page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
        ------------[ cut here ]------------
        kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
        invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
        Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
        CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
        RIP: free_pcppages_bulk+0x52a/0x6f0
        Call Trace:
          drain_pages_zone+0x3d/0x50
          drain_local_pages+0x1d/0x30
          on_each_cpu_mask+0x46/0x80
          drain_all_pages+0x14b/0x1e0
          soft_offline_page+0x432/0x6e0
          SyS_madvise+0x73c/0x780
          system_call_fastpath+0x12/0x17
        Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
        RIP  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
         RSP <ffff88007a117d28>
        ---[ end trace 53926436e76d1f35 ]---
      
      When soft offline successfully migrates page, the source page is supposed
      to be freed.  But there is a race condition where a source page looks
      isolated (i.e.  the refcount is 0 and the PageHWPoison is set) but
      somewhat linked to pcplist.  Then another soft offline event calls
      drain_all_pages() and tries to free such hwpoisoned page, which is
      forbidden.
      
      This odd page state seems to happen due to the race between put_page() in
      putback_lru_page() and __pagevec_lru_add_fn().  But I don't want to play
      with tweaking drain code as done in commit 9ab3b598 "mm: hwpoison:
      drop lru_add_drain_all() in __soft_offline_page()", or to change page
      freeing code for this soft offline's purpose.
      
      Instead, let's think about the difference between hard offline and soft
      offline.  There is an interesting difference in how to isolate the in-use
      page between these, that is, hard offline marks PageHWPoison of the target
      page at first, and doesn't free it by keeping its refcount 1.  OTOH, soft
      offline tries to free the target page then marks PageHWPoison.  This
      difference might be the source of complexity and result in bugs like the
      above.  So making soft offline isolate with keeping refcount can be a
      solution for this problem.
      
      We can pass to page migration code the "reason" which shows the caller, so
      let's use this more to avoid calling putback_lru_page() when called from
      soft offline, which effectively does the isolation for soft offline.  With
      this change, target pages of soft offline never be reused without changing
      migratetype, so this patch also removes the related code.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      add05cec
    • N
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling · ead07f6a
      Naoya Horiguchi 提交于
      memory_failure() can run in 2 different mode (specified by
      MF_COUNT_INCREASED) in page refcount perspective.  When
      MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
      takes a refcount of the target page.  And if cleared, memory_failure()
      takes it in it's own.
      
      In current code, however, refcounting is done differently in each caller.
      For example, madvise_hwpoison() uses get_user_pages_fast() and
      hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
      refcounting causes refcount failure especially for thp tail pages.
      Typical user visible effects are like memory leak or
      VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
      
      To fix this refcounting issue, this patch introduces get_hwpoison_page()
      to handle thp tail pages in the same manner for each caller of hwpoison
      code.
      
      memory_failure() might fail to split thp and in such case it returns
      without completing page isolation.  This is not good because PageHWPoison
      on the thp is still set and there's no easy way to unpoison such thps.  So
      this patch try to roll back any action to the thp in "non anonymous thp"
      case and "thp split failed" case, expecting an MCE(SRAR) generated by
      later access afterward will properly free such thps.
      
      [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ead07f6a
    • N
      mm/memory-failure: split thp earlier in memory error handling · 415c64c1
      Naoya Horiguchi 提交于
      memory_failure() doesn't handle thp itself at this time and need to split
      it before doing isolation.  Currently thp is split in the middle of
      hwpoison_user_mappings(), but there're corner cases where memory_failure()
      wrongly tries to handle thp without splitting.
      
      1) "non anonymous" thp, which is not a normal operating mode of thp,
         but a memory error could hit a thp before anon_vma is initialized.  In
         such case, split_huge_page() fails and me_huge_page() (intended for
         hugetlb) is called for thp, which triggers BUG_ON in page_hstate().
      
      2) !PageLRU case, where hwpoison_user_mappings() returns with
         SWAP_SUCCESS and the result is the same as case 1.
      
      memory_failure() can't avoid splitting, so let's split it more earlier,
      which also reduces code which are prepared for both of normal page and
      thp.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      415c64c1
    • A
      mm, hwpoison: remove obsolete "Notebook" todo list · ebb09738
      Andi Kleen 提交于
      All the items mentioned here have been either addressed, or were not
      really needed.  So just remove the comment.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebb09738
    • A
      mm, hwpoison: add comment describing when to add new cases · e0de78df
      Andi Kleen 提交于
      Here's another comment fix for hwpoison.
      
      It describes the "guiding principle" on when to add new
      memory error recovery code.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0de78df
  4. 06 5月, 2015 2 次提交
  5. 16 4月, 2015 2 次提交
    • N
      mm: hugetlb: introduce page_huge_active · bcc54222
      Naoya Horiguchi 提交于
      We are not safe from calling isolate_huge_page() on a hugepage
      concurrently, which can make the victim hugepage in invalid state and
      results in BUG_ON().
      
      The root problem of this is that we don't have any information on struct
      page (so easily accessible) about hugepages' activeness.  Note that
      hugepages' activeness means just being linked to
      hstate->hugepage_activelist, which is not the same as normal pages'
      activeness represented by PageActive flag.
      
      Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
      before isolation, so let's do similarly for hugetlb with a new
      paeg_huge_active().
      
      set/clear_page_huge_active() should be called within hugetlb_lock.  But
      hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
      in these functions set_page_huge_active() is called right after the
      hugepage is allocated and no other thread tries to isolate it.
      
      [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
      [fengguang.wu@intel.com: set_page_huge_active() can be static]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcc54222
    • N
      mm/memory-failure.c: define page types for action_result() in one place · 64d37a2b
      Naoya Horiguchi 提交于
      This cleanup patch moves all strings passed to action_result() into a
      singl= e array action_page_type so that a reader can easily find which
      kind of actio= n results are possible.  And this patch also fixes the
      odd lines to be printed out, like "unknown page state page" or "free
      buddy, 2nd try page".
      
      [akpm@linux-foundation.org: rename messages, per David]
      [akpm@linux-foundation.org: s/DIRTY_UNEVICTABLE_LRU/CLEAN_UNEVICTABLE_LRU', per Andi]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Xie XiuQi" <xiexiuqi@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64d37a2b
  6. 13 2月, 2015 2 次提交
    • N
      mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page() · 9ab3b598
      Naoya Horiguchi 提交于
      A race condition starts to be visible in recent mmotm, where a PG_hwpoison
      flag is set on a migration source page *before* it's back in buddy page
      poo= l.
      
      This is problematic because no page flag is supposed to be set when
      freeing (see __free_one_page().) So the user-visible effect of this race
      is that it could trigger the BUG_ON() when soft-offlining is called.
      
      The root cause is that we call lru_add_drain_all() to make sure that the
      page is in buddy, but that doesn't work because this function just
      schedule= s a work item and doesn't wait its completion.
      drain_all_pages() does drainin= g directly, so simply dropping
      lru_add_drain_all() solves this problem.
      
      Fixes: f15bdfa8 ("mm/memory-failure.c: fix memory leak in successful soft offlining")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[3.11+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab3b598
    • V
      vmscan: per memory cgroup slab shrinkers · cb731d6c
      Vladimir Davydov 提交于
      This patch adds SHRINKER_MEMCG_AWARE flag.  If a shrinker has this flag
      set, it will be called per memory cgroup.  The memory cgroup to scan
      objects from is passed in shrink_control->memcg.  If the memory cgroup
      is NULL, a memcg aware shrinker is supposed to scan objects from the
      global list.  Unaware shrinkers are only called on global pressure with
      memcg=NULL.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb731d6c
  7. 14 12月, 2014 3 次提交
  8. 11 12月, 2014 2 次提交
    • V
      mm, memory_hotplug/failure: drain single zone pcplists · c0554329
      Vlastimil Babka 提交于
      Memory hotplug and failure mechanisms have several places where pcplists
      are drained so that pages are returned to the buddy allocator and can be
      e.g. prepared for offlining.  This is always done in the context of a
      single zone, we can reduce the pcplists drain to the single zone, which
      is now possible.
      
      The change should make memory offlining due to hotremove or failure
      faster and not disturbing unrelated pcplists anymore.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0554329
    • V
      mm: introduce single zone pcplists drain · 93481ff0
      Vlastimil Babka 提交于
      The functions for draining per-cpu pages back to buddy allocators
      currently always operate on all zones.  There are however several cases
      where the drain is only needed in the context of a single zone, and
      spilling other pcplists is a waste of time both due to the extra
      spilling and later refilling.
      
      This patch introduces new zone pointer parameter to drain_all_pages()
      and changes the dummy parameter of drain_local_pages() to be also a zone
      pointer.  When NULL is passed, the functions operate on all zones as
      usual.  Passing a specific zone pointer reduces the work to the single
      zone.
      
      All callers are updated to pass the NULL pointer in this patch.
      Conversion to single zone (where appropriate) is done in further
      patches.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93481ff0
  9. 22 10月, 2014 1 次提交
  10. 19 9月, 2014 1 次提交
  11. 07 8月, 2014 1 次提交
  12. 31 7月, 2014 2 次提交
  13. 24 7月, 2014 1 次提交
  14. 04 7月, 2014 1 次提交
    • C
      hwpoison: fix the handling path of the victimized page frame that belong to non-LRU · 0bc1f8b0
      Chen Yucong 提交于
      Until now, the kernel has the same policy to handle victimized page
      frames that belong to kernel-space(reserved/slab-subsystem) or
      non-LRU(unknown page state).  In other word, the result of handling
      either of these victimized page frames is (IGNORED | FAILED), and the
      return value of memory_failure() is -EBUSY.
      
      This patch is to avoid that memory_failure() returns very soon due to
      the "true" value of (!PageLRU(p)), and it also ensures that
      action_result() can report more precise information("reserved kernel",
      "kernel slab", and "unknown page state") instead of "non LRU",
      especially for memory errors which are detected by memory-scrubbing.
      
      Andi said:
      
      : While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
      :
      : soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
      : page:ffffea000015b400 count:3 mapcount:2097169 mapping:          (null) index:0xffff8800056d7000
      : page flags: 0x3ffff800004081(locked|slab|head)
      : ------------[ cut here ]------------
      : kernel BUG at mm/rmap.c:1495!
      :
      : I think what happened is that a LRU page turned into a slab page in
      : parallel with offlining.  memory_failure initially tests for this case,
      : but doesn't retest later after the page has been locked.
      :
      : ...
      :
      : I ran this patch in a loop over night with some stress plus
      : the mcelog test suite running in a loop. I cannot guarantee it hit it,
      : but it should have given it a good beating.
      :
      : The kernel survived with no messages, although the mcelog test suite
      : got killed at some point because it couldn't fork anymore. Probably
      : some unrelated problem.
      :
      : So the patch is ok for me for .16.
      Signed-off-by: NChen Yucong <slaoub@gmail.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NAndi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bc1f8b0
  15. 05 6月, 2014 7 次提交
    • N
      mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO) · 3ba08129
      Naoya Horiguchi 提交于
      Currently memory error handler handles action optional errors in the
      deferred manner by default.  And if a recovery aware application wants
      to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
      However, such signal can be sent only to the main thread, so it's
      problematic if the application wants to have a dedicated thread to
      handler such signals.
      
      So this patch adds dedicated thread support to memory error handler.  We
      have PF_MCE_EARLY flags for each thread separately, so with this patch
      AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
      thread.  If you want to implement a dedicated thread, you call prctl()
      to set PF_MCE_EARLY on the thread.
      
      Memory error handler collects processes to be killed, so this patch lets
      it check PF_MCE_EARLY flag on each thread in the collecting routines.
      
      No behavioral change for all non-early kill cases.
      
      Tony said:
      
      : The old behavior was crazy - someone with a multithreaded process might
      : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
      : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
      : that thread wasn't the main thread for the process.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: Kamil Iskra <iskra@mcs.anl.gov>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ba08129
    • T
      mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED · 74614de1
      Tony Luck 提交于
      When Linux sees an "action optional" machine check (where h/w has reported
      an error that is not in the current execution path) we generally do not
      want to signal a process, since most processes do not have a SIGBUS
      handler - we'd just prematurely terminate the process for a problem that
      they might never actually see.
      
      task_early_kill() decides whether to consider a process - and it checks
      whether this specific process has been marked for early signals with
      "prctl", or if the system administrator has requested early signals for
      all processes using /proc/sys/vm/memory_failure_early_kill.
      
      But for MF_ACTION_REQUIRED case we must not defer.  The error is in the
      execution path of the current thread so we must send the SIGBUS
      immediatley.
      
      Fix by passing a flag argument through collect_procs*() to
      task_early_kill() so it knows whether we can defer or must take action.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74614de1
    • T
      mm/memory-failure.c-failure: send right signal code to correct thread · a70ffcac
      Tony Luck 提交于
      When a thread in a multi-threaded application hits a machine check because
      of an uncorrectable error in memory - we want to send the SIGBUS with
      si.si_code = BUS_MCEERR_AR to that thread.  Currently we fail to do that
      if the active thread is not the primary thread in the process.
      collect_procs() just finds primary threads and this test:
      
      	if ((flags & MF_ACTION_REQUIRED) && t == current) {
      
      will see that the thread we found isn't the current thread and so send a
      si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
      thread at this time).
      
      We can fix this by checking whether "current" shares the same mm with the
      process that collect_procs() said owned the page.  If so, we send the
      SIGBUS to current (with code BUS_MCEERR_AR).
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NOtto Bruggeman <otto.g.bruggeman@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a70ffcac
    • N
      mm/memory-failure.c: move comment · 6edd6cc6
      Naoya Horiguchi 提交于
      The comment about pages under writeback is far from the relevant code, so
      let's move it to the right place.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6edd6cc6
    • D
      mm, migration: add destination page freeing callback · 68711a74
      David Rientjes 提交于
      Memory migration uses a callback defined by the caller to determine how to
      allocate destination pages.  When migration fails for a source page,
      however, it frees the destination page back to the system.
      
      This patch adds a memory migration callback defined by the caller to
      determine how to free destination pages.  If a caller, such as memory
      compaction, builds its own freelist for migration targets, this can reuse
      already freed memory instead of scanning additional memory.
      
      If the caller provides a function to handle freeing of destination pages,
      it is called when page migration fails.  If the caller passes NULL then
      freeing back to the system will be handled as usual.  This patch
      introduces no functional change.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68711a74
    • C
      mm: replace __get_cpu_var uses with this_cpu_ptr · 7c8e0181
      Christoph Lameter 提交于
      Replace places where __get_cpu_var() is used for an address calculation
      with this_cpu_ptr().
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c8e0181
    • V
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov 提交于
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
  16. 24 5月, 2014 1 次提交