1. 05 6月, 2014 7 次提交
    • N
      mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO) · 3ba08129
      Naoya Horiguchi 提交于
      Currently memory error handler handles action optional errors in the
      deferred manner by default.  And if a recovery aware application wants
      to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
      However, such signal can be sent only to the main thread, so it's
      problematic if the application wants to have a dedicated thread to
      handler such signals.
      
      So this patch adds dedicated thread support to memory error handler.  We
      have PF_MCE_EARLY flags for each thread separately, so with this patch
      AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
      thread.  If you want to implement a dedicated thread, you call prctl()
      to set PF_MCE_EARLY on the thread.
      
      Memory error handler collects processes to be killed, so this patch lets
      it check PF_MCE_EARLY flag on each thread in the collecting routines.
      
      No behavioral change for all non-early kill cases.
      
      Tony said:
      
      : The old behavior was crazy - someone with a multithreaded process might
      : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
      : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
      : that thread wasn't the main thread for the process.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: Kamil Iskra <iskra@mcs.anl.gov>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ba08129
    • T
      mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED · 74614de1
      Tony Luck 提交于
      When Linux sees an "action optional" machine check (where h/w has reported
      an error that is not in the current execution path) we generally do not
      want to signal a process, since most processes do not have a SIGBUS
      handler - we'd just prematurely terminate the process for a problem that
      they might never actually see.
      
      task_early_kill() decides whether to consider a process - and it checks
      whether this specific process has been marked for early signals with
      "prctl", or if the system administrator has requested early signals for
      all processes using /proc/sys/vm/memory_failure_early_kill.
      
      But for MF_ACTION_REQUIRED case we must not defer.  The error is in the
      execution path of the current thread so we must send the SIGBUS
      immediatley.
      
      Fix by passing a flag argument through collect_procs*() to
      task_early_kill() so it knows whether we can defer or must take action.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74614de1
    • T
      mm/memory-failure.c-failure: send right signal code to correct thread · a70ffcac
      Tony Luck 提交于
      When a thread in a multi-threaded application hits a machine check because
      of an uncorrectable error in memory - we want to send the SIGBUS with
      si.si_code = BUS_MCEERR_AR to that thread.  Currently we fail to do that
      if the active thread is not the primary thread in the process.
      collect_procs() just finds primary threads and this test:
      
      	if ((flags & MF_ACTION_REQUIRED) && t == current) {
      
      will see that the thread we found isn't the current thread and so send a
      si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
      thread at this time).
      
      We can fix this by checking whether "current" shares the same mm with the
      process that collect_procs() said owned the page.  If so, we send the
      SIGBUS to current (with code BUS_MCEERR_AR).
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NOtto Bruggeman <otto.g.bruggeman@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Chen Gong <gong.chen@linux.jf.intel.com>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a70ffcac
    • N
      mm/memory-failure.c: move comment · 6edd6cc6
      Naoya Horiguchi 提交于
      The comment about pages under writeback is far from the relevant code, so
      let's move it to the right place.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6edd6cc6
    • D
      mm, migration: add destination page freeing callback · 68711a74
      David Rientjes 提交于
      Memory migration uses a callback defined by the caller to determine how to
      allocate destination pages.  When migration fails for a source page,
      however, it frees the destination page back to the system.
      
      This patch adds a memory migration callback defined by the caller to
      determine how to free destination pages.  If a caller, such as memory
      compaction, builds its own freelist for migration targets, this can reuse
      already freed memory instead of scanning additional memory.
      
      If the caller provides a function to handle freeing of destination pages,
      it is called when page migration fails.  If the caller passes NULL then
      freeing back to the system will be handled as usual.  This patch
      introduces no functional change.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68711a74
    • C
      mm: replace __get_cpu_var uses with this_cpu_ptr · 7c8e0181
      Christoph Lameter 提交于
      Replace places where __get_cpu_var() is used for an address calculation
      with this_cpu_ptr().
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c8e0181
    • V
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov 提交于
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
  2. 24 5月, 2014 2 次提交
  3. 04 3月, 2014 1 次提交
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  4. 12 2月, 2014 1 次提交
    • T
      cgroup: introduce cgroup_ino() · b1664924
      Tejun Heo 提交于
      mm/memory-failure.c::hwpoison_filter_task() has been reaching into
      cgroup to extract the associated ino to be used as a filtering
      criterion.  This is an implementation detail which shouldn't be
      depended upon from outside cgroup proper and is about to change with
      the scheduled kernfs conversion.
      
      This patch introduces a proper interface to determine the associated
      ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
      instead of reaching directly into cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      b1664924
  5. 11 2月, 2014 1 次提交
    • N
      mm/memory-failure.c: move refcount only in !MF_COUNT_INCREASED · 8d547ff4
      Naoya Horiguchi 提交于
      mce-test detected a test failure when injecting error to a thp tail
      page.  This is because we take page refcount of the tail page in
      madvise_hwpoison() while the fix in commit a3e0f9e4
      ("mm/memory-failure.c: transfer page count from head page to tail page
      after split thp") assumes that we always take refcount on the head page.
      
      When a real memory error happens we take refcount on the head page where
      memory_failure() is called without MF_COUNT_INCREASED set, so it seems
      to me that testing memory error on thp tail page using madvise makes
      little sense.
      
      This patch cancels moving refcount in !MF_COUNT_INCREASED for valid
      testing.
      
      [akpm@linux-foundation.org: s/&&/&/]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[3.9+: a3e0f9e4]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d547ff4
  6. 24 1月, 2014 1 次提交
    • N
      mm/memory-failure.c: shift page lock from head page to tail page after thp split · 54b9dd14
      Naoya Horiguchi 提交于
      After thp split in hwpoison_user_mappings(), we hold page lock on the
      raw error page only between try_to_unmap, hence we are in danger of race
      condition.
      
      I found in the RHEL7 MCE-relay testing that we have "bad page" error
      when a memory error happens on a thp tail page used by qemu-kvm:
      
        Triggering MCE exception on CPU 10
        mce: [Hardware Error]: Machine check events logged
        MCE exception done on CPU 10
        MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
        MCE 0x38c535: dirty LRU page recovery: Recovered
        qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
        BUG: Bad page state in process qemu-kvm  pfn:38c400
        page:ffffea000e310000 count:0 mapcount:0 mapping:          (null) index:0x7ffae3c00
        page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
        Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
        CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G   M        --------------   3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
        Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
        Call Trace:
          dump_stack+0x19/0x1b
          bad_page.part.59+0xcf/0xe8
          free_pages_prepare+0x148/0x160
          free_hot_cold_page+0x31/0x140
          free_hot_cold_page_list+0x46/0xa0
          release_pages+0x1c1/0x200
          free_pages_and_swap_cache+0xad/0xd0
          tlb_flush_mmu.part.46+0x4c/0x90
          tlb_finish_mmu+0x55/0x60
          exit_mmap+0xcb/0x170
          mmput+0x67/0xf0
          vhost_dev_cleanup+0x231/0x260 [vhost_net]
          vhost_net_release+0x3f/0x90 [vhost_net]
          __fput+0xe9/0x270
          ____fput+0xe/0x10
          task_work_run+0xc4/0xe0
          do_exit+0x2bb/0xa40
          do_group_exit+0x3f/0xa0
          get_signal_to_deliver+0x1d0/0x6e0
          do_signal+0x48/0x5e0
          do_notify_resume+0x71/0xc0
          retint_signal+0x48/0x8c
      
      The reason of this bug is that a page fault happens before unlocking the
      head page at the end of memory_failure().  This strange page fault is
      trying to access to address 0x20 and I'm not sure why qemu-kvm does
      this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
      way we catch the bad page bug/warning because we try to free a locked
      page (which was the former head page.)
      
      To fix this, this patch suggests to shift page lock from head page to
      tail page just after thp split.  SIGSEGV still happens, but it affects
      only error affected VMs, not a whole system.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>        [3.9+] # a3e0f9e4 "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54b9dd14
  7. 22 1月, 2014 2 次提交
  8. 03 1月, 2014 1 次提交
    • N
      mm/memory-failure.c: transfer page count from head page to tail page after split thp · a3e0f9e4
      Naoya Horiguchi 提交于
      Memory failures on thp tail pages cause kernel panic like below:
      
         mce: [Hardware Error]: Machine check events logged
         MCE exception done on CPU 7
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
         IP: [<ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0
         PGD bae42067 PUD ba47d067 PMD 0
         Oops: 0000 [#1] SMP
        ...
         CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G   M       O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
        ...
         Call Trace:
           me_huge_page+0x3e/0x50
           memory_failure+0x4bb/0xc20
           mce_process_work+0x3e/0x70
           process_one_work+0x171/0x420
           worker_thread+0x11b/0x3a0
           ? manage_workers.isra.25+0x2b0/0x2b0
           kthread+0xe4/0x100
           ? kthread_create_on_node+0x190/0x190
           ret_from_fork+0x7c/0xb0
           ? kthread_create_on_node+0x190/0x190
        ...
         RIP   dequeue_hwpoisoned_huge_page+0x131/0x1e0
         CR2: 0000000000000058
      
      The reasoning of this problem is shown below:
       - when we have a memory error on a thp tail page, the memory error
         handler grabs a refcount of the head page to keep the thp under us.
       - Before unmapping the error page from processes, we split the thp,
         where page refcounts of both of head/tail pages don't change.
       - Then we call try_to_unmap() over the error page (which was a tail
         page before). We didn't pin the error page to handle the memory error,
         this error page is freed and removed from LRU list.
       - We never have the error page on LRU list, so the first page state
         check returns "unknown page," then we move to the second check
         with the saved page flag.
       - The saved page flag have PG_tail set, so the second page state check
         returns "hugepage."
       - We call me_huge_page() for freed error page, then we hit the above panic.
      
      The root cause is that we didn't move refcount from the head page to the
      tail page after split thp.  So this patch suggests to do this.
      
      This panic was introduced by commit 524fca1e ("HWPOISON: fix
      misjudgement of page_action() for errors on mlocked pages").  Note that we
      did have the same refcount problem before this commit, but it was just
      ignored because we had only first page state check which returned "unknown
      page." The commit changed the refcount problem from "doesn't work" to
      "kernel panic."
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@vger.kernel.org>	[3.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3e0f9e4
  9. 19 12月, 2013 1 次提交
  10. 15 11月, 2013 1 次提交
    • S
      kfifo API type safety · 498d319b
      Stefani Seibold 提交于
      This patch enhances the type safety for the kfifo API.  It is now safe
      to put const data into a non const FIFO and the API will now generate a
      compiler warning when reading from the fifo where the destination
      address is pointing to a const variable.
      
      As a side effect the kfifo_put() does now expect the value of an element
      instead a pointer to the element.  This was suggested Russell King.  It
      make the handling of the kfifo_put easier since there is no need to
      create a helper variable for getting the address of a pointer or to pass
      integers of different sizes.
      
      IMHO the API break is okay, since there are currently only six users of
      kfifo_put().
      
      The code is also cleaner by kicking out the "if (0)" expressions.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Hauke Mehrtens <hauke@hauke-m.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      498d319b
  11. 13 11月, 2013 1 次提交
  12. 01 10月, 2013 2 次提交
  13. 12 9月, 2013 9 次提交
    • W
      mm/memory-failure.c: fix bug triggered by unpoisoning empty zero page · 3ba5eebc
      Wanpeng Li 提交于
        Injecting memory failure for page 0x19d0 at 0xb77d2000
        MCE 0x19d0: non LRU page recovery: Ignored
        MCE: Software-unpoisoned page 0x19d0
        BUG: Bad page state in process bash  pfn:019d0
        page:f3461a00 count:0 mapcount:0 mapping:  (null) index:0x0
        page flags: 0x40000404(referenced|reserved)
        Modules linked in: nfsd auth_rpcgss i915 nfs_acl nfs lockd video drm_kms_helper drm bnep rfcomm sunrpc bluetooth psmouse parport_pc ppdev lp serio_raw fscache parport gpio_ich lpc_ich mac_hid i2c_algo_bit tpm_tis wmi usb_storage hid_generic usbhid hid e1000e firewire_ohci firewire_core ahci ptp libahci pps_core crc_itu_t
        CPU: 3 PID: 2123 Comm: bash Not tainted 3.11.0-rc6+ #12
        Hardware name: LENOVO 7034DD7/        , BIOS 9HKT47AUS 01//2012
         00000000 00000000 e9625ea0 c15ec49b f3461a00 e9625eb8 c15ea119 c17cbf18
         ef084314 000019d0 f3461a00 e9625ed8 c110dc8a f3461a00 00000001 00000000
         f3461a00 40000404 00000000 e9625ef8 c110dcc1 f3461a00 f3461a00 000019d0
        Call Trace:
          dump_stack+0x41/0x52
          bad_page+0xcf/0xeb
          free_pages_prepare+0x12a/0x140
          free_hot_cold_page+0x21/0x110
          __put_single_page+0x21/0x30
          put_page+0x25/0x40
          unpoison_memory+0x107/0x200
          hwpoison_unpoison+0x20/0x30
          simple_attr_write+0xb6/0xd0
          vfs_write+0xa0/0x1b0
          SyS_write+0x4f/0x90
          sysenter_do_call+0x12/0x22
        Disabling lock debugging due to kernel taint
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <stdio.h>
      #include <sys/mman.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/types.h>
      #include <errno.h>
      
      #define PAGES_TO_TEST 1
      #define PAGE_SIZE	4096
      
      int main(void)
      {
      	char *mem;
      
      	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
      			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      
      	if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
      		return -1;
      
      	munmap(mem, PAGES_TO_TEST * PAGE_SIZE);
      
      	return 0;
      }
      
      There is one page reference count for default empty zero page,
      madvise_hwpoison add another one by get_user_pages_fast.  memory_hwpoison
      reduce one page reference count since it's a non LRU page.
      unpoison_memory release the last page reference count and free empty zero
      page to buddy system which is not correct since empty zero page has
      PG_reserved flag.  This patch fix it by don't reduce the page reference
      count under 1 against empty zero page.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ba5eebc
    • W
      mm/hwpoison: drop forward reference declarations __soft_offline_page() · 86e05773
      Wanpeng Li 提交于
      Drop forward reference declarations __soft_offline_page.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86e05773
    • W
      mm/hwpoison: don't set migration type twice to avoid holding heavily contend zone->lock · 0be35096
      Wanpeng Li 提交于
      Set pageblock migration type will hold zone->lock which is heavy contended
      in system to avoid race.  However, soft offline page will set pageblock
      migration type twice during get page if the page is in used, not hugetlbfs
      page and not on lru list.  There is unnecessary to set the pageblock
      migration type and hold heavy contended zone->lock again if the first
      round get page have already set the pageblock to right migration type.
      
      The trick here is migration type is MIGRATE_ISOLATE.  There are other two
      parts can change MIGRATE_ISOLATE except hwpoison.  One is memory hoplug,
      however, we hold lock_memory_hotplug() which avoid race.  The second is
      CMA which umovable page allocation requst can't fallback to.  So it's safe
      here.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0be35096
    • W
      mm/hwpoison: replace atomic_long_sub() with atomic_long_dec() · dd9538a5
      Wanpeng Li 提交于
      Replace atomic_long_sub() with atomic_long_dec() since the page is normal
      page instead of hugetlbfs page or thp.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd9538a5
    • W
      mm/hwpoison: fix race against poison thp · 0cea3fdc
      Wanpeng Li 提交于
      There is a race between hwpoison page and unpoison page, memory_failure
      set the page hwpoison and increase num_poisoned_pages without hold page
      lock, and one page count will be accounted against thp for
      num_poisoned_pages.  However, unpoison can occur before memory_failure
      hold page lock and split transparent hugepage, unpoison will decrease
      num_poisoned_pages by 1 << compound_order since memory_failure has not yet
      split transparent hugepage with page lock held.  That means we account one
      page for hwpoison and 1 << compound_order for unpoison.  This patch fix it
      by inserting a PageTransHuge check before doing TestClearPageHWPoison,
      unpoison failed without clearing PageHWPoison and decreasing
      num_poisoned_pages.
      
                  A                                                 	B
          	memory_failue
              TestSetPageHWPoison(p);
              if (PageHuge(p))
                  nr_pages = 1 << compound_order(hpage);
              else
                  nr_pages = 1;
              atomic_long_add(nr_pages, &num_poisoned_pages);
                                                                  unpoison_memory
      	                                                        nr_pages = 1<< compound_trans_order(page);
                                                                  if(TestClearPageHWPoison(p))
                                                                  atomic_long_sub(nr_pages, &num_poisoned_pages);
              lock page
              if (!PageHWPoison(p))
              	unlock page and return
              hwpoison_user_mappings
              if (PageTransHuge(hpage))
              	split_huge_page(hpage);
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cea3fdc
    • W
      mm/hwpoison: don't need to hold compound lock for hugetlbfs page · f9121153
      Wanpeng Li 提交于
      compound lock is introduced by commit e9da73d6("thp: compound_lock."), it
      is used to serialize put_page against __split_huge_page_refcount().  In
      addition, transparent hugepages will be splitted in hwpoison handler and
      just one subpage will be poisoned.  There is unnecessary to hold compound
      lock for hugetlbfs page.  This patch replace compound_trans_order by
      compond_order in the place where the page is hugetlbfs page.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9121153
    • W
      mm/hwpoison: fix loss of PG_dirty for errors on mlocked pages · 841fcc58
      Wanpeng Li 提交于
      memory_failure() store the page flag of the error page before doing unmap,
      and (only) if the first check with page flags at the time decided the
      error page is unknown, it do the second check with the stored page flag
      since memory_failure() does unmapping of the error pages before doing
      page_action().  This unmapping changes the page state, especially
      page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so
      page_action() can't catch mlocked pages after that.
      
      However, memory_failure() can't handle memory errors on dirty mlocked
      pages correctly.  try_to_unmap_one will move the dirty bit from pte to the
      physical page, the second check lose it since it check the stored page
      flag.  This patch fix it by restore PG_dirty flag to stored page flag if
      the page is dirty.
      
      Testcase:
      
      #define _GNU_SOURCE
      #include <stdlib.h>
      #include <stdio.h>
      #include <sys/mman.h>
      #include <sys/types.h>
      #include <errno.h>
      
      #define PAGES_TO_TEST 2
      #define PAGE_SIZE	4096
      
      int main(void)
      {
      	char *mem;
      	int i;
      
      	mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
      			PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0, 0);
      
      	for (i = 0; i < PAGES_TO_TEST; i++)
      		mem[i * PAGE_SIZE] = 'a';
      
      	if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
      		return -1;
      
      	return 0;
      }
      
      Before patch:
      
      [  912.839247] Injecting memory failure for page 7dfb8 at 7f6b4e37b000
      [  912.839257] MCE 0x7dfb8: clean mlocked LRU page recovery: Recovered
      [  912.845550] MCE 0x7dfb8: clean mlocked LRU page still referenced by 1 users
      [  912.852586] Injecting memory failure for page 7e6aa at 7f6b4e37c000
      [  912.852594] MCE 0x7e6aa: clean mlocked LRU page recovery: Recovered
      [  912.858936] MCE 0x7e6aa: clean mlocked LRU page still referenced by 1 users
      
      After patch:
      
      [  163.590225] Injecting memory failure for page 91bc2f at 7f9f5b0e5000
      [  163.590264] MCE 0x91bc2f: dirty mlocked LRU page recovery: Recovered
      [  163.596680] MCE 0x91bc2f: dirty mlocked LRU page still referenced by 1 users
      [  163.603831] Injecting memory failure for page 91cdd3 at 7f9f5b0e6000
      [  163.603852] MCE 0x91cdd3: dirty mlocked LRU page recovery: Recovered
      [  163.610305] MCE 0x91cdd3: dirty mlocked LRU page still referenced by 1 users
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      841fcc58
    • N
      hwpoison: always unset MIGRATE_ISOLATE before returning from soft_offline_page() · 0d6fdbdb
      Naoya Horiguchi 提交于
      Soft offline code expects that MIGRATE_ISOLATE is set on the target page
      only during soft offlining work.  But currenly it doesn't work as expected
      when get_any_page() fails and returns negative value.  In the result, end
      users can have unexpectedly isolated pages.  This patch just fixes it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d6fdbdb
    • N
      mm: soft-offline: use migrate_pages() instead of migrate_huge_page() · b8ec1cee
      Naoya Horiguchi 提交于
      Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
      as an argument, instead of taking a pointer to the list of hugepages to be
      migrated.  This behavior was introduced in commit 189ebff2 ("hugetlb:
      simplify migrate_huge_page()"), and was OK because until now hugepage
      migration is enabled only for soft-offlining which migrates only one
      hugepage in a single call.
      
      But the situation will change in the later patches in this series which
      enable other users of page migration to support hugepage migration.  They
      can kick migration for both of normal pages and hugepages in a single
      call, so we need to go back to original implementation which uses linked
      lists to collect the hugepages to be migrated.
      
      With this patch, soft_offline_huge_page() switches to use migrate_pages(),
      and migrate_huge_page() is not used any more.  So let's remove it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8ec1cee
  14. 11 9月, 2013 1 次提交
    • D
      shrinker: add node awareness · 0ce3d744
      Dave Chinner 提交于
      Pass the node of the current zone being reclaimed to shrink_slab(),
      allowing the shrinker control nodemask to be set appropriately for node
      aware shrinkers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0ce3d744
  15. 27 8月, 2013 1 次提交
  16. 11 7月, 2013 1 次提交
  17. 04 7月, 2013 1 次提交
    • N
      mm/memory-failure.c: fix memory leak in successful soft offlining · f15bdfa8
      Naoya Horiguchi 提交于
      After a successful page migration by soft offlining, the source page is
      not properly freed and it's never reusable even if we unpoison it
      afterward.
      
      This is caused by the race between freeing page and setting PG_hwpoison.
      In successful soft offlining, the source page is put (and the refcount
      becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked
      to pagevec and actual freeing back to buddy is delayed.  So if
      PG_hwpoison is set for the page before freeing, the freeing does not
      functions as expected (in such case freeing aborts in
      free_pages_prepare() check.)
      
      This patch tries to make sure to free the source page before setting
      PG_hwpoison on it.  To avoid reallocating, the page keeps
      MIGRATE_ISOLATE until after setting PG_hwpoison.
      
      This patch also removes obsolete comments about "keeping elevated
      refcount" because what they say is not true.  Unlike memory_failure(),
      soft_offline_page() uses no special page isolation code, and the
      soft-offlined pages have no elevated.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f15bdfa8
  18. 30 4月, 2013 1 次提交
  19. 24 2月, 2013 5 次提交