1. 24 2月, 2013 7 次提交
  2. 12 12月, 2012 2 次提交
    • N
      mm: hwpoison: fix action_result() to print out dirty/clean · ff604cf6
      Naoya Horiguchi 提交于
      action_result() fails to print out "dirty" even if an error occurred on
      a dirty pagecache, because when we check PageDirty in action_result() it
      was cleared after page isolation even if it's dirty before error
      handling.  This can break some applications that monitor this message,
      so should be fixed.
      
      There are several callers of action_result() except page_action(), but
      either of them are not for LRU pages but for free pages or kernel pages,
      so we don't have to consider dirty or not for them.
      
      Note that PG_dirty can be set outside page locks as described in commit
      6746aff7 ("HWPOISON: shmem: call set_page_dirty() with locked
      page"), so this patch does not completely closes the race window, but
      just narrows it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff604cf6
    • W
      memory-hotplug: skip HWPoisoned page when offlining pages · b023f468
      Wen Congyang 提交于
      hwpoisoned may be set when we offline a page by the sysfs interface
      /sys/devices/system/memory/soft_offline_page or
      /sys/devices/system/memory/hard_offline_page. If we don't clear
      this flag when onlining pages, this page can't be freed, and will
      not in free list. So we can't offline these pages again. So we
      should skip such page when offlining pages.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b023f468
  3. 11 12月, 2012 2 次提交
    • I
      mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable · 4fc3f1d6
      Ingo Molnar 提交于
      rmap_walk_anon() and try_to_unmap_anon() appears to be too
      careful about locking the anon vma: while it needs protection
      against anon vma list modifications, it does not need exclusive
      access to the list itself.
      
      Transforming this exclusive lock to a read-locked rwsem removes
      a global lock from the hot path of page-migration intense
      threaded workloads which can cause pathological performance like
      this:
      
          96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                        |
                        --- perf_trace_sched_switch
                            __schedule
                            schedule
                            schedule_preempt_disabled
                            __mutex_lock_common.isra.6
                            __mutex_lock_slowpath
                            mutex_lock
                           |
                           |--50.61%-- rmap_walk
                           |          move_to_new_page
                           |          migrate_pages
                           |          migrate_misplaced_page
                           |          __do_numa_page.isra.69
                           |          handle_pte_fault
                           |          handle_mm_fault
                           |          __do_page_fault
                           |          do_page_fault
                           |          page_fault
                           |          __memset_sse2
                           |          |
                           |           --100.00%-- worker_thread
                           |                     |
                           |                      --100.00%-- start_thread
                           |
                            --49.39%-- page_lock_anon_vma
                                      try_to_unmap_anon
                                      try_to_unmap
                                      migrate_pages
                                      migrate_misplaced_page
                                      __do_numa_page.isra.69
                                      handle_pte_fault
                                      handle_mm_fault
                                      __do_page_fault
                                      do_page_fault
                                      page_fault
                                      __memset_sse2
                                      |
                                       --100.00%-- worker_thread
                                                 start_thread
      
      With this change applied the profile is now nicely flat
      and there's no anon-vma related scheduling/blocking.
      
      Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
      to make it clearer that it's an exclusive write-lock in
      that case - suggested by Rik van Riel.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4fc3f1d6
    • M
      mm: migrate: Add a tracepoint for migrate_pages · 7b2a2d4a
      Mel Gorman 提交于
      The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
      about migration activity but not the type or the reason. This patch adds
      a tracepoint to identify the type of page migration and why the page is
      being migrated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      7b2a2d4a
  4. 01 12月, 2012 1 次提交
  5. 09 10月, 2012 2 次提交
    • M
      mm anon rmap: replace same_anon_vma linked list with an interval tree. · bf181b9f
      Michel Lespinasse 提交于
      When a large VMA (anon or private file mapping) is first touched, which
      will populate its anon_vma field, and then split into many regions through
      the use of mprotect(), the original anon_vma ends up linking all of the
      vmas on a linked list.  This can cause rmap to become inefficient, as we
      have to walk potentially thousands of irrelevent vmas before finding the
      one a given anon page might fall into.
      
      By replacing the same_anon_vma linked list with an interval tree (where
      each avc's interval is determined by its vma's start and last pgoffs), we
      can make rmap efficient for this use case again.
      
      While the change is large, all of its pieces are fairly simple.
      
      Most places that were walking the same_anon_vma list were looking for a
      known pgoff, so they can just use the anon_vma_interval_tree_foreach()
      interval tree iterator instead.  The exception here is ksm, where the
      page's index is not known.  It would probably be possible to rework ksm so
      that the index would be known, but for now I have decided to keep things
      simple and just walk the entirety of the interval tree there.
      
      When updating vma's that already have an anon_vma assigned, we must take
      care to re-index the corresponding avc's on their interval tree.  This is
      done through the use of anon_vma_interval_tree_pre_update_vma() and
      anon_vma_interval_tree_post_update_vma(), which remove the avc's from
      their interval tree before the update and re-insert them after the update.
       The anon_vma stays locked during the update, so there is no chance that
      rmap would miss the vmas that are being updated.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf181b9f
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
  6. 01 8月, 2012 2 次提交
  7. 31 7月, 2012 1 次提交
  8. 12 7月, 2012 1 次提交
    • T
      x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults · 6751ed65
      Tony Luck 提交于
      In commit dad1743e ("x86/mce: Only restart instruction after machine
      check recovery if it is safe") we fixed mce_notify_process() to force a
      signal to the current process if it was not restartable (RIPV bit not
      set in MCG_STATUS). But doing it here means that the process doesn't
      get told the virtual address of the fault via siginfo_t->si_addr. This
      would prevent application level recovery from the fault.
      
      Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
      that we will provide the right information with the signal.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: stable@kernel.org    # 3.4+
      6751ed65
  9. 30 5月, 2012 1 次提交
  10. 21 5月, 2012 1 次提交
  11. 22 3月, 2012 1 次提交
  12. 13 1月, 2012 1 次提交
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
  13. 04 1月, 2012 2 次提交
    • T
      HWPOISON: Add code to handle "action required" errors. · 7329bbeb
      Tony Luck 提交于
      Add new flag bit "MF_ACTION_REQUIRED" to be used by machine check
      code to force a signal with si_code = BUS_MCEERR_AR in the case
      where the error occurs in processor execution context. Pass the
      flags argument along call chain:
      	memory_failure()
      	  hwpoison_user_mappings()
      	    kill_procs()
      	      kill_proc()
      
      Drop the "_ao" suffix from kill_procs_ao() and kill_proc_ao() since
      they can now handle "action required" as well as "action optional" errors.
      Acked-by: NBorislav Petkov <bp@amd64.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      7329bbeb
    • T
      HWPOISON: Clean up memory_failure() vs. __memory_failure() · cd42f4a3
      Tony Luck 提交于
      There is only one caller of memory_failure(), all other users call
      __memory_failure() and pass in the flags argument explicitly. The
      lone user of memory_failure() will soon need to pass flags too.
      
      Add flags argument to the callsite in mce.c. Delete the old memory_failure()
      function, and then rename __memory_failure() without the leading "__".
      
      Provide clearer message when action optional memory errors are ignored.
      Acked-by: NBorislav Petkov <bp@amd64.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      cd42f4a3
  14. 01 11月, 2011 1 次提交
  15. 31 10月, 2011 1 次提交
  16. 03 8月, 2011 1 次提交
    • H
      HWPoison: add memory_failure_queue() · ea8f5fb8
      Huang Ying 提交于
      memory_failure() is the entry point for HWPoison memory error
      recovery.  It must be called in process context.  But commonly
      hardware memory errors are notified via MCE or NMI, so some delayed
      execution mechanism must be used.  In MCE handler, a work queue + ring
      buffer mechanism is used.
      
      In addition to MCE, now APEI (ACPI Platform Error Interface) GHES
      (Generic Hardware Error Source) can be used to report memory errors
      too.  To add support to APEI GHES memory recovery, a mechanism similar
      to that of MCE is implemented.  memory_failure_queue() is the new
      entry point that can be called in IRQ context.  The next step is to
      make MCE handler uses this interface too.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      ea8f5fb8
  17. 28 6月, 2011 1 次提交
  18. 16 6月, 2011 1 次提交
    • M
      mm/memory-failure.c: fix page isolated count mismatch · 5db8a73a
      Minchan Kim 提交于
      Pages isolated for migration are accounted with the vmstat counters
      NR_ISOLATE_[ANON|FILE].  Callers of migrate_pages() are expected to
      increment these counters when pages are isolated from the LRU.  Once the
      pages have been migrated, they are put back on the LRU or freed and the
      isolated count is decremented.
      
      Memory failure is not properly accounting for pages it isolates causing
      the NR_ISOLATED counters to be negative.  On SMP builds, this goes
      unnoticed as negative counters are treated as 0 due to expected per-cpu
      drift.  On UP builds, the counter is treated by too_many_isolated() as a
      large value causing processes to enter D state during page reclaim or
      compaction.  This patch accounts for pages isolated by memory failure
      correctly.
      
      [mel@csn.ul.ie: rewrote changelog]
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5db8a73a
  19. 25 5月, 2011 4 次提交
  20. 31 3月, 2011 1 次提交
  21. 23 3月, 2011 1 次提交
  22. 18 3月, 2011 1 次提交
  23. 10 3月, 2011 1 次提交
  24. 03 2月, 2011 3 次提交
    • J
      thp: fix unsuitable behavior for hwpoisoned tail page · af241a08
      Jin Dongming 提交于
      When a tail page of THP is poisoned, memory-failure will do nothing except
      setting PG_hwpoison, while the expected behavior is that the process, who
      is using the poisoned tail page, should be killed.
      
      The above problem is caused by lru check of the poisoned tail page of THP.
      Because PG_lru flag is only set on the head page of THP, the check always
      consider the poisoned tail page as NON lru page.
      
      So the lru check for the tail page of THP should be avoided, as like as
      hugetlb.
      
      This patch adds !PageTransCompound() before lru check for THP, because of
      the check (!PageHuge() && !PageTransCompound()) the whole branch could be
      optimized away at build time when both hugetlbfs and THP are set with "N"
      (or in archs not supporting either of those).
      
      [akpm@linux-foundation.org: fix unrelated typo in shake_page() comment]
      Signed-off-by: NJin Dongming <jin.dongming@np.css.fujitsu.com>
      Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af241a08
    • J
      thp: fix the wrong reported address of hwpoisoned hugepages · a6d30ddd
      Jin Dongming 提交于
      When the tail page of THP is poisoned, the head page will be poisoned too.
       And the wrong address, address of head page, will be sent with sigbus
      always.
      
      So when the poisoned page is used by Guest OS which is running on KVM,
      after the address changing(hva->gpa) by qemu, the unexpected process on
      Guest OS will be killed by sigbus.
      
      What we expected is that the process using the poisoned tail page could be
      killed on Guest OS, but not that the process using the healthy head page
      is killed.
      
      Since it is not good to poison the healthy page, avoid poisoning other
      than the page which is really poisoned.
        (While we poison all pages in a huge page in case of hugetlb,
         we can do this for THP thanks to split_huge_page().)
      
      Here we fix two parts:
        1. Isolate the poisoned page only to make sure
           the reported address is the address of poisoned page.
        2. make the poisoned page work as the poisoned regular page.
      
      [akpm@linux-foundation.org: fix spello in comment]
      Signed-off-by: NJin Dongming <jin.dongming@np.css.fujitsu.com>
      Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6d30ddd
    • J
      thp: fix splitting of hwpoisoned hugepages · efeda7a4
      Jin Dongming 提交于
      The poisoned THP is now split with split_huge_page() in
      collect_procs_anon().  If kmalloc() is failed in collect_procs(),
      split_huge_page() could not be called.  And the work after
      split_huge_page() for collecting the processes using poisoned page will
      not be done, too.  So the processes using the poisoned page could not be
      killed.
      
      The condition becomes worse when CONFIG_DEBUG_VM == "Y".  Because the
      poisoned THP could not be split, system panic will be caused by
      VM_BUG_ON(PageTransHuge(page)) in try_to_unmap().
      
      This patch does:
        1. move split_huge_page() to the place before collect_procs().
           This can be sure the failure of splitting THP is caused by itself.
        2. when splitting THP is failed, stop the operations after it.
           This can avoid unexpected system panic or non sense works.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJin Dongming <jin.dongming@np.css.fujitsu.com>
      Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efeda7a4