1. 03 1月, 2013 3 次提交
    • M
      mm: mempolicy: Convert shared_policy mutex to spinlock · 42288fe3
      Mel Gorman 提交于
      Sasha was fuzzing with trinity and reported the following problem:
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:269
        in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
        2 locks held by trinity-main/6361:
         #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0
         #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
        Pid: 6361, comm: trinity-main Tainted: G        W
        3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
        Call Trace:
          __might_sleep+0x1c3/0x1e0
          mutex_lock_nested+0x29/0x50
          mpol_shared_policy_lookup+0x2e/0x90
          shmem_get_policy+0x2e/0x30
          get_vma_policy+0x5a/0xa0
          mpol_misplaced+0x41/0x1d0
          handle_pte_fault+0x465/0x6a0
      
      This was triggered by a different version of automatic NUMA balancing
      but in theory the current version is vunerable to the same problem.
      
      do_numa_page
        -> numa_migrate_prep
          -> mpol_misplaced
            -> get_vma_policy
              -> shmem_get_policy
      
      It's very unlikely this will happen as shared pages are not marked
      pte_numa -- see the page_mapcount() check in change_pte_range() -- but
      it is possible.
      
      To address this, this patch restores sp->lock as originally implemented
      by Kosaki Motohiro.  In the path where get_vma_policy() is called, it
      should not be calling sp_alloc() so it is not necessary to treat the PTL
      specially.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42288fe3
    • H
      mempolicy: remove arg from mpol_parse_str, mpol_to_str · a7a88b23
      Hugh Dickins 提交于
      Remove the unused argument (formerly no_context) from mpol_parse_str()
      and from mpol_to_str().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7a88b23
    • H
      tmpfs mempolicy: fix /proc/mounts corrupting memory · f2a07f40
      Hugh Dickins 提交于
      Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
      mempolicy testing.  Very nasty.  Reading /proc/mounts, /proc/pid/mounts
      or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
      in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
      pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
      worse.  "mpol=prefer" and "mpol=prefer:Node" are equally toxic.
      
      Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
      when commit e17f74af "mempolicy: don't call mpol_set_nodemask() when
      no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
      which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
      With slab poisoning, you can then rely on mpol_to_str() to set the bit
      for node 0x6b6b, probably in the next page above the caller's stack.
      
      mpol_parse_str() is only called from shmem_parse_options(): no_context
      is always true, so call it unused for now, and remove !no_context code.
      Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
      expect.  Then mpol_to_str() can ignore its no_context argument also,
      the mpol being appropriately initialized whether contextualized or not.
      Rename its no_context unused too, and let subsequent patch remove them
      (that's not needed for stable backporting, which would involve rejects).
      
      I don't understand why MPOL_LOCAL is described as a pseudo-policy:
      it's a reasonable policy which suffers from a confusing implementation
      in terms of MPOL_PREFERRED with MPOL_F_LOCAL.  I believe this would be
      much more robust if MPOL_LOCAL were recognized in switch statements
      throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
      empty) nodes mask like everyone else, instead of its preferred_node
      variant (I presume an optimization from the days before MPOL_LOCAL).
      But that would take me too long to get right and fully tested.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2a07f40
  2. 13 12月, 2012 2 次提交
  3. 12 12月, 2012 1 次提交
  4. 11 12月, 2012 11 次提交
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • M
      mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely... · e42c8ff2
      Mel Gorman 提交于
      mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships
      
      Note: This two-stage filter was taken directly from the sched/numa patch
      	"sched, numa, mm: Add the scanning page fault machinery" but is
      	only a partial extraction. As the end result is not necessarily
      	recognisable, the signed-offs-by had to be removed. Will be added
      	back if requested.
      
      While it is desirable that all threads in a process run on its home
      node, this is not always possible or necessary. There may be more
      threads than exist within the node or the node might over-subscribed
      with unrelated processes.
      
      This can cause a situation whereby a page gets migrated off its home
      node because the threads clearing pte_numa were running off-node. This
      patch uses page->last_nid to build a two-stage filter before pages get
      migrated to avoid problems with short or unlikely task<->node
      relationships.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      e42c8ff2
    • M
      mm: numa: Migrate on reference policy · 5606e387
      Mel Gorman 提交于
      This is the simplest possible policy that still does something of note.
      When a pte_numa is faulted, it is moved immediately. Any replacement
      policy must at least do better than this and in all likelihood this
      policy regresses normal workloads.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      5606e387
    • M
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman 提交于
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      	lock
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      03c5a6e1
    • M
      mm: mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now · a720094d
      Mel Gorman 提交于
      The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
      explicitly request lazy migration is a good idea but the actual
      API has not been well reviewed and once released we have to support it.
      For now this patch prevents an application using the services. This
      will need to be revisited.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      a720094d
    • M
      mm: mempolicy: Implement change_prot_numa() in terms of change_protection() · 4b10e7d5
      Mel Gorman 提交于
      This patch converts change_prot_numa() to use change_protection(). As
      pte_numa and friends check the PTE bits directly it is necessary for
      change_protection() to use pmd_mknuma(). Hence the required
      modifications to change_protection() are a little clumsy but the
      end result is that most of the numa page table helpers are just one or
      two instructions.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4b10e7d5
    • L
      mm: mempolicy: Add MPOL_MF_LAZY · b24f53a0
      Lee Schermerhorn 提交于
      NOTE: Once again there is a lot of patch stealing and the end result
      	is sufficiently different that I had to drop the signed-offs.
      	Will re-add if the original authors are ok with that.
      
      This patch adds another mbind() flag to request "lazy migration".  The
      flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
      pages are marked PROT_NONE. The pages will be migrated in the fault
      path on "first touch", if the policy dictates at that time.
      
      "Lazy Migration" will allow testing of migrate-on-fault via mbind().
      Also allows applications to specify that only subsequently touched
      pages be migrated to obey new policy, instead of all pages in range.
      This can be useful for multi-threaded applications working on a
      large shared data area that is initialized by an initial thread
      resulting in all pages on one [or a few, if overflowed] nodes.
      After PROT_NONE, the pages in regions assigned to the worker threads
      will be automatically migrated local to the threads on 1st touch.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      b24f53a0
    • L
      mm: mempolicy: Check for misplaced page · 771fb4d8
      Lee Schermerhorn 提交于
      This patch provides a new function to test whether a page resides
      on a node that is appropriate for the mempolicy for the vma and
      address where the page is supposed to be mapped.  This involves
      looking up the node where the page belongs.  So, the function
      returns that node so that it may be used to allocated the page
      without consulting the policy again.
      
      A subsequent patch will call this function from the fault path.
      Because of this, I don't want to go ahead and allocate the page, e.g.,
      via alloc_page_vma() only to have to free it if it has the correct
      policy.  So, I just mimic the alloc_page_vma() node computation
      logic--sort of.
      
      Note:  we could use this function to implement a MPOL_MF_STRICT
      behavior when migrating pages to match mbind() mempolicy--e.g.,
      to ensure that pages in an interleaved range are reinterleaved
      rather than left where they are when they reside on any page in
      the interleave nodemask.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      [ Added MPOL_F_LAZY to trigger migrate-on-fault;
        simplified code now that we don't have to bother
        with special crap for interleaved ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      771fb4d8
    • L
      mm: mempolicy: Add MPOL_NOOP · d3a71033
      Lee Schermerhorn 提交于
      This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
      to mbind().  When the NOOP policy is used with the 'MOVE and 'LAZY
      flags, mbind() will map the pages PROT_NONE so that they will be
      migrated on the next touch.
      
      This allows an application to prepare for a new phase of operation
      where different regions of shared storage will be assigned to
      worker threads, w/o changing policy.  Note that we could just use
      "default" policy in this case.  However, this also allows an
      application to request that pages be migrated, only if necessary,
      to follow any arbitrary policy that might currently apply to a
      range of pages, without knowing the policy, or without specifying
      multiple mbind()s for ranges with different policies.
      
      [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]
      Bug-Reported-by: NReported-by: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      d3a71033
    • P
      mm: mempolicy: Make MPOL_LOCAL a real policy · 479e2802
      Peter Zijlstra 提交于
      Make MPOL_LOCAL a real and exposed policy such that applications that
      relied on the previous default behaviour can explicitly request it.
      Requested-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      479e2802
    • M
      mm: migrate: Add a tracepoint for migrate_pages · 7b2a2d4a
      Mel Gorman 提交于
      The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
      about migration activity but not the type or the reason. This patch adds
      a tracepoint to identify the type of page migration and why the page is
      being migrated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      7b2a2d4a
  5. 07 12月, 2012 1 次提交
    • M
      tmpfs: fix shared mempolicy leak · 18a2f371
      Mel Gorman 提交于
      This fixes a regression in 3.7-rc, which has since gone into stable.
      
      Commit 00442ad0 ("mempolicy: fix a memory corruption by refcount
      imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
      refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
      on expecting alloc_page_vma() to drop the refcount it had acquired.
      This deserves a rework: but for now fix the leak in shmem_alloc_page().
      
      Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
      the same refcounting there as in shmem_alloc_page(), delete its onstack
      mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
      those were invented to let swapin_readahead() make an unknown number of
      calls to alloc_pages_vma() with one mempolicy; but since 00442ad0,
      alloc_pages_vma() has kept refcount in balance, so now no problem.
      Reported-and-tested-by: NTommi Rantala <tt.rantala@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18a2f371
  6. 17 10月, 2012 1 次提交
    • D
      mm, mempolicy: fix printing stack contents in numa_maps · 32f8516a
      David Rientjes 提交于
      When reading /proc/pid/numa_maps, it's possible to return the contents of
      the stack where the mempolicy string should be printed if the policy gets
      freed from beneath us.
      
      This happens because mpol_to_str() may return an error the
      stack-allocated buffer is then printed without ever being stored.
      
      There are two possible error conditions in mpol_to_str():
      
       - if the buffer allocated is insufficient for the string to be stored,
         and
      
       - if the mempolicy has an invalid mode.
      
      The first error condition is not triggered in any of the callers to
      mpol_to_str(): at least 50 bytes is always allocated on the stack and this
      is sufficient for the string to be written.  A future patch should convert
      this into BUILD_BUG_ON() since we know the maximum strlen possible, but
      that's not -rc material.
      
      The second error condition is possible if a race occurs in dropping a
      reference to a task's mempolicy causing it to be freed during the read().
      The slab poison value is then used for the mode and mpol_to_str() returns
      -EINVAL.
      
      This race is only possible because get_vma_policy() believes that
      mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path
      does not hold mm->mmap_sem when dropping the reference or setting
      task->mempolicy to NULL: it uses task_lock(task) instead.
      
      Thus, it's required for the caller of a task mempolicy to hold
      task_lock(task) while grabbing the mempolicy and reading it.  Callers with
      a vma policy store their mempolicy earlier and can simply increment the
      reference count so it's guaranteed not to be freed.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32f8516a
  7. 09 10月, 2012 6 次提交
    • M
      mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range") · 08270807
      Minchan Kim 提交于
      Revert commit 0def08e3 because check_range can't fail in
      migrate_to_node with considering current usecases.
      
      Quote from Johannes
      
      : I think it makes sense to revert.  Not because of the semantics, but I
      : just don't see how check_range() could even fail for this callsite:
      :
      : 1. we pass mm->mmap->vm_start in there, so we should not fail due to
      :    find_vma()
      :
      : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
      :    and so can not fail
      :
      : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
      :    continue until addr == end, so we never fail with -EIO
      
      And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
      which might pass to MPOL_MF_STRICT.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vasiliy Kulikov <segooon@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08270807
    • M
      mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma() · 00442ad0
      Mel Gorman 提交于
      Commit cc9a6c87 ("cpuset: mm: reduce large amounts of memory barrier
      related damage v3") introduced a potential memory corruption.
      shmem_alloc_page() uses a pseudo vma and it has one significant unique
      combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED.
      
      get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
      and mpol_cond_put() DOES decrease a policy ref when a policy has
      MPOL_F_SHARED.  Therefore, when a cpuset update race occurs,
      alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the
      reference count and frees the policy prematurely.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00442ad0
    • K
      mempolicy: fix refcount leak in mpol_set_shared_policy() · 63f74ca2
      KOSAKI Motohiro 提交于
      When shared_policy_replace() fails to allocate new->policy is not freed
      correctly by mpol_set_shared_policy().  The problem is that shared
      mempolicy code directly call kmem_cache_free() in multiple places where
      it is easy to make a mistake.
      
      This patch creates an sp_free wrapper function and uses it. The bug was
      introduced pre-git age (IOW, before 2.6.12-rc2).
      
      [mgorman@suse.de: Editted changelog]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63f74ca2
    • M
      mempolicy: fix a race in shared_policy_replace() · b22d127a
      Mel Gorman 提交于
      shared_policy_replace() use of sp_alloc() is unsafe.  1) sp_node cannot
      be dereferenced if sp->lock is not held and 2) another thread can modify
      sp_node between spin_unlock for allocating a new sp node and next
      spin_lock.  The bug was introduced before 2.6.12-rc2.
      
      Kosaki's original patch for this problem was to allocate an sp node and
      policy within shared_policy_replace and initialise it when the lock is
      reacquired.  I was not keen on this approach because it partially
      duplicates sp_alloc().  As the paths were sp->lock is taken are not that
      performance critical this patch converts sp->lock to sp->mutex so it can
      sleep when calling sp_alloc().
      
      [kosaki.motohiro@jp.fujitsu.com: Original patch]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b22d127a
    • K
      mempolicy: remove mempolicy sharing · 869833f2
      KOSAKI Motohiro 提交于
      Dave Jones' system call fuzz testing tool "trinity" triggered the
      following bug error with slab debugging enabled
      
          =============================================================================
          BUG numa_policy (Not tainted): Poison overwritten
          -----------------------------------------------------------------------------
      
          INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
          INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
           __slab_alloc+0x3d3/0x445
           kmem_cache_alloc+0x29d/0x2b0
           mpol_new+0xa3/0x140
           sys_mbind+0x142/0x620
           system_call_fastpath+0x16/0x1b
      
          INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
           __slab_free+0x2e/0x1de
           kmem_cache_free+0x25a/0x260
           __mpol_put+0x27/0x30
           remove_vma+0x68/0x90
           exit_mmap+0x118/0x140
           mmput+0x73/0x110
           exit_mm+0x108/0x130
           do_exit+0x162/0xb90
           do_group_exit+0x4f/0xc0
           sys_exit_group+0x17/0x20
           system_call_fastpath+0x16/0x1b
      
          INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x          (null) flags=0x20000000004080
          INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0
      
      The problem is that the structure is being prematurely freed due to a
      reference count imbalance. In the following case mbind(addr, len) should
      replace the memory policies of both vma1 and vma2 and thus they will
      become to share the same mempolicy and the new mempolicy will have the
      MPOL_F_SHARED flag.
      
        +-------------------+-------------------+
        |     vma1          |     vma2(shmem)   |
        +-------------------+-------------------+
        |                                       |
       addr                                 addr+len
      
      alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for
      maintaining the mempolicy reference count.  The current rule is that
      get_vma_policy() only increments refcount for shmem VMA and
      mpol_conf_put() only decrements refcount if the policy has
      MPOL_F_SHARED.
      
      In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED!
      The reference count will be decreased even though was not increased
      whenever alloc_page_vma() is called.  This has been broken since commit
      [52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008.
      
      There is another serious bug with the sharing of memory policies.
      Currently, mempolicy rebind logic (it is called from cpuset rebinding)
      ignores a refcount of mempolicy and override it forcibly.  Thus, any
      mempolicy sharing may cause mempolicy corruption.  The bug was
      introduced by commit [68860ec1: cpusets: automatic numa mempolicy
      rebinding].
      
      Ideally, the shared policy handling would be rewritten to either
      properly handle COW of the policy structures or at least reference count
      MPOL_F_SHARED based exclusively on information within the policy.
      However, this patch takes the easier approach of disabling any policy
      sharing between VMAs.  Each new range allocated with sp_alloc will
      allocate a new policy, set the reference count to 1 and drop the
      reference count of the old policy.  This increases the memory footprint
      but is not expected to be a major problem as mbind() is unlikely to be
      used for fine-grained ranges.  It is also inefficient because it means
      we allocate a new policy even in cases where mbind_range() could use the
      new_policy passed to it.  However, it is more straight-forward and the
      change should be invisible to the user.
      
      [mgorman@suse.de: Edited changelog]
      Reported-by: Dave Jones <davej@redhat.com>,
      Cc: Christoph Lameter <cl@linux.com>,
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      869833f2
    • K
      revert "mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages" · 8d34694c
      KOSAKI Motohiro 提交于
      Commit 05f144a0 ("mm: mempolicy: Let vma_merge and vma_split handle
      vma->vm_policy linkages") removed vma->vm_policy updates code but it is
      the purpose of mbind_range().  Now, mbind_range() is virtually a no-op
      and while it does not allow memory corruption it is not the right fix.
      This patch is a revert.
      
      [mgorman@suse.de: Edited changelog]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d34694c
  8. 07 9月, 2012 1 次提交
  9. 21 6月, 2012 1 次提交
  10. 20 6月, 2012 1 次提交
  11. 30 5月, 2012 3 次提交
    • A
      mm: do_migrate_pages(): rename arguments · 0ce72d4f
      Andrew Morton 提交于
      s/from_nodes/from and s/to_nodes/to/.  The "_nodes" is redundant - it
      duplicates the argument's type.
      
      Done in a fit of irritation over 80-col issues :(
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <mkosaki@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ce72d4f
    • L
      mm: do_migrate_pages() calls migrate_to_node() even if task is already on a correct node · 4a5b18cc
      Larry Woodman 提交于
      While running an application that moves tasks from one cpuset to another
      I noticed that it takes much longer and moves many more pages than
      expected.
      
      The reason for this is do_migrate_pages() does its best to preserve the
      relative node differential from the first node of the cpuset because the
      application may have been written with that in mind.  If memory was
      interleaved on the nodes of the source cpuset by an application
      do_migrate_pages() will try its best to maintain that interleaving on
      the nodes of the destination cpuset.  This means copying the memory from
      all source nodes to the destination nodes even if the source and
      destination nodes overlap.
      
      This is a problem for userspace NUMA placement tools.  The amount of
      time spent doing extra memory moves cancels out some of the NUMA
      performance improvements.  Furthermore, if the number of source and
      destination nodes are to maintain the previous interleaving layout
      anyway.
      
      This patch changes do_migrate_pages() to only preserve the relative
      layout inside the program if the number of NUMA nodes in the source and
      destination mask are the same.  If the number is different, we do a much
      more efficient migration by not touching memory that is in an allowed
      node.
      
      This preserves the old behaviour for programs that want it, while
      allowing a userspace NUMA placement tool to use the new, faster
      migration.  This improves performance in our tests by up to a factor of
      7.
      
      Without this change migrating tasks from a cpuset containing nodes 0-7
      to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if
      they are in the both the source and destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 4 to 3
         Migrating 1 to 4
         Migrating 3 to 4
         Migrating 0 to 3
         Migrating 2 to 3
      
      With this change we only migrate from nodes that are not in the
      destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 2 to 3
         Migrating 1 to 4
         Migrating 0 to 3
      
      Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset
      containing 3,4,5 we still do move everything so that we preserve the
      desired NUMA offsets:
      
         Migrating 4 to 5
         Migrating 3 to 4
         Migrating 2 to 3
      
      As far as performance is concerned this simple patch improves the time
      it takes to move 14, 20 and 26 large tasks from a cpuset containing
      nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7.
      Here are the timings with and without the patch:
      
      BEFORE PATCH -- Move times: 59, 140, 651 seconds
      ============
      
        Moving 14 tasks from nodes (0-7) to nodes (1,3)
        numad(8780) do_migrate_pages (mm=0xffff88081d414400
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 8890 moved to node(s) 1,3 in 59.2 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,4-5)
        numad(8780) do_migrate_pages (mm=0xffff88081d88c700
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 8962 moved to node(s) 1,4-5 in 139.88 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1-3,5)
        numad(8780) do_migrate_pages (mm=0xffff88081d5bc740
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 9058 moved to node(s) 1-3,5 in 651.45 seconds
      
      AFTER PATCH -- Move times: 42, 56, 93 seconds
      ===========
      
        Moving 14 tasks from nodes (0-7) to nodes (5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d5ff140
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 33221 moved to node(s) 5,7 in 41.67 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,3,5)
        numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 33289 moved to node(s) 1,3,5 in 56.3 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d924400
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds
      
      [akpm@linux-foundation.org: clean up comment layout]
      Signed-off-by: NLarry Woodman <lwoodman@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a5b18cc
    • W
      mm/mempolicy.c: use enum value MPOL_REBIND_ONCE in mpol_rebind_policy() · 89c522c7
      Wang Sheng-Hui 提交于
      We have enum definition in mempolicy.h: MPOL_REBIND_ONCE.  It should
      replace the magic number 0 for step comparison in function
      mpol_rebind_policy.
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c522c7
  12. 24 5月, 2012 1 次提交
    • M
      mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages · 05f144a0
      Mel Gorman 提交于
      Dave Jones' system call fuzz testing tool "trinity" triggered the
      following bug error with slab debugging enabled
      
          =============================================================================
          BUG numa_policy (Not tainted): Poison overwritten
          -----------------------------------------------------------------------------
      
          INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
          INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
           __slab_alloc+0x3d3/0x445
           kmem_cache_alloc+0x29d/0x2b0
           mpol_new+0xa3/0x140
           sys_mbind+0x142/0x620
           system_call_fastpath+0x16/0x1b
          INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
           __slab_free+0x2e/0x1de
           kmem_cache_free+0x25a/0x260
           __mpol_put+0x27/0x30
           remove_vma+0x68/0x90
           exit_mmap+0x118/0x140
           mmput+0x73/0x110
           exit_mm+0x108/0x130
           do_exit+0x162/0xb90
           do_group_exit+0x4f/0xc0
           sys_exit_group+0x17/0x20
           system_call_fastpath+0x16/0x1b
          INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x          (null) flags=0x20000000004080
          INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0
      
      This implied a reference counting bug and the problem happened during
      mbind().
      
      mbind() applies a new memory policy to a range and uses mbind_range() to
      merge existing VMAs or split them as necessary.  In the event of splits,
      mpol_dup() will allocate a new struct mempolicy and maintain existing
      reference counts whose rules are documented in
      Documentation/vm/numa_memory_policy.txt .
      
      The problem occurs with shared memory policies.  The vm_op->set_policy
      increments the reference count if necessary and split_vma() and
      vma_merge() have already handled the existing reference counts.
      However, policy_vma() screws it up by replacing an existing
      vma->vm_policy with one that potentially has the wrong reference count
      leading to a premature free.  This patch removes the damage caused by
      policy_vma().
      
      With this patch applied Dave's trinity tool runs an mbind test for 5
      minutes without error.  /proc/slabinfo reported that there are no
      numa_policy or shared_policy_node objects allocated after the test
      completed and the shared memory region was deleted.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Stephen Wilson <wilsons@start.ca>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05f144a0
  13. 16 5月, 2012 1 次提交
  14. 26 4月, 2012 1 次提交
    • S
      mm: fix NULL ptr dereference in migrate_pages · f2a9ef88
      Sasha Levin 提交于
      Commit 3268c63e ("mm: fix move/migrate_pages() race on task struct") has
      added an odd construct where 'mm' is checked for being NULL, and if it is,
      it would get dereferenced anyways by mput()ing it.
      
      This would lead to the following NULL ptr deref and BUG() when calling
      migrate_pages() with a pid that has no mm struct:
      
      [25904.193704] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
      [25904.194235] IP: [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235] PGD 773e6067 PUD 77da0067 PMD 0
      [25904.194235] Oops: 0002 [#1] PREEMPT SMP
      [25904.194235] CPU 2
      [25904.194235] Pid: 31608, comm: trinity Tainted: G        W    3.4.0-rc2-next-20120412-sasha #69
      [25904.194235] RIP: 0010:[<ffffffff810b0de7>]  [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235] RSP: 0018:ffff880077d49e08  EFLAGS: 00010202
      [25904.194235] RAX: 0000000000000286 RBX: 0000000000000000 RCX: 0000000000000000
      [25904.194235] RDX: ffff880075ef8000 RSI: 000000000000023d RDI: 0000000000000286
      [25904.194235] RBP: ffff880077d49e18 R08: 0000000000000001 R09: 0000000000000001
      [25904.194235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [25904.194235] R13: 00000000ffffffea R14: ffff880034287740 R15: ffff8800218d3010
      [25904.194235] FS:  00007fc8b244c700(0000) GS:ffff880029800000(0000) knlGS:0000000000000000
      [25904.194235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [25904.194235] CR2: 0000000000000050 CR3: 00000000767c6000 CR4: 00000000000406e0
      [25904.194235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [25904.194235] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [25904.194235] Process trinity (pid: 31608, threadinfo ffff880077d48000, task ffff880075ef8000)
      [25904.194235] Stack:
      [25904.194235]  ffff8800342876c0 0000000000000000 ffff880077d49f78 ffffffff811b8020
      [25904.194235]  ffffffff811b7d91 ffff880075ef8000 ffff88002256d200 0000000000000000
      [25904.194235]  00000000000003ff 0000000000000000 0000000000000000 0000000000000000
      [25904.194235] Call Trace:
      [25904.194235]  [<ffffffff811b8020>] sys_migrate_pages+0x340/0x3a0
      [25904.194235]  [<ffffffff811b7d91>] ? sys_migrate_pages+0xb1/0x3a0
      [25904.194235]  [<ffffffff8266cbb9>] system_call_fastpath+0x16/0x1b
      [25904.194235] Code: c9 c3 66 90 55 31 d2 48 89 e5 be 3d 02 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 89 fb 48 c7 c7 cf 0e e1 82 e8 69 18 03 00 <f0> ff 4b 50 0f 94 c0 84 c0 0f 84 aa 00 00 00 48 89 df e8 72 f1
      [25904.194235] RIP  [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235]  RSP <ffff880077d49e08>
      [25904.194235] CR2: 0000000000000050
      [25904.348999] ---[ end trace a307b3ed40206b4b ]---
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2a9ef88
  15. 22 3月, 2012 3 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
    • C
      mm: fix move/migrate_pages() race on task struct · 3268c63e
      Christoph Lameter 提交于
      Migration functions perform the rcu_read_unlock too early.  As a result
      the task pointed to may change from under us.  This can result in an oops,
      as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.
      
      The following patch extend the period of the rcu_read_lock until after the
      permissions checks are done.  We also take a refcount so that the task
      reference is stable when calling security check functions and performing
      cpuset node validation (which takes a mutex).
      
      The refcount is dropped before actual page migration occurs so there is no
      change to the refcounts held during page migration.
      
      Also move the determination of the mm of the task struct to immediately
      before the do_migrate*() calls so that it is clear that we switch from
      handling the task during permission checks to the mm for the actual
      migration.  Since the determination is only done once and we then no
      longer use the task_struct we can be sure that we operate on a specific
      address space that will not change from under us.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Reported-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3268c63e
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  16. 07 3月, 2012 1 次提交
    • L
      vm: avoid using find_vma_prev() unnecessarily · 097d5910
      Linus Torvalds 提交于
      Several users of "find_vma_prev()" were not in fact interested in the
      previous vma if there was no primary vma to be found either.  And in
      those cases, we're much better off just using the regular "find_vma()",
      and then "prev" can be looked up by just checking vma->vm_prev.
      
      The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
      commit 83cd904d: "mm: fix find_vma_prev"), and the whole "return
      prev by reference" means that it generates worse code too.
      
      Thus this "let's avoid using this inconvenient and clearly too subtle
      interface when we don't really have to" patch.
      
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      097d5910
  17. 13 1月, 2012 1 次提交
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
  18. 11 1月, 2012 1 次提交