1. 20 6月, 2012 1 次提交
  2. 30 5月, 2012 3 次提交
    • A
      mm: do_migrate_pages(): rename arguments · 0ce72d4f
      Andrew Morton 提交于
      s/from_nodes/from and s/to_nodes/to/.  The "_nodes" is redundant - it
      duplicates the argument's type.
      
      Done in a fit of irritation over 80-col issues :(
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <mkosaki@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ce72d4f
    • L
      mm: do_migrate_pages() calls migrate_to_node() even if task is already on a correct node · 4a5b18cc
      Larry Woodman 提交于
      While running an application that moves tasks from one cpuset to another
      I noticed that it takes much longer and moves many more pages than
      expected.
      
      The reason for this is do_migrate_pages() does its best to preserve the
      relative node differential from the first node of the cpuset because the
      application may have been written with that in mind.  If memory was
      interleaved on the nodes of the source cpuset by an application
      do_migrate_pages() will try its best to maintain that interleaving on
      the nodes of the destination cpuset.  This means copying the memory from
      all source nodes to the destination nodes even if the source and
      destination nodes overlap.
      
      This is a problem for userspace NUMA placement tools.  The amount of
      time spent doing extra memory moves cancels out some of the NUMA
      performance improvements.  Furthermore, if the number of source and
      destination nodes are to maintain the previous interleaving layout
      anyway.
      
      This patch changes do_migrate_pages() to only preserve the relative
      layout inside the program if the number of NUMA nodes in the source and
      destination mask are the same.  If the number is different, we do a much
      more efficient migration by not touching memory that is in an allowed
      node.
      
      This preserves the old behaviour for programs that want it, while
      allowing a userspace NUMA placement tool to use the new, faster
      migration.  This improves performance in our tests by up to a factor of
      7.
      
      Without this change migrating tasks from a cpuset containing nodes 0-7
      to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if
      they are in the both the source and destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 4 to 3
         Migrating 1 to 4
         Migrating 3 to 4
         Migrating 0 to 3
         Migrating 2 to 3
      
      With this change we only migrate from nodes that are not in the
      destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 2 to 3
         Migrating 1 to 4
         Migrating 0 to 3
      
      Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset
      containing 3,4,5 we still do move everything so that we preserve the
      desired NUMA offsets:
      
         Migrating 4 to 5
         Migrating 3 to 4
         Migrating 2 to 3
      
      As far as performance is concerned this simple patch improves the time
      it takes to move 14, 20 and 26 large tasks from a cpuset containing
      nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7.
      Here are the timings with and without the patch:
      
      BEFORE PATCH -- Move times: 59, 140, 651 seconds
      ============
      
        Moving 14 tasks from nodes (0-7) to nodes (1,3)
        numad(8780) do_migrate_pages (mm=0xffff88081d414400
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 8890 moved to node(s) 1,3 in 59.2 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,4-5)
        numad(8780) do_migrate_pages (mm=0xffff88081d88c700
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 8962 moved to node(s) 1,4-5 in 139.88 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1-3,5)
        numad(8780) do_migrate_pages (mm=0xffff88081d5bc740
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 9058 moved to node(s) 1-3,5 in 651.45 seconds
      
      AFTER PATCH -- Move times: 42, 56, 93 seconds
      ===========
      
        Moving 14 tasks from nodes (0-7) to nodes (5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d5ff140
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 33221 moved to node(s) 5,7 in 41.67 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,3,5)
        numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 33289 moved to node(s) 1,3,5 in 56.3 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d924400
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds
      
      [akpm@linux-foundation.org: clean up comment layout]
      Signed-off-by: NLarry Woodman <lwoodman@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a5b18cc
    • W
      mm/mempolicy.c: use enum value MPOL_REBIND_ONCE in mpol_rebind_policy() · 89c522c7
      Wang Sheng-Hui 提交于
      We have enum definition in mempolicy.h: MPOL_REBIND_ONCE.  It should
      replace the magic number 0 for step comparison in function
      mpol_rebind_policy.
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c522c7
  3. 24 5月, 2012 1 次提交
    • M
      mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages · 05f144a0
      Mel Gorman 提交于
      Dave Jones' system call fuzz testing tool "trinity" triggered the
      following bug error with slab debugging enabled
      
          =============================================================================
          BUG numa_policy (Not tainted): Poison overwritten
          -----------------------------------------------------------------------------
      
          INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
          INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
           __slab_alloc+0x3d3/0x445
           kmem_cache_alloc+0x29d/0x2b0
           mpol_new+0xa3/0x140
           sys_mbind+0x142/0x620
           system_call_fastpath+0x16/0x1b
          INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
           __slab_free+0x2e/0x1de
           kmem_cache_free+0x25a/0x260
           __mpol_put+0x27/0x30
           remove_vma+0x68/0x90
           exit_mmap+0x118/0x140
           mmput+0x73/0x110
           exit_mm+0x108/0x130
           do_exit+0x162/0xb90
           do_group_exit+0x4f/0xc0
           sys_exit_group+0x17/0x20
           system_call_fastpath+0x16/0x1b
          INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x          (null) flags=0x20000000004080
          INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0
      
      This implied a reference counting bug and the problem happened during
      mbind().
      
      mbind() applies a new memory policy to a range and uses mbind_range() to
      merge existing VMAs or split them as necessary.  In the event of splits,
      mpol_dup() will allocate a new struct mempolicy and maintain existing
      reference counts whose rules are documented in
      Documentation/vm/numa_memory_policy.txt .
      
      The problem occurs with shared memory policies.  The vm_op->set_policy
      increments the reference count if necessary and split_vma() and
      vma_merge() have already handled the existing reference counts.
      However, policy_vma() screws it up by replacing an existing
      vma->vm_policy with one that potentially has the wrong reference count
      leading to a premature free.  This patch removes the damage caused by
      policy_vma().
      
      With this patch applied Dave's trinity tool runs an mbind test for 5
      minutes without error.  /proc/slabinfo reported that there are no
      numa_policy or shared_policy_node objects allocated after the test
      completed and the shared memory region was deleted.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Stephen Wilson <wilsons@start.ca>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05f144a0
  4. 16 5月, 2012 1 次提交
  5. 26 4月, 2012 1 次提交
    • S
      mm: fix NULL ptr dereference in migrate_pages · f2a9ef88
      Sasha Levin 提交于
      Commit 3268c63e ("mm: fix move/migrate_pages() race on task struct") has
      added an odd construct where 'mm' is checked for being NULL, and if it is,
      it would get dereferenced anyways by mput()ing it.
      
      This would lead to the following NULL ptr deref and BUG() when calling
      migrate_pages() with a pid that has no mm struct:
      
      [25904.193704] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
      [25904.194235] IP: [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235] PGD 773e6067 PUD 77da0067 PMD 0
      [25904.194235] Oops: 0002 [#1] PREEMPT SMP
      [25904.194235] CPU 2
      [25904.194235] Pid: 31608, comm: trinity Tainted: G        W    3.4.0-rc2-next-20120412-sasha #69
      [25904.194235] RIP: 0010:[<ffffffff810b0de7>]  [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235] RSP: 0018:ffff880077d49e08  EFLAGS: 00010202
      [25904.194235] RAX: 0000000000000286 RBX: 0000000000000000 RCX: 0000000000000000
      [25904.194235] RDX: ffff880075ef8000 RSI: 000000000000023d RDI: 0000000000000286
      [25904.194235] RBP: ffff880077d49e18 R08: 0000000000000001 R09: 0000000000000001
      [25904.194235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [25904.194235] R13: 00000000ffffffea R14: ffff880034287740 R15: ffff8800218d3010
      [25904.194235] FS:  00007fc8b244c700(0000) GS:ffff880029800000(0000) knlGS:0000000000000000
      [25904.194235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [25904.194235] CR2: 0000000000000050 CR3: 00000000767c6000 CR4: 00000000000406e0
      [25904.194235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [25904.194235] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [25904.194235] Process trinity (pid: 31608, threadinfo ffff880077d48000, task ffff880075ef8000)
      [25904.194235] Stack:
      [25904.194235]  ffff8800342876c0 0000000000000000 ffff880077d49f78 ffffffff811b8020
      [25904.194235]  ffffffff811b7d91 ffff880075ef8000 ffff88002256d200 0000000000000000
      [25904.194235]  00000000000003ff 0000000000000000 0000000000000000 0000000000000000
      [25904.194235] Call Trace:
      [25904.194235]  [<ffffffff811b8020>] sys_migrate_pages+0x340/0x3a0
      [25904.194235]  [<ffffffff811b7d91>] ? sys_migrate_pages+0xb1/0x3a0
      [25904.194235]  [<ffffffff8266cbb9>] system_call_fastpath+0x16/0x1b
      [25904.194235] Code: c9 c3 66 90 55 31 d2 48 89 e5 be 3d 02 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 89 fb 48 c7 c7 cf 0e e1 82 e8 69 18 03 00 <f0> ff 4b 50 0f 94 c0 84 c0 0f 84 aa 00 00 00 48 89 df e8 72 f1
      [25904.194235] RIP  [<ffffffff810b0de7>] mmput+0x27/0xf0
      [25904.194235]  RSP <ffff880077d49e08>
      [25904.194235] CR2: 0000000000000050
      [25904.348999] ---[ end trace a307b3ed40206b4b ]---
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2a9ef88
  6. 22 3月, 2012 3 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
    • C
      mm: fix move/migrate_pages() race on task struct · 3268c63e
      Christoph Lameter 提交于
      Migration functions perform the rcu_read_unlock too early.  As a result
      the task pointed to may change from under us.  This can result in an oops,
      as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.
      
      The following patch extend the period of the rcu_read_lock until after the
      permissions checks are done.  We also take a refcount so that the task
      reference is stable when calling security check functions and performing
      cpuset node validation (which takes a mutex).
      
      The refcount is dropped before actual page migration occurs so there is no
      change to the refcounts held during page migration.
      
      Also move the determination of the mm of the task struct to immediately
      before the do_migrate*() calls so that it is clear that we switch from
      handling the task during permission checks to the mm for the actual
      migration.  Since the determination is only done once and we then no
      longer use the task_struct we can be sure that we operate on a specific
      address space that will not change from under us.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Reported-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3268c63e
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  7. 07 3月, 2012 1 次提交
    • L
      vm: avoid using find_vma_prev() unnecessarily · 097d5910
      Linus Torvalds 提交于
      Several users of "find_vma_prev()" were not in fact interested in the
      previous vma if there was no primary vma to be found either.  And in
      those cases, we're much better off just using the regular "find_vma()",
      and then "prev" can be looked up by just checking vma->vm_prev.
      
      The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
      commit 83cd904d: "mm: fix find_vma_prev"), and the whole "return
      prev by reference" means that it generates worse code too.
      
      Thus this "let's avoid using this inconvenient and clearly too subtle
      interface when we don't really have to" patch.
      
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      097d5910
  8. 13 1月, 2012 1 次提交
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
  9. 11 1月, 2012 1 次提交
  10. 30 12月, 2011 1 次提交
    • K
      mm/mempolicy.c: refix mbind_range() vma issue · e26a5114
      KOSAKI Motohiro 提交于
      commit 8aacc9f5 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the
      slightly incorrect fix.
      
      Why? Think following case.
      
      1. map 4 pages of a file at offset 0
      
         [0123]
      
      2. map 2 pages just after the first mapping of the same file but with
         page offset 2
      
         [0123][23]
      
      3. mbind() 2 pages from the first mapping at offset 2.
         mbind_range() should treat new vma is,
      
         [0123][23]
           |23|
           mbind vma
      
         but it does
      
         [0123][23]
           |01|
           mbind vma
      
         Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar).
      
      This patch fixes it.
      
      [testcase]
        test result - before the patch
      
      	case4: 126: test failed. expect '2,4', actual '2,2,2'
             	case5: passed
      	case6: passed
      	case7: passed
      	case8: passed
      	case_n: 246: test failed. expect '4,2', actual '1,4'
      
      	------------[ cut here ]------------
      	kernel BUG at mm/filemap.c:135!
      	invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC
      
      	(snip long bug on messages)
      
        test result - after the patch
      
      	case4: passed
             	case5: passed
      	case6: passed
      	case7: passed
      	case8: passed
      	case_n: passed
      
        source:  mbind_vma_test.c
      ============================================================
       #include <numaif.h>
       #include <numa.h>
       #include <sys/mman.h>
       #include <stdio.h>
       #include <unistd.h>
       #include <stdlib.h>
       #include <string.h>
      
      static unsigned long pagesize;
      void* mmap_addr;
      struct bitmask *nmask;
      char buf[1024];
      FILE *file;
      char retbuf[10240] = "";
      int mapped_fd;
      
      char *rubysrc = "ruby -e '\
        pid = %d; \
        vstart = 0x%llx; \
        vend = 0x%llx; \
        s = `pmap -q #{pid}`; \
        rary = []; \
        s.each_line {|line|; \
          ary=line.split(\" \"); \
          addr = ary[0].to_i(16); \
          if(vstart <= addr && addr < vend) then \
            rary.push(ary[1].to_i()/4); \
          end; \
        }; \
        print rary.join(\",\"); \
      '";
      
      void init(void)
      {
      	void* addr;
      	char buf[128];
      
      	nmask = numa_allocate_nodemask();
      	numa_bitmask_setbit(nmask, 0);
      
      	pagesize = getpagesize();
      
      	sprintf(buf, "%s", "mbind_vma_XXXXXX");
      	mapped_fd = mkstemp(buf);
      	if (mapped_fd == -1)
      		perror("mkstemp "), exit(1);
      	unlink(buf);
      
      	if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0)
      		perror("lseek "), exit(1);
      	if (write(mapped_fd, "\0", 1) < 0)
      		perror("write "), exit(1);
      
      	addr = mmap(NULL, pagesize*8, PROT_NONE,
      		    MAP_SHARED, mapped_fd, 0);
      	if (addr == MAP_FAILED)
      		perror("mmap "), exit(1);
      
      	if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0)
      		perror("mprotect "), exit(1);
      
      	mmap_addr = addr + pagesize;
      
      	/* make page populate */
      	memset(mmap_addr, 0, pagesize*6);
      }
      
      void fin(void)
      {
      	void* addr = mmap_addr - pagesize;
      	munmap(addr, pagesize*8);
      
      	memset(buf, 0, sizeof(buf));
      	memset(retbuf, 0, sizeof(retbuf));
      }
      
      void mem_bind(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_BIND, nmask->maskp, nmask->size, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void mem_interleave(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void mem_unbind(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_DEFAULT, NULL, 0, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void Assert(char *expected, char *value, char *name, int line)
      {
      	if (strcmp(expected, value) == 0) {
      		fprintf(stderr, "%s: passed\n", name);
      		return;
      	}
      	else {
      		fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n",
      			name, line,
      			expected, value);
      //		exit(1);
      	}
      }
      
      /*
            AAAA
          PPPPPPNNNNNN
          might become
          PPNNNNNNNNNN
          case 4 below
      */
      void case4(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 4);
      	mem_unbind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("2,4", retbuf, "case4", __LINE__);
      
      	fin();
      }
      
      /*
             AAAA
       PPPPPPNNNNNN
       might become
       PPPPPPPPPPNN
       case 5 below
      */
      void case5(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case5", __LINE__);
      
      	fin();
      }
      
      /*
      	    AAAA
      	PPPPNNNNXXXX
      	might become
      	PPPPPPPPPPPP 6
      */
      void case6(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_bind(4, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("6", retbuf, "case6", __LINE__);
      
      	fin();
      }
      
      /*
          AAAA
      PPPPNNNNXXXX
      might become
      PPPPPPPPXXXX 7
      */
      void case7(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_interleave(4, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case7", __LINE__);
      
      	fin();
      }
      
      /*
          AAAA
      PPPPNNNNXXXX
      might become
      PPPPNNNNNNNN 8
      */
      void case8(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_interleave(4, 2);
      	mem_interleave(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("2,4", retbuf, "case8", __LINE__);
      
      	fin();
      }
      
      void case_n(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	/* make redundunt mappings [0][1234][34][7] */
      	mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE,
      	     MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3);
      
      	/* Expect to do nothing. */
      	mem_unbind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case_n", __LINE__);
      
      	fin();
      }
      
      int main(int argc, char** argv)
      {
      	case4();
      	case5();
      	case6();
      	case7();
      	case8();
      	case_n();
      
      	return 0;
      }
      =============================================================
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Caspar Zhang <caspar@casparzhang.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: <stable@vger.kernel.org>		[3.1.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e26a5114
  11. 01 11月, 2011 1 次提交
  12. 31 10月, 2011 1 次提交
  13. 15 9月, 2011 2 次提交
    • K
      mm/mempolicy.c: make copy_from_user() provably correct · 2bbff6c7
      KAMEZAWA Hiroyuki 提交于
      When compiling mm/mempolicy.c with struct user copy checks the following
      warning is shown:
      
        In file included from arch/x86/include/asm/uaccess.h:572,
                         from include/linux/uaccess.h:5,
                         from include/linux/highmem.h:7,
                         from include/linux/pagemap.h:10,
                         from include/linux/mempolicy.h:70,
                         from mm/mempolicy.c:68:
        In function `copy_from_user',
            inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
        arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
          LD      mm/built-in.o
      
      Fix this by passing correct buffer size value.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bbff6c7
    • C
      mm/mempolicy.c: fix pgoff in mbind vma merge · 8aacc9f5
      Caspar Zhang 提交于
      commit 9d8cebd4 ("mm: fix mbind vma merge problem") didn't really
      fix the mbind vma merge problem due to wrong pgoff value passing to
      vma_merge(), which made vma_merge() always return NULL.
      
      Before the patch applied, we are getting a result like:
      
        addr = 0x7fa58f00c000
        [snip]
        7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0
        7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0
        7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0
      
      here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be
      merged described as described in commit 9d8cebd4.
      
      Re-testing the patched kernel with the reproducer provided in commit
      9d8cebd4, we get the correct result:
      
        addr = 0x7ffa5aaa2000
        [snip]
        7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0
        7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0                          [stack]
      Signed-off-by: NCaspar Zhang <caspar@casparzhang.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aacc9f5
  14. 27 7月, 2011 1 次提交
    • M
      cpusets: randomize node rotor used in cpuset_mem_spread_node() · 778d3b0f
      Michal Hocko 提交于
      [ This patch has already been accepted as commit 0ac0c0d0 but later
        reverted (commit 35926ff5) because it itroduced arch specific
        __node_random which was defined only for x86 code so it broke other
        archs.  This is a followup without any arch specific code.  Other than
        that there are no functional changes.]
      
      Some workloads that create a large number of small files tend to assign
      too many pages to node 0 (multi-node systems).  Part of the reason is
      that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
      at node 0 for newly created tasks.
      
      This patch changes the rotor to be initialized to a random node number
      of the cpuset.
      
      [akpm@linux-foundation.org: fix layout]
      [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
      [mhocko@suse.cz: Make it arch independent]
      [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Menage <menage@google.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778d3b0f
  15. 25 5月, 2011 6 次提交
  16. 23 3月, 2011 1 次提交
  17. 05 3月, 2011 2 次提交
  18. 01 3月, 2011 1 次提交
  19. 26 2月, 2011 1 次提交
  20. 14 1月, 2011 5 次提交
  21. 03 12月, 2010 1 次提交
  22. 29 10月, 2010 1 次提交
    • E
      numa: fix slab_node(MPOL_BIND) · 800416f7
      Eric Dumazet 提交于
      When a node contains only HighMem memory, slab_node(MPOL_BIND)
      dereferences a NULL pointer.
      
      [ This code seems to go back all the way to commit 19770b32: "mm:
        filter based on a nodemask as well as a gfp_mask".  Which was back in
        April 2008, and it got merged into 2.6.26.  - Linus ]
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      800416f7
  23. 27 10月, 2010 2 次提交
  24. 10 8月, 2010 1 次提交