1. 10 10月, 2014 4 次提交
    • O
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov 提交于
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
    • O
      mempolicy: sanitize the usage of get_task_policy() · 8d90274b
      Oleg Nesterov 提交于
      Cleanup + preparation. Every user of get_task_policy() calls it
      unconditionally, even if it is not going to use the result.
      
      get_task_policy() is cheap but still this does not look clean, plus
      the code looks simpler if get_task_policy() is called only when this
      is really needed.
      
      Note: I hope this is correct, but it is not clear why vma_policy_mof()
      doesn't fall back to get_task_policy() if ->get_policy() returns NULL.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d90274b
    • O
      mempolicy: change get_task_policy() to return default_policy rather than NULL · f15ca78e
      Oleg Nesterov 提交于
      Every caller of get_task_policy() falls back to default_policy if it
      returns NULL. Change get_task_policy() to do this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f15ca78e
    • O
      mempolicy: change alloc_pages_vma() to use mpol_cond_put() · 2386740d
      Oleg Nesterov 提交于
      Trivial cleanup. alloc_pages_vma() can use mpol_cond_put().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2386740d
  2. 25 6月, 2014 1 次提交
    • G
      cpuset,mempolicy: fix sleeping function called from invalid context · 391acf97
      Gu Zheng 提交于
      When runing with the kernel(3.15-rc7+), the follow bug occurs:
      [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
      [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
      [ 9969.441175] INFO: lockdep is turned off.
      [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G       A      3.15.0-rc7+ #85
      [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
      [ 9969.706052]  ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
      [ 9969.795323]  ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
      [ 9969.884710]  ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
      [ 9969.974071] Call Trace:
      [ 9970.003403]  [<ffffffff8162f523>] dump_stack+0x4d/0x66
      [ 9970.065074]  [<ffffffff8109995a>] __might_sleep+0xfa/0x130
      [ 9970.130743]  [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
      [ 9970.200638]  [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
      [ 9970.272610]  [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
      [ 9970.344584]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.409282]  [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
      [ 9970.471897]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.536585]  [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
      [ 9970.613763]  [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
      [ 9970.683660]  [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
      [ 9970.759795]  [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
      [ 9970.834885]  [<ffffffff8106a598>] do_fork+0xd8/0x380
      [ 9970.894375]  [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
      [ 9970.969470]  [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
      [ 9971.030011]  [<ffffffff81642009>] stub_clone+0x69/0x90
      [ 9971.091573]  [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
      
      The cause is that cpuset_mems_allowed() try to take
      mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
      __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
      under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
      protection region to protect the access to cpuset only in
      current_cpuset_is_being_rebound(). So that we can avoid this bug.
      
      This patch is a temporary solution that just addresses the bug
      mentioned above, can not fix the long-standing issue about cpuset.mems
      rebinding on fork():
      
      "When the forker's task_struct is duplicated (which includes
       ->mems_allowed) and it races with an update to cpuset_being_rebound
       in update_tasks_nodemask() then the task's mems_allowed doesn't get
       updated. And the child task's mems_allowed can be wrong if the
       cpuset's nodemask changes before the child has been added to the
       cgroup's tasklist."
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      391acf97
  3. 24 6月, 2014 1 次提交
  4. 07 6月, 2014 2 次提交
  5. 05 6月, 2014 4 次提交
  6. 08 4月, 2014 2 次提交
    • D
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes 提交于
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15
    • D
      mm, mempolicy: rename slab_node for clarity · 2a389610
      David Rientjes 提交于
      slab_node() is actually a mempolicy function, so rename it to
      mempolicy_slab_node() to make it clearer that it used for processes with
      mempolicies.
      
      At the same time, cleanup its code by saving numa_mem_id() in a local
      variable (since we require a node with memory, not just any node) and
      remove an obsolete comment that assumes the mempolicy is actually passed
      into the function.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a389610
  7. 04 4月, 2014 1 次提交
  8. 06 3月, 2014 1 次提交
  9. 31 1月, 2014 1 次提交
  10. 30 1月, 2014 2 次提交
  11. 28 1月, 2014 2 次提交
  12. 24 1月, 2014 2 次提交
  13. 19 12月, 2013 2 次提交
    • W
      mm/mempolicy: fix !vma in new_vma_page() · 11c731e8
      Wanpeng Li 提交于
      BUG_ON(!vma) assumption is introduced by commit 0bf598d8 ("mbind:
      add BUG_ON(!vma) in new_vma_page()"), however, even if
      
          address = __vma_address(page, vma);
      
      and
      
          vma->start < address < vma->end
      
      page_address_in_vma() may still return -EFAULT because of many other
      conditions in it.  As a result the while loop in new_vma_page() may end
      with vma=NULL.
      
      This patch revert the commit and also fix the potential dereference NULL
      pointer reported by Dan.
      
         http://marc.info/?l=linux-mm&m=137689530323257&w=2
      
        kernel BUG at mm/mempolicy.c:1204!
        invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU: 3 PID: 7056 Comm: trinity-child3 Not tainted 3.13.0-rc3+ #2
        task: ffff8801ca5295d0 ti: ffff88005ab20000 task.ti: ffff88005ab20000
        RIP: new_vma_page+0x70/0x90
        RSP: 0000:ffff88005ab21db0  EFLAGS: 00010246
        RAX: fffffffffffffff2 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000008040075 RSI: ffff8801c3d74600 RDI: ffffea00079a8b80
        RBP: ffff88005ab21dc8 R08: 0000000000000004 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffff2
        R13: ffffea00079a8b80 R14: 0000000000400000 R15: 0000000000400000
      
        FS:  00007ff49c6f4740(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007ff49c68f994 CR3: 000000005a205000 CR4: 00000000001407e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Stack:
         ffffea00079a8b80 ffffea00079a8bc0 ffffea00079a8ba0 ffff88005ab21e50
         ffffffff811adc7a 0000000000000000 ffff8801ca5295d0 0000000464e224f8
         0000000000000000 0000000000000002 0000000000000000 ffff88020ce75c00
        Call Trace:
          migrate_pages+0x12a/0x850
          SYSC_mbind+0x513/0x6a0
          SyS_mbind+0xe/0x10
          ia32_do_call+0x13/0x13
        Code: 85 c0 75 2f 4c 89 e1 48 89 da 31 f6 bf da 00 02 00 65 44 8b 04 25 08 f7 1c 00 e8 ec fd ff ff 5b 41 5c 41 5d 5d c3 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 4c 89 e6 48 89 df ba 01 00 00 00 e8 48
        RIP  [<ffffffff8119f200>] new_vma_page+0x70/0x90
         RSP <ffff88005ab21db0>
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11c731e8
    • J
      mm/mempolicy: correct putback method for isolate pages if failed · b0e5fd73
      Joonsoo Kim 提交于
      queue_pages_range() isolates hugetlbfs pages and putback_lru_pages()
      can't handle these.  We should change it to putback_movable_pages().
      
      Naoya said that it is worth going into stable, because it can break
      in-use hugepage list.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0e5fd73
  14. 09 12月, 2013 1 次提交
  15. 22 11月, 2013 1 次提交
  16. 15 11月, 2013 1 次提交
    • K
      mm, hugetlb: convert hugetlbfs to use split pmd lock · cb900f41
      Kirill A. Shutemov 提交于
      Hugetlb supports multiple page sizes. We use split lock only for PMD
      level, but not for PUD.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb900f41
  17. 13 11月, 2013 2 次提交
  18. 09 10月, 2013 6 次提交
    • R
      sched/numa: Skip some page migrations after a shared fault · de1c9ce6
      Rik van Riel 提交于
      Shared faults can lead to lots of unnecessary page migrations,
      slowing down the system, and causing private faults to hit the
      per-pgdat migration ratelimit.
      
      This patch adds sysctl numa_balancing_migrate_deferred, which specifies
      how many shared page migrations to skip unconditionally, after each page
      migration that is skipped because it is a shared fault.
      
      This reduces the number of page migrations back and forth in
      shared fault situations. It also gives a strong preference to
      the tasks that are already running where most of the memory is,
      and to moving the other tasks to near the memory.
      
      Testing this with a much higher scan rate than the default
      still seems to result in fewer page migrations than before.
      
      Memory seems to be somewhat better consolidated than previously,
      with multi-instance specjbb runs on a 4 node system.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-62-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      de1c9ce6
    • R
      mm: numa: Revert temporarily disabling of NUMA migration · 1e3646ff
      Rik van Riel 提交于
      With the scan rate code working (at least for multi-instance specjbb),
      the large hammer that is "sched: Do not migrate memory immediately after
      switching node" can be replaced with something smarter. Revert temporarily
      migration disabling and all traces of numa_migrate_seq.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-61-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e3646ff
    • P
      mm: numa: Change page last {nid,pid} into {cpu,pid} · 90572890
      Peter Zijlstra 提交于
      Change the per page last fault tracking to use cpu,pid instead of
      nid,pid. This will allow us to try and lookup the alternate task more
      easily. Note that even though it is the cpu that is store in the page
      flags that the mpol_misplaced decision is still based on the node.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
      [ Fixed build failure on 32-bit systems. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90572890
    • M
      mm: numa: Limit NUMA scanning to migrate-on-fault VMAs · fc314724
      Mel Gorman 提交于
      There is a 90% regression observed with a large Oracle performance test
      on a 4 node system. Profiles indicated that the overhead was due to
      contention on sp_lock when looking up shared memory policies. These
      policies do not have the appropriate flags to allow them to be
      automatically balanced so trapping faults on them is pointless. This
      patch skips VMAs that do not have MPOL_F_MOF set.
      
      [riel@redhat.com: Initial patch]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-and-tested-by: NJoe Mario <jmario@redhat.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc314724
    • R
      sched/numa: Do not migrate memory immediately after switching node · 6fe6b2d6
      Rik van Riel 提交于
      The load balancer can move tasks between nodes and does not take NUMA
      locality into account. With automatic NUMA balancing this may result in the
      tasks working set being migrated to the new node. However, as the fault
      buffer will still store faults from the old node the schduler may decide to
      reset the preferred node and migrate the task back resulting in more
      migrations.
      
      The ideal would be that the scheduler did not migrate tasks with a heavy
      memory footprint but this may result nodes being overloaded. We could
      also discard the fault information on task migration but this would still
      cause all the tasks working set to be migrated. This patch simply avoids
      migrating the memory for a short time after a task is migrated.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fe6b2d6
    • M
      sched/numa: Set preferred NUMA node based on number of private faults · b795854b
      Mel Gorman 提交于
      Ideally it would be possible to distinguish between NUMA hinting faults that
      are private to a task and those that are shared. If treated identically
      there is a risk that shared pages bounce between nodes depending on
      the order they are referenced by tasks. Ultimately what is desirable is
      that task private pages remain local to the task while shared pages are
      interleaved between sharing tasks running on different nodes to give good
      average performance. This is further complicated by THP as even
      applications that partition their data may not be partitioning on a huge
      page boundary.
      
      To start with, this patch assumes that multi-threaded or multi-process
      applications partition their data and that in general the private accesses
      are more important for cpu->memory locality in the general case. Also,
      no new infrastructure is required to treat private pages properly but
      interleaving for shared pages requires additional infrastructure.
      
      To detect private accesses the pid of the last accessing task is required
      but the storage requirements are a high. This patch borrows heavily from
      Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
      to encode some bits from the last accessing task in the page flags as
      well as the node information. Collisions will occur but it is better than
      just depending on the node information. Node information is then used to
      determine if a page needs to migrate. The PID information is used to detect
      private/shared accesses. The preferred NUMA node is selected based on where
      the maximum number of approximately private faults were measured. Shared
      faults are not taken into consideration for a few reasons.
      
      First, if there are many tasks sharing the page then they'll all move
      towards the same node. The node will be compute overloaded and then
      scheduled away later only to bounce back again. Alternatively the shared
      tasks would just bounce around nodes because the fault information is
      effectively noise. Either way accounting for shared faults the same as
      private faults can result in lower performance overall.
      
      The second reason is based on a hypothetical workload that has a small
      number of very important, heavily accessed private pages but a large shared
      array. The shared array would dominate the number of faults and be selected
      as a preferred node even though it's the wrong decision.
      
      The third reason is that multiple threads in a process will race each
      other to fault the shared page making the fault information unreliable.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      [ Fix complication error when !NUMA_BALANCING. ]
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b795854b
  19. 12 9月, 2013 4 次提交