1. 28 1月, 2014 1 次提交
  2. 24 1月, 2014 1 次提交
    • P
      mm/mm_init.c: make creation of the mm_kobj happen earlier than device_initcall · da29bd36
      Paul Gortmaker 提交于
      The use of __initcall is to be eventually replaced by choosing one from
      the prioritized groupings laid out in init.h header:
      
      	pure_initcall               0
      	core_initcall               1
      	postcore_initcall           2
      	arch_initcall               3
      	subsys_initcall             4
      	fs_initcall                 5
      	device_initcall             6
      	late_initcall               7
      
      In the interim, all __initcall are mapped onto device_initcall, which as
      can be seen above, comes quite late in the ordering.
      
      Currently the mm_kobj is created with __initcall in mm_sysfs_init().
      This means that any other initcalls that want to reference the mm_kobj
      have to be device_initcall (or later), otherwise we will for example,
      trip the BUG_ON(!kobj) in sysfs's internal_create_group().  This
      unfairly restricts those users; for example something that clearly makes
      sense to be an arch_initcall will not be able to choose that.
      
      However, upon examination, it is only this way for historical reasons
      (i.e.  simply not reprioritized yet).  We see that sysfs is ready quite
      earlier in init/main.c via:
      
       vfs_caches_init
       |_ mnt_init
          |_ sysfs_init
      
      well ahead of the processing of the prioritized calls listed above.
      
      So we can recategorize mm_sysfs_init to be a pure_initcall, which in
      turn allows any mm_kobj initcall users a wider range (1 --> 7) of
      initcall priorities to choose from.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da29bd36
  3. 09 10月, 2013 2 次提交
    • P
      mm: numa: Change page last {nid,pid} into {cpu,pid} · 90572890
      Peter Zijlstra 提交于
      Change the per page last fault tracking to use cpu,pid instead of
      nid,pid. This will allow us to try and lookup the alternate task more
      easily. Note that even though it is the cpu that is store in the page
      flags that the mpol_misplaced decision is still based on the node.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
      [ Fixed build failure on 32-bit systems. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90572890
    • M
      sched/numa: Set preferred NUMA node based on number of private faults · b795854b
      Mel Gorman 提交于
      Ideally it would be possible to distinguish between NUMA hinting faults that
      are private to a task and those that are shared. If treated identically
      there is a risk that shared pages bounce between nodes depending on
      the order they are referenced by tasks. Ultimately what is desirable is
      that task private pages remain local to the task while shared pages are
      interleaved between sharing tasks running on different nodes to give good
      average performance. This is further complicated by THP as even
      applications that partition their data may not be partitioning on a huge
      page boundary.
      
      To start with, this patch assumes that multi-threaded or multi-process
      applications partition their data and that in general the private accesses
      are more important for cpu->memory locality in the general case. Also,
      no new infrastructure is required to treat private pages properly but
      interleaving for shared pages requires additional infrastructure.
      
      To detect private accesses the pid of the last accessing task is required
      but the storage requirements are a high. This patch borrows heavily from
      Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
      to encode some bits from the last accessing task in the page flags as
      well as the node information. Collisions will occur but it is better than
      just depending on the node information. Node information is then used to
      determine if a page needs to migrate. The PID information is used to detect
      private/shared accesses. The preferred NUMA node is selected based on where
      the maximum number of approximately private faults were measured. Shared
      faults are not taken into consideration for a few reasons.
      
      First, if there are many tasks sharing the page then they'll all move
      towards the same node. The node will be compute overloaded and then
      scheduled away later only to bounce back again. Alternatively the shared
      tasks would just bounce around nodes because the fault information is
      effectively noise. Either way accounting for shared faults the same as
      private faults can result in lower performance overall.
      
      The second reason is based on a hypothetical workload that has a small
      number of very important, heavily accessed private pages but a large shared
      array. The shared array would dominate the number of faults and be selected
      as a preferred node even though it's the wrong decision.
      
      The third reason is that multiple threads in a process will race each
      other to fault the shared page making the fault information unreliable.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      [ Fix complication error when !NUMA_BALANCING. ]
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b795854b
  4. 04 7月, 2013 1 次提交
    • T
      mm: tune vm_committed_as percpu_counter batching size · 917d9290
      Tim Chen 提交于
      Currently the per cpu counter's batch size for memory accounting is
      configured as twice the number of cpus in the system.  However, for
      system with very large memory, it is more appropriate to make it
      proportional to the memory size per cpu in the system.
      
      For example, for a x86_64 system with 64 cpus and 128 GB of memory, the
      batch size is only 2*64 pages (0.5 MB).  So any memory accounting
      changes of more than 0.5MB will overflow the per cpu counter into the
      global counter.  Instead, for the new scheme, the batch size is
      configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256), which is
      more inline with the memory size.
      
      I've done a repeated brk test of 800KB (from will-it-scale test suite)
      with 80 concurrent processes on a 4 socket Westmere machine with a total
      of 40 cores.  Without the patch, about 80% of cpu is spent on spin-lock
      contention within the vm_committed_as counter.  With the patch, there's
      a 73x speedup on the benchmark and the lock contention drops off almost
      entirely.
      
      [akpm@linux-foundation.org: fix section mismatch]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917d9290
  5. 24 2月, 2013 1 次提交
    • M
      mm: init: report on last-nid information stored in page->flags · a4e1b4c6
      Mel Gorman 提交于
      Answering the question "how much space remains in the page->flags" is
      time-consuming.  mminit_loglevel can help answer the question but it
      does not take last_nid information into account.  This patch corrects it
      and while there it corrects the messages related to page flag usage,
      pgshifts and node/zone id.  When applied the relevant output looks
      something like this but will depend on the kernel configuration.
      
        mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25
        mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9
        mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44
        mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53
        mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4e1b4c6
  6. 31 10月, 2011 1 次提交
  7. 21 8月, 2008 1 次提交
  8. 06 8月, 2008 1 次提交
  9. 25 7月, 2008 5 次提交