1. 15 2月, 2008 1 次提交
  2. 14 2月, 2008 2 次提交
  3. 12 2月, 2008 2 次提交
    • K
      mempolicy: silently restrict nodemask to allowed nodes · 31f1de46
      KOSAKI Motohiro 提交于
      Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
      presence of memoryless nodes.  This patch attempts to fix that problem.
      
      Some background:
      
      numactl --interleave=all calls set_mempolicy(2) with a fully populated
      [out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
      calls contextualize_policy() which requires that the nodemask be a
      subset of the current task's mems_allowed; else EINVAL will be returned.
      
      A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
      i.e., nodes with memory.  So, a fully populated nodemask will be
      declared invalid if it includes memoryless nodes.
      
        NOTE:  the same thing will occur when running in a cpuset
               with restricted mem_allowed--for the same reason:
               node mask contains dis-allowed nodes.
      
      mbind(2), on the other hand, just masks off any nodes in the nodemask
      that are not included in the caller's mems_allowed.
      
      In each case [mbind() and set_mempolicy()], mpol_check_policy() will
      complain [again, resulting in EINVAL] if the nodemask contains any
      memoryless nodes.  This is somewhat redundant as mpol_new() will remove
      memoryless nodes for interleave policy, as will bind_zonelist()--called
      by mpol_new() for BIND policy.
      
      Proposed fix:
      
      1) modify contextualize_policy logic to:
         a) remember whether the incoming node mask is empty.
         b) if not, restrict the nodemask to allowed nodes, as is
            currently done in-line for mbind().  This guarantees
            that the resulting mask includes only nodes with memory.
      
            NOTE:  this is a [benign, IMO] change in behavior for
                   set_mempolicy().  Dis-allowed nodes will be
                   silently ignored, rather than returning an error.
      
         c) fold this code into mpol_check_policy(), replace 2 calls to
            contextualize_policy() to call mpol_check_policy() directly
            and remove contextualize_policy().
      
      2) In existing mpol_check_policy() logic, after "contextualization":
         a) MPOL_DEFAULT:  require that in coming mask "was_empty"
         b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
            contains at least one node.
         c) add a case for MPOL_PREFERRED:  if in coming was not empty
            and resulting mask IS empty, user specified invalid nodes.
            Return EINVAL.
         c) remove the now redundant check for memoryless nodes
      
      3) remove the now redundant masking of policy nodes for interleave
         policy from mpol_new().
      
      4) Now that mpol_check_policy() contextualizes the nodemask, remove
         the in-line nodes_and() from sys_mbind().  I believe that this
         restores mbind() to the behavior before the memoryless-nodes
         patch series.  E.g., we'll no longer treat an invalid nodemask
         with MPOL_PREFERRED as local allocation.
      
      [ Patch history:
      
        v1 -> v2:
         - Communicate whether or not incoming node mask was empty to
           mpol_check_policy() for better error checking.
         - As suggested by David Rientjes, remove the now unused
           cpuset_nodes_subset_current_mems_allowed() from cpuset.h
      
        v2 -> v3:
         - As suggested by Kosaki Motohito, fold the "contextualization"
           of policy nodemask into mpol_check_policy().  Looks a little
           cleaner. ]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31f1de46
    • J
      Be more robust about bad arguments in get_user_pages() · 900cf086
      Jonathan Corbet 提交于
      So I spent a while pounding my head against my monitor trying to figure
      out the vmsplice() vulnerability - how could a failure to check for
      *read* access turn into a root exploit? It turns out that it's a buffer
      overflow problem which is made easy by the way get_user_pages() is
      coded.
      
      In particular, "len" is a signed int, and it is only checked at the
      *end* of a do {} while() loop.  So, if it is passed in as zero, the loop
      will execute once and decrement len to -1.  At that point, the loop will
      proceed until the next invalid address is found; in the process, it will
      likely overflow the pages array passed in to get_user_pages().
      
      I think that, if get_user_pages() has been asked to grab zero pages,
      that's what it should do.  Thus this patch; it is, among other things,
      enough to block the (already fixed) root exploit and any others which
      might be lurking in similar code.  I also think that the number of pages
      should be unsigned, but changing the prototype of this function probably
      requires some more careful review.
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      900cf086
  4. 10 2月, 2008 1 次提交
  5. 09 2月, 2008 8 次提交
  6. 08 2月, 2008 26 次提交
    • I
      SLUB: fix checkpatch warnings · 3adbefee
      Ingo Molnar 提交于
      fix checkpatch --file mm/slub.c errors and warnings.
      
       $ q-code-quality-compare
                                            errors   lines of code   errors/KLOC
       mm/slub.c      [before]                  22            4204           5.2
       mm/slub.c      [after]                    0            4210             0
      
      no code changed:
      
          text    data     bss     dec     hex filename
         22195    8634     136   30965    78f5 slub.o.before
         22195    8634     136   30965    78f5 slub.o.after
      
         md5:
           93cdfbec2d6450622163c590e1064358  slub.o.before.asm
           93cdfbec2d6450622163c590e1064358  slub.o.after.asm
      
      [clameter: rediffed against Pekka's cleanup patch, omitted
      moves of the name of a function to the start of line]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      3adbefee
    • N
      Use non atomic unlock · a76d3546
      Nick Piggin 提交于
      Slub can use the non-atomic version to unlock because other flags will not
      get modified with the lock held.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a76d3546
    • C
      SLUB: Support for performance statistics · 8ff12cfc
      Christoph Lameter 提交于
      The statistics provided here allow the monitoring of allocator behavior but
      at the cost of some (minimal) loss of performance. Counters are placed in
      SLUB's per cpu data structure. The per cpu structure may be extended by the
      statistics to grow larger than one cacheline which will increase the cache
      footprint of SLUB.
      
      There is a compile option to enable/disable the inclusion of the runtime
      statistics and its off by default.
      
      The slabinfo tool is enhanced to support these statistics via two options:
      
      -D 	Switches the line of information displayed for a slab from size
      	mode to activity mode.
      
      -A	Sorts the slabs displayed by activity. This allows the display of
      	the slabs most important to the performance of a certain load.
      
      -r	Report option will report detailed statistics on
      
      Example (tbench load):
      
      slabinfo -AD		->Shows the most active slabs
      
      Name                   Objects    Alloc     Free   %Fast
      skbuff_fclone_cache         33 111953835 111953835  99  99
      :0000192                  2666  5283688  5281047  99  99
      :0001024                   849  5247230  5246389  83  83
      vm_area_struct            1349   119642   118355  91  22
      :0004096                    15    66753    66751  98  98
      :0000064                  2067    25297    23383  98  78
      dentry                   10259    28635    18464  91  45
      :0000080                 11004    18950     8089  98  98
      :0000096                  1703    12358    10784  99  98
      :0000128                   762    10582     9875  94  18
      :0000512                   184     9807     9647  95  81
      :0002048                   479     9669     9195  83  65
      anon_vma                   777     9461     9002  99  71
      kmalloc-8                 6492     9981     5624  99  97
      :0000768                   258     7174     6931  58  15
      
      So the skbuff_fclone_cache is of highest importance for the tbench load.
      Pretty high load on the 192 sized slab. Look for the aliases
      
      slabinfo -a | grep 000192
      :0000192     <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP
      	request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili
      
      Likely skbuff_head_cache.
      
      
      Looking into the statistics of the skbuff_fclone_cache is possible through
      
      slabinfo skbuff_fclone_cache	->-r option implied if cache name is mentioned
      
      
      .... Usual output ...
      
      Slab Perf Counter       Alloc     Free %Al %Fr
      --------------------------------------------------
      Fastpath             111953360 111946981  99  99
      Slowpath                 1044     7423   0   0
      Page Alloc                272      264   0   0
      Add partial                25      325   0   0
      Remove partial             86      264   0   0
      RemoteObj/SlabFrozen      350     4832   0   0
      Total                111954404 111954404
      
      Flushes       49 Refill        0
      Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)
      
      Looks good because the fastpath is overwhelmingly taken.
      
      
      skbuff_head_cache:
      
      Slab Perf Counter       Alloc     Free %Al %Fr
      --------------------------------------------------
      Fastpath              5297262  5259882  99  99
      Slowpath                 4477    39586   0   0
      Page Alloc                937      824   0   0
      Add partial                 0     2515   0   0
      Remove partial           1691      824   0   0
      RemoteObj/SlabFrozen     2621     9684   0   0
      Total                 5301739  5299468
      
      Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)
      
      
      Descriptions of the output:
      
      Total:		The total number of allocation and frees that occurred for a
      		slab
      
      Fastpath:	The number of allocations/frees that used the fastpath.
      
      Slowpath:	Other allocations
      
      Page Alloc:	Number of calls to the page allocator as a result of slowpath
      		processing
      
      Add Partial:	Number of slabs added to the partial list through free or
      		alloc (occurs during cpuslab flushes)
      
      Remove Partial:	Number of slabs removed from the partial list as a result of
      		allocations retrieving a partial slab or by a free freeing
      		the last object of a slab.
      
      RemoteObj/Froz:	How many times were remotely freed object encountered when a
      		slab was about to be deactivated. Frozen: How many times was
      		free able to skip list processing because the slab was in use
      		as the cpuslab of another processor.
      
      Flushes:	Number of times the cpuslab was flushed on request
      		(kmem_cache_shrink, may result from races in __slab_alloc)
      
      Refill:		Number of times we were able to refill the cpuslab from
      		remotely freed objects for the same slab.
      
      Deactivate:	Statistics how slabs were deactivated. Shows how they were
      		put onto the partial list.
      
      In general fastpath is very good. Slowpath without partial list processing is
      also desirable. Any touching of partial list uses node specific locks which
      may potentially cause list lock contention.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      8ff12cfc
    • C
      SLUB: Alternate fast paths using cmpxchg_local · 1f84260c
      Christoph Lameter 提交于
      Provide an alternate implementation of the SLUB fast paths for alloc
      and free using cmpxchg_local. The cmpxchg_local fast path is selected
      for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only
      set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an
      interrupt enable/disable sequence. This is known to be true for both
      x86 platforms so set FAST_CMPXCHG_LOCAL for both arches.
      
      Currently another requirement for the fastpath is that the kernel is
      compiled without preemption. The restriction will go away with the
      introduction of a new per cpu allocator and new per cpu operations.
      
      The advantages of a cmpxchg_local based fast path are:
      
      1. Potentially lower cycle count (30%-60% faster)
      
      2. There is no need to disable and enable interrupts on the fast path.
         Currently interrupts have to be disabled and enabled on every
         slab operation. This is likely avoiding a significant percentage
         of interrupt off / on sequences in the kernel.
      
      3. The disposal of freed slabs can occur with interrupts enabled.
      
      The alternate path is realized using #ifdef's. Several attempts to do the
      same with macros and inline functions resulted in a mess (in particular due
      to the strange way that local_interrupt_save() handles its argument and due
      to the need to define macros/functions that sometimes disable interrupts
      and sometimes do something else).
      
      [clameter: Stripped preempt bits and disabled fastpath if preempt is enabled]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      1f84260c
    • C
      SLUB: Use unique end pointer for each slab page. · 683d0baa
      Christoph Lameter 提交于
      We use a NULL pointer on freelists to signal that there are no more objects.
      However the NULL pointers of all slabs match in contrast to the pointers to
      the real objects which are in different ranges for different slab pages.
      
      Change the end pointer to be a pointer to the first object and set bit 0.
      Every slab will then have a different end pointer. This is necessary to ensure
      that end markers can be matched to the source slab during cmpxchg_local.
      
      Bring back the use of the mapping field by SLUB since we would otherwise have
      to call a relatively expensive function page_address() in __slab_alloc().  Use
      of the mapping field allows avoiding a call to page_address() in various other
      functions as well.
      
      There is no need to change the page_mapping() function since bit 0 is set on
      the mapping as also for anonymous pages.  page_mapping(slab_page) will
      therefore still return NULL although the mapping field is overloaded.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      683d0baa
    • C
      SLUB: Deal with annoying gcc warning on kfree() · 5bb983b0
      Christoph Lameter 提交于
      gcc 4.2 spits out an annoying warning if one casts a const void *
      pointer to a void * pointer. No warning is generated if the
      conversion is done through an assignment.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      5bb983b0
    • B
      Introduce flags for reserve_bootmem() · 72a7fe39
      Bernhard Walle 提交于
      This patchset adds a flags variable to reserve_bootmem() and uses the
      BOOTMEM_EXCLUSIVE flag in crashkernel reservation code to detect collisions
      between crashkernel area and already used memory.
      
      This patch:
      
      Change the reserve_bootmem() function to accept a new flag BOOTMEM_EXCLUSIVE.
      If that flag is set, the function returns with -EBUSY if the memory already
      has been reserved in the past.  This is to avoid conflicts.
      
      Because that code runs before SMP initialisation, there's no race condition
      inside reserve_bootmem_core().
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix powerpc build]
      Signed-off-by: NBernhard Walle <bwalle@suse.de>
      Cc: <linux-arch@vger.kernel.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72a7fe39
    • B
      Memory controller remove control_type feature · 3c541e14
      Balbir Singh 提交于
      Based on the discussion at http://lkml.org/lkml/2007/12/20/383, it was felt
      that control_type might not be a good thing to implement right away.  We
      can add this flexibility at a later point when required.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c541e14
    • K
      per-zone and reclaim enhancements for memory controller: per-zone-lock for cgroup · 072c56c1
      KAMEZAWA Hiroyuki 提交于
      Now, lru is per-zone.
      
      Then, lru_lock can be (should be) per-zone, too.
      This patch implementes per-zone lru lock.
      
      lru_lock is placed into mem_cgroup_per_zone struct.
      
      lock can be accessed by
         mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
         &mz->lru_lock
      
         or
         mz = page_cgroup_zoneinfo(page_cgroup);
         &mz->lru_lock
      Signed-off-by: NKAMEZAWA hiroyuki <kmaezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072c56c1
    • K
      per-zone and reclaim enhancements for memory controller: per zone lru for cgroup · 1ecaab2b
      KAMEZAWA Hiroyuki 提交于
      This patch implements per-zone lru for memory cgroup.
      This patch makes use of mem_cgroup_per_zone struct for per zone lru.
      
      LRU can be accessed by
      
         mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
         &mz->active_list
         &mz->inactive_list
      
         or
         mz = page_cgroup_zoneinfo(page_cgroup);
         &mz->active_list
         &mz->inactive_list
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ecaab2b
    • K
      per-zone and reclaim enhancements for memory controller: modifies vmscan.c for... · 1cfb419b
      KAMEZAWA Hiroyuki 提交于
      per-zone and reclaim enhancements for memory controller: modifies vmscan.c for isolate globa/cgroup lru activity
      
      When using memory controller, there are 2 levels of memory reclaim.
       1. zone memory reclaim because of system/zone memory shortage.
       2. memory cgroup memory reclaim because of hitting limit.
      
      These two can be distinguished by sc->mem_cgroup parameter.
      (scan_global_lru() macro)
      
      This patch tries to make memory cgroup reclaim routine avoid affecting
      system/zone memory reclaim. This patch inserts if (scan_global_lru()) and
      hook to memory_cgroup reclaim support functions.
      
      This patch can be a help for isolating system lru activity and group lru
      activity and shows what additional functions are necessary.
      
       * mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup.
       * mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in
                                              cgroup.
       * mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to
                                      be scanned in this priority in mem_cgroup.
      
       * mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages
                                      to be scanned in this priority in mem_cgroup.
      
       * mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable
                                           or not.
       * mem_cgroup_get_reclaim_priority() ...
       * mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal)
       * mem_cgroup_remember_reclaim_priority()
                                   .... record reclaim priority as
                                        zone->prev_priority.
                                        This value is used for calc reclaim_mapped.
      
      [akpm@linux-foundation.org: fix unused var warning]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cfb419b
    • K
      per-zone and reclaim enhancements for memory controller: calculate the number... · cc38108e
      KAMEZAWA Hiroyuki 提交于
      per-zone and reclaim enhancements for memory controller: calculate the number of pages to be scanned per cgroup
      
      Define function for calculating the number of scan target on each Zone/LRU.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc38108e
    • K
      per-zone and reclaim enhancements for memory controller: remember reclaim priority in memory cgroup · 6c48a1d0
      KAMEZAWA Hiroyuki 提交于
      Functions to remember reclaim priority per cgroup (as zone->prev_priority)
      
      [akpm@linux-foundation.org: build fixes]
      [akpm@linux-foundation.org: more build fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c48a1d0
    • K
      per-zone and reclaim enhancements for memory controller: calculate... · 5932f367
      KAMEZAWA Hiroyuki 提交于
      per-zone and reclaim enhancements for memory controller: calculate active/inactive imbalance per cgroup
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5932f367
    • K
      per-zone and reclaim enhancements for memory controller: calculate mapper_ratio per cgroup · 58ae83db
      KAMEZAWA Hiroyuki 提交于
      Define function for calculating mapped_ratio in memory cgroup.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58ae83db
    • K
      per-zone and reclaim enhancements for memory controller: per-zone active inactive counter · 6d12e2d8
      KAMEZAWA Hiroyuki 提交于
      This patch adds per-zone status in memory cgroup.  These values are often read
      (as per-zone value) by page reclaiming.
      
      In current design, per-zone stat is just a unsigned long value and not an
      atomic value because they are modified only under lru_lock.  (So, atomic_ops
      is not necessary.)
      
      This patch adds ACTIVE and INACTIVE per-zone status values.
      
      For handling per-zone status, this patch adds
        struct mem_cgroup_per_zone {
      		...
        }
      and some helper functions. This will be useful to add per-zone objects
      in mem_cgroup.
      
      This patch turns memory controller's early_init to be 0 for calling
      kmalloc() in initialization.
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d12e2d8
    • K
      per-zone and reclaim enhancements for memory controller: nid/zid helper function for cgroup · c0149530
      KAMEZAWA Hiroyuki 提交于
      Add macro to get node_id and zone_id of page_cgroup.  Will be used in
      per-zone-xxx patches and others.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0149530
    • K
      per-zone and reclaim enhancements for memory controller: add scan_global_lru macro · 91a45470
      KAMEZAWA Hiroyuki 提交于
      This is used to detect which scan_control scans global lru or mem_cgroup lru.
      And compiled to be static value (1) when memory controller is not configured.
      This may make the meaning obvious.
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91a45470
    • K
      memory cgroup enhancements: implicit force_empty() at rmdir · df878fb0
      KAMEZAWA Hiroyuki 提交于
      Add pre_destroy handler for mem_cgroup and try to make mem_cgroup empty at
      rmdir().
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df878fb0
    • K
      memory cgroup enhancements: add memory.stat file · d2ceb9b7
      KAMEZAWA Hiroyuki 提交于
      Show accounted information of memory cgroup by memory.stat file
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ceb9b7
    • K
      memory cgroup enhancements: add status accounting function for memory cgroup · d52aa412
      KAMEZAWA Hiroyuki 提交于
      Add statistics account infrastructure for memory controller.  All account
      information is stored per-cpu and caller will not have to take lock or use
      atomic ops.  This will be used by memory.stat file later.
      
      CACHE includes swapcache now. I'd like to divide it to
      PAGECACHE and SWAPCACHE later.
      
      This patch adds 3 functions for accounting.
       * __mem_cgroup_stat_add() ... for usual routine.
       * __mem_cgroup_stat_add_safe ... for calling under irq_disabled section.
       * mem_cgroup_read_stat() ... for reading stat value.
       * renamed PAGECACHE to CACHE (because it may include swapcache *now*)
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix smp_processor_id-in-preemptible]
      [akpm@linux-foundation.org: uninline things]
      [akpm@linux-foundation.org: remove dead code]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d52aa412
    • K
      memory cgroup enhancements: remember "a page is on active list of cgroup or not" · 3564c7c4
      KAMEZAWA Hiroyuki 提交于
      Remember page_cgroup is on active_list or not in page_cgroup->flags.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3564c7c4
    • H
      memcgroup: fix hang with shmem/tmpfs · 82369553
      Hugh Dickins 提交于
      The memcgroup regime relies upon a cgroup reclaiming pages from itself within
      add_to_page_cache: which may involve some waiting.  Whereas shmem and tmpfs
      rely upon using add_to_page_cache while holding a spinlock: when it cannot
      wait.  The consequence is that when a cgroup reaches its limit, shmem_getpage
      just hangs - unless there is outside memory pressure too, neither kswapd nor
      radix_tree_preload get it out of the retry loop.
      
      In most cases we can mem_cgroup_cache_charge the page waitably first, to
      attach the page_cgroup in advance, so add_to_page_cache will do no more than
      increment a count; then mem_cgroup_uncharge_page after (in both success and
      failure cases) to balance the books again.
      
      And where there used to be a congestion_wait for kswapd (recently made
      redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
      to go through a cycle of allocation and freeing, without accounting to any
      particular page, and without updating the statistics vector.  This brings the
      cgroup below its limit so the next try usually succeeds.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82369553
    • H
      memcgroup: tidy up mem_cgroup_charge_common · 3be91277
      Hugh Dickins 提交于
      Tidy up mem_cgroup_charge_common before extending it.  Adjust some comments,
      but mainly clean up its loop: I've an aversion to loops full of continues,
      then a break or a goto at the bottom.  And the is_atomic test should be on the
      __GFP_WAIT bit, not GFP_ATOMIC bits.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3be91277
    • B
      Memory controller use rcu_read_lock() in mem_cgroup_cache_charge() · ac44d354
      Balbir Singh 提交于
      Hugh Dickins noticed that we were using rcu_dereference() without
      rcu_read_lock() in the cache charging routine. The patch below fixes
      this problem
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac44d354
    • K
      memory cgroup enhancements: remember "a page is charged as page cache" · 217bc319
      KAMEZAWA Hiroyuki 提交于
      Add a flag to page_cgroup to remember "this page is
      charged as cache."
      cache here includes page caches and swap cache.
      This is useful for implementing precise accounting in memory cgroup.
      TODO:
        distinguish page-cache and swap-cache
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      217bc319