1. 11 8月, 2010 7 次提交
  2. 10 8月, 2010 3 次提交
    • K
      memcg: add mm_vmscan_memcg_isolate tracepoint · cc8e970c
      KOSAKI Motohiro 提交于
      Memcg also need to trace page isolation information as global reclaim.
      This patch does it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc8e970c
    • D
      oom: badness heuristic rewrite · a63d83f4
      David Rientjes 提交于
      This a complete rewrite of the oom killer's badness() heuristic which is
      used to determine which task to kill in oom conditions.  The goal is to
      make it as simple and predictable as possible so the results are better
      understood and we end up killing the task which will lead to the most
      memory freeing while still respecting the fine-tuning from userspace.
      
      Instead of basing the heuristic on mm->total_vm for each task, the task's
      rss and swap space is used instead.  This is a better indication of the
      amount of memory that will be freeable if the oom killed task is chosen
      and subsequently exits.  This helps specifically in cases where KDE or
      GNOME is chosen for oom kill on desktop systems instead of a memory
      hogging task.
      
      The baseline for the heuristic is a proportion of memory that each task is
      currently using in memory plus swap compared to the amount of "allowable"
      memory.  "Allowable," in this sense, means the system-wide resources for
      unconstrained oom conditions, the set of mempolicy nodes, the mems
      attached to current's cpuset, or a memory controller's limit.  The
      proportion is given on a scale of 0 (never kill) to 1000 (always kill),
      roughly meaning that if a task has a badness() score of 500 that the task
      consumes approximately 50% of allowable memory resident in RAM or in swap
      space.
      
      The proportion is always relative to the amount of "allowable" memory and
      not the total amount of RAM systemwide so that mempolicies and cpusets may
      operate in isolation; they shall not need to know the true size of the
      machine on which they are running if they are bound to a specific set of
      nodes or mems, respectively.
      
      Root tasks are given 3% extra memory just like __vm_enough_memory()
      provides in LSMs.  In the event of two tasks consuming similar amounts of
      memory, it is generally better to save root's task.
      
      Because of the change in the badness() heuristic's baseline, it is also
      necessary to introduce a new user interface to tune it.  It's not possible
      to redefine the meaning of /proc/pid/oom_adj with a new scale since the
      ABI cannot be changed for backward compatability.  Instead, a new tunable,
      /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
      be used to polarize the heuristic such that certain tasks are never
      considered for oom kill while others may always be considered.  The value
      is added directly into the badness() score so a value of -500, for
      example, means to discount 50% of its memory consumption in comparison to
      other tasks either on the system, bound to the mempolicy, in the cpuset,
      or sharing the same memory controller.
      
      /proc/pid/oom_adj is changed so that its meaning is rescaled into the
      units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
      these per-task tunables will rescale the value of the other to an
      equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
      a bitshift on the badness score, it now shares the same linear growth as
      /proc/pid/oom_score_adj but with different granularity.  This is required
      so the ABI is not broken with userspace applications and allows oom_adj to
      be deprecated for future removal.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a63d83f4
    • K
      vmscan: kill prev_priority completely · 25edde03
      KOSAKI Motohiro 提交于
      Since 2.6.28 zone->prev_priority is unused. Then it can be removed
      safely. It reduce stack usage slightly.
      
      Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
      can be integrate again, it's useful. but four (or more) times trying
      haven't got good performance number. Thus I give up such approach.
      
      The rest of this changelog is notes on prev_priority and why it existed in
      the first place and why it might be not necessary any more. This information
      is based heavily on discussions between Andrew Morton, Rik van Riel and
      Kosaki Motohiro who is heavily quotes from.
      
      Historically prev_priority was important because it determined when the VM
      would start unmapping PTE pages. i.e. there are no balances of note within
      the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
      is a potential risk of unnecessarily increasing minor faults as a large
      amount of read activity of use-once pages could push mapped pages to the
      end of the LRU and get unmapped.
      
      There is no proof this is still a problem but currently it is not considered
      to be. Active files are not deactivated if the active file list is smaller
      than the inactive list reducing the liklihood that file-mapped pages are
      being pushed off the LRU and referenced executable pages are kept on the
      active list to avoid them getting pushed out by read activity.
      
      Even if it is a problem, prev_priority prev_priority wouldn't works
      nowadays. First of all, current vmscan still a lot of UP centric code. it
      expose some weakness on some dozens CPUs machine. I think we need more and
      more improvement.
      
      The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
      and per-task-pressure a bit. example, prev_priority try to boost priority to
      other concurrent priority. but if the another task have mempolicy restriction,
      it is unnecessary, but also makes wrong big latency and exceeding reclaim.
      per-task based priority + prev_priority adjustment make the emulation of
      per-system pressure. but it have two issue 1) too rough and brutal emulation
      2) we need per-zone pressure, not per-system.
      
      Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
      2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
      but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
      system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
      prev_priority can't solve such multithreads workload issue. In other word,
      prev_priority concept assume the sysmtem don't have lots threads."
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25edde03
  3. 30 6月, 2010 1 次提交
  4. 28 5月, 2010 10 次提交
  5. 12 5月, 2010 2 次提交
    • K
      memcg: fix css_is_ancestor() RCU locking · 747388d7
      KAMEZAWA Hiroyuki 提交于
      Some callers (in memcontrol.c) calls css_is_ancestor() without
      rcu_read_lock.  Because css_is_ancestor() has to access RCU protected
      data, it should be under rcu_read_lock().
      
      This makes css_is_ancestor() itself does safe access to RCU protected
      area.  (At least, "root" can have refcnt==0 if it's not an ancestor of
      "child".  So, we need rcu_read_lock().)
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747388d7
    • K
      memcg: fix css_id() RCU locking for real · 7f0f1546
      KAMEZAWA Hiroyuki 提交于
      Commit ad4ba375 ("memcg: css_id() must be
      called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
      message.  But Andrew Morton pointed out that the fix doesn't seems sane
      and it was just for hidining lockdep messages.
      
      This is a patch for do proper things.  Checking again, all places,
      accessing without rcu_read_lock, that commit fixies was intentional....
      all callers of css_id() has reference count on it.  So, it's not necessary
      to be under rcu_read_lock().
      
      Considering again, we can use rcu_dereference_check for css_id().  We know
      css->id is valid if css->refcnt > 0.  (css->id never changes and freed
      after css->refcnt going to be 0.)
      
      This patch makes use of rcu_dereference_check() in css_id/depth and remove
      unnecessary rcu-read-lock added by the commit.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f0f1546
  6. 05 5月, 2010 1 次提交
  7. 25 4月, 2010 1 次提交
  8. 07 4月, 2010 1 次提交
  9. 25 3月, 2010 2 次提交
  10. 15 3月, 2010 1 次提交
  11. 13 3月, 2010 11 次提交
    • K
      memcg: fix oom kill behavior · 867578cb
      KAMEZAWA Hiroyuki 提交于
      In current page-fault code,
      
      	handle_mm_fault()
      		-> ...
      		-> mem_cgroup_charge()
      		-> map page or handle error.
      	-> check return code.
      
      If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
      called.  But if it's caused by memcg, OOM should have been already
      invoked.
      
      Then, I added a patch: a636b327.  That
      patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
      page_fault_out_of_memory from being invoked in near future.
      
      But Nishimura-san reported that check by jiffies is not enough when the
      system is terribly heavy.
      
      This patch changes memcg's oom logic as.
       * If memcg causes OOM-kill, continue to retry.
       * remove jiffies check which is used now.
       * add memcg-oom-lock which works like perzone oom lock.
       * If current is killed(as a process), bypass charge.
      
      Something more sophisticated can be added but this pactch does
      fundamental things.
      TODO:
       - add oom notifier
       - add permemcg disable-oom-kill flag and freezer at oom.
       - more chances for wake up oom waiter (when changing memory limit etc..)
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      867578cb
    • K
      cgroups: remove events before destroying subsystem state objects · a0a4db54
      Kirill A. Shutemov 提交于
      Events should be removed after rmdir of cgroup directory, but before
      destroying subsystem state objects.  Let's take reference to cgroup
      directory dentry to do that.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0a4db54
    • K
      memcg : share event counter rather than duplicate · d2265e6f
      KAMEZAWA Hiroyuki 提交于
      Memcg has 2 eventcountes which counts "the same" event.  Just usages are
      different from each other.  This patch tries to reduce event counter.
      
      Now logic uses "only increment, no reset" counter and masks for each
      checks.  Softlimit chesk was done per 1000 evetns.  So, the similar check
      can be done by !(new_counter & 0x3ff).  Threshold check was done per 100
      events.  So, the similar check can be done by (!new_counter & 0x7f)
      
      ALL event checks are done right after EVENT percpu counter is updated.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2265e6f
    • K
      memcg: update threshold and softlimit at commit · 430e4863
      KAMEZAWA Hiroyuki 提交于
      Presently, move_task does "batched" precharge.  Because res_counter or
      css's refcnt are not-scalable jobs for memcg, try_charge_()..  tend to be
      done in batched manner if allowed.
      
      Now, softlimit and threshold check their event counter in try_charge, but
      the charge is not a per-page event.  And event counter is not updated at
      charge().  Moreover, precharge doesn't pass "page" to try_charge() and
      softlimit tree will be never updated until uncharge() causes an event."
      
      So the best place to check the event counter is commit_charge().  This is
      per-page event by its nature.  This patch move checks to there.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      430e4863
    • K
      memcg: use generic percpu instead of private implementation · c62b1a3b
      KAMEZAWA Hiroyuki 提交于
      When per-cpu counter for memcg was implemneted, dynamic percpu allocator
      was not very good.  But now, we have good one and useful macros.  This
      patch replaces memcg's private percpu counter implementation with generic
      dynamic percpu allocator.
      
      The benefits are
      	- We can remove private implementation.
      	- The counters will be NUMA-aware. (Current one is not...)
      	- This patch makes sizeof struct mem_cgroup smaller. Then,
      	  struct mem_cgroup may be fit in page size on small config.
              - About basic performance aspects, see below.
      
       [Before]
       # size mm/memcontrol.o
         text    data     bss     dec     hex filename
        24373    2528    4132   31033    7939 mm/memcontrol.o
      
       [page-fault-throuput test on 8cpu/SMP in root cgroup]
       # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8
      
       Performance counter stats for './multi-fault-fork 8' (5 runs):
      
             45878618  page-faults                ( +-   0.110% )
            602635826  cache-misses               ( +-   0.105% )
      
         61.005373262  seconds time elapsed   ( +-   0.004% )
      
       Then cache-miss/page fault = 13.14
      
       [After]
       #size mm/memcontrol.o
         text    data     bss     dec     hex filename
        23913    2528    4132   30573    776d mm/memcontrol.o
       # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8
      
       Performance counter stats for './multi-fault-fork 8' (5 runs):
      
             48179400  page-faults                ( +-   0.271% )
            588628407  cache-misses               ( +-   0.136% )
      
         61.004615021  seconds time elapsed   ( +-   0.004% )
      
        Then cache-miss/page fault = 12.22
      
       Text size is reduced.
       This performance improvement is not big and will be invisible in real world
       applications. But this result shows this patch has some good effect even
       on (small) SMP.
      
      Here is a test program I used.
      
       1. fork() processes on each cpus.
       2. do page fault repeatedly on each process.
       3. after 60secs, kill all childredn and exit.
      
      (3 is necessary for getting stable data, this is improvement from previous one.)
      
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <sched.h>
      #include <sys/mman.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <signal.h>
      #include <stdlib.h>
      
      /*
       * For avoiding contention in page table lock, FAULT area is
       * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
       */
      #define FAULT_LENGTH	(2 * 1024 * 1024)
      #define PAGE_SIZE	4096
      #define MAXNUM		(128)
      
      void alarm_handler(int sig)
      {
      }
      
      void *worker(int cpu, int ppid)
      {
      	void *start, *end;
      	char *c;
      	cpu_set_t set;
      	int i;
      
      	CPU_ZERO(&set);
      	CPU_SET(cpu, &set);
      	sched_setaffinity(0, sizeof(set), &set);
      
      	start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE,
      			MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	if (start == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      	end = start + FAULT_LENGTH;
      
      	pause();
      	//fprintf(stderr, "run%d", cpu);
      	while (1) {
      		for (c = (char*)start; (void *)c < end; c += PAGE_SIZE)
      			*c = 0;
      		madvise(start, FAULT_LENGTH, MADV_DONTNEED);
      	}
      	return NULL;
      }
      
      int main(int argc, char *argv[])
      {
      	int num, i, ret, pid, status;
      	int pids[MAXNUM];
      
      	if (argc < 2)
      		return 0;
      
      	setpgid(0, 0);
      	signal(SIGALRM, alarm_handler);
      	num = atoi(argv[1]);
      	pid = getpid();
      
      	for (i = 0; i < num; ++i) {
      		ret = fork();
      		if (!ret) {
      			worker(i, pid);
      			exit(0);
      		}
      		pids[i] = ret;
      	}
      	sleep(1);
      	kill(-pid, SIGALRM);
      	sleep(60);
      	for (i = 0; i < num; i++)
      		kill(pids[i], SIGKILL);
      	for (i = 0; i < num; i++)
      		waitpid(pids[i], &status, 0);
      	return 0;
      }
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c62b1a3b
    • K
      memcg: typo in comment to mem_cgroup_print_oom_info() · 6a6135b6
      Kirill A. Shutemov 提交于
      s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a6135b6
    • K
      memcg: implement memory thresholds · 2e72b634
      Kirill A. Shutemov 提交于
      It allows to register multiple memory and memsw thresholds and gets
      notifications when it crosses.
      
      To register a threshold application need:
      - create an eventfd;
      - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
      - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
        cgroup.event_control.
      
      Application will be notified through eventfd when memory usage crosses
      threshold in any direction.
      
      It's applicable for root and non-root cgroup.
      
      It uses stats to track memory usage, simmilar to soft limits. It checks
      if we need to send event to userspace on every 100 page in/out. I guess
      it's good compromise between performance and accuracy of thresholds.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue]
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e72b634
    • K
      memcg: rework usage of stats by soft limit · 378ce724
      Kirill A. Shutemov 提交于
      Instead of incrementing counter on each page in/out and comparing it with
      constant, we set counter to constant, decrement counter on each page
      in/out and compare it with zero.  We want to make comparing as fast as
      possible.  On many RISC systems (probably not only RISC) comparing with
      zero is more effective than comparing with a constant, since not every
      constant can be immediate operand for compare instruction.
      
      Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
      since really it's not a generic counter.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      378ce724
    • K
      memcg: extract mem_group_usage() from mem_cgroup_read() · 104f3928
      Kirill A. Shutemov 提交于
      Helper to get memory or mem+swap usage of the cgroup.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      104f3928
    • D
      memcg: improve performance in moving swap charge · 483c30b5
      Daisuke Nishimura 提交于
      Try to reduce overheads in moving swap charge by:
      
      - Adds a new function(__mem_cgroup_put), which takes "count" as a arg and
        decrement mem->refcnt by "count".
      - Removed res_counter_uncharge, css_put, and mem_cgroup_put from the path
        of moving swap account, and consolidate all of them into mem_cgroup_clear_mc.
        We cannot do that about mc.to->refcnt.
      
      These changes reduces the overhead from 1.35sec to 0.9sec to move charges
      of 1G anonymous memory(including 500MB swap) in my test environment.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      483c30b5
    • D
      memcg: move charges of anonymous swap · 02491447
      Daisuke Nishimura 提交于
      This patch is another core part of this move-charge-at-task-migration
      feature.  It enables moving charges of anonymous swaps.
      
      To move the charge of swap, we need to exchange swap_cgroup's record.
      
      In current implementation, swap_cgroup's record is protected by:
      
        - page lock: if the entry is on swap cache.
        - swap_lock: if the entry is not on swap cache.
      
      This works well in usual swap-in/out activity.
      
      But this behavior make the feature of moving swap charge check many
      conditions to exchange swap_cgroup's record safely.
      
      So I changed modification of swap_cgroup's recored(swap_cgroup_record())
      to use xchg, and define a new function to cmpxchg swap_cgroup's record.
      
      This patch also enables moving charge of non pte_present but not uncharged
      swap caches, which can be exist on swap-out path, by getting the target
      pages via find_get_page() as do_mincore() does.
      
      [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
      [akpm@linux-foundation.org: fix typos]
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02491447