1. 25 3月, 2010 1 次提交
  2. 13 3月, 2010 15 次提交
    • K
      memcg: fix oom kill behavior · 867578cb
      KAMEZAWA Hiroyuki 提交于
      In current page-fault code,
      
      	handle_mm_fault()
      		-> ...
      		-> mem_cgroup_charge()
      		-> map page or handle error.
      	-> check return code.
      
      If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
      called.  But if it's caused by memcg, OOM should have been already
      invoked.
      
      Then, I added a patch: a636b327.  That
      patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
      page_fault_out_of_memory from being invoked in near future.
      
      But Nishimura-san reported that check by jiffies is not enough when the
      system is terribly heavy.
      
      This patch changes memcg's oom logic as.
       * If memcg causes OOM-kill, continue to retry.
       * remove jiffies check which is used now.
       * add memcg-oom-lock which works like perzone oom lock.
       * If current is killed(as a process), bypass charge.
      
      Something more sophisticated can be added but this pactch does
      fundamental things.
      TODO:
       - add oom notifier
       - add permemcg disable-oom-kill flag and freezer at oom.
       - more chances for wake up oom waiter (when changing memory limit etc..)
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      867578cb
    • K
      cgroups: remove events before destroying subsystem state objects · a0a4db54
      Kirill A. Shutemov 提交于
      Events should be removed after rmdir of cgroup directory, but before
      destroying subsystem state objects.  Let's take reference to cgroup
      directory dentry to do that.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0a4db54
    • K
      memcg : share event counter rather than duplicate · d2265e6f
      KAMEZAWA Hiroyuki 提交于
      Memcg has 2 eventcountes which counts "the same" event.  Just usages are
      different from each other.  This patch tries to reduce event counter.
      
      Now logic uses "only increment, no reset" counter and masks for each
      checks.  Softlimit chesk was done per 1000 evetns.  So, the similar check
      can be done by !(new_counter & 0x3ff).  Threshold check was done per 100
      events.  So, the similar check can be done by (!new_counter & 0x7f)
      
      ALL event checks are done right after EVENT percpu counter is updated.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2265e6f
    • K
      memcg: update threshold and softlimit at commit · 430e4863
      KAMEZAWA Hiroyuki 提交于
      Presently, move_task does "batched" precharge.  Because res_counter or
      css's refcnt are not-scalable jobs for memcg, try_charge_()..  tend to be
      done in batched manner if allowed.
      
      Now, softlimit and threshold check their event counter in try_charge, but
      the charge is not a per-page event.  And event counter is not updated at
      charge().  Moreover, precharge doesn't pass "page" to try_charge() and
      softlimit tree will be never updated until uncharge() causes an event."
      
      So the best place to check the event counter is commit_charge().  This is
      per-page event by its nature.  This patch move checks to there.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      430e4863
    • K
      memcg: use generic percpu instead of private implementation · c62b1a3b
      KAMEZAWA Hiroyuki 提交于
      When per-cpu counter for memcg was implemneted, dynamic percpu allocator
      was not very good.  But now, we have good one and useful macros.  This
      patch replaces memcg's private percpu counter implementation with generic
      dynamic percpu allocator.
      
      The benefits are
      	- We can remove private implementation.
      	- The counters will be NUMA-aware. (Current one is not...)
      	- This patch makes sizeof struct mem_cgroup smaller. Then,
      	  struct mem_cgroup may be fit in page size on small config.
              - About basic performance aspects, see below.
      
       [Before]
       # size mm/memcontrol.o
         text    data     bss     dec     hex filename
        24373    2528    4132   31033    7939 mm/memcontrol.o
      
       [page-fault-throuput test on 8cpu/SMP in root cgroup]
       # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8
      
       Performance counter stats for './multi-fault-fork 8' (5 runs):
      
             45878618  page-faults                ( +-   0.110% )
            602635826  cache-misses               ( +-   0.105% )
      
         61.005373262  seconds time elapsed   ( +-   0.004% )
      
       Then cache-miss/page fault = 13.14
      
       [After]
       #size mm/memcontrol.o
         text    data     bss     dec     hex filename
        23913    2528    4132   30573    776d mm/memcontrol.o
       # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8
      
       Performance counter stats for './multi-fault-fork 8' (5 runs):
      
             48179400  page-faults                ( +-   0.271% )
            588628407  cache-misses               ( +-   0.136% )
      
         61.004615021  seconds time elapsed   ( +-   0.004% )
      
        Then cache-miss/page fault = 12.22
      
       Text size is reduced.
       This performance improvement is not big and will be invisible in real world
       applications. But this result shows this patch has some good effect even
       on (small) SMP.
      
      Here is a test program I used.
      
       1. fork() processes on each cpus.
       2. do page fault repeatedly on each process.
       3. after 60secs, kill all childredn and exit.
      
      (3 is necessary for getting stable data, this is improvement from previous one.)
      
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <sched.h>
      #include <sys/mman.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <signal.h>
      #include <stdlib.h>
      
      /*
       * For avoiding contention in page table lock, FAULT area is
       * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
       */
      #define FAULT_LENGTH	(2 * 1024 * 1024)
      #define PAGE_SIZE	4096
      #define MAXNUM		(128)
      
      void alarm_handler(int sig)
      {
      }
      
      void *worker(int cpu, int ppid)
      {
      	void *start, *end;
      	char *c;
      	cpu_set_t set;
      	int i;
      
      	CPU_ZERO(&set);
      	CPU_SET(cpu, &set);
      	sched_setaffinity(0, sizeof(set), &set);
      
      	start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE,
      			MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	if (start == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      	end = start + FAULT_LENGTH;
      
      	pause();
      	//fprintf(stderr, "run%d", cpu);
      	while (1) {
      		for (c = (char*)start; (void *)c < end; c += PAGE_SIZE)
      			*c = 0;
      		madvise(start, FAULT_LENGTH, MADV_DONTNEED);
      	}
      	return NULL;
      }
      
      int main(int argc, char *argv[])
      {
      	int num, i, ret, pid, status;
      	int pids[MAXNUM];
      
      	if (argc < 2)
      		return 0;
      
      	setpgid(0, 0);
      	signal(SIGALRM, alarm_handler);
      	num = atoi(argv[1]);
      	pid = getpid();
      
      	for (i = 0; i < num; ++i) {
      		ret = fork();
      		if (!ret) {
      			worker(i, pid);
      			exit(0);
      		}
      		pids[i] = ret;
      	}
      	sleep(1);
      	kill(-pid, SIGALRM);
      	sleep(60);
      	for (i = 0; i < num; i++)
      		kill(pids[i], SIGKILL);
      	for (i = 0; i < num; i++)
      		waitpid(pids[i], &status, 0);
      	return 0;
      }
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c62b1a3b
    • K
      memcg: typo in comment to mem_cgroup_print_oom_info() · 6a6135b6
      Kirill A. Shutemov 提交于
      s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a6135b6
    • K
      memcg: implement memory thresholds · 2e72b634
      Kirill A. Shutemov 提交于
      It allows to register multiple memory and memsw thresholds and gets
      notifications when it crosses.
      
      To register a threshold application need:
      - create an eventfd;
      - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
      - write string like "<event_fd> <memory.usage_in_bytes> <threshold>" to
        cgroup.event_control.
      
      Application will be notified through eventfd when memory usage crosses
      threshold in any direction.
      
      It's applicable for root and non-root cgroup.
      
      It uses stats to track memory usage, simmilar to soft limits. It checks
      if we need to send event to userspace on every 100 page in/out. I guess
      it's good compromise between performance and accuracy of thresholds.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue]
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e72b634
    • K
      memcg: rework usage of stats by soft limit · 378ce724
      Kirill A. Shutemov 提交于
      Instead of incrementing counter on each page in/out and comparing it with
      constant, we set counter to constant, decrement counter on each page
      in/out and compare it with zero.  We want to make comparing as fast as
      possible.  On many RISC systems (probably not only RISC) comparing with
      zero is more effective than comparing with a constant, since not every
      constant can be immediate operand for compare instruction.
      
      Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
      since really it's not a generic counter.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      378ce724
    • K
      memcg: extract mem_group_usage() from mem_cgroup_read() · 104f3928
      Kirill A. Shutemov 提交于
      Helper to get memory or mem+swap usage of the cgroup.
      Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Dan Malek <dan@embeddedalley.com>
      Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Alexander Shishkin <virtuoso@slind.org>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      104f3928
    • D
      memcg: improve performance in moving swap charge · 483c30b5
      Daisuke Nishimura 提交于
      Try to reduce overheads in moving swap charge by:
      
      - Adds a new function(__mem_cgroup_put), which takes "count" as a arg and
        decrement mem->refcnt by "count".
      - Removed res_counter_uncharge, css_put, and mem_cgroup_put from the path
        of moving swap account, and consolidate all of them into mem_cgroup_clear_mc.
        We cannot do that about mc.to->refcnt.
      
      These changes reduces the overhead from 1.35sec to 0.9sec to move charges
      of 1G anonymous memory(including 500MB swap) in my test environment.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      483c30b5
    • D
      memcg: move charges of anonymous swap · 02491447
      Daisuke Nishimura 提交于
      This patch is another core part of this move-charge-at-task-migration
      feature.  It enables moving charges of anonymous swaps.
      
      To move the charge of swap, we need to exchange swap_cgroup's record.
      
      In current implementation, swap_cgroup's record is protected by:
      
        - page lock: if the entry is on swap cache.
        - swap_lock: if the entry is not on swap cache.
      
      This works well in usual swap-in/out activity.
      
      But this behavior make the feature of moving swap charge check many
      conditions to exchange swap_cgroup's record safely.
      
      So I changed modification of swap_cgroup's recored(swap_cgroup_record())
      to use xchg, and define a new function to cmpxchg swap_cgroup's record.
      
      This patch also enables moving charge of non pte_present but not uncharged
      swap caches, which can be exist on swap-out path, by getting the target
      pages via find_get_page() as do_mincore() does.
      
      [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
      [akpm@linux-foundation.org: fix typos]
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02491447
    • D
      memcg: avoid oom during moving charge · 8033b97c
      Daisuke Nishimura 提交于
      This move-charge-at-task-migration feature has extra charges on
      "to"(pre-charges) and "from"(left-over charges) during moving charge.
      This means unnecessary oom can happen.
      
      This patch tries to avoid such oom.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8033b97c
    • D
      memcg: improve performance in moving charge · 854ffa8d
      Daisuke Nishimura 提交于
      Try to reduce overheads in moving charge by:
      
      - Instead of calling res_counter_uncharge() against the old cgroup in
        __mem_cgroup_move_account() everytime, call res_counter_uncharge() at the end
        of task migration once.
      - removed css_get(&to->css) from __mem_cgroup_move_account() because callers
        should have already called css_get(). And removed css_put(&to->css) too,
        which was called by callers of move_account on success of move_account.
      - Instead of calling __mem_cgroup_try_charge(), i.e. res_counter_charge(),
        repeatedly, call res_counter_charge(PAGE_SIZE * count) in can_attach() if
        possible.
      - Instead of calling css_get()/css_put() repeatedly, make use of coalesce
        __css_get()/__css_put() if possible.
      
      These changes reduces the overhead from 1.7sec to 0.6sec to move charges
      of 1G anonymous memory in my test environment.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854ffa8d
    • D
      memcg: move charges of anonymous page · 4ffef5fe
      Daisuke Nishimura 提交于
      This patch is the core part of this move-charge-at-task-migration feature.
       It implements functions to move charges of anonymous pages mapped only by
      the target task.
      
      Implementation:
      - define struct move_charge_struct and a valuable of it(mc) to remember the
        count of pre-charges and other information.
      - At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge()
        repeatedly and count up mc.precharge.
      - At attach(), parse the page table, find a target page to be move, and call
        mem_cgroup_move_account() about the page.
      - Cancel all precharges if mc.precharge > 0 on failure or at the end of
        task move.
      
      [akpm@linux-foundation.org: a little simplification]
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ffef5fe
    • D
      memcg: add interface to move charge at task migration · 7dc74be0
      Daisuke Nishimura 提交于
      In current memcg, charges associated with a task aren't moved to the new
      cgroup at task migration.  Some users feel this behavior to be strange.
      These patches are for this feature, that is, for charging to the new
      cgroup and, of course, uncharging from the old cgroup at task migration.
      
      This patch adds "memory.move_charge_at_immigrate" file, which is a flag
      file to determine whether charges should be moved to the new cgroup at
      task migration or not and what type of charges should be moved.  This
      patch also adds read and write handlers of the file.
      
      This patch also adds no-op handlers for this feature.  These handlers will
      be implemented in later patches.  And you cannot write any values other
      than 0 to move_charge_at_immigrate yet.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dc74be0
  3. 07 3月, 2010 1 次提交
  4. 17 1月, 2010 1 次提交
    • D
      memcg: ensure list is empty at rmdir · fce66477
      Daisuke Nishimura 提交于
      Current mem_cgroup_force_empty() only ensures mem->res.usage == 0 on
      success.  But this doesn't guarantee memcg's LRU is really empty, because
      there are some cases in which !PageCgrupUsed pages exist on memcg's LRU.
      
      For example:
      - Pages can be uncharged by its owner process while they are on LRU.
      - race between mem_cgroup_add_lru_list() and __mem_cgroup_uncharge_common().
      
      So there can be a case in which the usage is zero but some of the LRUs are not empty.
      
      OTOH, mem_cgroup_del_lru_list(), which can be called asynchronously with
      rmdir, accesses the mem_cgroup, so this access can cause a problem if it
      races with rmdir because the mem_cgroup might have been freed by rmdir.
      
      Actually, I saw a bug which seems to be caused by this race.
      
      	[1530745.949906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230
      	[1530745.950651] IP: [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80
      	[1530745.950651] PGD 3863de067 PUD 3862c7067 PMD 0
      	[1530745.950651] Oops: 0002 [#1] SMP
      	[1530745.950651] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/shared_cpu_map
      	[1530745.950651] CPU 3
      	[1530745.950651] Modules linked in: configs ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp nfsd nfs_acl auth_rpcgss exportfs autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc battery ac lp kvm_intel kvm sg ide_cd_mod cdrom serio_raw tpm_tis tpm tpm_bios acpi_memhotplug button parport_pc parport rtc_cmos rtc_core rtc_lib e1000 i2c_i801 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
      	[1530745.950651] Pid: 19653, comm: shmem_test_02 Tainted: G   M       2.6.32-mm1-00701-g2b04386 #3 Express5800/140Rd-4 [N8100-1065]
      	[1530745.950651] RIP: 0010:[<ffffffff810fbc11>]  [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80
      	[1530745.950651] RSP: 0018:ffff8803863ddcb8  EFLAGS: 00010002
      	[1530745.950651] RAX: 00000000000001e0 RBX: ffff8803abc02238 RCX: 00000000000001e0
      	[1530745.950651] RDX: 0000000000000000 RSI: ffff88038611a000 RDI: ffff8803abc02238
      	[1530745.950651] RBP: ffff8803863ddcc8 R08: 0000000000000002 R09: ffff8803a04c8643
      	[1530745.950651] R10: 0000000000000000 R11: ffffffff810c7333 R12: 0000000000000000
      	[1530745.950651] R13: ffff880000017f00 R14: 0000000000000092 R15: ffff8800179d0310
      	[1530745.950651] FS:  0000000000000000(0000) GS:ffff880017800000(0000) knlGS:0000000000000000
      	[1530745.950651] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      	[1530745.950651] CR2: 0000000000000230 CR3: 0000000379d87000 CR4: 00000000000006e0
      	[1530745.950651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	[1530745.950651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      	[1530745.950651] Process shmem_test_02 (pid: 19653, threadinfo ffff8803863dc000, task ffff88038612a8a0)
      	[1530745.950651] Stack:
      	[1530745.950651]  ffffea00040c2fe8 0000000000000000 ffff8803863ddd98 ffffffff810c739a
      	[1530745.950651] <0> 00000000863ddd18 000000000000000c 0000000000000000 0000000000000000
      	[1530745.950651] <0> 0000000000000002 0000000000000000 ffff8803863ddd68 0000000000000046
      	[1530745.950651] Call Trace:
      	[1530745.950651]  [<ffffffff810c739a>] release_pages+0x142/0x1e7
      	[1530745.950651]  [<ffffffff810c778f>] ? pagevec_move_tail+0x6e/0x112
      	[1530745.950651]  [<ffffffff810c781e>] pagevec_move_tail+0xfd/0x112
      	[1530745.950651]  [<ffffffff810c78a9>] lru_add_drain+0x76/0x94
      	[1530745.950651]  [<ffffffff810dba0c>] exit_mmap+0x6e/0x145
      	[1530745.950651]  [<ffffffff8103f52d>] mmput+0x5e/0xcf
      	[1530745.950651]  [<ffffffff81043ea8>] exit_mm+0x11c/0x129
      	[1530745.950651]  [<ffffffff8108fb29>] ? audit_free+0x196/0x1c9
      	[1530745.950651]  [<ffffffff81045353>] do_exit+0x1f5/0x6b7
      	[1530745.950651]  [<ffffffff8106133f>] ? up_read+0x2b/0x2f
      	[1530745.950651]  [<ffffffff8137d187>] ? lockdep_sys_exit_thunk+0x35/0x67
      	[1530745.950651]  [<ffffffff81045898>] do_group_exit+0x83/0xb0
      	[1530745.950651]  [<ffffffff810458dc>] sys_exit_group+0x17/0x1b
      	[1530745.950651]  [<ffffffff81002c1b>] system_call_fastpath+0x16/0x1b
      	[1530745.950651] Code: 54 53 0f 1f 44 00 00 83 3d cc 29 7c 00 00 41 89 f4 75 63 eb 4e 48 83 7b 08 00 75 04 0f 0b eb fe 48 89 df e8 18 f3 ff ff 44 89 e2 <48> ff 4c d0 50 48 8b 05 2b 2d 7c 00 48 39 43 08 74 39 48 8b 4b
      	[1530745.950651] RIP  [<ffffffff810fbc11>] mem_cgroup_del_lru_list+0x30/0x80
      	[1530745.950651]  RSP <ffff8803863ddcb8>
      	[1530745.950651] CR2: 0000000000000230
      	[1530745.950651] ---[ end trace c3419c1bb8acc34f ]---
      	[1530745.950651] Fixing recursive fault but reboot is needed!
      
      The problem here is pages on LRU may contain pointer to stale memcg.  To
      make res->usage to be 0, all pages on memcg must be uncharged or moved to
      another(parent) memcg.  Moved page_cgroup have already removed from
      original LRU, but uncharged page_cgroup contains pointer to memcg withou
      PCG_USED bit.  (This asynchronous LRU work is for improving performance.)
      If PCG_USED bit is not set, page_cgroup will never be added to memcg's
      LRU.  So, about pages not on LRU, they never access stale pointer.  Then,
      what we have to take care of is page_cgroup _on_ LRU list.  This patch
      fixes this problem by making mem_cgroup_force_empty() visit all LRUs
      before exiting its loop and guarantee there are no pages on its LRU.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fce66477
  5. 16 12月, 2009 12 次提交
  6. 04 12月, 2009 1 次提交
  7. 09 11月, 2009 1 次提交
  8. 02 10月, 2009 3 次提交
    • K
      memcg: reduce check for softlimit excess · ef8745c1
      KAMEZAWA Hiroyuki 提交于
      In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly
      and it takes res_counter's spin_lock every time.
      
      This patch removes unnecessary calls for res_count_soft_limit_excess.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8745c1
    • K
      memcg: some modification to softlimit under hierarchical memory reclaim. · 4e649152
      KAMEZAWA Hiroyuki 提交于
      This patch clean up/fixes for memcg's uncharge soft limit path.
      
      Problems:
        Now, res_counter_charge()/uncharge() handles softlimit information at
        charge/uncharge and softlimit-check is done when event counter per memcg
        goes over limit. Now, event counter per memcg is updated only when
        memory usage is over soft limit. Here, considering hierarchical memcg
        management, ancesotors should be taken care of.
      
        Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
        This is not good.
      
        Prolems:
        1. memcg's event counter incremented only when softlimit hits. That's bad.
           It makes event counter hard to be reused for other purpose.
      
        2. At uncharge, only the lowest level rescounter is handled. This is bug.
           Because ancesotor's event counter is not incremented, children should
           take care of them.
      
        3. res_counter_uncharge()'s 3rd argument is NULL in most case.
           ops under res_counter->lock should be small. No "if" sentense is better.
      
      Fixes:
        * Removed soft_limit_xx poitner and checks in charge and uncharge.
          Do-check-only-when-necessary scheme works enough well without them.
      
        * make event-counter of memcg incremented at every charge/uncharge.
          (per-cpu area will be accessed soon anyway)
      
        * All ancestors are checked at soft-limit-check. This is necessary because
          ancesotor's event counter may never be modified. Then, they should be
          checked at the same time.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e649152
    • K
      memcg: fix refcnt going negative · 26251eaf
      KAMEZAWA Hiroyuki 提交于
      __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
      with incremnted mz->mem->css's refcnt.  Then, the caller of this function
      has to call css_put(mz->mem->css).
      
      But, mz can be !NULL even if "not found" i.e.  without css_get().  By
      this, css->refcnt will go down to minus.
      
      This may cause various things...one of results will be
      initite-loop in css_tryget()  as this.
      
      INFO: RCU detected CPU 0 stall (t=10000 jiffies)
      sending NMI to all CPUs:
      NMI backtrace for cpu 0
      CPU 0:
      <snip>
      
       <<EOE>>  <IRQ>  [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
        [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
        [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
        [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
        [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
        [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
        [<ffffffff81063a26>] update_process_times+0x46/0x70
        [<ffffffff81085930>] tick_sched_timer+0x60/0x160
        [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
        [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
        [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
        [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
        [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
        [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
        <EOI>  [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
        [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
        [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
        [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
        [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
        [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
      
      Above shows CPU0 caught in css_tryget()'s inifinite loop because
      of bad refcnt.
      
      This is a fix to set mz=NULL at the top of retry path.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26251eaf
  9. 24 9月, 2009 5 次提交
    • D
      memcg: show swap usage in stat file · 1dd3a273
      Daisuke Nishimura 提交于
      We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage.  It would
      be useful for users to show swap usage in memory.stat file, because they
      don't need calculate memsw.usage - res.usage to know swap usage.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd3a273
    • B
      memcg: improve resource counter scalability · 0c3e73e8
      Balbir Singh 提交于
      Reduce the resource counter overhead (mostly spinlock) associated with the
      root cgroup.  This is a part of the several patches to reduce mem cgroup
      overhead.  I had posted other approaches earlier (including using percpu
      counters).  Those patches will be a natural addition and will be added
      iteratively on top of these.
      
      The patch stops resource counter accounting for the root cgroup.  The data
      for display is derived from the statisitcs we maintain via
      mem_cgroup_charge_statistics (which is more scalable).  What happens today
      is that, we do double accounting, once using res_counter_charge() and once
      using memory_cgroup_charge_statistics().  For the root, since we don't
      implement limits any more, we don't need to track every charge via
      res_counter_charge() and check for limit being exceeded and reclaim.
      
      The main mem->res usage_in_bytes can be derived by summing the cache and
      rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE).  However, for memsw->res usage_in_bytes, we need
      additional data about swapped out memory.  This patch adds a
      MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE to derive the memsw data.  This data is computed
      recursively when hierarchy is enabled.
      
      The tests results I see on a 24 way show that
      
      1. The lock contention disappears from /proc/lock_stats
      2. The results of the test are comparable to running with
         cgroup_disable=memory.
      
      Here is a sample of my program runs
      
      Without Patch
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7192804.124144  task-clock-msecs         #     23.937 CPUs
               424691  context-switches         #      0.000 M/sec
                  267  CPU-migrations           #      0.000 M/sec
             28498113  page-faults              #      0.004 M/sec
        5826093739340  cycles                   #    809.989 M/sec
         408883496292  instructions             #      0.070 IPC
           7057079452  cache-references         #      0.981 M/sec
           3036086243  cache-misses             #      0.422 M/sec
      
        300.485365680  seconds time elapsed
      
      With cgroup_disable=memory
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7182183.546587  task-clock-msecs         #     23.915 CPUs
               425458  context-switches         #      0.000 M/sec
                  203  CPU-migrations           #      0.000 M/sec
             92545093  page-faults              #      0.013 M/sec
        6034363609986  cycles                   #    840.185 M/sec
         437204346785  instructions             #      0.072 IPC
           6636073192  cache-references         #      0.924 M/sec
           2358117732  cache-misses             #      0.328 M/sec
      
        300.320905827  seconds time elapsed
      
      With this patch applied
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7191619.223977  task-clock-msecs         #     23.955 CPUs
               422579  context-switches         #      0.000 M/sec
                   88  CPU-migrations           #      0.000 M/sec
             91946060  page-faults              #      0.013 M/sec
        5957054385619  cycles                   #    828.333 M/sec
        1058117350365  instructions             #      0.178 IPC
           9161776218  cache-references         #      1.274 M/sec
           1920494280  cache-misses             #      0.267 M/sec
      
        300.218764862  seconds time elapsed
      
      Data from Prarit (kernel compile with make -j64 on a 64
      CPU/32G machine)
      
      For a single run
      
      Without patch
      
      real 27m8.988s
      user 87m24.916s
      sys 382m6.037s
      
      With patch
      
      real    4m18.607s
      user    84m58.943s
      sys     50m52.682s
      
      With config turned off
      
      real    4m54.972s
      user    90m13.456s
      sys     50m19.711s
      
      NOTE: The data looks counterintuitive due to the increased performance
      with the patch, even over the config being turned off. We probably need
      more runs, but so far all testing has shown that the patches definitely
      help.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c3e73e8
    • B
      memory controller: soft limit reclaim on contention · 4e416953
      Balbir Singh 提交于
      Implement reclaim from groups over their soft limit
      
      Permit reclaim from memory cgroups on contention (via the direct reclaim
      path).
      
      memory cgroup soft limit reclaim finds the group that exceeds its soft
      limit by the largest number of pages and reclaims pages from it and then
      reinserts the cgroup into its correct place in the rbtree.
      
      Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
      loops in case all swap is turned off.  The code has been refactored and
      the loop check (loop < 2) has been enhanced for soft limits.  For soft
      limits, we try to do more targetted reclaim.  Instead of bailing out after
      two loops, the routine now reclaims memory proportional to the size by
      which the soft limit is exceeded.  The proportion has been empirically
      determined.
      
      [akpm@linux-foundation.org: build fix]
      [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
      [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e416953
    • B
      memory controller: soft limit refactor reclaim flags · 75822b44
      Balbir Singh 提交于
      Refactor mem_cgroup_hierarchical_reclaim()
      
      Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
      flags, so that new parameters don't have to be passed as we make the
      reclaim routine more flexible
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75822b44
    • B
      memory controller: soft limit organize cgroups · f64c3f54
      Balbir Singh 提交于
      Organize cgroups over soft limit in a RB-Tree
      
      Introduce an RB-Tree for storing memory cgroups that are over their soft
      limit.  The overall goal is to
      
      1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
         We are careful about updates, updates take place only after a particular
         time interval has passed
      2. We remove the node from the RB-Tree when the usage goes below the soft
         limit
      
      The next set of patches will exploit the RB-Tree to get the group that is
      over its soft limit by the largest amount and reclaim from it, when we
      face memory contention.
      
      [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f64c3f54
新手
引导
客服 返回
顶部