1. 03 11月, 2011 6 次提交
  2. 01 11月, 2011 1 次提交
    • M
      mm: change isolate mode from #define to bitwise type · 4356f21d
      Minchan Kim 提交于
      Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
      macro isn't recommended as it's type-unsafe and making debugging harder as
      symbol cannot be passed throught to the debugger.
      
      Quote from Johannes
      " Hmm, it would probably be cleaner to fully convert the isolation mode
      into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
      tri-state among flags, which is a bit ugly."
      
      This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4356f21d
  3. 31 10月, 2011 1 次提交
  4. 15 9月, 2011 1 次提交
  5. 26 8月, 2011 2 次提交
  6. 10 8月, 2011 1 次提交
    • M
      Revert "memcg: get rid of percpu_charge_mutex lock" · 9f50fad6
      Michal Hocko 提交于
      This reverts commit 8521fc50.
      
      The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
      bit operations is sufficient but that is not true.  Johannes Weiner has
      reported a crash during parallel memory cgroup removal:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
        IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
        Oops: 0000 [#1] PREEMPT SMP
        Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
        RIP: 0010:[<ffffffff81083b70>]  css_is_ancestor+0x20/0x70
        RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
        Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
        Call Trace:
         [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
         [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
         [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
         [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
         [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
         [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
         [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
         [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
         [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
      
      We are crashing because we try to dereference cached memcg when we are
      checking whether we should wait for draining on the cache.  The cache is
      already cleaned up, though.
      
      There is also a theoretical chance that the cached memcg gets freed
      between we test for the FLUSHING_CACHED_CHARGE and dereference it in
      mem_cgroup_same_or_subtree:
      
              CPU0                    CPU1                         CPU2
        mem=stock->cached
        stock->cached=NULL
                                    clear_bit
                                                              test_and_set_bit
        test_bit()                    ...
        <preempted>             mem_cgroup_destroy
        use after free
      
      The percpu_charge_mutex protected from this race because sync draining
      is exclusive.
      
      It is safer to revert now and come up with a more parallel
      implementation later.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NJohannes Weiner <jweiner@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f50fad6
  7. 04 8月, 2011 1 次提交
  8. 27 7月, 2011 10 次提交
    • M
      memcg: get rid of percpu_charge_mutex lock · 8521fc50
      Michal Hocko 提交于
      percpu_charge_mutex protects from multiple simultaneous per-cpu charge
      caches draining because we might end up having too many work items.  At
      least this was the case until commit 26fe6168 ("memcg: fix percpu
      cached charge draining frequency") when we introduced a more targeted
      draining for async mode.
      
      Now that also sync draining is targeted we can safely remove mutex
      because we will not send more work than the current number of CPUs.
      FLUSHING_CACHED_CHARGE protects from sending the same work multiple
      times and stock->nr_pages == 0 protects from pointless sending a work if
      there is obviously nothing to be done.  This is of course racy but we
      can live with it as the race window is really small (we would have to
      see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
      non-zero).
      
      The only remaining place where we can race is synchronous mode when we
      rely on FLUSHING_CACHED_CHARGE test which might have been set by other
      drainer on the same group but we should wait in that case as well.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8521fc50
    • M
      memcg: add mem_cgroup_same_or_subtree() helper · 3e92041d
      Michal Hocko 提交于
      We are checking whether a given two groups are same or at least in the
      same subtree of a hierarchy at several places.  Let's make a helper for
      it to make code easier to read.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e92041d
    • M
      memcg: unify sync and async per-cpu charge cache draining · d38144b7
      Michal Hocko 提交于
      Currently we have two ways how to drain per-CPU caches for charges.
      drain_all_stock_sync will synchronously drain all caches while
      drain_all_stock_async will asynchronously drain only those that refer to
      a given memory cgroup or its subtree in hierarchy.  Targeted async
      draining has been introduced by 26fe6168 (memcg: fix percpu cached
      charge draining frequency) to reduce the cpu workers number.
      
      sync draining is currently triggered only from mem_cgroup_force_empty
      which is triggered only by userspace (mem_cgroup_force_empty_write) or
      when a cgroup is removed (mem_cgroup_pre_destroy).  Although these are
      not usually frequent operations it still makes some sense to do targeted
      draining as well, especially if the box has many CPUs.
      
      This patch unifies both methods to use the single code (drain_all_stock)
      which relies on the original async implementation and just adds
      flush_work to wait on all caches that are still under work for the sync
      mode.  We are using FLUSHING_CACHED_CHARGE bit check to prevent from
      waiting on a work that we haven't triggered.  Please note that both sync
      and async functions are currently protected by percpu_charge_mutex so we
      cannot race with other drainers.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d38144b7
    • M
      memcg: do not try to drain per-cpu caches without pages · d1a05b69
      Michal Hocko 提交于
      drain_all_stock_async tries to optimize a work to be done on the work
      queue by excluding any work for the current CPU because it assumes that
      the context we are called from already tried to charge from that cache
      and it's failed so it must be empty already.
      
      While the assumption is correct we can optimize it even more by checking
      the current number of pages in the cache.  This will also reduce a work
      on other CPUs with an empty stock.
      
      For the current CPU we can simply call drain_local_stock rather than
      deferring it to the work queue.
      
      [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1a05b69
    • K
      memcg: add memory.vmscan_stat · 82f9d486
      KAMEZAWA Hiroyuki 提交于
      The commit log of 0ae5e89c ("memcg: count the soft_limit reclaim
      in...") says it adds scanning stats to memory.stat file.  But it doesn't
      because we considered we needed to make a concensus for such new APIs.
      
      This patch is a trial to add memory.scan_stat. This shows
        - the number of scanned pages(total, anon, file)
        - the number of rotated pages(total, anon, file)
        - the number of freed pages(total, anon, file)
        - the number of elaplsed time (including sleep/pause time)
      
        for both of direct/soft reclaim.
      
      The biggest difference with oringinal Ying's one is that this file
      can be reset by some write, as
      
        # echo 0 ...../memory.scan_stat
      
      Example of output is here. This is a result after make -j 6 kernel
      under 300M limit.
      
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
        [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
        scanned_pages_by_limit 9471864
        scanned_anon_pages_by_limit 6640629
        scanned_file_pages_by_limit 2831235
        rotated_pages_by_limit 4243974
        rotated_anon_pages_by_limit 3971968
        rotated_file_pages_by_limit 272006
        freed_pages_by_limit 2318492
        freed_anon_pages_by_limit 962052
        freed_file_pages_by_limit 1356440
        elapsed_ns_by_limit 351386416101
        scanned_pages_by_system 0
        scanned_anon_pages_by_system 0
        scanned_file_pages_by_system 0
        rotated_pages_by_system 0
        rotated_anon_pages_by_system 0
        rotated_file_pages_by_system 0
        freed_pages_by_system 0
        freed_anon_pages_by_system 0
        freed_file_pages_by_system 0
        elapsed_ns_by_system 0
        scanned_pages_by_limit_under_hierarchy 9471864
        scanned_anon_pages_by_limit_under_hierarchy 6640629
        scanned_file_pages_by_limit_under_hierarchy 2831235
        rotated_pages_by_limit_under_hierarchy 4243974
        rotated_anon_pages_by_limit_under_hierarchy 3971968
        rotated_file_pages_by_limit_under_hierarchy 272006
        freed_pages_by_limit_under_hierarchy 2318492
        freed_anon_pages_by_limit_under_hierarchy 962052
        freed_file_pages_by_limit_under_hierarchy 1356440
        elapsed_ns_by_limit_under_hierarchy 351386416101
        scanned_pages_by_system_under_hierarchy 0
        scanned_anon_pages_by_system_under_hierarchy 0
        scanned_file_pages_by_system_under_hierarchy 0
        rotated_pages_by_system_under_hierarchy 0
        rotated_anon_pages_by_system_under_hierarchy 0
        rotated_file_pages_by_system_under_hierarchy 0
        freed_pages_by_system_under_hierarchy 0
        freed_anon_pages_by_system_under_hierarchy 0
        freed_file_pages_by_system_under_hierarchy 0
        elapsed_ns_by_system_under_hierarchy 0
      
      total_xxxx is for hierarchy management.
      
      This will be useful for further memcg developments and need to be
      developped before we do some complicated rework on LRU/softlimit
      management.
      
      This patch adds a new struct memcg_scanrecord into scan_control struct.
      sc->nr_scanned at el is not designed for exporting information.  For
      example, nr_scanned is reset frequentrly and incremented +2 at scanning
      mapped pages.
      
      To avoid complexity, I added a new param in scan_control which is for
      exporting scanning score.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andrew Bresticker <abrestic@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82f9d486
    • D
      memcg: fix behavior of mem_cgroup_resize_limit() · 108b6a78
      Daisuke Nishimura 提交于
      Commit 22a668d7 ("memcg: fix behavior under memory.limit equals to
      memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
      when mem_limit == memsw_limit.  The flag is checked at the beginning of
      reclaim, and "noswap" is set if the flag is true, because using swap is
      meaningless in this case.
      
      This works well in most cases, but when we try to shrink mem_limit,
      which is the same as memsw_limit now, we might fail to shrink mem_limit
      because swap doesn't used.
      
      This patch fixes this behavior by:
       - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
       - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108b6a78
    • M
      memcg: change memcg_oom_mutex to spinlock · 1af8efe9
      Michal Hocko 提交于
      memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
      for oom_control.  None of the critical sections which it protects sleep
      (eventfd_signal works from atomic context and the rest are simple linked
      list resp.  oom_lock atomic operations).
      
      Mutex is also too heavyweight for those code paths because it triggers a
      lot of scheduling.  It also makes makes convoying effects more visible
      when we have a big number of oom killing because we take the lock
      mutliple times during mem_cgroup_handle_oom so we have multiple places
      where many processes can sleep.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1af8efe9
    • M
      memcg: make oom_lock 0 and 1 based rather than counter · 79dfdacc
      Michal Hocko 提交于
      Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
      counter which is incremented by mem_cgroup_oom_lock when we are about to
      handle memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep
      if oom_lock > 1 to prevent from multiple oom kills at the same time.
      The counter is then decremented by mem_cgroup_oom_unlock called from the
      same function.
      
      This works correctly but it can lead to serious starvations when we have
      many processes triggering OOM and many CPUs available for them (I have
      tested with 16 CPUs).
      
      Consider a process (call it A) which gets the oom_lock (the first one
      that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
      processes that are blocked on the mutex.  While A releases the mutex and
      calls mem_cgroup_out_of_memory others will wake up (one after another)
      and increase the counter and fall into sleep (memcg_oom_waitq).
      
      Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
      decreases oom_lock and wakes other tasks (if releasing memory by
      somebody else - e.g.  killed process - hasn't done it yet).
      
      A testcase would look like:
        Assume malloc XXX is a program allocating XXX Megabytes of memory
        which touches all allocated pages in a tight loop
        # swapoff SWAP_DEVICE
        # cgcreate -g memory:A
        # cgset -r memory.oom_control=0   A
        # cgset -r memory.limit_in_bytes= 200M
        # for i in `seq 100`
        # do
        #     cgexec -g memory:A   malloc 10 &
        # done
      
      The main problem here is that all processes still race for the mutex and
      there is no guarantee that we will get counter back to 0 for those that
      got back to mem_cgroup_handle_oom.  In the end the whole convoy
      in/decreases the counter but we do not get to 1 that would enable
      killing so nothing useful can be done.  The time is basically unbounded
      because it highly depends on scheduling and ordering on mutex (I have
      seen this taking hours...).
      
      This patch replaces the counter by a simple {un}lock semantic.  As
      mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
      make sure that nobody else races with us which is guaranteed by the
      memcg_oom_mutex.
      
      We have to be careful while locking subtrees because we can encounter a
      subtree which is already locked: hierarchy:
      
                A
              /   \
             B     \
            /\      \
           C  D     E
      
      B - C - D tree might be already locked.  While we want to enable locking
      E subtree because OOM situations cannot influence each other we
      definitely do not want to allow locking A.
      
      Therefore we have to refuse lock if any subtree is already locked and
      clear up the lock for all nodes that have been set up to the failure
      point.
      
      On the other hand we have to make sure that the rest of the world will
      recognize that a group is under OOM even though it doesn't have a lock.
      Therefore we have to introduce under_oom variable which is incremented
      and decremented for the whole subtree when we enter resp.  leave
      mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
      updated under memcg_oom_mutex because its users only check a single
      group and they use atomic operations for that.
      
      This can be checked easily by the following test case:
      
        # cgcreate -g memory:A
        # cgset -r memory.use_hierarchy=1 A
        # cgset -r memory.oom_control=1   A
        # cgset -r memory.limit_in_bytes= 100M
        # cgset -r memory.memsw.limit_in_bytes= 100M
        # cgcreate -g memory:A/B
        # cgset -r memory.oom_control=1 A/B
        # cgset -r memory.limit_in_bytes=20M
        # cgset -r memory.memsw.limit_in_bytes=20M
        # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of group B
        # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of group A
      
      While B gets oom_lock A will not get it.  Both of them go into sleep and
      wait for an external action.  We can make the limit higher for A to
      enforce waking it up
      
        # cgset -r memory.memsw.limit_in_bytes=300M A
        # cgset -r memory.limit_in_bytes=300M A
      
      malloc in A has to wake up even though it doesn't have oom_lock.
      
      Finally, the unlock path is very easy because we always unlock only the
      subtree we have locked previously while we always decrement under_oom.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79dfdacc
    • K
      memcg: consolidate memory cgroup lru stat functions · bb2a0de9
      KAMEZAWA Hiroyuki 提交于
      In mm/memcontrol.c, there are many lru stat functions as..
      
        mem_cgroup_zone_nr_lru_pages
        mem_cgroup_node_nr_file_lru_pages
        mem_cgroup_nr_file_lru_pages
        mem_cgroup_node_nr_anon_lru_pages
        mem_cgroup_nr_anon_lru_pages
        mem_cgroup_node_nr_unevictable_lru_pages
        mem_cgroup_nr_unevictable_lru_pages
        mem_cgroup_node_nr_lru_pages
        mem_cgroup_nr_lru_pages
        mem_cgroup_get_local_zonestat
      
      Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
      This seems bad. This patch consolidates all functions into
      
        mem_cgroup_zone_nr_lru_pages()
        mem_cgroup_node_nr_lru_pages()
        mem_cgroup_nr_lru_pages()
      
      For these functions, "which LRU?" information is passed by a mask.
      
      example:
        mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))
      
      And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.
      
      example:
        mem_cgroup_nr_lru_pages(mem, ALL_LRU)
      
      BTW, considering layout of NUMA memory placement of counters, this patch seems
      to be better.
      
      Now, when we gather all LRU information, we scan in following orer
          for_each_lru -> for_each_node -> for_each_zone.
      
      This means we'll touch cache lines in different node in turn.
      
      After patch, we'll scan
          for_each_node -> for_each_zone -> for_each_lru(mask)
      
      Then, we'll gather information in the same cacheline at once.
      
      [akpm@linux-foundation.org: fix warnigns, build error]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb2a0de9
    • K
      memcg: export memory cgroup's swappiness with mem_cgroup_swappiness() · 1f4c025b
      KAMEZAWA Hiroyuki 提交于
      Each memory cgroup has a 'swappiness' value which can be accessed by
      get_swappiness(memcg).  The major user is try_to_free_mem_cgroup_pages()
      and swappiness is passed by argument.  It's propagated by scan_control.
      
      get_swappiness() is a static function but some planned updates will need
      to get swappiness from files other than memcontrol.c This patch exports
      get_swappiness() as mem_cgroup_swappiness().  With this, we can remove the
      argument of swapiness from try_to_free...  and drop swappiness from
      scan_control.  only memcg uses it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ying Han <yinghan@google.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f4c025b
  9. 09 7月, 2011 2 次提交
  10. 28 6月, 2011 1 次提交
  11. 16 6月, 2011 5 次提交
  12. 27 5月, 2011 8 次提交
    • Y
      memcg: add the pagefault count into memcg stats · 456f998e
      Ying Han 提交于
      Two new stats in per-memcg memory.stat which tracks the number of page
      faults and number of major page faults.
      
        "pgfault"
        "pgmajfault"
      
      They are different from "pgpgin"/"pgpgout" stat which count number of
      pages charged/discharged to the cgroup and have no meaning of reading/
      writing page to disk.
      
      It is valuable to track the two stats for both measuring application's
      performance as well as the efficiency of the kernel page reclaim path.
      Counting pagefaults per process is useful, but we also need the aggregated
      value since processes are monitored and controlled in cgroup basis in
      memcg.
      
      Functional test: check the total number of pgfault/pgmajfault of all
      memcgs and compare with global vmstat value:
      
        $ cat /proc/vmstat | grep fault
        pgfault 1070751
        pgmajfault 553
      
        $ cat /dev/cgroup/memory.stat | grep fault
        pgfault 1071138
        pgmajfault 553
        total_pgfault 1071142
        total_pgmajfault 553
      
        $ cat /dev/cgroup/A/memory.stat | grep fault
        pgfault 199
        pgmajfault 0
        total_pgfault 199
        total_pgmajfault 0
      
      Performance test: run page fault test(pft) wit 16 thread on faulting in
      15G anon pages in 16G container.  There is no regression noticed on the
      "flt/cpu/s"
      
      Sample output from pft:
      
        TAG pft:anon-sys-default:
          Gb  Thr CLine   User     System     Wall    flt/cpu/s fault/wsec
          15   16   1     0.67s   233.41s    14.76s   16798.546 266356.260
      
        +-------------------------------------------------------------------------+
            N           Min           Max        Median           Avg        Stddev
        x  10     16682.962     17344.027     16913.524     16928.812      166.5362
        +  10     16695.568     16923.896     16820.604     16824.652     84.816568
        No difference proven at 95.0% confidence
      
      [akpm@linux-foundation.org: fix build]
      [hughd@google.com: shmem fix]
      Signed-off-by: NYing Han <yinghan@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      456f998e
    • Y
      memcg: add memory.numastat api for numa statistics · 406eb0c9
      Ying Han 提交于
      The new API exports numa_maps per-memcg basis.  This is a piece of useful
      information where it exports per-memcg page distribution across real numa
      nodes.
      
      One of the usecases is evaluating application performance by combining
      this information w/ the cpu allocation to the application.
      
      The output of the memory.numastat tries to follow w/ simiar format of
      numa_maps like:
      
        total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
        file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
        anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
        unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
      
      And we have per-node:
      
        total = file + anon + unevictable
      
        $ cat /dev/cgroup/memory/memory.numa_stat
        total=250020 N0=87620 N1=52367 N2=45298 N3=64735
        file=225232 N0=83402 N1=46160 N2=40522 N3=55148
        anon=21053 N0=3424 N1=6207 N2=4776 N3=6646
        unevictable=3735 N0=794 N1=0 N2=0 N3=2941
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      406eb0c9
    • Y
      memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages() · 1bac180b
      Ying Han 提交于
      The caller of the function has been renamed to zone_nr_lru_pages(), and
      this is just fixing up in the memcg code.  The current name is easily to
      be mis-read as zone's total number of pages.
      Signed-off-by: NYing Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bac180b
    • J
      memcg: remove unused retry signal from reclaim · 4fd14ebf
      Johannes Weiner 提交于
      If the memcg reclaim code detects the target memcg below its limit it
      exits and returns a guaranteed non-zero value so that the charge is
      retried.
      
      Nowadays, the charge side checks the memcg limit itself and does not rely
      on this non-zero return value trick.
      
      This patch removes it.  The reclaim code will now always return the true
      number of pages it reclaimed on its own.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: Rik van Riel<riel@redhat.com>
      Acked-by: Ying Han<yinghan@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fd14ebf
    • Y
      memcg: reclaim memory from nodes in round-robin order · 889976db
      Ying Han 提交于
      Presently, memory cgroup's direct reclaim frees memory from the current
      node.  But this has some troubles.  Usually when a set of threads works in
      a cooperative way, they tend to operate on the same node.  So if they hit
      limits under memcg they will reclaim memory from themselves, damaging the
      active working set.
      
      For example, assume 2 node system which has Node 0 and Node 1 and a memcg
      which has 1G limit.  After some work, file cache remains and the usages
      are
      
         Node 0:  1M
         Node 1:  998M.
      
      and run an application on Node 0, it will eat its foot before freeing
      unnecessary file caches.
      
      This patch adds round-robin for NUMA and adds equal pressure to each node.
      When using cpuset's spread memory feature, this will work very well.
      
      But yes, a better algorithm is needed.
      
      [akpm@linux-foundation.org: comment editing]
      [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      889976db
    • M
      memcg: remove pointless next_mz nullification in mem_cgroup_soft_limit_reclaim() · 39cc98f1
      Michal Hocko 提交于
      next_mz is assigned to NULL if __mem_cgroup_largest_soft_limit_node
      selects the same mz.  This doesn't make much sense as we assign to the
      variable right in the next loop.
      
      Compiler will probably optimize this out but it is little bit confusing
      for the code reading.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39cc98f1
    • Y
      memcg: count the soft_limit reclaim in global background reclaim · 0ae5e89c
      Ying Han 提交于
      The global kswapd scans per-zone LRU and reclaims pages regardless of the
      cgroup. It breaks memory isolation since one cgroup can end up reclaiming
      pages from another cgroup. Instead we should rely on memcg-aware target
      reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
      memory pressure.
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU. However, the return value is ignored. This patch is the first
      step to skip shrink_zone() if soft_limit reclaim does enough work.
      
      This is part of the effort which tries to reduce reclaiming pages in global
      LRU in memcg. The per-memcg background reclaim patchset further enhances the
      per-cgroup targetting reclaim, which I should have V4 posted shortly.
      
      Try running multiple memory intensive workloads within seperate memcgs. Watch
      the counters of soft_steal in memory.stat.
      
        $ cat /dev/cgroup/A/memory.stat | grep 'soft'
        soft_steal 240000
        soft_scan 240000
        total_soft_steal 240000
        total_soft_scan 240000
      
      This patch:
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU.  However, the return value is ignored.
      
      We would like to skip shrink_zone() if soft_limit reclaim does enough
      work.  Also, we need to make the memory pressure balanced across per-memcg
      zones, like the logic vm-core.  This patch is the first step where we
      start with counting the nr_scanned and nr_reclaimed from soft_limit
      reclaim into the global scan_control.
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ae5e89c
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
  13. 25 5月, 2011 1 次提交