1. 08 12月, 2019 2 次提交
    • Y
      alios: mm: memcontrol: make distance between wmark_low and wmark_high configurable · 33ef4784
      Yang Shi 提交于
      Introduce a new interface, wmark_scale_factor, which defines the
      distance between wmark_high and wmark_low.  The unit is in fractions of
      10,000. The default value of 50 means the distance between wmark_high
      and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
      is 1000, or 10% of the max limit.
      
      The distance between wmark_low and wmark_high have impact on how hard
      memcg kswapd would reclaim.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      33ef4784
    • Y
      alios: mm: memcontrol: support background async page reclaim · 6b2ef082
      Yang Shi 提交于
      Currently when memory usage exceeds memory cgroup limit, memory cgroup
      just can do sync direct reclaim.  This may incur unexpected stall on
      some applications which are sensitive to latency.  Introduce background
      async page reclaim mechanism, like what kswapd does.
      
      Define memcg memory usage water mark by introducing wmark_ratio interface,
      which is from 0 to 100 and represents percentage of max limit.  The
      wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
      (wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
      is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
      which is the default value.
      
      If wmark_ratio is setup, when charging page, if usage is greater than
      wmark_high, which means the available memory of memcg is low, a work
      would be scheduled to do background page reclaim until memory usage is
      reduced to wmark_low if possible.
      
      Define a dedicated unbound workqueue for scheduling water mark reclaim
      works.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      6b2ef082
  2. 12 4月, 2018 1 次提交
  3. 31 1月, 2018 1 次提交
  4. 11 7月, 2017 1 次提交
    • D
      mm, vmpressure: pass-through notification support · b6bb9811
      David Rientjes 提交于
      By default, vmpressure events are not pass-through, i.e.  they propagate
      up through the memcg hierarchy until an event notifier is found for any
      threshold level.
      
      This presents a difficulty when a thread waiting on a read(2) for a
      vmpressure event cannot distinguish between local memory pressure and
      memory pressure in a descendant memcg, especially when that thread may
      not control the memcg hierarchy.
      
      Consider a user-controlled child memcg with a smaller limit than a
      top-level memcg controlled by the "Activity Manager" specified in
      Documentation/cgroup-v1/memory.txt.  It may register for memory pressure
      notification for descendant memcgs to make a policy decision: oom kill a
      low priority job, increase the limit, decrease other limits, etc.  If it
      registers for memory pressure notification on the top-level memcg, it
      currently cannot distinguish between memory pressure in its own memcg or
      a descendant memcg, which is user-controlled.
      
      Conversely, if a user registers for memory pressure notification on
      their own descendant memcg, the Activity Manager does not receive any
      pressure notification for that child memcg hierarchy.  Vmpressure events
      are not received for ancestor memcgs if the memcg experiencing pressure
      have notifiers registered, perhaps outside the knowledge of the thread
      waiting on read(2) at the top level.
      
      Both of these are consequences of vmpressure notification not being
      pass-through.
      
      This implements a pass-through behavior for vmpressure events.  When
      writing to control.event_control, vmpressure event handlers may
      optionally specify a mode.  There are two new modes:
      
       - "hierarchy": always propagate memory pressure events up the hierarchy
         regardless if descendant memcgs have their own notifiers registered,
         and
      
       - "local": only receive notifications when the memcg for which the
         event is registered experiences memory pressure.
      
      Of course, processes may register for one notification of "low,local",
      for example, and another for "low".
      
      If no mode is specified, the current behavior is maintained for
      backwards compatibility.
      
      See the change to Documentation/cgroup-v1/memory.txt for full
      specification.
      
      [dan.carpenter@oracle.com: free the same pointer we allocated]
        Link: http://lkml.kernel.org/r/20170613191820.GA20003@elgon.mountain
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705311421320.8946@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6bb9811
  5. 29 7月, 2016 1 次提交
  6. 15 5月, 2016 1 次提交
  7. 12 1月, 2016 1 次提交
  8. 17 11月, 2015 1 次提交
  9. 02 6月, 2015 1 次提交
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
  10. 11 4月, 2015 1 次提交
  11. 14 12月, 2014 1 次提交
  12. 11 12月, 2014 3 次提交
    • J
      mm: embed the memcg pointer directly into struct page · 1306a85a
      Johannes Weiner 提交于
      Memory cgroups used to have 5 per-page pointers.  To allow users to
      disable that amount of overhead during runtime, those pointers were
      allocated in a separate array, with a translation layer between them and
      struct page.
      
      There is now only one page pointer remaining: the memcg pointer, that
      indicates which cgroup the page is associated with when charged.  The
      complexity of runtime allocation and the runtime translation overhead is
      no longer justified to save that *potential* 0.19% of memory.  With
      CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
      after the page->private member and doesn't even increase struct page,
      and then this patch actually saves space.  Remaining users that care can
      still compile their kernels without CONFIG_MEMCG.
      
           text    data     bss     dec     hex     filename
        8828345 1725264  983040 11536649 b00909  vmlinux.old
        8827425 1725264  966656 11519345 afc571  vmlinux.new
      
      [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1306a85a
    • J
      kernel: res_counter: remove the unused API · 5b1efc02
      Johannes Weiner 提交于
      All memory accounting and limiting has been switched over to the
      lockless page counters.  Bye, res_counter!
      
      [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
      [mhocko@suse.cz: ditch the last remainings of res_counter]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b1efc02
    • J
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner 提交于
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
  13. 07 6月, 2014 1 次提交
    • M
      vmscan: memcg: always use swappiness of the reclaimed memcg · 688eb988
      Michal Hocko 提交于
      Memory reclaim always uses swappiness of the reclaim target memcg
      (origin of the memory pressure) or vm_swappiness for global memory
      reclaim.  This behavior was consistent (except for difference between
      global and hard limit reclaim) because swappiness was enforced to be
      consistent within each memcg hierarchy.
      
      After "mm: memcontrol: remove hierarchy restrictions for swappiness and
      oom_control" each memcg can have its own swappiness independent of
      hierarchical parents, though, so the consistency guarantee is gone.
      This can lead to an unexpected behavior.  Say that a group is explicitly
      configured to not swapout by memory.swappiness=0 but its memory gets
      swapped out anyway when the memory pressure comes from its parent with a
      It is also unexpected that the knob is meaningless without setting the
      hard limit which would trigger the reclaim and enforce the swappiness.
      There are setups where the hard limit is configured higher in the
      hierarchy by an administrator and children groups are under control of
      somebody else who is interested in the swapout behavior but not
      necessarily about the memory limit.
      
      From a semantic point of view swappiness is an attribute defining anon
      vs.
       file proportional scanning of LRU which is memcg specific (unlike
      charges which are propagated up the hierarchy) so it should be applied
      to the particular memcg's LRU regardless where the memory pressure comes
      from.
      
      This patch removes vmscan_swappiness() and stores the swappiness into
      the scan_control structure.  mem_cgroup_swappiness is then used to
      provide the correct value before shrink_lruvec is called.  The global
      vm_swappiness is used for the root memcg.
      
      [hughd@google.com: oopses immediately when booted with cgroup_disable=memory]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      688eb988
  14. 05 6月, 2014 2 次提交
  15. 17 5月, 2014 1 次提交
    • M
      memcg: remove tasks/children test from mem_cgroup_force_empty() · f61c42a7
      Michal Hocko 提交于
      Tejun has correctly pointed out that tasks/children test in
      mem_cgroup_force_empty is not correct because there is no other locking
      which preserves this state throughout the rest of the function so both
      new tasks can join the group or new children groups can be added while
      somebody is writing to memory.force_empty. A new task would break
      mem_cgroup_reparent_charges expectation that all failures as described
      by mem_cgroup_force_empty_list are temporal and there is no way out.
      
      The main use case for the knob as described by
      Documentation/cgroups/memory.txt is to:
      "
        The typical use case for this interface is before calling rmdir().
        Because rmdir() moves all pages to parent, some out-of-use page caches can be
        moved to the parent. If you want to avoid that, force_empty will be useful.
      "
      
      This means that reparenting is not really required as rmdir will
      reparent pages implicitly from the safe context. If we remove it from
      mem_cgroup_force_empty then we are safe even with existing tasks because
      the number of reclaim attempts is bounded. Moreover the knob still does
      what the documentation claims (modulo reparenting which doesn't make any
      difference) and users might expect. Longterm we want to deprecate the
      whole knob and put the reparented pages to the tail of parent LRU during
      cgroup removal.
      
      tj: Removed unused variable @cgrp from mem_cgroup_force_empty()
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f61c42a7
  16. 31 12月, 2013 1 次提交
  17. 13 11月, 2013 1 次提交
    • Y
      memcg: support hierarchical memory.numa_stats · 071aee13
      Ying Han 提交于
      The memory.numa_stat file was not hierarchical.  Memory charged to the
      children was not shown in parent's numa_stat.
      
      This change adds the "hierarchical_" stats to the existing stats.  The
      new hierarchical stats include the sum of all children's values in
      addition to the value of the memcg.
      
      Tested: Create cgroup a, a/b and run workload under b.  The values of
      b are included in the "hierarchical_*" under a.
      
      $ cd /sys/fs/cgroup
      $ echo 1 > memory.use_hierarchy
      $ mkdir a a/b
      
      Run workload in a/b:
      $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &
      
      The hierarchical_ fields in parent (a) show use of workload in a/b:
      $ cat a/memory.numa_stat
      total=0 N0=0 N1=0 N2=0 N3=0
      file=0 N0=0 N1=0 N2=0 N3=0
      anon=0 N0=0 N1=0 N2=0 N3=0
      unevictable=0 N0=0 N1=0 N2=0 N3=0
      hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
      hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
      hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
      hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0
      
      $ cat a/b/memory.numa_stat
      total=908 N0=552 N1=317 N2=39 N3=0
      file=850 N0=549 N1=301 N2=0 N3=0
      anon=58 N0=3 N1=16 N2=39 N3=0
      unevictable=0 N0=0 N1=0 N2=0 N3=0
      hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
      hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
      hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
      hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      071aee13
  18. 13 9月, 2013 1 次提交
  19. 04 7月, 2013 1 次提交
  20. 24 6月, 2013 1 次提交
  21. 28 5月, 2013 1 次提交
  22. 08 5月, 2013 1 次提交
  23. 30 4月, 2013 1 次提交
    • A
      memcg: add memory.pressure_level events · 70ddf637
      Anton Vorontsov 提交于
      With this patch userland applications that want to maintain the
      interactivity/memory allocation cost can use the pressure level
      notifications.  The levels are defined like this:
      
      The "low" level means that the system is reclaiming memory for new
      allocations.  Monitoring this reclaiming activity might be useful for
      maintaining cache level.  Upon notification, the program (typically
      "Activity Manager") might analyze vmstat and act in advance (i.e.
      prematurely shutdown unimportant services).
      
      The "medium" level means that the system is experiencing medium memory
      pressure, the system might be making swap, paging out active file
      caches, etc.  Upon this event applications may decide to further analyze
      vmstat/zoneinfo/memcg or internal memory usage statistics and free any
      resources that can be easily reconstructed or re-read from a disk.
      
      The "critical" level means that the system is actively thrashing, it is
      about to out of memory (OOM) or even the in-kernel OOM killer is on its
      way to trigger.  Applications should do whatever they can to help the
      system.  It might be too late to consult with vmstat or any other
      statistics, so it's advisable to take an immediate action.
      
      The events are propagated upward until the event is handled, i.e.  the
      events are not pass-through.  Here is what this means: for example you
      have three cgroups: A->B->C.  Now you set up an event listener on
      cgroups A, B and C, and suppose group C experiences some pressure.  In
      this situation, only group C will receive the notification, i.e.  groups
      A and B will not receive it.  This is done to avoid excessive
      "broadcasting" of messages, which disturbs the system and which is
      especially bad if we are low on memory or thrashing.  So, organize the
      cgroups wisely, or propagate the events manually (or, ask us to
      implement the pass-through events, explaining why would you need them.)
      
      Performance wise, the memory pressure notifications feature itself is
      lightweight and does not require much of bookkeeping, in contrast to the
      rest of memcg features.  Unfortunately, as of current memcg
      implementation, pages accounting is an inseparable part and cannot be
      turned off.  The good news is that there are some efforts[1] to improve
      the situation; plus, implementing the same, fully API-compatible[2]
      interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
      option, so it will not require any changes on the userland side.
      
      [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
      [2] http://lkml.org/lkml/2013/2/21/454
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ddf637
  24. 27 3月, 2013 1 次提交
  25. 19 12月, 2012 2 次提交
  26. 12 12月, 2012 1 次提交
  27. 17 11月, 2012 1 次提交
  28. 09 10月, 2012 1 次提交
  29. 01 8月, 2012 2 次提交
  30. 30 5月, 2012 3 次提交
  31. 13 4月, 2012 1 次提交
  32. 13 1月, 2012 1 次提交