1. 15 4月, 2011 1 次提交
  2. 24 3月, 2011 1 次提交
  3. 23 3月, 2011 2 次提交
  4. 21 1月, 2011 1 次提交
  5. 14 1月, 2011 2 次提交
  6. 11 8月, 2010 1 次提交
  7. 10 8月, 2010 2 次提交
    • D
      oom: badness heuristic rewrite · a63d83f4
      David Rientjes 提交于
      This a complete rewrite of the oom killer's badness() heuristic which is
      used to determine which task to kill in oom conditions.  The goal is to
      make it as simple and predictable as possible so the results are better
      understood and we end up killing the task which will lead to the most
      memory freeing while still respecting the fine-tuning from userspace.
      
      Instead of basing the heuristic on mm->total_vm for each task, the task's
      rss and swap space is used instead.  This is a better indication of the
      amount of memory that will be freeable if the oom killed task is chosen
      and subsequently exits.  This helps specifically in cases where KDE or
      GNOME is chosen for oom kill on desktop systems instead of a memory
      hogging task.
      
      The baseline for the heuristic is a proportion of memory that each task is
      currently using in memory plus swap compared to the amount of "allowable"
      memory.  "Allowable," in this sense, means the system-wide resources for
      unconstrained oom conditions, the set of mempolicy nodes, the mems
      attached to current's cpuset, or a memory controller's limit.  The
      proportion is given on a scale of 0 (never kill) to 1000 (always kill),
      roughly meaning that if a task has a badness() score of 500 that the task
      consumes approximately 50% of allowable memory resident in RAM or in swap
      space.
      
      The proportion is always relative to the amount of "allowable" memory and
      not the total amount of RAM systemwide so that mempolicies and cpusets may
      operate in isolation; they shall not need to know the true size of the
      machine on which they are running if they are bound to a specific set of
      nodes or mems, respectively.
      
      Root tasks are given 3% extra memory just like __vm_enough_memory()
      provides in LSMs.  In the event of two tasks consuming similar amounts of
      memory, it is generally better to save root's task.
      
      Because of the change in the badness() heuristic's baseline, it is also
      necessary to introduce a new user interface to tune it.  It's not possible
      to redefine the meaning of /proc/pid/oom_adj with a new scale since the
      ABI cannot be changed for backward compatability.  Instead, a new tunable,
      /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
      be used to polarize the heuristic such that certain tasks are never
      considered for oom kill while others may always be considered.  The value
      is added directly into the badness() score so a value of -500, for
      example, means to discount 50% of its memory consumption in comparison to
      other tasks either on the system, bound to the mempolicy, in the cpuset,
      or sharing the same memory controller.
      
      /proc/pid/oom_adj is changed so that its meaning is rescaled into the
      units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
      these per-task tunables will rescale the value of the other to an
      equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
      a bitshift on the badness score, it now shares the same linear growth as
      /proc/pid/oom_score_adj but with different granularity.  This is required
      so the ABI is not broken with userspace applications and allows oom_adj to
      be deprecated for future removal.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a63d83f4
    • K
      vmscan: kill prev_priority completely · 25edde03
      KOSAKI Motohiro 提交于
      Since 2.6.28 zone->prev_priority is unused. Then it can be removed
      safely. It reduce stack usage slightly.
      
      Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
      can be integrate again, it's useful. but four (or more) times trying
      haven't got good performance number. Thus I give up such approach.
      
      The rest of this changelog is notes on prev_priority and why it existed in
      the first place and why it might be not necessary any more. This information
      is based heavily on discussions between Andrew Morton, Rik van Riel and
      Kosaki Motohiro who is heavily quotes from.
      
      Historically prev_priority was important because it determined when the VM
      would start unmapping PTE pages. i.e. there are no balances of note within
      the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
      is a potential risk of unnecessarily increasing minor faults as a large
      amount of read activity of use-once pages could push mapped pages to the
      end of the LRU and get unmapped.
      
      There is no proof this is still a problem but currently it is not considered
      to be. Active files are not deactivated if the active file list is smaller
      than the inactive list reducing the liklihood that file-mapped pages are
      being pushed off the LRU and referenced executable pages are kept on the
      active list to avoid them getting pushed out by read activity.
      
      Even if it is a problem, prev_priority prev_priority wouldn't works
      nowadays. First of all, current vmscan still a lot of UP centric code. it
      expose some weakness on some dozens CPUs machine. I think we need more and
      more improvement.
      
      The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
      and per-task-pressure a bit. example, prev_priority try to boost priority to
      other concurrent priority. but if the another task have mempolicy restriction,
      it is unnecessary, but also makes wrong big latency and exceeding reclaim.
      per-task based priority + prev_priority adjustment make the emulation of
      per-system pressure. but it have two issue 1) too rough and brutal emulation
      2) we need per-zone pressure, not per-system.
      
      Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
      2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
      but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
      system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
      prev_priority can't solve such multithreads workload issue. In other word,
      prev_priority concept assume the sysmtem don't have lots threads."
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25edde03
  8. 28 5月, 2010 1 次提交
    • A
      memcg: fix mis-accounting of file mapped racy with migration · ac39cf8c
      akpm@linux-foundation.org 提交于
      FILE_MAPPED per memcg of migrated file cache is not properly updated,
      because our hook in page_add_file_rmap() can't know to which memcg
      FILE_MAPPED should be counted.
      
      Basically, this patch is for fixing the bug but includes some big changes
      to fix up other messes.
      
      Now, at migrating mapped file, events happen in following sequence.
      
       1. allocate a new page.
       2. get memcg of an old page.
       3. charge ageinst a new page before migration. But at this point,
          no changes to new page's page_cgroup, no commit for the charge.
          (IOW, PCG_USED bit is not set.)
       4. page migration replaces radix-tree, old-page and new-page.
       5. page migration remaps the new page if the old page was mapped.
       6. Here, the new page is unlocked.
       7. memcg commits the charge for newpage, Mark the new page's page_cgroup
          as PCG_USED.
      
      Because "commit" happens after page-remap, we can count FILE_MAPPED
      at "5", because we should avoid to trust page_cgroup->mem_cgroup.
      if PCG_USED bit is unset.
      (Note: memcg's LRU removal code does that but LRU-isolation logic is used
       for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
       not on LRU or page_cgroup->mem_cgroup is NULL.)
      
      We can lose file_mapped accounting information at 5 because FILE_MAPPED
      is updated only when mapcount changes 0->1. So we should catch it.
      
      BTW, historically, above implemntation comes from migration-failure
      of anonymous page. Because we charge both of old page and new page
      with mapcount=0, we can't catch
        - the page is really freed before remap.
        - migration fails but it's freed before remap
      or .....corner cases.
      
      New migration sequence with memcg is:
      
       1. allocate a new page.
       2. mark PageCgroupMigration to the old page.
       3. charge against a new page onto the old page's memcg. (here, new page's pc
          is marked as PageCgroupUsed.)
       4. page migration replaces radix-tree, page table, etc...
       5. At remapping, new page's page_cgroup is now makrked as "USED"
          We can catch 0->1 event and FILE_MAPPED will be properly updated.
      
          And we can catch SWAPOUT event after unlock this and freeing this
          page by unmap() can be caught.
      
       7. Clear PageCgroupMigration of the old page.
      
      So, FILE_MAPPED will be correctly updated.
      
      Then, for what MIGRATION flag is ?
        Without it, at migration failure, we may have to charge old page again
        because it may be fully unmapped. "charge" means that we have to dive into
        memory reclaim or something complated. So, it's better to avoid
        charge it again. Before this patch, __commit_charge() was working for
        both of the old/new page and fixed up all. But this technique has some
        racy condtion around FILE_MAPPED and SWAPOUT etc...
        Now, the kernel use MIGRATION flag and don't uncharge old page until
        the end of migration.
      
      I hope this change will make memcg's page migration much simpler.  This
      page migration has caused several troubles.  Worth to add a flag for
      simplification.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reported-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac39cf8c
  9. 25 5月, 2010 1 次提交
  10. 13 3月, 2010 1 次提交
  11. 16 12月, 2009 4 次提交
    • K
      memcg: make memcg's file mapped consistent with global VM · d8046582
      KAMEZAWA Hiroyuki 提交于
      In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE.  This makes
      grep difficult.  Replace memcg's MAPPED_FILE with FILE_MAPPED
      
      And in global VM, mapped shared memory is accounted into FILE_MAPPED.
      But memcg doesn't. fix it.
      Note:
        page_is_file_cache() just checks SwapBacked or not.
        So, we need to check PageAnon.
      
      Cc: Balbir Singh <balbir@in.ibm.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8046582
    • K
      memcg: coalesce uncharge during unmap/truncate · 569b846d
      KAMEZAWA Hiroyuki 提交于
      In massive parallel enviroment, res_counter can be a performance
      bottleneck.  One strong techinque to reduce lock contention is reducing
      calls by coalescing some amount of calls into one.
      
      Considering charge/uncharge chatacteristic,
      	- charge is done one by one via demand-paging.
      	- uncharge is done by
      		- in chunk at munmap, truncate, exit, execve...
      		- one by one via vmscan/paging.
      
      It seems we have a chance to coalesce uncharges for improving scalability
      at unmap/truncation.
      
      This patch is a for coalescing uncharge.  For avoiding scattering memcg's
      structure to functions under /mm, this patch adds memcg batch uncharge
      information to the task.  A reason for per-task batching is for making use
      of caller's context information.  We do batched uncharge (deleyed
      uncharge) when truncation/unmap occurs but do direct uncharge when
      uncharge is called by memory reclaim (vmscan.c).
      
      The degree of coalescing depends on callers
        - at invalidate/trucate... pagevec size
        - at unmap ....ZAP_BLOCK_SIZE
      (memory itself will be freed in this degree.)
      Then, we'll not coalescing too much.
      
      On x86-64 8cpu server, I tested overheads of memcg at page fault by
      running a program which does map/fault/unmap in a loop. Running
      a task per a cpu by taskset and see sum of the number of page faults
      in 60secs.
      
      [without memcg config]
        40156968  page-faults              #      0.085 M/sec   ( +-   0.046% )
        27.67 cache-miss/faults
      [root cgroup]
        36659599  page-faults              #      0.077 M/sec   ( +-   0.247% )
        31.58 miss/faults
      [in a child cgroup]
        18444157  page-faults              #      0.039 M/sec   ( +-   0.133% )
        69.96 miss/faults
      [child with this patch]
        27133719  page-faults              #      0.057 M/sec   ( +-   0.155% )
        47.16 miss/faults
      
      We can see some amounts of improvement.
      (root cgroup doesn't affected by this patch)
      Another patch for "charge" will follow this and above will be improved more.
      
      Changelog(since 2009/10/02):
       - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
       - some clean up and commentary/description updates.
       - added initialize code to copy_process(). (possible bug fix)
      
      Changelog(old):
       - fixed !CONFIG_MEM_CGROUP case.
       - rebased onto the latest mmotm + softlimit fix patches.
       - unified patch for callers
       - added commetns.
       - make ->do_batch as bool.
       - removed css_get() at el. We don't need it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      569b846d
    • W
      memcg: add accessor to mem_cgroup.css · d324236b
      Wu Fengguang 提交于
      So that an outside user can free the reference count grabbed by
      try_get_mem_cgroup_from_page().
      
      CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      CC: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      d324236b
    • W
      memcg: rename and export try_get_mem_cgroup_from_page() · e42d9d5d
      Wu Fengguang 提交于
      So that the hwpoison injector can get mem_cgroup for arbitrary page
      and thus know whether it is owned by some mem_cgroup task(s).
      
      [AK: Merged with latest git tree]
      
      CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      CC: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      e42d9d5d
  12. 24 9月, 2009 1 次提交
  13. 19 6月, 2009 1 次提交
    • B
      memcg: add file-based RSS accounting · d69b042f
      Balbir Singh 提交于
      Add file RSS tracking per memory cgroup
      
      We currently don't track file RSS, the RSS we report is actually anon RSS.
       All the file mapped pages, come in through the page cache and get
      accounted there.  This patch adds support for accounting file RSS pages.
      It should
      
      1. Help improve the metrics reported by the memory resource controller
      2. Will form the basis for a future shared memory accounting heuristic
         that has been proposed by Kamezawa.
      
      Unfortunately, we cannot rename the existing "rss" keyword used in
      memory.stat to "anon_rss".  We however, add "mapped_file" data and hope to
      educate the end user through documentation.
      
      [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.cn>
      Cc: Paul Menage <menage@google.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d69b042f
  14. 17 6月, 2009 1 次提交
    • R
      vmscan: evict use-once pages first · 56e49d21
      Rik van Riel 提交于
      When the file LRU lists are dominated by streaming IO pages, evict those
      pages first, before considering evicting other pages.
      
      This should be safe from deadlocks or performance problems
      because only three things can happen to an inactive file page:
      
      1) referenced twice and promoted to the active list
      2) evicted by the pageout code
      3) under IO, after which it will get evicted or promoted
      
      The pages freed in this way can either be reused for streaming IO, or
      allocated for something else.  If the pages are used for streaming IO,
      this pageout pattern continues.  Otherwise, we will fall back to the
      normal pageout pattern.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NElladan <elladan@eskimo.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56e49d21
  15. 03 5月, 2009 1 次提交
    • D
      memcg: fix mem_cgroup_shrink_usage() · ae3abae6
      Daisuke Nishimura 提交于
      Current mem_cgroup_shrink_usage() has two problems.
      
      1. It doesn't call mem_cgroup_out_of_memory and doesn't update
         last_oom_jiffies, so pagefault_out_of_memory invokes global OOM.
      
      2. Considering hierarchy, shrinking has to be done from the
         mem_over_limit, not from the memcg which the page would be charged to.
      
      mem_cgroup_try_charge_swapin() does all of these things properly, so we
      use it and call cancel_charge_swapin when it succeeded.
      
      The name of "shrink_usage" is not appropriate for this behavior, so we
      change it too.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.cn>
      Cc: Paul Menage <menage@google.com>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae3abae6
  16. 22 4月, 2009 1 次提交
  17. 03 4月, 2009 3 次提交
  18. 09 1月, 2009 15 次提交