1. 13 1月, 2012 13 次提交
    • J
      mm: make per-memcg LRU lists exclusive · 925b7673
      Johannes Weiner 提交于
      Now that all code that operated on global per-zone LRU lists is
      converted to operate on per-memory cgroup LRU lists instead, there is no
      reason to keep the double-LRU scheme around any longer.
      
      The pc->lru member is removed and page->lru is linked directly to the
      per-memory cgroup LRU lists, which removes two pointers from a
      descriptor that exists for every page frame in the system.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NYing Han <yinghan@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      925b7673
    • J
      mm: collect LRU list heads into struct lruvec · 6290df54
      Johannes Weiner 提交于
      Having a unified structure with a LRU list set for both global zones and
      per-memcg zones allows to keep that code simple which deals with LRU
      lists and does not care about the container itself.
      
      Once the per-memcg LRU lists directly link struct pages, the isolation
      function and all other list manipulations are shared between the memcg
      case and the global LRU case.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6290df54
    • J
      mm: vmscan: convert global reclaim to per-memcg LRU lists · b95a2f2d
      Johannes Weiner 提交于
      The global per-zone LRU lists are about to go away on memcg-enabled
      kernels, global reclaim must be able to find its pages on the per-memcg
      LRU lists.
      
      Since the LRU pages of a zone are distributed over all existing memory
      cgroups, a scan target for a zone is complete when all memory cgroups
      are scanned for their proportional share of a zone's memory.
      
      The forced scanning of small scan targets from kswapd is limited to
      zones marked unreclaimable, otherwise kswapd can quickly overreclaim by
      force-scanning the LRU lists of multiple memory cgroups.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b95a2f2d
    • J
      mm: memcg: remove optimization of keeping the root_mem_cgroup LRU lists empty · ad2b8e60
      Johannes Weiner 提交于
      root_mem_cgroup, lacking a configurable limit, was never subject to
      limit reclaim, so the pages charged to it could be kept off its LRU
      lists.  They would be found on the global per-zone LRU lists upon
      physical memory pressure and it made sense to avoid uselessly linking
      them to both lists.
      
      The global per-zone LRU lists are about to go away on memcg-enabled
      kernels, with all pages being exclusively linked to their respective
      per-memcg LRU lists.  As a result, pages of the root_mem_cgroup must
      also be linked to its LRU lists again.  This is purely about the LRU
      list, root_mem_cgroup is still not charged.
      
      The overhead is temporary until the double-LRU scheme is going away
      completely.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2b8e60
    • J
      mm: move memcg hierarchy reclaim to generic reclaim code · 5660048c
      Johannes Weiner 提交于
      Memory cgroup limit reclaim and traditional global pressure reclaim will
      soon share the same code to reclaim from a hierarchical tree of memory
      cgroups.
      
      In preparation of this, move the two right next to each other in
      shrink_zone().
      
      The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
      limit reclaim function, which still does hierarchy walking on its own,
      and a limit (shrinking) reclaim function, which relies on generic
      reclaim code to walk the hierarchy.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5660048c
    • J
      mm: memcg: per-priority per-zone hierarchy scan generations · 527a5ec9
      Johannes Weiner 提交于
      Memory cgroup limit reclaim currently picks one memory cgroup out of the
      target hierarchy, remembers it as the last scanned child, and reclaims
      all zones in it with decreasing priority levels.
      
      The new hierarchy reclaim code will pick memory cgroups from the same
      hierarchy concurrently from different zones and priority levels, it
      becomes necessary that hierarchy roots not only remember the last
      scanned child, but do so for each zone and priority level.
      
      Until now, we reclaimed memcgs like this:
      
          mem = mem_cgroup_iter(root)
          for each priority level:
            for each zone in zonelist:
              reclaim(mem, zone)
      
      But subsequent patches will move the memcg iteration inside the loop
      over the zones:
      
          for each priority level:
            for each zone in zonelist:
              mem = mem_cgroup_iter(root)
              reclaim(mem, zone)
      
      And to keep with the original scan order - memcg -> priority -> zone -
      the last scanned memcg has to be remembered per zone and per priority
      level.
      
      Furthermore, global reclaim will be switched to the hierarchy walk as
      well.  Different from limit reclaim, which can just recheck the limit
      after some reclaim progress, its target is to scan all memcgs for the
      desired zone pages, proportional to the memcg size, and so reliably
      detecting a full hierarchy round-trip will become crucial.
      
      Currently, the code relies on one reclaimer encountering the same memcg
      twice, but that is error-prone with concurrent reclaimers.  Instead, use
      a generation counter that is increased every time the child with the
      highest ID has been visited, so that reclaimers can stop when the
      generation changes.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527a5ec9
    • J
      mm: vmscan: distinguish between memcg triggering reclaim and memcg being scanned · f16015fb
      Johannes Weiner 提交于
      Memory cgroup hierarchies are currently handled completely outside of
      the traditional reclaim code, which is invoked with a single memory
      cgroup as an argument for the whole call stack.
      
      Subsequent patches will switch this code to do hierarchical reclaim, so
      there needs to be a distinction between a) the memory cgroup that is
      triggering reclaim due to hitting its limit and b) the memory cgroup
      that is being scanned as a child of a).
      
      This patch introduces a struct mem_cgroup_zone that contains the
      combination of the memory cgroup and the zone being scanned, which is
      then passed down the stack instead of the zone argument.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f16015fb
    • J
      mm: vmscan: distinguish global reclaim from global LRU scanning · 89b5fae5
      Johannes Weiner 提交于
      The traditional zone reclaim code is scanning the per-zone LRU lists
      during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
      lists when reclaiming on behalf of a memory cgroup limit.
      
      Subsequent patches will convert the traditional reclaim code to reclaim
      exclusively from the per-memory cgroup LRU lists.  As a result, using
      the predicate for which LRU list is scanned will no longer be
      appropriate to tell global reclaim from limit reclaim.
      
      This patch adds a global_reclaim() predicate to tell direct/kswapd
      reclaim from memory cgroup limit reclaim and substitutes it in all
      places where currently scanning_global_lru() is used for that.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89b5fae5
    • J
      mm: memcg: consolidate hierarchy iteration primitives · 9f3a0d09
      Johannes Weiner 提交于
      The memcg naturalization series:
      
      Memory control groups are currently bolted onto the side of
      traditional memory management in places where better integration would
      be preferrable.  To reclaim memory, for example, memory control groups
      maintain their own LRU list and reclaim strategy aside from the global
      per-zone LRU list reclaim.  But an extra list head for each existing
      page frame is expensive and maintaining it requires additional code.
      
      This patchset disables the global per-zone LRU lists on memory cgroup
      configurations and converts all its users to operate on the per-memory
      cgroup lists instead.  As LRU pages are then exclusively on one list,
      this saves two list pointers for each page frame in the system:
      
      page_cgroup array size with 4G physical memory
      
        vanilla: allocated 31457280 bytes of page_cgroup
        patched: allocated 15728640 bytes of page_cgroup
      
      At the same time, system performance for various workloads is
      unaffected:
      
      100G sparse file cat, 4G physical memory, 10 runs, to test for code
      bloat in the traditional LRU handling and kswapd & direct reclaim
      paths, without/with the memory controller configured in
      
        vanilla: 71.603(0.207) seconds
        patched: 71.640(0.156) seconds
      
        vanilla: 79.558(0.288) seconds
        patched: 77.233(0.147) seconds
      
      100G sparse file cat in 1G memory cgroup, 10 runs, to test for code
      bloat in the traditional memory cgroup LRU handling and reclaim path
      
        vanilla: 96.844(0.281) seconds
        patched: 94.454(0.311) seconds
      
      4 unlimited memcgs running kbuild -j32 each, 4G physical memory, 500M
      swap on SSD, 10 runs, to test for regressions in kswapd & direct
      reclaim using per-memcg LRU lists with multiple memcgs and multiple
      allocators within each memcg
      
        vanilla: 717.722(1.440) seconds [ 69720.100(11600.835) majfaults ]
        patched: 714.106(2.313) seconds [ 71109.300(14886.186) majfaults ]
      
      16 unlimited memcgs running kbuild, 1900M hierarchical limit, 500M
      swap on SSD, 10 runs, to test for regressions in hierarchical memcg
      setups
      
        vanilla: 2742.058(1.992) seconds [ 26479.600(1736.737) majfaults ]
        patched: 2743.267(1.214) seconds [ 27240.700(1076.063) majfaults ]
      
      This patch:
      
      There are currently two different implementations of iterating over a
      memory cgroup hierarchy tree.
      
      Consolidate them into one worker function and base the convenience
      looping-macros on top of it.
      Signed-off-by: NJohannes Weiner <jweiner@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f3a0d09
    • K
      memcg: add mem_cgroup_replace_page_cache() to fix LRU issue · ab936cbc
      KAMEZAWA Hiroyuki 提交于
      Commit ef6a3c63 ("mm: add replace_page_cache_page() function") added a
      function replace_page_cache_page().  This function replaces a page in the
      radix-tree with a new page.  WHen doing this, memory cgroup needs to fix
      up the accounting information.  memcg need to check PCG_USED bit etc.
      
      In some(many?) cases, 'newpage' is on LRU before calling
      replace_page_cache().  So, memcg's LRU accounting information should be
      fixed, too.
      
      This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
       In that function, old pages will be unaccounted without touching
      res_counter and new page will be accounted to the memcg (of old page).
      WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
      races with LRU handling.
      
      Background:
        replace_page_cache_page() is called by FUSE code in its splice() handling.
        Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
        page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
        because rmdir() checks the whole LRU is empty and there is no account leak.
        If a page is on the other LRU than it should be, rmdir() will fail.
      
      This bug was added in March 2011, but no bug report yet.  I guess there
      are not many people who use memcg and FUSE at the same time with upstream
      kernels.
      
      The result of this bug is that admin cannot destroy a memcg because of
      account leak.  So, no panic, no deadlock.  And, even if an active cgroup
      exist, umount can succseed.  So no problem at shutdown.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Miklos Szeredi <mszeredi@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab936cbc
    • H
      mm,x86,um: move CMPXCHG_DOUBLE config option · 2565409f
      Heiko Carstens 提交于
      Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
      can simply select the option if it is supported.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2565409f
    • H
      mm,x86,um: move CMPXCHG_LOCAL config option · 4156153c
      Heiko Carstens 提交于
      Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
      can simply select the option if it is supported.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4156153c
    • H
      mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL · 43570fd2
      Heiko Carstens 提交于
      While implementing cmpxchg_double() on s390 I realized that we don't set
      CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.
      
      However setting that option will increase the size of struct page by
      eight bytes on 64 bit, which we certainly do not want.  Also, it doesn't
      make sense that a present cpu feature should increase the size of struct
      page.
      
      Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
      that it should depend on CMPXCHG_DOUBLE instead.
      
      This patch:
      
      If an architecture supports CMPXCHG_LOCAL this shouldn't result
      automatically in larger struct pages if the SLUB allocator is used.
      Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
      can be selected if a double word aligned struct page is required.  Also
      update x86 Kconfig so that it should work as before.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43570fd2
  2. 11 1月, 2012 27 次提交