1. 20 10月, 2008 5 次提交
  2. 29 9月, 2008 1 次提交
    • B
      mm owner: fix race between swapoff and exit · 31a78f23
      Balbir Singh 提交于
      There's a race between mm->owner assignment and swapoff, more easily
      seen when task slab poisoning is turned on.  The condition occurs when
      try_to_unuse() runs in parallel with an exiting task.  A similar race
      can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
      or ptrace or page migration.
      
      CPU0                                    CPU1
                                              try_to_unuse
                                              looks at mm = task0->mm
                                              increments mm->mm_users
      task 0 exits
      mm->owner needs to be updated, but no
      new owner is found (mm_users > 1, but
      no other task has task->mm = task0->mm)
      mm_update_next_owner() leaves
                                              mmput(mm) decrements mm->mm_users
      task0 freed
                                              dereferencing mm->owner fails
      
      The fix is to notify the subsystem via mm_owner_changed callback(),
      if no new owner is found, by specifying the new task as NULL.
      
      Jiri Slaby:
      mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
      must be set after that, so as not to pass NULL as old owner causing oops.
      
      Daisuke Nishimura:
      mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
      and its callers need to take account of this situation to avoid oops.
      
      Hugh Dickins:
      Lockdep warning and hang below exec_mmap() when testing these patches.
      exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
      so exec_mmap() now needs to do the same.  And with that repositioning,
      there's now no point in mm_need_new_owner() allowing for NULL mm.
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a78f23
  3. 23 9月, 2008 1 次提交
    • D
      memcg: check under limit at shrink_usage · a10cebf5
      Daisuke Nishimura 提交于
      Current memory cgroup(both in mainline and -mm) doesn't account swap
      caches as memory(swap cache support is dropped temporarily now).
      
      So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that
      have been moved to swap cache.
      
      But this makes mem_cgroup_shrink_usage fail easily if most of the pages
      are anon/shmem, and then shmem_getpage returns -ENOMEM and the process
      will be killed.
      
      This patch adds res_counter_check_under_limit to avoid these cases.
      
      BTW, even if swap cache support is enabled again, if a process is moved to
      another cgroup, which has been just made, between precharge and
      shrink_usage in shmem_getpage, shrink_usage may fail just because there is
      no pages to reclaim.
      
      So this change would make sense anyway.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a10cebf5
  4. 13 8月, 2008 1 次提交
  5. 31 7月, 2008 1 次提交
  6. 26 7月, 2008 10 次提交
    • K
      memcg: limit change shrink usage · 628f4235
      KAMEZAWA Hiroyuki 提交于
      Shrinking memory usage at limit change.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      628f4235
    • L
      memcg: clean up checking of the disabled flag · cede86ac
      Li Zefan 提交于
      Those checks are unnecessary, because when the subsystem is disabled
      it can't be mounted, so those functions won't get called.
      
      The check is needed in functions which will be called in other places
      except cgroup.
      
      [hugh@veritas.com: further checking of disabled flag]
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cede86ac
    • K
      memcg: remove a redundant check · accf163e
      KAMEZAWA Hiroyuki 提交于
      Because of remove refcnt patch, it's very rare case to that
      mem_cgroup_charge_common() is called against a page which is accounted.
      
      mem_cgroup_charge_common() is called when.
       1. a page is added into file cache.
       2. an anon page is _newly_ mapped.
      
      A racy case is that a newly-swapped-in anonymous page is referred from
      prural threads in do_swap_page() at the same time.
      (a page is not Locked when mem_cgroup_charge() is called from do_swap_page.)
      
      Another case is shmem. It charges its page before calling add_to_page_cache().
      Then, mem_cgroup_charge_cache() is called twice. This case is handled in
      mem_cgroup_cache_charge(). But this check may be too hacky...
      
      Signed-off-by : KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accf163e
    • K
      memcg: add hints for branch · b76734e5
      KAMEZAWA Hiroyuki 提交于
      Showing brach direction for obvious conditions.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b76734e5
    • K
      memcg: helper function for relcaim from shmem. · c9b0ed51
      KAMEZAWA Hiroyuki 提交于
      A new call, mem_cgroup_shrink_usage() is added for shmem handling and
      relacing non-standard usage of mem_cgroup_charge/uncharge.
      
      Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
      mem_cgroup.  In general, shmem is used by some process group and not for
      global resource (like file caches).  So, it's reasonable to reclaim pages
      from mem_cgroup where shmem is mainly used.
      
      [hugh@veritas.com: shmem_getpage release page sooner]
      [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b0ed51
    • K
      memcg: remove refcnt from page_cgroup · 69029cd5
      KAMEZAWA Hiroyuki 提交于
      memcg: performance improvements
      
      Patch Description
       1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
       2/5 ... swapcache handling patch
       3/5 ... add helper function for shmem's memory reclaim patch
       4/5 ... optimize by likely/unlikely ppatch
       5/5 ... remove redundunt check patch (shmem handling is fixed.)
      
      Unix bench result.
      
      == 2.6.26-rc2-mm1 + memory resource controller
      Execl Throughput                           2915.4 lps   (29.6 secs, 3 samples)
      C Compiler Throughput                      1019.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5796.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1097.7 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               565.3 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1022128.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   544057.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    346481.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      319325.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     148788.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks       99051.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2058917.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1606109.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    854789.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         126145.2 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     2915.4      678.0
      File Copy 1024 bufsize 2000 maxblocks         3960.0   346481.0      875.0
      File Copy 256 bufsize 500 maxblocks           1655.0    99051.0      598.5
      File Copy 4096 bufsize 8000 maxblocks         5800.0   854789.0     1473.8
      Shell Scripts (8 concurrent)                     6.0     1097.7     1829.5
                                                                       =========
           FINAL SCORE                                                     991.3
      
      == 2.6.26-rc2-mm1 + this set ==
      Execl Throughput                           3012.9 lps   (29.9 secs, 3 samples)
      C Compiler Throughput                       981.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5872.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1120.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               578.0 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1003993.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   550452.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    347159.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      314644.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     151852.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks      101000.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2033256.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1611814.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    847979.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         128148.7 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     3012.9      700.7
      File Copy 1024 bufsize 2000 maxblocks         3960.0   347159.0      876.7
      File Copy 256 bufsize 500 maxblocks           1655.0   101000.0      610.3
      File Copy 4096 bufsize 8000 maxblocks         5800.0   847979.0     1462.0
      Shell Scripts (8 concurrent)                     6.0     1120.3     1867.2
                                                                       =========
           FINAL SCORE                                                    1004.6
      
      This patch:
      
      Remove refcnt from page_cgroup().
      
      After this,
      
       * A page is charged only when !page_mapped() && no page_cgroup is assigned.
      	* Anon page is newly mapped.
      	* File page is added to mapping->tree.
      
       * A page is uncharged only when
      	* Anon page is fully unmapped.
      	* File page is removed from LRU.
      
      There is no change in behavior from user's view.
      
      This patch also removes unnecessary calls in rmap.c which was used only for
      refcnt mangement.
      
      [akpm@linux-foundation.org: fix warning]
      [hugh@veritas.com: fix shmem_unuse_inode charging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69029cd5
    • K
      memcg: better migration handling · e8589cc1
      KAMEZAWA Hiroyuki 提交于
      This patch changes page migration under memory controller to use a
      different algorithm.  (thanks to Christoph for new idea.)
      
      Before:
       - page_cgroup is migrated from an old page to a new page.
      After:
       - a new page is accounted , no reuse of page_cgroup.
      
      Pros:
      
       - We can avoid compliated lock depndencies and races in migration.
      
      Cons:
      
       - new param to mem_cgroup_charge_common().
      
       - mem_cgroup_getref() is added for handling ref_cnt ping-pong.
      
      This version simplifies complicated lock dependency in page migraiton
      under memory resource controller.
      
        new refcnt sequence is following.
      
      a mapped page:
        prepage_migration() ..... +1 to NEW page
        try_to_unmap()      ..... all refs to OLD page is gone.
        move_pages()        ..... +1 to NEW page if page cache.
        remap...            ..... all refs from *map* is added to NEW one.
        end_migration()     ..... -1 to New page.
      
        page's mapcount + (page_is_cache) refs are added to NEW one.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8589cc1
    • K
      memcg: avoid unnecessary initialization · 508b7be0
      KAMEZAWA Hiroyuki 提交于
      * remove over-killing initialization (in fast path)
      * makeing the condition for PAGE_CGROUP_FLAG_ACTIVE be more obvious.
      Signed-off-by: NKAMEAZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      508b7be0
    • K
      memcg: make global var read_mostly · a181b0e8
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_subsys and page_cgroup_cache should be read_mostly and
      MEM_CGROUP_RECLAIM_RETRIES can be just a fixed number.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a181b0e8
    • P
      cgroup files: convert res_counter_write() to be a cgroups write_string() handler · 856c13aa
      Paul Menage 提交于
      Currently res_counter_write() is a raw file handler even though it's
      ultimately taking a number, since in some cases it wants to
      pre-process the string when converting it to a number.
      
      This patch converts res_counter_write() from a raw file handler to a
      write_string() handler; this allows some of the boilerplate
      copying/locking/checking to be removed, and simplies the cleanup path,
      since these functions are now performed by the cgroups framework.
      
      [lizf@cn.fujitsu.com: build fix]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      856c13aa
  7. 01 5月, 2008 1 次提交
  8. 29 4月, 2008 12 次提交
  9. 09 4月, 2008 1 次提交
  10. 05 4月, 2008 1 次提交
    • B
      memory controller: make memory resource control aware of boot options · 4077960e
      Balbir Singh 提交于
      A boot option for the memory controller was discussed on lkml.  It is a good
      idea to add it, since it saves memory for people who want to turn off the
      memory controller.
      
      By default the option is on for the following two reasons:
      
      1. It provides compatibility with the current scheme where the memory
         controller turns on if the config option is enabled
      2. It allows for wider testing of the memory controller, once the config
         option is enabled
      
      We still allow the create, destroy callbacks to succeed, since they are not
      aware of boot options.  We do not populate the directory will memory resource
      controller specific files.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4077960e
  11. 20 3月, 2008 1 次提交
  12. 05 3月, 2008 5 次提交
    • H
      memcg: fix oops on NULL lru list · fb59e9f1
      Hugh Dickins 提交于
      While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list
      called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list.
       I couldn't see what racing tasks on other cpus were doing, but surmise that
      another must have been in mem_cgroup_charge_common on the same page, between
      its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc
      which I'd almost changed to a kmalloc).
      
      Normally such a race cannot happen, the ref_cnt prevents it, the final
      uncharge cannot race with the initial charge.  But force_empty buggers the
      ref_cnt, that's what it's all about; and thereafter forced pages are
      vulnerable to races such as this (just think of a shared page also mapped into
      an mm of another mem_cgroup than that just emptied).  And remain vulnerable
      until they're freed indefinitely later.
      
      This patch just fixes the oops by moving the unlock_page_cgroups down below
      adding to and removing from the list (only possible given the previous patch);
      and while we're at it, we might as well make it an invariant that
      page->page_cgroup is always set while pc is on lru.
      
      But this behaviour of force_empty seems highly unsatisfactory to me: why have
      a ref_cnt if we always have to cope with it being violated (as in the earlier
      page migration patch).  We may prefer force_empty to move pages to an orphan
      mem_cgroup (could be the root, but better not), from which other cgroups could
      recover them; we might need to reverse the locking again; but no time now for
      such concerns.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb59e9f1
    • H
      memcg: simplify force_empty and move_lists · 9b3c0a07
      Hirokazu Takahashi 提交于
      As for force_empty, though this may not be the main topic here,
      mem_cgroup_force_empty_list() can be implemented simpler.  It is possible to
      make the function just call mem_cgroup_uncharge_page() instead of releasing
      page_cgroups by itself.  The tip is to call get_page() before invoking
      mem_cgroup_uncharge_page(), so the page won't be released during this
      function.
      
      Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges,
      the page might have been reassigned to an lru of a different mem_cgroup, and
      now be emptied from that; but Hugh claims that's okay, the end state is the
      same as when it hasn't gone to another list.
      
      And once force_empty stops taking lock_page_cgroup within mz->lru_lock,
      mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while
      holding page_cgroup lock (but still has to use try_lock_page_cgroup).
      Signed-off-by: NHirokazu Takahashi <taka@valinux.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b3c0a07
    • H
      memcg: fix mem_cgroup_move_lists locking · 2680eed7
      Hugh Dickins 提交于
      Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went
      into page freeing, I've hit it from time to time in testing on some machines,
      sometimes only after many days.  Recently found a machine which could usually
      produce it within a few hours, which got me there at last.
      
      The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the
      arrangement of structures was such that you got page_cgroups from the lru list
      neatly put on to SLUB's freelist.  Kamezawa-san identified the same hole
      independently.
      
      The main problem was that it was missing the lock_page_cgroup it needs to
      safely page_get_page_cgroup; but it's tricky to go beyond that too, and I
      couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected.  See the code for
      comments on the constraints.
      
      This patch immediately gets replaced by a simpler one from Hirokazu-san; but
      is it just foolish pride that tells me to put this one on record, in case we
      need to come back to it later?
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2680eed7
    • H
      memcg: css_put after remove_list · 6d48ff8b
      Hugh Dickins 提交于
      mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from
      it, and before removing page_cgroup from one of its lru lists: isn't there a
      danger that struct mem_cgroup memory could be freed and reused before
      completing that, so corrupting something?  Never seen it, and for all I know
      there may be other constraints which make it impossible; but let's be
      defensive and reverse the ordering there.
      
      mem_cgroup_force_empty_list is safe because there's an extra css_get around
      all its works; but even so, change its ordering the same way round, to help
      get in the habit of doing it like this.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d48ff8b
    • H
      memcg: remove clear_page_cgroup and atomics · b9c565d5
      Hugh Dickins 提交于
      Remove clear_page_cgroup: it's an unhelpful helper, see for example how
      mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it
      (serious races from that?  I'm not sure).
      
      Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be
      atomic: it's always manipulated under lock_page_cgroup, except where
      force_empty unilaterally reset it to 0 (and how does uncharge's
      atomic_dec_and_test protect against that?).
      
      Simplify this page_cgroup locking: if you've got the lock and the pc is
      attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to
      check that pc->page matches page (we're on the way to finding why sometimes it
      doesn't, but this patch doesn't fix that).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9c565d5