1. 16 4月, 2008 2 次提交
    • L
      memcg: fix oops in oom handling · e115f2d8
      Li Zefan 提交于
      When I used a test program to fork mass processes and immediately move them to
      a cgroup where the memory limit is low enough to trigger oom kill, I got oops:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000808
      IP: [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18
      PGD 4c95f067 PUD 4406c067 PMD 0
      Oops: 0002 [1] SMP
      CPU 2
      Modules linked in:
      
      Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5
      RIP: 0010:[<ffffffff8045c47f>]  [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18
      RSP: 0018:ffff8100448c7c30  EFLAGS: 00010002
      RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3
      RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808
      RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900
      R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140
      R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200
      FS:  00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0)
      Stack:  ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0
       ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0
       ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0
      Call Trace:
       [<ffffffff8023ef5a>] ? force_sig_info+0x25/0xb9
       [<ffffffff8026cef9>] ? oom_kill_task+0x77/0xe2
       [<ffffffff8026d696>] ? mem_cgroup_out_of_memory+0x55/0x67
       [<ffffffff802910ad>] ? mem_cgroup_charge_common+0xec/0x202
       [<ffffffff8027997b>] ? handle_mm_fault+0x24e/0x77f
       [<ffffffff8022c4af>] ? default_wake_function+0x0/0xe
       [<ffffffff8027a17a>] ? get_user_pages+0x2ce/0x3af
       [<ffffffff80290fee>] ? mem_cgroup_charge_common+0x2d/0x202
       [<ffffffff8027a441>] ? make_pages_present+0x8e/0xa4
       [<ffffffff8027d1ab>] ? mmap_region+0x373/0x429
       [<ffffffff8027d7eb>] ? do_mmap_pgoff+0x2ff/0x364
       [<ffffffff80210471>] ? sys_mmap+0xe5/0x111
       [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1
      
      Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00
      RIP  [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18
       RSP <ffff8100448c7c30>
      CR2: 0000000000000808
      ---[ end trace c3702fa668021ea4 ]---
      
      It's reproducable in a x86_64 box, but doesn't happen in x86_32.
      
      This is because tsk->sighand is not guarded by RCU, so we have to
      hold tasklist_lock, just as what out_of_memory() does.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e115f2d8
    • I
      mm: sparsemem memory_present() fix · bead9a3a
      Ingo Molnar 提交于
      Fix memory corruption and crash on 32-bit x86 systems.
      
      If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of
      RAM, then we call memory_present() with a start/end that goes outside
      the scope of MAX_PHYSMEM_BITS.
      
      That causes this loop to happily walk over the limit of the sparse
      memory section map:
      
          for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
                      unsigned long section = pfn_to_section_nr(pfn);
                      struct mem_section *ms;
      
                      sparse_index_init(section, nid);
                      set_section_nid(section, nid);
      
                      ms = __nr_to_section(section);
                      if (!ms->section_mem_map)
                              ms->section_mem_map = sparse_encode_early_nid(nid) |
      			                                SECTION_MARKED_PRESENT;
      
      'ms' will be out of bounds and we'll corrupt a small amount of memory by
      encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.
      
      The corruption might happen when encoding a non-zero node ID, or due to
      the SECTION_MARKED_PRESENT which is 0x1:
      
      	mmzone.h:#define	SECTION_MARKED_PRESENT	(1UL<<0)
      
      The fix is to sanity check anything the architecture passes to
      sparsemem.
      
      This bug seems to be rather old (as old as sparsemem support itself),
      but the exact incarnation depended on random details like configs, which
      made this bug more prominent in v2.6.25-to-be.
      
      An additional enhancement might be to print a warning about ignored or
      trimmed memory ranges.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Tested-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Yinghai Lu <Yinghai.Lu@sun.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bead9a3a
  2. 09 4月, 2008 1 次提交
  3. 05 4月, 2008 1 次提交
    • B
      memory controller: make memory resource control aware of boot options · 4077960e
      Balbir Singh 提交于
      A boot option for the memory controller was discussed on lkml.  It is a good
      idea to add it, since it saves memory for people who want to turn off the
      memory controller.
      
      By default the option is on for the following two reasons:
      
      1. It provides compatibility with the current scheme where the memory
         controller turns on if the config option is enabled
      2. It allows for wider testing of the memory controller, once the config
         option is enabled
      
      We still allow the create, destroy callbacks to succeed, since they are not
      aware of boot options.  We do not populate the directory will memory resource
      controller specific files.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4077960e
  4. 02 4月, 2008 1 次提交
  5. 31 3月, 2008 1 次提交
  6. 28 3月, 2008 1 次提交
  7. 27 3月, 2008 4 次提交
    • N
      hugetlb: fix potential livelock in return_unused_surplus_hugepages() · 11320d17
      Nishanth Aravamudan 提交于
      Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5
      and 2.6.25-rc5-mm1:
      
          BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531]
          NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088
          REGS: c000005db12cb360 TRAP: 0901   Not tainted  (2.6.25-rc5-autokern1)
          MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 48008448  XER: 20000000
          TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3
          GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004
          GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100
          GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000
          GPR12: 0000000028000442 c000000000770080
          NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c
          LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c
          Call Trace:
          [c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable)
          [c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354
          [c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214
          [c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c
          [c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0
          [c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c
          [c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8
          [c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194
          [c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4
          [c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0
          [c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8
          [c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0
          [c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134
          [c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40
          Instruction dump:
          ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25
          60000000 38800010 38a00000 7c601b78 <7fa3eb78> 2f800010 409d0008 38000010
      
      This was tracked down to a potential livelock in
      return_unused_surplus_hugepages().  In the case where we have surplus
      pages on some node, but no free pages on the same node, we may never
      break out of the loop. To avoid this livelock, terminate the search if
      we iterate a number of times equal to the number of online nodes without
      freeing a page.
      
      Thanks to Andy Whitcroft and Adam Litke for helping with debugging and
      the patch.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11320d17
    • N
      hugetlb: indicate surplus huge page counts in per-node meminfo · a1de0919
      Nishanth Aravamudan 提交于
      Currently we show the surplus hugetlb pool state in /proc/meminfo, but
      not in the per-node meminfo files, even though we track the information
      on a per-node basis. Printing it there can help track down dynamic pool
      bugs including the one in the follow-on patch.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1de0919
    • D
      slab: fix cache_cache bootstrap in kmem_cache_init() · ec1f5eee
      Daniel Yeisley 提交于
      Commit 556a169d ("slab: fix bootstrap on
      memoryless node") introduced bootstrap-time cache_cache list3s for all nodes
      but forgot that initkmem_list3 needs to be accessed by [somevalue + node]. This
      patch fixes list_add() corruption in mm/slab.c seen on the ES7000.
      
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Olaf Hering <olaf@aepfle.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NDan Yeisley <dan.yeisley@unisys.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      ec1f5eee
    • C
      count_partial() is not used if !SLUB_DEBUG and !CONFIG_SLABINFO · 53625b42
      Christoph Lameter 提交于
      Avoid warnings about unused functions if neither SLUB_DEBUG nor CONFIG_SLABINFO
      is defined. This patch will be reversed when slab defrag is merged since slab
      defrag requires count_partial() to determine the fragmentation status of
      slab caches.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      53625b42
  8. 25 3月, 2008 2 次提交
  9. 20 3月, 2008 7 次提交
  10. 19 3月, 2008 1 次提交
  11. 18 3月, 2008 1 次提交
    • C
      slub page alloc fallback: Enable interrupts for GFP_WAIT. · caeab084
      Christoph Lameter 提交于
      The fallback path needs to enable interrupts like done for
      the other page allocator calls. This was not necessary with
      the alternate fast path since we handled irq enable/disable in
      the slow path. The regular fastpath handles irq enable/disable
      around calls to the slow path so we need to restore the proper
      status before calling the page allocator from the slowpath.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      caeab084
  12. 11 3月, 2008 3 次提交
    • N
      iov_iter_advance() fix · f7009264
      Nick Piggin 提交于
      iov_iter_advance() skips over zero-length iovecs, however it does not properly
      terminate at the end of the iovec array.  Fix this by checking against
      i->count before we skip a zero-length iov.
      
      The bug was reproduced with a test program that continually randomly creates
      iovs to writev.  The fix was also verified with the same program and also it
      could verify that the correct data was contained in the file after each
      writev.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Tested-by: N"Kevin Coffman" <kwc@citi.umich.edu>
      Cc: "Alexey Dobriyan" <adobriyan@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7009264
    • A
      hugetlb: correct page count for surplus huge pages · 2668db91
      Adam Litke 提交于
      Free pages in the hugetlb pool are free and as such have a reference count of
      zero.  Regular allocations into the pool from the buddy are "freed" into the
      pool which results in their page_count dropping to zero.  However, surplus
      pages can be directly utilized by the caller without first being freed to the
      pool.  Therefore, a call to put_page_testzero() is in order so that such a
      page will be handed to the caller with a correct count.
      
      This has not affected end users because the bad page count is reset before the
      page is handed off.  However, under CONFIG_DEBUG_VM this triggers a BUG when
      the page count is validated.
      
      Thanks go to Mel for first spotting this issue and providing an initial fix.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2668db91
    • L
      mempolicy: fix reference counting bugs · 69682d85
      Lee Schermerhorn 提交于
      Address 3 known bugs in the current memory policy reference counting method.
      I have a series of patches to rework the reference counting to reduce overhead
      in the allocation path.  However, that series will require testing in -mm once
      I repost it.
      
      1) alloc_page_vma() does not release the extra reference taken for
         vma/shared mempolicy when the mode == MPOL_INTERLEAVE.  This can result in
         leaking mempolicy structures.  This is probably occurring, but not being
         noticed.
      
         Fix:  add the conditional release of the reference.
      
      2) hugezonelist unconditionally releases a reference on the mempolicy when
         mode == MPOL_INTERLEAVE.  This can result in decrementing the reference
         count for system default policy [should have no ill effect] or premature
         freeing of task policy.  If this occurred, the next allocation using task
         mempolicy would use the freed structure and probably BUG out.
      
         Fix:  add the necessary check to the release.
      
      3) The current reference counting method assumes that vma 'get_policy()'
         methods automatically add an extra reference a non-NULL returned mempolicy.
          This is true for shmem_get_policy() used by tmpfs mappings, including
         regular page shm segments.  However, SHM_HUGETLB shm's, backed by
         hugetlbfs, just use the vma policy without the extra reference.  This
         results in freeing of the vma policy on the first allocation, with reuse of
         the freed mempolicy structure on subsequent allocations.
      
         Fix: Rather than add another condition to the conditional reference
         release, which occur in the allocation path, just add a reference when
         returning the vma policy in shm_get_policy() to match the assumptions.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <eric.whitney@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69682d85
  13. 10 3月, 2008 1 次提交
  14. 07 3月, 2008 5 次提交
  15. 05 3月, 2008 9 次提交
    • N
      hugetlb: fix pool shrinking while in restricted cpuset · 348e1e04
      Nishanth Aravamudan 提交于
      Adam Litke noticed that currently we grow the hugepage pool independent of any
      cpuset the running process may be in, but when shrinking the pool, the cpuset
      is checked.  This leads to inconsistency when shrinking the pool in a
      restricted cpuset -- an administrator may have been able to grow the pool on a
      node restricted by a containing cpuset, but they cannot shrink it there.
      
      There are two options: either prevent growing of the pool outside of the
      cpuset or allow shrinking outside of the cpuset.  >From previous discussions
      on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
      should not be restricted by cpusets.  So allow shrinking the pool by removing
      pages from nodes outside of current's cpuset.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: William Irwin <wli@holomorphy.com>
      Cc: Lee Schermerhorn <Lee.Schermerhonr@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      348e1e04
    • A
      hugetlb: close a difficult to trigger reservation race · ac09b3a1
      Adam Litke 提交于
      A hugetlb reservation may be inadequately backed in the event of racing
      allocations and frees when utilizing surplus huge pages.  Consider the
      following series of events in processes A and B:
      
       A) Allocates some surplus pages to satisfy a reservation
       B) Frees some huge pages
       A) A notices the extra free pages and drops hugetlb_lock to free some of
          its surplus pages back to the buddy allocator.
       B) Allocates some huge pages
       A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()
      
      Avoid this by commiting the reservation after pages have been allocated but
      before dropping the lock to free excess pages.  For parity, release the
      reservation in return_unused_surplus_pages().
      
      This patch also corrects the cpuset_mems_nr() error path in
      hugetlb_acct_memory().  If the cpuset check fails, uncommit the
      reservation, but also be sure to return any surplus huge pages that may
      have been allocated to back the failed reservation.
      
      Thanks to Andy Whitcroft for discovering this.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac09b3a1
    • H
      memcg: fix oops on NULL lru list · fb59e9f1
      Hugh Dickins 提交于
      While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list
      called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list.
       I couldn't see what racing tasks on other cpus were doing, but surmise that
      another must have been in mem_cgroup_charge_common on the same page, between
      its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc
      which I'd almost changed to a kmalloc).
      
      Normally such a race cannot happen, the ref_cnt prevents it, the final
      uncharge cannot race with the initial charge.  But force_empty buggers the
      ref_cnt, that's what it's all about; and thereafter forced pages are
      vulnerable to races such as this (just think of a shared page also mapped into
      an mm of another mem_cgroup than that just emptied).  And remain vulnerable
      until they're freed indefinitely later.
      
      This patch just fixes the oops by moving the unlock_page_cgroups down below
      adding to and removing from the list (only possible given the previous patch);
      and while we're at it, we might as well make it an invariant that
      page->page_cgroup is always set while pc is on lru.
      
      But this behaviour of force_empty seems highly unsatisfactory to me: why have
      a ref_cnt if we always have to cope with it being violated (as in the earlier
      page migration patch).  We may prefer force_empty to move pages to an orphan
      mem_cgroup (could be the root, but better not), from which other cgroups could
      recover them; we might need to reverse the locking again; but no time now for
      such concerns.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb59e9f1
    • H
      memcg: simplify force_empty and move_lists · 9b3c0a07
      Hirokazu Takahashi 提交于
      As for force_empty, though this may not be the main topic here,
      mem_cgroup_force_empty_list() can be implemented simpler.  It is possible to
      make the function just call mem_cgroup_uncharge_page() instead of releasing
      page_cgroups by itself.  The tip is to call get_page() before invoking
      mem_cgroup_uncharge_page(), so the page won't be released during this
      function.
      
      Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges,
      the page might have been reassigned to an lru of a different mem_cgroup, and
      now be emptied from that; but Hugh claims that's okay, the end state is the
      same as when it hasn't gone to another list.
      
      And once force_empty stops taking lock_page_cgroup within mz->lru_lock,
      mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while
      holding page_cgroup lock (but still has to use try_lock_page_cgroup).
      Signed-off-by: NHirokazu Takahashi <taka@valinux.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b3c0a07
    • H
      memcg: fix mem_cgroup_move_lists locking · 2680eed7
      Hugh Dickins 提交于
      Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went
      into page freeing, I've hit it from time to time in testing on some machines,
      sometimes only after many days.  Recently found a machine which could usually
      produce it within a few hours, which got me there at last.
      
      The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the
      arrangement of structures was such that you got page_cgroups from the lru list
      neatly put on to SLUB's freelist.  Kamezawa-san identified the same hole
      independently.
      
      The main problem was that it was missing the lock_page_cgroup it needs to
      safely page_get_page_cgroup; but it's tricky to go beyond that too, and I
      couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected.  See the code for
      comments on the constraints.
      
      This patch immediately gets replaced by a simpler one from Hirokazu-san; but
      is it just foolish pride that tells me to put this one on record, in case we
      need to come back to it later?
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2680eed7
    • H
      memcg: css_put after remove_list · 6d48ff8b
      Hugh Dickins 提交于
      mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from
      it, and before removing page_cgroup from one of its lru lists: isn't there a
      danger that struct mem_cgroup memory could be freed and reused before
      completing that, so corrupting something?  Never seen it, and for all I know
      there may be other constraints which make it impossible; but let's be
      defensive and reverse the ordering there.
      
      mem_cgroup_force_empty_list is safe because there's an extra css_get around
      all its works; but even so, change its ordering the same way round, to help
      get in the habit of doing it like this.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d48ff8b
    • H
      memcg: remove clear_page_cgroup and atomics · b9c565d5
      Hugh Dickins 提交于
      Remove clear_page_cgroup: it's an unhelpful helper, see for example how
      mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it
      (serious races from that?  I'm not sure).
      
      Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be
      atomic: it's always manipulated under lock_page_cgroup, except where
      force_empty unilaterally reset it to 0 (and how does uncharge's
      atomic_dec_and_test protect against that?).
      
      Simplify this page_cgroup locking: if you've got the lock and the pc is
      attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to
      check that pc->page matches page (we're on the way to finding why sometimes it
      doesn't, but this patch doesn't fix that).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9c565d5
    • H
      memcg: memcontrol uninlined and static · d5b69e38
      Hugh Dickins 提交于
      More cleanup to memcontrol.c, this time changing some of the code generated.
      Let the compiler decide what to inline (except for page_cgroup_locked which is
      only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup etc.
      was quite a waste since bit_spin_lock etc.  are inlines in a header file; made
      mem_cgroup_force_empty and mem_cgroup_write_strategy static.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5b69e38
    • H
      memcg: memcontrol whitespace cleanups · 8869b8f6
      Hugh Dickins 提交于
      Sorry, before getting down to more important changes, I'd like to do some
      cleanup in memcontrol.c.  This patch doesn't change the code generated, but
      cleans up whitespace, moves up a double declaration, removes an unused enum,
      removes void returns, removes misleading comments, that kind of thing.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8869b8f6