1. 02 7月, 2011 9 次提交
  2. 28 6月, 2011 7 次提交
    • K
      memcg: fix direct softlimit reclaim to be called in limit path · ac34a1a3
      KAMEZAWA Hiroyuki 提交于
      Commit d149e3b2 ("memcg: add the soft_limit reclaim in global direct
      reclaim") adds a softlimit hook to shrink_zones().  By this, soft limit
      is called as
      
         try_to_free_pages()
             do_try_to_free_pages()
                 shrink_zones()
                     mem_cgroup_soft_limit_reclaim()
      
      Then, direct reclaim is memcg softlimit hint aware, now.
      
      But, the memory cgroup's "limit" path can call softlimit shrinker.
      
         try_to_free_mem_cgroup_pages()
             do_try_to_free_pages()
                 shrink_zones()
                     mem_cgroup_soft_limit_reclaim()
      
      This will cause a global reclaim when a memcg hits limit.
      
      This is bug. soft_limit_reclaim() should be called when
      scanning_global_lru(sc) == true.
      
      And the commit adds a variable "total_scanned" for counting softlimit
      scanned pages....it's not "total".  This patch removes the variable and
      update sc->nr_scanned instead of it.  This will affect shrink_slab()'s
      scan condition but, global LRU is scanned by softlimit and I think this
      change makes sense.
      
      TODO: avoid too much scanning of a zone when softlimit did enough work.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Ying Han <yinghan@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac34a1a3
    • J
      mm: fix assertion mapping->nrpages == 0 in end_writeback() · 08142579
      Jan Kara 提交于
      Under heavy memory and filesystem load, users observe the assertion
      mapping->nrpages == 0 in end_writeback() trigger.  This can be caused by
      page reclaim reclaiming the last page from a mapping in the following
      race:
      
      	CPU0				CPU1
        ...
        shrink_page_list()
          __remove_mapping()
            __delete_from_page_cache()
              radix_tree_delete()
      					evict_inode()
      					  truncate_inode_pages()
      					    truncate_inode_pages_range()
      					      pagevec_lookup() - finds nothing
      					  end_writeback()
      					    mapping->nrpages != 0 -> BUG
              page->mapping = NULL
              mapping->nrpages--
      
      Fix the problem by doing a reliable check of mapping->nrpages under
      mapping->tree_lock in end_writeback().
      
      Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out
      by Miklos Szeredi <mszeredi@suse.de>.
      
      Cc: Jay <jinshan.xiong@whamcloud.com>
      Cc: Miklos Szeredi <mszeredi@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08142579
    • P
      mm/memory-failure.c: fix spinlock vs mutex order · 9b679320
      Peter Zijlstra 提交于
      We cannot take a mutex while holding a spinlock, so flip the order and
      fix the locking documentation.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b679320
    • H
      tmpfs: add shmem_read_mapping_page_gfp · d9d90e5e
      Hugh Dickins 提交于
      Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp()
      is unsuited to tmpfs, because it inserts a page into pagecache before
      calling the filesystem's ->readpage: tmpfs may have pages in swapcache
      which only it knows how to locate and switch to filecache.
      
      At present tmpfs provides a ->readpage method, and copes with this by
      copying pages; but soon we can simplify it by removing its ->readpage.
      Provide shmem_read_mapping_page_gfp() now, ready for that transition,
      
      Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h,
      with shmem_read_mapping_page() inline for the common mapping_gfp case.
      
      (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the
      read_mapping_page functions use the mapping's ->readpage, and the
      read_cache_page functions use the supplied filler, so I think
      read_cache_page_gfp was slightly misnamed.)
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9d90e5e
    • H
      tmpfs: take control of its truncate_range · 94c1e62d
      Hugh Dickins 提交于
      2.6.35's new truncate convention gave tmpfs the opportunity to control
      its file truncation, no longer enforced from outside by vmtruncate().
      We shall want to build upon that, to handle pagecache and swap together.
      
      Slightly redefine the ->truncate_range interface: let it now be called
      between the unmap_mapping_range()s, with the filesystem responsible for
      doing the truncate_inode_pages_range() from it - just as the filesystem
      is nowadays responsible for doing that from its ->setattr.
      
      Let's rename shmem_notify_change() to shmem_setattr().  Instead of
      calling the generic truncate_setsize(), bring that code in so we can
      call shmem_truncate_range() - which will later be updated to perform its
      own variant of truncate_inode_pages_range().
      
      Remove the punch_hole unmap_mapping_range() from shmem_truncate_range():
      now that the COW's unmap_mapping_range() comes after ->truncate_range,
      there is no need to call it a third time.
      
      Export shmem_truncate_range() and add it to the list in shmem_fs.h, so
      that i915_gem_object_truncate() can call it explicitly in future; get
      this patch in first, then update drm/i915 once this is available (until
      then, i915 will just be doing the truncate_inode_pages() twice).
      
      Though introduced five years ago, no other filesystem is implementing
      ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we
      expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly,
      whereupon ->truncate_range can be removed from inode_operations -
      shmem_truncate_range() will help i915 across that transition too.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94c1e62d
    • H
      mm: move shmem prototypes to shmem_fs.h · 072441e2
      Hugh Dickins 提交于
      Before adding any more global entry points into shmem.c, gather such
      prototypes into shmem_fs.h.  Remove mm's own declarations from swap.h,
      but for now leave the ones in mm.h: because shmem_file_setup() and
      shmem_zero_setup() are called from various places, and we should not
      force other subsystems to update immediately.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      072441e2
    • H
      mm: move vmtruncate_range to truncate.c · 5b8ba101
      Hugh Dickins 提交于
      You would expect to find vmtruncate_range() next to vmtruncate() in
      mm/truncate.c: move it there.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b8ba101
  3. 23 6月, 2011 2 次提交
  4. 18 6月, 2011 3 次提交
    • L
      mm: avoid anon_vma_chain allocation under anon_vma lock · dd34739c
      Linus Torvalds 提交于
      Hugh Dickins points out that lockdep (correctly) spots a potential
      deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation
      of anon_vma_chain while doing anon_vma_clone().  The problem is that
      page reclaim will want to take the anon_vma lock of any anonymous pages
      that it will try to reclaim.
      
      So re-organize the code in anon_vma_clone() slightly: first do just a
      GFP_NOWAIT allocation, which will usually work fine.  But if that fails,
      let's just drop the lock and re-do the allocation, now with GFP_KERNEL.
      
      End result: not only do we avoid the locking problem, this also ends up
      getting better concurrency in case the allocation does need to block.
      Tim Chen reports that with all these anon_vma locking tweaks, we're now
      almost back up to the spinlock performance.
      Reported-and-tested-by: NHugh Dickins <hughd@google.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd34739c
    • P
      mm: avoid repeated anon_vma lock/unlock sequences in unlink_anon_vmas() · eee2acba
      Peter Zijlstra 提交于
      This matches the anon_vma_clone() case, and uses the same lock helper
      functions.  Because of the need to potentially release the anon_vma's,
      it's a bit more complex, though.
      
      We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets
      the anon_vma lock (with the helper function that only takes the lock
      once for the whole loop), and removes any entries that don't need any
      more processing.
      
      The second phase just traverses the remaining list entries (without
      holding the anon_vma lock), and does any actual freeing of the
      anon_vma's that is required.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NHugh Dickins <hughd@google.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eee2acba
    • L
      mm: avoid repeated anon_vma lock/unlock sequences in anon_vma_clone() · bb4aa396
      Linus Torvalds 提交于
      In anon_vma_clone() we traverse the vma->anon_vma_chain of the source
      vma, locking the anon_vma for each entry.
      
      But they are all going to have the same root entry, which means that
      we're locking and unlocking the same lock over and over again.  Which is
      expensive in locked operations, but can get _really_ expensive when that
      root entry sees any kind of lock contention.
      
      In fact, Tim Chen reports a big performance regression due to this: when
      we switched to use a mutex instead of a spinlock, the contention case
      gets much worse.
      
      So to alleviate this all, this commit creates a small helper function
      (lock_anon_vma_root()) that can be used to take the lock just once
      rather than taking and releasing it over and over again.
      
      We still have the same "take the lock and release" it behavior in the
      exit path (in unlink_anon_vmas()), but that one is a bit harder to fix
      since we're actually freeing the anon_vma entries as we go, and that
      will touch the lock too.
      Reported-and-tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb4aa396
  5. 17 6月, 2011 1 次提交
  6. 16 6月, 2011 18 次提交
    • L
      mm: get rid of the most spurious find_vma_prev() users · 9be34c9d
      Linus Torvalds 提交于
      We have some users of this function that date back to before the vma
      list was doubly linked, and just are silly.  These days, you can find
      the previous vma by just following the vma->vm_prev pointer.
      
      In some cases you don't need any find_vma() lookup at all, and in other
      cases you're better off with the regular "find_vma()" that uses the vma
      cache front-end lookup.
      
      Some "find_vma_prev()" users are still valid, though.  For example, in
      the case of a stack that grows up, it can be the case that we don't find
      any 'vma' at all (because we're looking up an address that is past the
      last vma), and that the stack that we want to grow is the 'prev' vma.
      
      But that kind of special case aside, we generally should prefer to use
      'find_vma()'.
      
      Noticed due to a totally unrelated POWER memory corruption bug that just
      happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
      using that function here?".
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9be34c9d
    • H
      ksm: fix NULL pointer dereference in scan_get_next_rmap_item() · 2b472611
      Hugh Dickins 提交于
      Andrea Righi reported a case where an exiting task can race against
      ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily
      triggering a NULL pointer dereference in ksmd.
      
      ksm_scan.mm_slot == &ksm_mm_head with only one registered mm
      
      CPU 1 (__ksm_exit)		CPU 2 (scan_get_next_rmap_item)
       				list_empty() is false
      lock				slot == &ksm_mm_head
      list_del(slot->mm_list)
      (list now empty)
      unlock
      				lock
      				slot = list_entry(slot->mm_list.next)
      				(list is empty, so slot is still ksm_mm_head)
      				unlock
      				slot->mm == NULL ... Oops
      
      Close this race by revalidating that the new slot is not simply the list
      head again.
      
      Andrea's test case:
      
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <sys/mman.h>
      
      #define BUFSIZE getpagesize()
      
      int main(int argc, char **argv)
      {
      	void *ptr;
      
      	if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) {
      		perror("posix_memalign");
      		exit(1);
      	}
      	if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) {
      		perror("madvise");
      		exit(1);
      	}
      	*(char *)NULL = 0;
      
      	return 0;
      }
      Reported-by: NAndrea Righi <andrea@betterlinux.com>
      Tested-by: NAndrea Righi <andrea@betterlinux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b472611
    • M
      mm: compaction: abort compaction if too many pages are isolated and caller is asynchronous V2 · f9e35b3b
      Mel Gorman 提交于
      Asynchronous compaction is used when promoting to huge pages.  This is all
      very nice but if there are a number of processes in compacting memory, a
      large number of pages can be isolated.  An "asynchronous" process can
      stall for long periods of time as a result with a user reporting that
      firefox can stall for 10s of seconds.  This patch aborts asynchronous
      compaction if too many pages are isolated as it's better to fail a
      hugepage promotion than stall a process.
      
      [minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
      Reported-and-tested-by: NUry Stankevich <urykhy@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9e35b3b
    • A
      mm: vmscan: do not use page_count without a page pin · d179e84b
      Andrea Arcangeli 提交于
      It is unsafe to run page_count during the physical pfn scan because
      compound_head could trip on a dangling pointer when reading
      page->first_page if the compound page is being freed by another CPU.
      
      [mgorman@suse.de: split out patch]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d179e84b
    • M
      mm: compaction: ensure that the compaction free scanner does not move to the next zone · 7454f4ba
      Mel Gorman 提交于
      Compaction works with two scanners, a migration and a free scanner.  When
      the scanners crossover, migration within the zone is complete.  The
      location of the scanner is recorded on each cycle to avoid excesive
      scanning.
      
      When a zone is small and mostly reserved, it's very easy for the migration
      scanner to be close to the end of the zone.  Then the following situation
      can occurs
      
        o migration scanner isolates some pages near the end of the zone
        o free scanner starts at the end of the zone but finds that the
          migration scanner is already there
        o free scanner gets reinitialised for the next cycle as
          cc->migrate_pfn + pageblock_nr_pages
          moving the free scanner into the next zone
        o migration scanner moves into the next zone
      
      When this happens, NR_ISOLATED accounting goes haywire because some of the
      accounting happens against the wrong zone.  One zones counter remains
      positive while the other goes negative even though the overall global
      count is accurate.  This was reported on X86-32 with !SMP because !SMP
      allows the negative counters to be visible.  The fact that it is the bug
      should theoritically be possible there.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7454f4ba
    • S
      compaction: checks correct fragmentation index · a582a738
      Shaohua Li 提交于
      fragmentation_index() returns -1000 when the allocation might succeed
      This doesn't match the comment and code in compaction_suitable(). I
      thought compaction_suitable should return COMPACT_PARTIAL in -1000
      case, because in this case allocation could succeed depending on
      watermarks.
      
      The impact of this is that compaction starts and compact_finished() is
      called which rechecks the watermarks and the free lists.  It should have
      the same result in that compaction should not start but is more expensive.
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a582a738
    • M
      mm/memory-failure.c: fix page isolated count mismatch · 5db8a73a
      Minchan Kim 提交于
      Pages isolated for migration are accounted with the vmstat counters
      NR_ISOLATE_[ANON|FILE].  Callers of migrate_pages() are expected to
      increment these counters when pages are isolated from the LRU.  Once the
      pages have been migrated, they are put back on the LRU or freed and the
      isolated count is decremented.
      
      Memory failure is not properly accounting for pages it isolates causing
      the NR_ISOLATED counters to be negative.  On SMP builds, this goes
      unnoticed as negative counters are treated as 0 due to expected per-cpu
      drift.  On UP builds, the counter is treated by too_many_isolated() as a
      large value causing processes to enter D state during page reclaim or
      compaction.  This patch accounts for pages isolated by memory failure
      correctly.
      
      [mel@csn.ul.ie: rewrote changelog]
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5db8a73a
    • K
      memcg: avoid percpu cached charge draining at softlimit · fbc29a25
      KAMEZAWA Hiroyuki 提交于
      Based on Michal Hocko's comment.
      
      We are not draining per cpu cached charges during soft limit reclaim
      because background reclaim doesn't care about charges.  It tries to free
      some memory and charges will not give any.
      
      Cached charges might influence only selection of the biggest soft limit
      offender but as the call is done only after the selection has been already
      done it makes no change.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbc29a25
    • K
      memcg: fix percpu cached charge draining frequency · 26fe6168
      KAMEZAWA Hiroyuki 提交于
      For performance, memory cgroup caches some "charge" from res_counter into
      per cpu cache.  This works well but because it's cache, it needs to be
      flushed in some cases.  Typical cases are
      
         1. when someone hit limit.
      
         2. when rmdir() is called and need to charges to be 0.
      
      But "1" has problem.
      
      Recently, with large SMP machines, we see many kworker runs because of
      flushing memcg's cache.  Bad things in implementation are that even if a
      cpu contains a cache for memcg not related to a memcg which hits limit,
      drain code is called.
      
      This patch does
              A) check percpu cache contains a useful data or not.
              B) check other asynchronous percpu draining doesn't run.
              C) don't call local cpu callback.
      
      (*)This patch avoid changing the calling condition with hard-limit.
      
      When I run "cat 1Gfile > /dev/null" under 300M limit memcg,
      
      [Before]
      13767 kamezawa  20   0 98.6m  424  416 D 10.0  0.0   0:00.61 cat
         58 root      20   0     0    0    0 S  0.6  0.0   0:00.09 kworker/2:1
         60 root      20   0     0    0    0 S  0.6  0.0   0:00.08 kworker/4:1
          4 root      20   0     0    0    0 S  0.3  0.0   0:00.02 kworker/0:0
         57 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/1:1
         61 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/5:1
         62 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/6:1
         63 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/7:1
      
      [After]
       2676 root      20   0 98.6m  416  416 D  9.3  0.0   0:00.87 cat
       2626 kamezawa  20   0 15192 1312  920 R  0.3  0.0   0:00.28 top
          1 root      20   0 19384 1496 1204 S  0.0  0.0   0:00.66 init
          2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
          3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
          4 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kworker/0:0
      
      [akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NYing Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26fe6168
    • K
      memcg: fix wrong check of noswap with softlimit · 7ae534d0
      KAMEZAWA Hiroyuki 提交于
      Hierarchical reclaim doesn't swap out if memsw and resource limits are
      thye same (memsw_is_minimum == true) because we would hit mem+swap limit
      anyway (during hard limit reclaim).
      
      If it comes to the soft limit we shouldn't consider memsw_is_minimum at
      all because it doesn't make much sense.  Either the soft limit is bellow
      the hard limit and then we cannot hit mem+swap limit or the direct reclaim
      takes a precedence.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ae534d0
    • K
      memcg: fix init_page_cgroup nid with sparsemem · 37573e8c
      KAMEZAWA Hiroyuki 提交于
      Commit 21a3c964 ("memcg: allocate memory cgroup structures in local
      nodes") makes page_cgroup allocation as NUMA aware.  But that caused a
      problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.
      
      The problem was getting a NID from invalid struct pages, which was not
      initialized because it was out-of-node, out of [node_start_pfn,
      node_end_pfn)
      
      Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.  But
      this may scan a pfn which is not on any node and can access memmap which
      is not initialized.
      
      This makes page_cgroup_init() for SPARSEMEM node aware and remove a code
      to get nid from page->flags.  (Then, we'll use valid NID always.)
      
      [akpm@linux-foundation.org: try to fix up comments]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37573e8c
    • K
      mm: memory.numa_stat: fix file permission · 89577127
      KAMEZAWA Hiroyuki 提交于
      Commit 406eb0c9 ("memcg: add memory.numastat api for numa
      statistics") adds memory.numa_stat file for memory cgroup.  But the file
      permissions are wrong.
      
        [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
        ---------- 1 root root 0 Jun  9 18:36 /cgroup/memory/A/memory.numa_stat
      
      This patch fixes the permission as
      
        [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
        -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NYing Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89577127
    • R
      mm: fix negative commitlimit when gigantic hugepages are allocated · b0320c7b
      Rafael Aquini 提交于
      When 1GB hugepages are allocated on a system, free(1) reports less
      available memory than what really is installed in the box.  Also, if the
      total size of hugepages allocated on a system is over half of the total
      memory size, CommitLimit becomes a negative number.
      
      The problem is that gigantic hugepages (order > MAX_ORDER) can only be
      allocated at boot with bootmem, thus its frames are not accounted to
      'totalram_pages'.  However, they are accounted to hugetlb_total_pages()
      
      What happens to turn CommitLimit into a negative number is this
      calculation, in fs/proc/meminfo.c:
      
              allowed = ((totalram_pages - hugetlb_total_pages())
                      * sysctl_overcommit_ratio / 100) + total_swap_pages;
      
      A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.
      
      Also, every vm statistic which depends on 'totalram_pages' will render
      confusing values, as if system were 'missing' some part of its memory.
      
      Impact of this bug:
      
      When gigantic hugepages are allocated and sysctl_overcommit_memory ==
      OVERCOMMIT_NEVER.  In a such situation, __vm_enough_memory() goes through
      the mentioned 'allowed' calculation and might end up mistakenly returning
      -ENOMEM, thus forcing the system to start reclaiming pages earlier than it
      would be ususal, and this could cause detrimental impact to overall
      system's performance, depending on the workload.
      
      Besides the aforementioned scenario, I can only think of this causing
      annoyances with memory reports from /proc/meminfo and free(1).
      
      [akpm@linux-foundation.org: standardize comment layout]
      Reported-by: NRuss Anderson <rja@sgi.com>
      Signed-off-by: NRafael Aquini <aquini@linux.com>
      Acked-by: NRuss Anderson <rja@sgi.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0320c7b
    • K
      mm/memory_hotplug.c: fix building of node hotplug zonelist · 959ecc48
      KAMEZAWA Hiroyuki 提交于
      During memory hotplug we refresh zonelists when we online a page in a new
      zone.  It means that the node's zonelist is not initialized until pages
      are onlined.  So for example, "nid" passed by MEM_GOING_ONLINE notifier
      will point to NODE_DATA(nid) which has no zone fallback list.  Moreover,
      if we hot-add cpu-only nodes, alloc_pages() will do no fallback.
      
      This patch makes a zonelist when a new pgdata is available.
      
      Note: in production, at fujitsu, memory should be onlined before cpu
            and our server didn't have any memory-less nodes and had no problems.
      
            But recent changes in MEM_GOING_ONLINE+page_cgroup
            will access not initialized zonelist of node.
            Anyway, there are memory-less node and we need some care.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      959ecc48
    • M
      mm: compaction: fix special case -1 order checks · 3957c776
      Michal Hocko 提交于
      Commit 56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails") introduced a check for cc->order == -1 in
      compact_finished.  We should continue compacting in that case because
      the request came from userspace and there is no particular order to
      compact for.  Similar check has been added by 82478fb7 (mm: compaction:
      prevent division-by-zero during user-requested compaction) for
      compaction_suitable.
      
      The check is, however, done after zone_watermark_ok which uses order as a
      right hand argument for shifts.  Not only watermark check is pointless if
      we can break out without it but it also uses 1 << -1 which is not well
      defined (at least from C standard).  Let's move the -1 check above
      zone_watermark_ok.
      
      [minchan.kim@gmail.com> - caught compaction_suitable]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3957c776
    • S
      mm: fix wrong kunmap_atomic() pointer · 5f1a1907
      Steven Rostedt 提交于
      Running a ktest.pl test, I hit the following bug on x86_32:
      
        ------------[ cut here ]------------
        WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1()
         Hardware name:
        Modules linked in:
        Pid: 93, comm: sh Not tainted 2.6.39-test+ #1
        Call Trace:
         [<c04450da>] warn_slowpath_common+0x7c/0x91
         [<c042f5df>] ? __kunmap_atomic+0x64/0xc1
         [<c042f5df>] ? __kunmap_atomic+0x64/0xc1^M
         [<c0445111>] warn_slowpath_null+0x22/0x24
         [<c042f5df>] __kunmap_atomic+0x64/0xc1
         [<c04d4a22>] unmap_vmas+0x43a/0x4e0
         [<c04d9065>] exit_mmap+0x91/0xd2
         [<c0443057>] mmput+0x43/0xad
         [<c0448358>] exit_mm+0x111/0x119
         [<c044855f>] do_exit+0x1ff/0x5fa
         [<c0454ea2>] ? set_current_blocked+0x3c/0x40
         [<c0454f24>] ? sigprocmask+0x7e/0x8e
         [<c0448b55>] do_group_exit+0x65/0x88
         [<c0448b90>] sys_exit_group+0x18/0x1c
         [<c0c3915f>] sysenter_do_call+0x12/0x38
        ---[ end trace 8055f74ea3c0eb62 ]---
      
      Running a ktest.pl git bisect, found the culprit: commit e303297e
      ("mm: extended batches for generic mmu_gather")
      
      But although this was the commit triggering the bug, it was not the one
      originally responsible for the bug.  That was commit d16dfc55 ("mm:
      mmu_gather rework").
      
      The code in zap_pte_range() has something that looks like the following:
      
      	pte =  pte_offset_map_lock(mm, pmd, addr, &ptl);
      	do {
      		[...]
      	} while (pte++, addr += PAGE_SIZE, addr != end);
      	pte_unmap_unlock(pte - 1, ptl);
      
      The pte starts off pointing at the first element in the page table
      directory that was returned by the pte_offset_map_lock().  When it's done
      with the page, pte will be pointing to anything between the next entry and
      the first entry of the next page inclusive.  By doing a pte - 1, this puts
      the pte back onto the original page, which is all that pte_unmap_unlock()
      needs.
      
      In most archs (64 bit), this is not an issue as the pte is ignored in the
      pte_unmap_unlock().  But on 32 bit archs, where things may be kmapped, it
      is essential that the pte passed to pte_unmap_unlock() resides on the same
      page that was given by pte_offest_map_lock().
      
      The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced
      a "break;" from the while loop.  This alone did not seem to easily trigger
      the bug.  But the modifications made by e303297e caused that "break;" to
      be hit on the first iteration, before the pte++.
      
      The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to
      be pointing to the previous page.  This will cause the wrong page to be
      unmapped, and also trigger the warning above.
      
      The simple solution is to just save the pointer given by
      pte_offset_map_lock() and use it in the unlock.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f1a1907
    • K
      vmscan: implement swap token priority aging · d7911ef3
      KOSAKI Motohiro 提交于
      While testing for memcg aware swap token, I observed a swap token was
      often grabbed an intermittent running process (eg init, auditd) and they
      never release a token.
      
      Why?
      
      Some processes (eg init, auditd, audispd) wake up when a process exiting.
      And swap token can be get first page-in process when a process exiting
      makes no swap token owner.  Thus such above intermittent running process
      often get a token.
      
      And currently, swap token priority is only decreased at page fault path.
      Then, if the process sleep immediately after to grab swap token, the swap
      token priority never be decreased.  That's obviously undesirable.
      
      This patch implement very poor (and lightweight) priority aging.  It only
      be affect to the above corner case and doesn't change swap tendency
      workload performance (eg multi process qsbench load)
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7911ef3
    • K
      vmscan: implement swap token trace · 83cd81a3
      KOSAKI Motohiro 提交于
      This is useful for observing swap token activity.
      
      example output:
      
                   zsh-1845  [000]   598.962716: update_swap_token_priority:
      mm=ffff88015eaf7700 old_prio=1 new_prio=0
                memtoy-1830  [001]   602.033900: update_swap_token_priority:
      mm=ffff880037a45880 old_prio=947 new_prio=949
                memtoy-1830  [000]   602.041509: update_swap_token_priority:
      mm=ffff880037a45880 old_prio=949 new_prio=951
                memtoy-1830  [000]   602.051959: update_swap_token_priority:
      mm=ffff880037a45880 old_prio=951 new_prio=953
                memtoy-1830  [000]   602.052188: update_swap_token_priority:
      mm=ffff880037a45880 old_prio=953 new_prio=955
                memtoy-1830  [001]   602.427184: put_swap_token:
      token_mm=ffff880037a45880
                   zsh-1789  [000]   602.427281: replace_swap_token:
      old_token_mm=          (null) old_prio=0 new_token_mm=ffff88015eaf7018
      new_prio=2
                   zsh-1789  [001]   602.433456: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=2 new_prio=4
                   zsh-1789  [000]   602.437613: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=4 new_prio=6
                   zsh-1789  [000]   602.443924: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=6 new_prio=8
                   zsh-1789  [000]   602.451873: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=8 new_prio=10
                   zsh-1789  [001]   602.462639: update_swap_token_priority:
      mm=ffff88015eaf7018 old_prio=10 new_prio=12
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: Rik van Riel<riel@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83cd81a3