1. 24 2月, 2013 40 次提交
    • H
      ksm: shrink 32-bit rmap_item back to 32 bytes · bc56620b
      Hugh Dickins 提交于
      Think of struct rmap_item as an extension of struct page (restricted to
      MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
      small, especially on 32-bit architectures of limited lowmem.
      
      Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
      making no change to its 64-byte struct rmap_item; but bloats the 32-bit
      struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
      rounds up to 40 bytes once allocated from slab.  We'd better avoid that.
      
      Hey, I only just remembered that the anon_vma pointer in struct
      rmap_item has no purpose until the rmap_item is hung from a stable tree
      node (which has its own nid field); and rmap_item's nid field no purpose
      than to say which tree root to tell rb_erase() when unlinking from an
      unstable tree.
      
      Double them up in a union.  There's just one place where we set anon_vma
      early (when we already hold mmap_sem): now we must remove tree_rmap_item
      from its unstable tree there, before overwriting nid.  No need to
      spatter BUG()s around: we'd be seeing oopses if this were wrong.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc56620b
    • H
      ksm: treat unstable nid like in stable tree · b599cbdf
      Hugh Dickins 提交于
      An inconsistency emerged in reviewing the NUMA node changes to KSM: when
      meeting a page from the wrong NUMA node in a stable tree, we say that
      it's okay for comparisons, but not as a leaf for merging; whereas when
      meeting a page from the wrong NUMA node in an unstable tree, we bail out
      immediately.
      
      Now, it might be that a wrong NUMA node in an unstable tree is more
      likely to correlate with instablility (different content, with rbnode
      now misplaced) than page migration; but even so, we are accustomed to
      instablility in the unstable tree.
      
      Without strong evidence for which strategy is generally better, I'd
      rather be consistent with what's done in the stable tree: accept a page
      from the wrong NUMA node for comparison, but not as a leaf for merging.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b599cbdf
    • H
      ksm: add some comments · 8fdb3dbf
      Hugh Dickins 提交于
      Added slightly more detail to the Documentation of merge_across_nodes, a
      few comments in areas indicated by review, and renamed get_ksm_page()'s
      argument from "locked" to "lock_it".  No functional change.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fdb3dbf
    • G
      tmpfs: fix mempolicy object leaks · 49cd0a5c
      Greg Thelen 提交于
      Fix several mempolicy leaks in the tmpfs mount logic.  These leaks are
      slow - on the order of one object leaked per mount attempt.
      
      Leak 1 (umount doesn't free mpol allocated in mount):
          while true; do
              mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
              umount /mnt
          done
      
      Leak 2 (errors parsing remount options will leak mpol):
          mount -t tmpfs -o size=100M nodev /mnt
          while true; do
              mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
          done
          umount /mnt
      
      Leak 3 (multiple mpol per mount leak mpol):
          while true; do
              mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
              umount /mnt
          done
      
      This patch fixes all of the above.  I could have broken the patch into
      three pieces but is seemed easier to review as one.
      
      [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49cd0a5c
    • G
      tmpfs: fix use-after-free of mempolicy object · 5f00110f
      Greg Thelen 提交于
      The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
      option is not specified in the remount request.  A new policy can be
      specified if mpol=M is given.
      
      Before this patch remounting an mpol bound tmpfs without specifying
      mpol= mount option in the remount request would set the filesystem's
      mempolicy object to a freed mempolicy object.
      
      To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
          # mkdir /tmp/x
      
          # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0
      
          # mount -o remount,size=200M nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
              # note ? garbage in mpol=... output above
      
          # dd if=/dev/zero of=/tmp/x/f count=1
              # panic here
      
      Panic:
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          [...]
          Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
          Call Trace:
            mpol_shared_policy_init+0xa5/0x160
            shmem_get_inode+0x209/0x270
            shmem_mknod+0x3e/0xf0
            shmem_create+0x18/0x20
            vfs_create+0xb5/0x130
            do_last+0x9a1/0xea0
            path_openat+0xb3/0x4d0
            do_filp_open+0x42/0xa0
            do_sys_open+0xfe/0x1e0
            compat_sys_open+0x1b/0x20
            cstar_dispatch+0x7/0x1f
      
      Non-debug kernels will not crash immediately because referencing the
      dangling mpol will not cause a fault.  Instead the filesystem will
      reference a freed mempolicy object, which will cause unpredictable
      behavior.
      
      The problem boils down to a dropped mpol reference below if
      shmem_parse_options() does not allocate a new mpol:
      
          config = *sbinfo
          shmem_parse_options(data, &config, true)
          mpol_put(sbinfo->mpol)
          sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */
      
      This patch avoids the crash by not releasing the mempolicy if
      shmem_parse_options() doesn't create a new mpol.
      
      How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
      not look back further.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f00110f
    • M
      mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages · 67d46b29
      Mel Gorman 提交于
      Rob van der Heij reported the following (paraphrased) on private mail.
      
      	The scenario is that I want to avoid backups to fill up the page
      	cache and purge stuff that is more likely to be used again (this is
      	with s390x Linux on z/VM, so I don't give it as much memory that
      	we don't care anymore). So I have something with LD_PRELOAD that
      	intercepts the close() call (from tar, in this case) and issues
      	a posix_fadvise() just before closing the file.
      
      	This mostly works, except for small files (less than 14 pages)
      	that remains in page cache after the face.
      
      Unfortunately Rob has not had a chance to test this exact patch but the
      test program below should be reproducing the problem he described.
      
      The issue is the per-cpu pagevecs for LRU additions.  If the pages are
      added by one CPU but fadvise() is called on another then the pages
      remain resident as the invalidate_mapping_pages() only drains the local
      pagevecs via its call to pagevec_release().  The user-visible effect is
      that a program that uses fadvise() properly is not obeyed.
      
      A possible fix for this is to put the necessary smarts into
      invalidate_mapping_pages() to globally drain the LRU pagevecs if a
      pagevec page could not be discarded.  The downside with this is that an
      inode cache shrink would send a global IPI and memory pressure
      potentially causing global IPI storms is very undesirable.
      
      Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
      check if invalidate_mapping_pages() discarded all the requested pages.
      If a subset of pages are discarded it drains the LRU pagevecs and tries
      again.  If the second attempt fails, it assumes it is due to the pages
      being mapped, locked or dirty and does not care.  With this patch, an
      application using fadvise() correctly will be obeyed but there is a
      downside that a malicious application can force the kernel to send
      global IPIs and increase overhead.
      
      If accepted, I would like this to be considered as a -stable candidate.
      It's not an urgent issue but it's a system call that is not working as
      advertised which is weak.
      
      The following test program demonstrates the problem.  It should never
      report that pages are still resident but will without this patch.  It
      assumes that CPU 0 and 1 exist.
      
      int main() {
      	int fd;
      	int pagesize = getpagesize();
      	ssize_t written = 0, expected;
      	char *buf;
      	unsigned char *vec;
      	int resident, i;
      	cpu_set_t set;
      
      	/* Prepare a buffer for writing */
      	expected = FILESIZE_PAGES * pagesize;
      	buf = malloc(expected + 1);
      	if (buf == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      	buf[expected] = 0;
      	memset(buf, 'a', expected);
      
      	/* Prepare the mincore vec */
      	vec = malloc(FILESIZE_PAGES);
      	if (vec == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Bind ourselves to CPU 0 */
      	CPU_ZERO(&set);
      	CPU_SET(0, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* open file, unlink and write buffer */
      	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
      	if (fd == -1) {
      		perror("open");
      		exit(EXIT_FAILURE);
      	}
      	unlink("fadvise-test-file");
      	while (written < expected) {
      		ssize_t this_write;
      		this_write = write(fd, buf + written, expected - written);
      
      		if (this_write == -1) {
      			perror("write");
      			exit(EXIT_FAILURE);
      		}
      
      		written += this_write;
      	}
      	free(buf);
      
      	/*
      	 * Force ourselves to another CPU. If fadvise only flushes the local
      	 * CPUs pagevecs then the fadvise will fail to discard all file pages
      	 */
      	CPU_ZERO(&set);
      	CPU_SET(1, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* sync and fadvise to discard the page cache */
      	fsync(fd);
      	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
      		perror("posix_fadvise");
      		exit(EXIT_FAILURE);
      	}
      
      	/* map the file and use mincore to see which parts of it are resident */
      	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
      	if (buf == NULL) {
      		perror("mmap");
      		exit(EXIT_FAILURE);
      	}
      	if (mincore(buf, expected, vec) == -1) {
      		perror("mincore");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Check residency */
      	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
      		if (vec[i])
      			resident++;
      	}
      	if (resident != 0) {
      		printf("Nr unexpected pages resident: %d\n", resident);
      		exit(EXIT_FAILURE);
      	}
      
      	munmap(buf, expected);
      	close(fd);
      	free(vec);
      	exit(EXIT_SUCCESS);
      }
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NRob van der Heij <rvdheij@gmail.com>
      Tested-by: NRob van der Heij <rvdheij@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d46b29
    • C
      mm: export mmu notifier invalidates · fa794199
      Cliff Wickman 提交于
      We at SGI have a need to address some very high physical address ranges
      with our GRU (global reference unit), sometimes across partitioned
      machine boundaries and sometimes with larger addresses than the cpu
      supports.  We do this with the aid of our own 'extended vma' module
      which mimics the vma.  When something (either unmap or exit) frees an
      'extended vma' we use the mmu notifiers to clean them up.
      
      We had been able to mimic the functions
      __mmu_notifier_invalidate_range_start() and
      __mmu_notifier_invalidate_range_end() by locking the per-mm lock and
      walking the per-mm notifier list.  But with the change to a global srcu
      lock (static in mmu_notifier.c) we can no longer do that.  Our module has
      no access to that lock.
      
      So we request that these two functions be exported.
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      Acked-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa794199
    • M
      mm: accelerate mm_populate() treatment of THP pages · 240aadee
      Michel Lespinasse 提交于
      This change adds a follow_page_mask function which is equivalent to
      follow_page, but with an extra page_mask argument.
      
      follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
      a THP page, and to 0 in other cases.
      
      __get_user_pages() makes use of this in order to accelerate populating
      THP ranges - that is, when both the pages and vmas arrays are NULL, we
      don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
      we also avoid taking mm->page_table_lock that many times).
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      240aadee
    • M
      mm: use long type for page counts in mm_populate() and get_user_pages() · 28a35716
      Michel Lespinasse 提交于
      Use long type for page counts in mm_populate() so as to avoid integer
      overflow when running the following test code:
      
      int main(void) {
        void *p = mmap(NULL, 0x100000000000, PROT_READ,
                       MAP_PRIVATE | MAP_ANON, -1, 0);
        printf("p: %p\n", p);
        mlockall(MCL_CURRENT);
        printf("done\n");
        return 0;
      }
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28a35716
    • Z
      mm: accurately document nr_free_*_pages functions with code comments · e0fb5815
      Zhang Yanfei 提交于
      nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
      are horribly badly named, so accurately document them with code comments
      in case of the misuse of them.
      
      [akpm@linux-foundation.org: tweak comments]
      Reviewed-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0fb5815
    • N
      HWPOISON: change order of error_states[]'s elements · 5f4b9fc5
      Naoya Horiguchi 提交于
      error_states[] has two separate states "unevictable LRU page" and
      "mlocked LRU page", and the former one has the higher priority now.  But
      because of that the latter one is rarely chosen because pages with
      PageMlocked highly likely have PG_unevictable set.  On the other hand,
      PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
      shared memory, so reversing the priority of these two states helps us
      clearly distinguish them.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f4b9fc5
    • N
      HWPOISON: fix misjudgement of page_action() for errors on mlocked pages · 524fca1e
      Naoya Horiguchi 提交于
      memory_failure() can't handle memory errors on mlocked pages correctly,
      because page_action() judges such errors as ones on "unknown pages"
      instead of ones on "unevictable LRU page" or "mlocked LRU page".  In
      order to determine page_state page_action() checks page flags at the
      timing of the judgement, but such page flags are not the same with those
      just after memory_failure() is called, because memory_failure() does
      unmapping of the error pages before doing page_action().  This unmapping
      changes the page state, especially page_remove_rmap() (called from
      try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
      mlocked pages after that.
      
      With this patch, we store the page flag of the error page before doing
      unmap, and (only) if the first check with page flags at the time decided
      the error page is unknown, we do the second check with the stored page
      flag.  This implementation doesn't change error handling for the page
      types for which the first check can determine the page state correctly.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      524fca1e
    • H
      memcg: stop warning on memcg_propagate_kmem · 6d043990
      Hugh Dickins 提交于
      Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
      I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
      "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
      used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d043990
    • Z
      net: change type of virtio_chan->p9_max_pages · 7293bfba
      Zhang Yanfei 提交于
      This member of struct virtio_chan is calculated from nr_free_buffer_pages
      so change its type to unsigned long in case of overflow.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7293bfba
    • Z
      vmscan: change type of vm_total_pages to unsigned long · b21e0b90
      Zhang Yanfei 提交于
      This variable is calculated from nr_free_pagecache_pages so
      change its type to unsigned long.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b21e0b90
    • Z
      fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used · 697ce9be
      Zhang Yanfei 提交于
      The three variables are calculated from nr_free_buffer_pages so change
      their types to unsigned long in case of overflow.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      697ce9be
    • Z
      fs/buffer.c: change type of max_buffer_heads to unsigned long · 43be594a
      Zhang Yanfei 提交于
      max_buffer_heads is calculated from nr_free_buffer_pages(), so change
      its type to unsigned long in case of overflow.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43be594a
    • Z
      ia64: use %ld to print pages calculated in nr_free_buffer_pages · 6434b94a
      Zhang Yanfei 提交于
      Now the function nr_free_buffer_pages returns unsigned long, so use %ld
      to print its return value.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6434b94a
    • Z
      mm: fix return type for functions nr_free_*_pages · ebec3862
      Zhang Yanfei 提交于
      Currently, the amount of RAM that functions nr_free_*_pages return is
      held in unsigned int.  But in machines with big memory (exceeding 16TB),
      the amount may be incorrect because of overflow, so fix it.
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Miller <davem@davemloft.net>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebec3862
    • M
      memcg: cleanup mem_cgroup_init comment · 1081312f
      Michal Hocko 提交于
      We should encourage all memcg controller initialization independent on a
      specific mem_cgroup to be done here rather than exploit css_alloc
      callback and assume that nothing happens before root cgroup is created.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1081312f
    • M
      memcg: move memcg_stock initialization to mem_cgroup_init · e4777496
      Michal Hocko 提交于
      memcg_stock are currently initialized during the root cgroup allocation
      which is OK but it pointlessly pollutes memcg allocation code with
      something that can be called when the memcg subsystem is initialized by
      mem_cgroup_init along with other controller specific parts.
      
      This patch wraps the current memcg_stock initialization code into a
      helper calls it from the controller subsystem initialization code.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4777496
    • M
      memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init · 8787a1df
      Michal Hocko 提交于
      Per-node-zone soft limit tree is currently initialized when the root
      cgroup is created which is OK but it pointlessly pollutes memcg
      allocation code with something that can be called when the memcg
      subsystem is initialized by mem_cgroup_init along with other controller
      specific parts.
      
      While we are at it let's make mem_cgroup_soft_limit_tree_init void
      because it doesn't make much sense to report memory failure because if
      we fail to allocate memory that early during the boot then we are
      screwed anyway (this saves some code).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8787a1df
    • M
      mm: use up free swap space before reaching OOM kill · 0e50ce3b
      Minchan Kim 提交于
      Recently, Luigi reported there are lots of free swap space when OOM
      happens.  It's easily reproduced on zram-over-swap, where many instance
      of memory hogs are running and laptop_mode is enabled.  He said there
      was no problem when he disabled laptop_mode.  The problem when I
      investigate problem is following as.
      
      Assumption for easy explanation: There are no page cache page in system
      because they all are already reclaimed.
      
      1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
      2. shrink_inactive_list isolates victim pages from inactive anon lru list.
      3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
         pageout because sc->may_writepage is 0 so the page is rotated back into
         inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
      4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
         retry reclaim with higher priority.
      5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
         but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
         inactive anon lru list is full of dirty pages by 3 so it just returns
         without  any reclaim progress.
      6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
         Because sc->nr_scanned is increased by shrink_page_list but we don't call
         shrink_page_list in 5 due to short of isolated pages.
      
      Above loop is continued until OOM happens.
      
      The problem didn't happen before [1] was merged because old logic's
      isolatation in shrink_inactive_list was successful and tried to call
      shrink_page_list to pageout them but it still ends up failed to page out
      by may_writepage.  But important point is that sc->nr_scanned was
      increased although we couldn't swap out them so do_try_to_free_pages
      could set may_writepages.
      
      Since commit f80c0673 ("mm: zone_reclaim: make isolate_lru_page()
      filter-aware") was introduced, it's not a good idea any more to depends
      on only the number of scanned pages for setting may_writepage.  So this
      patch adds new trigger point of setting may_writepage as below
      DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
      in VM so it's good fit for our purpose which would be better to lose
      power saving or clickety rather than OOM killing.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NLuigi Semenzato <semenzato@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e50ce3b
    • D
      mm: use NUMA_NO_NODE · 00ef2d2f
      David Rientjes 提交于
      Make a sweep through mm/ and convert code that uses -1 directly to using
      the more appropriate NUMA_NO_NODE.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00ef2d2f
    • R
      mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts · 751efd86
      Robin Holt 提交于
      There is a race condition between mmu_notifier_unregister() and
      __mmu_notifier_release().
      
      Assume two tasks, one calling mmu_notifier_unregister() as a result of a
      filp_close() ->flush() callout (task A), and the other calling
      mmu_notifier_release() from an mmput() (task B).
      
                      A                               B
      t1                                              srcu_read_lock()
      t2              if (!hlist_unhashed())
      t3                                              srcu_read_unlock()
      t4              srcu_read_lock()
      t5                                              hlist_del_init_rcu()
      t6                                              synchronize_srcu()
      t7              srcu_read_unlock()
      t8              hlist_del_rcu()  <--- NULL pointer deref.
      
      Additionally, the list traversal in __mmu_notifier_release() is not
      protected by the by the mmu_notifier_mm->hlist_lock which can result in
      callouts to the ->release() notifier from both mmu_notifier_unregister()
      and __mmu_notifier_release().
      
      -stable suggestions:
      
      The stable trees prior to 3.7.y need commits 21a92735 and
      70400303 cherry-picked in that order prior to cherry-picking this
      commit.  The 3.7.y tree already has those two commits.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Sagi Grimberg <sagig@mellanox.co.il>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      751efd86
    • C
      mm/memory_hotplug: use pgdat_end_pfn() instead of open coding the same. · c1f19495
      Cody P Schafer 提交于
      Replace open coded pgdat_end_pfn() with helper function.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1f19495
    • C
      mm/memory_hotplug: use ensure_zone_is_initialized() · 64dd1b29
      Cody P Schafer 提交于
      Remove open coding of ensure_zone_is_initialzied().
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64dd1b29
    • C
      mm: add helper ensure_zone_is_initialized() · f6bbb78e
      Cody P Schafer 提交于
      ensure_zone_is_initialized() checks if a zone is in a empty & not
      initialized state (typically occuring after it is created in memory
      hotplugging), and, if so, calls init_currently_empty_zone() to
      initialize the zone.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6bbb78e
    • C
      mm/page_alloc: add informative debugging message in page_outside_zone_boundaries() · b5e6a5a2
      Cody P Schafer 提交于
      Add a debug message which prints when a page is found outside of the
      boundaries of the zone it should belong to. Format is:
      	"page $pfn outside zone [ $start_pfn - $end_pfn ]"
      
      [akpm@linux-foundation.org: s/pr_debug/pr_err/]
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5e6a5a2
    • C
      mmzone: add pgdat_{end_pfn,is_empty}() helpers & consolidate. · da3649e1
      Cody P Schafer 提交于
      Add pgdat_end_pfn() and pgdat_is_empty() helpers which match the similar
      zone_*() functions.
      
      Change node_end_pfn() to be a wrapper of pgdat_end_pfn().
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da3649e1
    • C
      mm/page_alloc: add a VM_BUG in __free_one_page() if the zone is uninitialized. · d29bb978
      Cody P Schafer 提交于
      Freeing pages to uninitialized zones is not handled by
      __free_one_page(), and should never happen when the code is correct.
      
      Ran into this while writing some code that dynamically onlines extra
      zones.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d29bb978
    • C
      mm: add zone_is_empty() and zone_is_initialized() · 2a6e3ebe
      Cody P Schafer 提交于
      Factoring out these 2 checks makes it more clear what we are actually
      checking for.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a6e3ebe
    • C
      mm: add & use zone_end_pfn() and zone_spans_pfn() · 108bcc96
      Cody P Schafer 提交于
      Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
      duplication.
      
      This also switches to using them in compaction (where an additional
      variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
      kmemleak.
      
      Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
      because I expect at some point the sycronization issues with start_pfn &
      spanned_pages will need fixing, either by actually using the seqlock or
      clever memory barrier usage.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      108bcc96
    • C
      mm: add SECTION_IN_PAGE_FLAGS · 9127ab4f
      Cody P Schafer 提交于
      Instead of directly utilizing a combination of config options to determine
      this, add a macro to specifically address it.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: David Hansen <dave@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9127ab4f
    • J
      mm/mlock.c: document scary-looking stack expansion mlock chain · 4805b02e
      Johannes Weiner 提交于
      The fact that mlock calls get_user_pages, and get_user_pages might call
      mlock when expanding a stack looks like a potential recursion.
      
      However, mlock makes sure the requested range is already contained
      within a vma, so no stack expansion will actually happen from mlock.
      
      Should this ever change: the stack expansion mlocks only the newly
      expanded range and so will not result in recursive expansion.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4805b02e
    • J
      mm: refactor inactive_file_is_low() to use get_lru_size() · e3790144
      Johannes Weiner 提交于
      An inactive file list is considered low when its active counterpart is
      bigger, regardless of whether it is a global zone LRU list or a memcg
      zone LRU list.  The only difference is in how the LRU size is assessed.
      
      get_lru_size() does the right thing for both global and memcg reclaim
      situations.
      
      Get rid of inactive_file_is_low_global() and
      mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
      the numbers in common code.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3790144
    • J
      mm: shmem: use new radix tree iterator · 860f2759
      Johannes Weiner 提交于
      In shmem_find_get_pages_and_swap(), use the faster radix tree iterator
      construct from commit 78c1d784 ("radix-tree: introduce bit-optimized
      iterator").
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      860f2759
    • H
      ksm: stop hotremove lockdep warning · ef4d43a8
      Hugh Dickins 提交于
      Complaints are rare, but lockdep still does not understand the way
      ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds
      it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a
      problem because notifier callbacks are made under down_read of
      blocking_notifier_head->rwsem (so first the mutex is taken while holding
      the rwsem, then later the rwsem is taken while still holding the mutex);
      but is not in fact a problem because mem_hotplug_mutex is held
      throughout the dance.
      
      There was an attempt to fix this with mutex_lock_nested(); but if that
      happened to fool lockdep two years ago, apparently it does so no longer.
      
      I had hoped to eradicate this issue in extending KSM page migration not
      to need the ksm_thread_mutex.  But then realized that although the page
      migration itself is safe, we do still need to lock out ksmd and other
      users of get_ksm_page() while offlining memory - at some point between
      MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may
      vanish, and get_ksm_page()'s accesses to them become a violation.
      
      So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
      MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
      checks, to achieve the same lockout without being caught by lockdep.
      This is less elegant for KSM, but it's more important to keep lockdep
      useful to other users - and I apologize for how long it took to fix.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Tested-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef4d43a8
    • H
      mm: remove offlining arg to migrate_pages · 9c620e2b
      Hugh Dickins 提交于
      No functional change, but the only purpose of the offlining argument to
      migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
      KSM page for memory hotremove (which took ksm_thread_mutex) but not for
      other callers.  Now all cases are safe, remove the arg.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c620e2b
    • H
      ksm: enable KSM page migration · b79bc0a0
      Hugh Dickins 提交于
      Migration of KSM pages is now safe: remove the PageKsm restrictions from
      mempolicy.c and migrate.c.
      
      But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
      are irrelevant to KSM: it looks as if that code was preventing hotremove
      migration of KSM pages, unless they happened to be in swapcache.
      
      There is some question as to whether enforcing a NUMA mempolicy migration
      ought to migrate KSM pages, mapped into entirely unrelated processes; but
      moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
      and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
      any area where this is a worry.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b79bc0a0