1. 10 1月, 2012 1 次提交
  2. 08 1月, 2012 1 次提交
    • G
      net: fix sock_clone reference mismatch with tcp memcontrol · f3f511e1
      Glauber Costa 提交于
      Sockets can also be created through sock_clone. Because it copies
      all data in the sock structure, it also copies the memcg-related pointer,
      and all should be fine. However, since we now use reference counts in
      socket creation, we are left with some sockets that have no reference
      counts. It matters when we destroy them, since it leads to a mismatch.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Greg Thelen <gthelen@google.com>
      CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Laurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3f511e1
  3. 07 1月, 2012 1 次提交
  4. 04 1月, 2012 9 次提交
  5. 30 12月, 2011 2 次提交
    • H
      mm: hugetlb: fix non-atomic enqueue of huge page · b0365c8d
      Hillf Danton 提交于
      If a huge page is enqueued under the protection of hugetlb_lock, then the
      operation is atomic and safe.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>		[2.6.37+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0365c8d
    • K
      mm/mempolicy.c: refix mbind_range() vma issue · e26a5114
      KOSAKI Motohiro 提交于
      commit 8aacc9f5 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the
      slightly incorrect fix.
      
      Why? Think following case.
      
      1. map 4 pages of a file at offset 0
      
         [0123]
      
      2. map 2 pages just after the first mapping of the same file but with
         page offset 2
      
         [0123][23]
      
      3. mbind() 2 pages from the first mapping at offset 2.
         mbind_range() should treat new vma is,
      
         [0123][23]
           |23|
           mbind vma
      
         but it does
      
         [0123][23]
           |01|
           mbind vma
      
         Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar).
      
      This patch fixes it.
      
      [testcase]
        test result - before the patch
      
      	case4: 126: test failed. expect '2,4', actual '2,2,2'
             	case5: passed
      	case6: passed
      	case7: passed
      	case8: passed
      	case_n: 246: test failed. expect '4,2', actual '1,4'
      
      	------------[ cut here ]------------
      	kernel BUG at mm/filemap.c:135!
      	invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC
      
      	(snip long bug on messages)
      
        test result - after the patch
      
      	case4: passed
             	case5: passed
      	case6: passed
      	case7: passed
      	case8: passed
      	case_n: passed
      
        source:  mbind_vma_test.c
      ============================================================
       #include <numaif.h>
       #include <numa.h>
       #include <sys/mman.h>
       #include <stdio.h>
       #include <unistd.h>
       #include <stdlib.h>
       #include <string.h>
      
      static unsigned long pagesize;
      void* mmap_addr;
      struct bitmask *nmask;
      char buf[1024];
      FILE *file;
      char retbuf[10240] = "";
      int mapped_fd;
      
      char *rubysrc = "ruby -e '\
        pid = %d; \
        vstart = 0x%llx; \
        vend = 0x%llx; \
        s = `pmap -q #{pid}`; \
        rary = []; \
        s.each_line {|line|; \
          ary=line.split(\" \"); \
          addr = ary[0].to_i(16); \
          if(vstart <= addr && addr < vend) then \
            rary.push(ary[1].to_i()/4); \
          end; \
        }; \
        print rary.join(\",\"); \
      '";
      
      void init(void)
      {
      	void* addr;
      	char buf[128];
      
      	nmask = numa_allocate_nodemask();
      	numa_bitmask_setbit(nmask, 0);
      
      	pagesize = getpagesize();
      
      	sprintf(buf, "%s", "mbind_vma_XXXXXX");
      	mapped_fd = mkstemp(buf);
      	if (mapped_fd == -1)
      		perror("mkstemp "), exit(1);
      	unlink(buf);
      
      	if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0)
      		perror("lseek "), exit(1);
      	if (write(mapped_fd, "\0", 1) < 0)
      		perror("write "), exit(1);
      
      	addr = mmap(NULL, pagesize*8, PROT_NONE,
      		    MAP_SHARED, mapped_fd, 0);
      	if (addr == MAP_FAILED)
      		perror("mmap "), exit(1);
      
      	if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0)
      		perror("mprotect "), exit(1);
      
      	mmap_addr = addr + pagesize;
      
      	/* make page populate */
      	memset(mmap_addr, 0, pagesize*6);
      }
      
      void fin(void)
      {
      	void* addr = mmap_addr - pagesize;
      	munmap(addr, pagesize*8);
      
      	memset(buf, 0, sizeof(buf));
      	memset(retbuf, 0, sizeof(retbuf));
      }
      
      void mem_bind(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_BIND, nmask->maskp, nmask->size, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void mem_interleave(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void mem_unbind(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_DEFAULT, NULL, 0, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void Assert(char *expected, char *value, char *name, int line)
      {
      	if (strcmp(expected, value) == 0) {
      		fprintf(stderr, "%s: passed\n", name);
      		return;
      	}
      	else {
      		fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n",
      			name, line,
      			expected, value);
      //		exit(1);
      	}
      }
      
      /*
            AAAA
          PPPPPPNNNNNN
          might become
          PPNNNNNNNNNN
          case 4 below
      */
      void case4(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 4);
      	mem_unbind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("2,4", retbuf, "case4", __LINE__);
      
      	fin();
      }
      
      /*
             AAAA
       PPPPPPNNNNNN
       might become
       PPPPPPPPPPNN
       case 5 below
      */
      void case5(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case5", __LINE__);
      
      	fin();
      }
      
      /*
      	    AAAA
      	PPPPNNNNXXXX
      	might become
      	PPPPPPPPPPPP 6
      */
      void case6(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_bind(4, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("6", retbuf, "case6", __LINE__);
      
      	fin();
      }
      
      /*
          AAAA
      PPPPNNNNXXXX
      might become
      PPPPPPPPXXXX 7
      */
      void case7(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_interleave(4, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case7", __LINE__);
      
      	fin();
      }
      
      /*
          AAAA
      PPPPNNNNXXXX
      might become
      PPPPNNNNNNNN 8
      */
      void case8(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_interleave(4, 2);
      	mem_interleave(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("2,4", retbuf, "case8", __LINE__);
      
      	fin();
      }
      
      void case_n(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	/* make redundunt mappings [0][1234][34][7] */
      	mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE,
      	     MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3);
      
      	/* Expect to do nothing. */
      	mem_unbind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case_n", __LINE__);
      
      	fin();
      }
      
      int main(int argc, char** argv)
      {
      	case4();
      	case5();
      	case6();
      	case7();
      	case8();
      	case_n();
      
      	return 0;
      }
      =============================================================
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Caspar Zhang <caspar@casparzhang.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: <stable@vger.kernel.org>		[3.1.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e26a5114
  6. 23 12月, 2011 2 次提交
    • G
      Partial revert "Basic kernel memory functionality for the Memory Controller" · 65c64ce8
      Glauber Costa 提交于
      This reverts commit e5671dfa.
      
      After a follow up discussion with Michal, it was agreed it would
      be better to leave the kmem controller with just the tcp files,
      deferring the behavior of the other general memory.kmem.* files
      for a later time, when more caches are controlled. This is because
      generic kmem files are not used by tcp accounting and it is
      not clear how other slab caches would fit into the scheme.
      
      We are reverting the original commit so we can track the reference.
      Part of the patch is kept, because it was used by the later tcp
      code. Conflicts are shown in the bottom. init/Kconfig is removed from
      the revert entirely.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      CC: Kirill A. Shutemov <kirill@shutemov.name>
      CC: Paul Menage <paul@paulmenage.org>
      CC: Greg Thelen <gthelen@google.com>
      CC: Johannes Weiner <jweiner@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      
      Conflicts:
      
      	Documentation/cgroups/memory.txt
      	mm/memcontrol.c
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65c64ce8
    • C
      percpu: Remove irqsafe_cpu_xxx variants · 933393f5
      Christoph Lameter 提交于
      We simply say that regular this_cpu use must be safe regardless of
      preemption and interrupt state.  That has no material change for x86
      and s390 implementations of this_cpu operations.  However, arches that
      do not provide their own implementation for this_cpu operations will
      now get code generated that disables interrupts instead of preemption.
      
      -tj: This is part of on-going percpu API cleanup.  For detailed
           discussion of the subject, please refer to the following thread.
      
           http://thread.gmane.org/gmane.linux.kernel/1222078Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <alpine.DEB.2.00.1112221154380.11787@router.home>
      933393f5
  7. 22 12月, 2011 2 次提交
  8. 21 12月, 2011 3 次提交
  9. 16 12月, 2011 1 次提交
    • E
      percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addresses · 9f57bd4d
      Eugene Surovegin 提交于
      per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
      case to the page boundary, which is bogus for any non-page-aligned
      address.
      
      This affects the only in-tree user of this function - sysfs handler
      for per-cpu 'crash_notes' physical address.  The trouble is that the
      crash_notes per-cpu variable is not page-aligned:
      
      crash_notes = 0xc08e8ed4
      PER-CPU OFFSET VALUES:
       CPU 0: 3711f000
       CPU 1: 37129000
       CPU 2: 37133000
       CPU 3: 3713d000
      
      So, the per-cpu addresses are:
       crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
       crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
       crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
       crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4
      
      However, /sys/devices/system/cpu/cpu*/crash_notes says:
       /sys/devices/system/cpu/cpu0/crash_notes: 36b57000
       /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
       /sys/devices/system/cpu/cpu2/crash_notes: 36b43000
       /sys/devices/system/cpu/cpu3/crash_notes: 36b39000
      
      As you can see, all values are rounded down to a page
      boundary. Consequently, this is where kexec sets up the NOTE segments,
      and thus where the secondary kernel is looking for them. However, when
      the first kernel crashes, it saves the notes to the unaligned
      addresses, where they are not found.
      
      Fix it by adding offset_in_page() to the translated page address.
      
      -tj: Combined Eugene's and Petr's commit messages.
      Signed-off-by: NEugene Surovegin <ebs@ebshome.net>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NPetr Tesarik <ptesarik@suse.cz>
      Cc: stable@kernel.org
      9f57bd4d
  10. 13 12月, 2011 4 次提交
  11. 09 12月, 2011 14 次提交
    • M
      mm: vmalloc: check for page allocation failure before vmlist insertion · 1368edf0
      Mel Gorman 提交于
      Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
      /proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
      it is fully initialised.  Unfortunately, it did not check that
      __vmalloc_area_node() successfully populated the area.  In the event of
      allocation failure, the vmalloc area is freed but the pointer to freed
      memory is inserted into the vmlist leading to a a crash later in
      get_vmalloc_info().
      
      This patch adds a check for ____vmalloc_area_node() failure within
      __vmalloc_node_range.  It does not use "goto fail" as in the previous
      error path as a warning was already displayed by __vmalloc_area_node()
      before it called vfree in its failure path.
      
      Credit goes to Luciano Chavez for doing all the real work of identifying
      exactly where the problem was.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NLuciano Chavez <lnx1138@linux.vnet.ibm.com>
      Tested-by: NLuciano Chavez <lnx1138@linux.vnet.ibm.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>		[3.1.x+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1368edf0
    • M
      mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks · d0215638
      Michal Hocko 提交于
      setup_zone_migrate_reserve() expects that zone->start_pfn starts at
      pageblock_nr_pages aligned pfn otherwise we could access beyond an
      existing memblock resulting in the following panic if
      CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:
      
        IP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180
        *pdpt = 0000000000000000 *pde = f000ff53f000ff53
        Oops: 0000 [#1] SMP
        Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
        EIP: 0060:[<c02d331d>] EFLAGS: 00010006 CPU: 0
        EIP is at setup_zone_migrate_reserve+0xcd/0x180
        EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
        ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
        DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
        Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
        Call Trace:
        [<c02d389c>] __setup_per_zone_wmarks+0xec/0x160
        [<c02d3a1f>] setup_per_zone_wmarks+0xf/0x20
        [<c08a771c>] init_per_zone_wmark_min+0x27/0x86
        [<c020111b>] do_one_initcall+0x2b/0x160
        [<c086639d>] kernel_init+0xbe/0x157
        [<c05cae26>] kernel_thread_helper+0x6/0xd
        Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 <2b> 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
        EIP: [<c02d331d>] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
        CR2: 00000000000012b4
      
      We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
      highstart_pfn = 0x36ffe.
      
      The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
      in a pageblock is reserved before marking it MIGRATE_RESERVE").
      
      Make sure that start_pfn is always aligned to pageblock_nr_pages to
      ensure that pfn_valid s always called at the start of each pageblock.
      Architectures with holes in pageblocks will be correctly handled by
      pfn_valid_within in pageblock_is_reserved.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NDang Bo <bdang@vmware.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Arve Hjnnevg <arve@android.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[3.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0215638
    • H
      mm/migrate.c: pair unlock_page() and lock_page() when migrating huge pages · 09761333
      Hillf Danton 提交于
      Avoid unlocking and unlocked page if we failed to lock it.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09761333
    • Y
      thp: set compound tail page _count to zero · 58a84aa9
      Youquan Song 提交于
      Commit 70b50f94 ("mm: thp: tail page refcounting fix") keeps all
      page_tail->_count zero at all times.  But the current kernel does not
      set page_tail->_count to zero if a 1GB page is utilized.  So when an
      IOMMU 1GB page is used by KVM, it wil result in a kernel oops because a
      tail page's _count does not equal zero.
      
        kernel BUG at include/linux/mm.h:386!
        invalid opcode: 0000 [#1] SMP
        Call Trace:
          gup_pud_range+0xb8/0x19d
          get_user_pages_fast+0xcb/0x192
          ? trace_hardirqs_off+0xd/0xf
          hva_to_pfn+0x119/0x2f2
          gfn_to_pfn_memslot+0x2c/0x2e
          kvm_iommu_map_pages+0xfd/0x1c1
          kvm_iommu_map_memslots+0x7c/0xbd
          kvm_iommu_map_guest+0xaa/0xbf
          kvm_vm_ioctl_assigned_device+0x2ef/0xa47
          kvm_vm_ioctl+0x36c/0x3a2
          do_vfs_ioctl+0x49e/0x4e4
          sys_ioctl+0x5a/0x7c
          system_call_fastpath+0x16/0x1b
        RIP  gup_huge_pud+0xf2/0x159
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58a84aa9
    • A
      thp: reduce khugepaged freezing latency · 1dfb059b
      Andrea Arcangeli 提交于
      khugepaged can sometimes cause suspend to fail, requiring that the user
      retry the suspend operation.
      
      Use wait_event_freezable_timeout() instead of
      schedule_timeout_interruptible() to avoid missing freezer wakeups.  A
      try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
      tight loop too in case of the allocation failing repeatedly, and
      wait_event_freezable_timeout will provide it too.
      
      khugepaged would still freeze just fine by trying again the next minute
      but it's better if it freezes immediately.
      Reported-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dfb059b
    • K
      vmscan: use atomic-long for shrinker batching · 83aeeada
      Konstantin Khlebnikov 提交于
      Use atomic-long operations instead of looping around cmpxchg().
      
      [akpm@linux-foundation.org: massage atomic.h inclusions]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83aeeada
    • K
      vmscan: fix initial shrinker size handling · 635697c6
      Konstantin Khlebnikov 提交于
      A shrinker function can return -1, means that it cannot do anything
      without a risk of deadlock.  For example prune_super() does this if it
      cannot grab a superblock refrence, even if nr_to_scan=0.  Currently we
      interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
      according to this.  So the next time around this shrinker can cause
      really big pressure.  Let's skip such shrinkers instead.
      
      Also make total_scan signed, otherwise the check (total_scan < 0) below
      never works.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      635697c6
    • T
      memblock: Reimplement memblock allocation using reverse free area iterator · 7bd0b0f0
      Tejun Heo 提交于
      Now that all early memory information is in memblock when enabled, we
      can implement reverse free area iterator and use it to implement NUMA
      aware allocator which is then wrapped for simpler variants instead of
      the confusing and inefficient mending of information in separate NUMA
      aware allocator.
      
      Implement for_each_free_mem_range_reverse(), use it to reimplement
      memblock_find_in_range_node() which in turn is used by all allocators.
      
      The visible allocator interface is inconsistent and can probably use
      some cleanup too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      7bd0b0f0
    • T
      memblock: Kill early_node_map[] · 0ee332c1
      Tejun Heo 提交于
      Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
      there's no user of early_node_map[] left.  Kill early_node_map[] and
      replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP.  Also,
      relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
      as page_alloc.c would no longer host an alternative implementation.
      
      This change is ultimately one to one mapping and shouldn't cause any
      observable difference; however, after the recent changes, there are
      some functions which now would fit memblock.c better than page_alloc.c
      and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
      doesn't make much sense on some of them.  Further cleanups for
      functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.
      
      -v2: Fix compile bug introduced by mis-spelling
       CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
       mmzone.h.  Reported by Stephen Rothwell.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      0ee332c1
    • T
      memblock: Implement memblock_add_node() · 7fb0bc3f
      Tejun Heo 提交于
      Implement memblock_add_node() which can add a new memblock memory
      region with specific node ID.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      7fb0bc3f
    • T
      memblock: s/memblock_analyze()/memblock_allow_resize()/ and update users · 1aadc056
      Tejun Heo 提交于
      The only function of memblock_analyze() is now allowing resize of
      memblock region arrays.  Rename it to memblock_allow_resize() and
      update its users.
      
      * The following users remain the same other than renaming.
      
        arm/mm/init.c::arm_memblock_init()
        microblaze/kernel/prom.c::early_init_devtree()
        powerpc/kernel/prom.c::early_init_devtree()
        openrisc/kernel/prom.c::early_init_devtree()
        sh/mm/init.c::paging_init()
        sparc/mm/init_64.c::paging_init()
        unicore32/mm/init.c::uc32_memblock_init()
      
      * In the following users, analyze was used to update total size which
        is no longer necessary.
      
        powerpc/kernel/machine_kexec.c::reserve_crashkernel()
        powerpc/kernel/prom.c::early_init_devtree()
        powerpc/mm/init_32.c::MMU_init()
        powerpc/mm/tlb_nohash.c::__early_init_mmu()  
        powerpc/platforms/ps3/mm.c::ps3_mm_add_memory()
        powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups()
        sh/kernel/machine_kexec.c::reserve_crashkernel()
      
      * x86/kernel/e820.c::memblock_x86_fill() was directly setting
        memblock_can_resize before populating memblock and calling analyze
        afterwards.  Call memblock_allow_resize() before start populating.
      
      memblock_can_resize is now static inside memblock.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      1aadc056
    • T
      memblock: Track total size of regions automatically · 1440c4e2
      Tejun Heo 提交于
      Total size of memory regions was calculated by memblock_analyze()
      requiring explicitly calling the function between operations which can
      change memory regions and possible users of total size, which is
      cumbersome and fragile.
      
      This patch makes each memblock_type track total size automatically
      with minor modifications to memblock manipulation functions and remove
      requirements on calling memblock_analyze().  [__]memblock_dump_all()
      now also dumps the total size of reserved regions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      1440c4e2
    • T
      memblock: Reimplement memblock_enforce_memory_limit() using __memblock_remove() · c0ce8fef
      Tejun Heo 提交于
      With recent updates, the basic memblock operations are robust enough
      that there's no reason for memblock_enfore_memory_limit() to directly
      manipulate memblock region arrays.  Reimplement it using
      __memblock_remove().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      c0ce8fef
    • T
      memblock: Make memblock functions handle overflowing range @size · eb18f1b5
      Tejun Heo 提交于
      Allow memblock users to specify range where @base + @size overflows
      and automatically cap it at maximum.  This makes the interface more
      robust and specifying till-the-end-of-memory easier.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      eb18f1b5