1. 22 3月, 2012 1 次提交
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  2. 07 3月, 2012 1 次提交
    • L
      vm: avoid using find_vma_prev() unnecessarily · 097d5910
      Linus Torvalds 提交于
      Several users of "find_vma_prev()" were not in fact interested in the
      previous vma if there was no primary vma to be found either.  And in
      those cases, we're much better off just using the regular "find_vma()",
      and then "prev" can be looked up by just checking vma->vm_prev.
      
      The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
      commit 83cd904d: "mm: fix find_vma_prev"), and the whole "return
      prev by reference" means that it generates worse code too.
      
      Thus this "let's avoid using this inconvenient and clearly too subtle
      interface when we don't really have to" patch.
      
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      097d5910
  3. 13 1月, 2012 1 次提交
    • M
      mm: compaction: introduce sync-light migration for use by compaction · a6bc32b8
      Mel Gorman 提交于
      This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
      mode that avoids writing back pages to backing storage.  Async compaction
      maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
      For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
      used.
      
      This avoids sync compaction stalling for an excessive length of time,
      particularly when copying files to a USB stick where there might be a
      large number of dirty pages backed by a filesystem that does not support
      ->writepages.
      
      [aarcange@redhat.com: This patch is heavily based on Andrea's work]
      [akpm@linux-foundation.org: fix fs/nfs/write.c build]
      [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andy Isaacson <adi@hexapodia.org>
      Cc: Nai Xia <nai.xia@gmail.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6bc32b8
  4. 11 1月, 2012 1 次提交
  5. 30 12月, 2011 1 次提交
    • K
      mm/mempolicy.c: refix mbind_range() vma issue · e26a5114
      KOSAKI Motohiro 提交于
      commit 8aacc9f5 ("mm/mempolicy.c: fix pgoff in mbind vma merge") is the
      slightly incorrect fix.
      
      Why? Think following case.
      
      1. map 4 pages of a file at offset 0
      
         [0123]
      
      2. map 2 pages just after the first mapping of the same file but with
         page offset 2
      
         [0123][23]
      
      3. mbind() 2 pages from the first mapping at offset 2.
         mbind_range() should treat new vma is,
      
         [0123][23]
           |23|
           mbind vma
      
         but it does
      
         [0123][23]
           |01|
           mbind vma
      
         Oops. then, it makes wrong vma merge and splitting ([01][0123] or similar).
      
      This patch fixes it.
      
      [testcase]
        test result - before the patch
      
      	case4: 126: test failed. expect '2,4', actual '2,2,2'
             	case5: passed
      	case6: passed
      	case7: passed
      	case8: passed
      	case_n: 246: test failed. expect '4,2', actual '1,4'
      
      	------------[ cut here ]------------
      	kernel BUG at mm/filemap.c:135!
      	invalid opcode: 0000 [#4] SMP DEBUG_PAGEALLOC
      
      	(snip long bug on messages)
      
        test result - after the patch
      
      	case4: passed
             	case5: passed
      	case6: passed
      	case7: passed
      	case8: passed
      	case_n: passed
      
        source:  mbind_vma_test.c
      ============================================================
       #include <numaif.h>
       #include <numa.h>
       #include <sys/mman.h>
       #include <stdio.h>
       #include <unistd.h>
       #include <stdlib.h>
       #include <string.h>
      
      static unsigned long pagesize;
      void* mmap_addr;
      struct bitmask *nmask;
      char buf[1024];
      FILE *file;
      char retbuf[10240] = "";
      int mapped_fd;
      
      char *rubysrc = "ruby -e '\
        pid = %d; \
        vstart = 0x%llx; \
        vend = 0x%llx; \
        s = `pmap -q #{pid}`; \
        rary = []; \
        s.each_line {|line|; \
          ary=line.split(\" \"); \
          addr = ary[0].to_i(16); \
          if(vstart <= addr && addr < vend) then \
            rary.push(ary[1].to_i()/4); \
          end; \
        }; \
        print rary.join(\",\"); \
      '";
      
      void init(void)
      {
      	void* addr;
      	char buf[128];
      
      	nmask = numa_allocate_nodemask();
      	numa_bitmask_setbit(nmask, 0);
      
      	pagesize = getpagesize();
      
      	sprintf(buf, "%s", "mbind_vma_XXXXXX");
      	mapped_fd = mkstemp(buf);
      	if (mapped_fd == -1)
      		perror("mkstemp "), exit(1);
      	unlink(buf);
      
      	if (lseek(mapped_fd, pagesize*8, SEEK_SET) < 0)
      		perror("lseek "), exit(1);
      	if (write(mapped_fd, "\0", 1) < 0)
      		perror("write "), exit(1);
      
      	addr = mmap(NULL, pagesize*8, PROT_NONE,
      		    MAP_SHARED, mapped_fd, 0);
      	if (addr == MAP_FAILED)
      		perror("mmap "), exit(1);
      
      	if (mprotect(addr+pagesize, pagesize*6, PROT_READ|PROT_WRITE) < 0)
      		perror("mprotect "), exit(1);
      
      	mmap_addr = addr + pagesize;
      
      	/* make page populate */
      	memset(mmap_addr, 0, pagesize*6);
      }
      
      void fin(void)
      {
      	void* addr = mmap_addr - pagesize;
      	munmap(addr, pagesize*8);
      
      	memset(buf, 0, sizeof(buf));
      	memset(retbuf, 0, sizeof(retbuf));
      }
      
      void mem_bind(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_BIND, nmask->maskp, nmask->size, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void mem_interleave(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_INTERLEAVE, nmask->maskp, nmask->size, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void mem_unbind(int index, int len)
      {
      	int err;
      
      	err = mbind(mmap_addr+pagesize*index, pagesize*len,
      		    MPOL_DEFAULT, NULL, 0, 0);
      	if (err)
      		perror("mbind "), exit(err);
      }
      
      void Assert(char *expected, char *value, char *name, int line)
      {
      	if (strcmp(expected, value) == 0) {
      		fprintf(stderr, "%s: passed\n", name);
      		return;
      	}
      	else {
      		fprintf(stderr, "%s: %d: test failed. expect '%s', actual '%s'\n",
      			name, line,
      			expected, value);
      //		exit(1);
      	}
      }
      
      /*
            AAAA
          PPPPPPNNNNNN
          might become
          PPNNNNNNNNNN
          case 4 below
      */
      void case4(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 4);
      	mem_unbind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("2,4", retbuf, "case4", __LINE__);
      
      	fin();
      }
      
      /*
             AAAA
       PPPPPPNNNNNN
       might become
       PPPPPPPPPPNN
       case 5 below
      */
      void case5(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case5", __LINE__);
      
      	fin();
      }
      
      /*
      	    AAAA
      	PPPPNNNNXXXX
      	might become
      	PPPPPPPPPPPP 6
      */
      void case6(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_bind(4, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("6", retbuf, "case6", __LINE__);
      
      	fin();
      }
      
      /*
          AAAA
      PPPPNNNNXXXX
      might become
      PPPPPPPPXXXX 7
      */
      void case7(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_interleave(4, 2);
      	mem_bind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case7", __LINE__);
      
      	fin();
      }
      
      /*
          AAAA
      PPPPNNNNXXXX
      might become
      PPPPNNNNNNNN 8
      */
      void case8(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	mem_bind(0, 2);
      	mem_interleave(4, 2);
      	mem_interleave(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("2,4", retbuf, "case8", __LINE__);
      
      	fin();
      }
      
      void case_n(void)
      {
      	init();
      	sprintf(buf, rubysrc, getpid(), mmap_addr, mmap_addr+pagesize*6);
      
      	/* make redundunt mappings [0][1234][34][7] */
      	mmap(mmap_addr + pagesize*4, pagesize*2, PROT_READ|PROT_WRITE,
      	     MAP_FIXED|MAP_SHARED, mapped_fd, pagesize*3);
      
      	/* Expect to do nothing. */
      	mem_unbind(2, 2);
      
      	file = popen(buf, "r");
      	fread(retbuf, sizeof(retbuf), 1, file);
      	Assert("4,2", retbuf, "case_n", __LINE__);
      
      	fin();
      }
      
      int main(int argc, char** argv)
      {
      	case4();
      	case5();
      	case6();
      	case7();
      	case8();
      	case_n();
      
      	return 0;
      }
      =============================================================
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Caspar Zhang <caspar@casparzhang.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: <stable@vger.kernel.org>		[3.1.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e26a5114
  6. 01 11月, 2011 1 次提交
  7. 31 10月, 2011 1 次提交
  8. 15 9月, 2011 2 次提交
    • K
      mm/mempolicy.c: make copy_from_user() provably correct · 2bbff6c7
      KAMEZAWA Hiroyuki 提交于
      When compiling mm/mempolicy.c with struct user copy checks the following
      warning is shown:
      
        In file included from arch/x86/include/asm/uaccess.h:572,
                         from include/linux/uaccess.h:5,
                         from include/linux/highmem.h:7,
                         from include/linux/pagemap.h:10,
                         from include/linux/mempolicy.h:70,
                         from mm/mempolicy.c:68:
        In function `copy_from_user',
            inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
        arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
          LD      mm/built-in.o
      
      Fix this by passing correct buffer size value.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bbff6c7
    • C
      mm/mempolicy.c: fix pgoff in mbind vma merge · 8aacc9f5
      Caspar Zhang 提交于
      commit 9d8cebd4 ("mm: fix mbind vma merge problem") didn't really
      fix the mbind vma merge problem due to wrong pgoff value passing to
      vma_merge(), which made vma_merge() always return NULL.
      
      Before the patch applied, we are getting a result like:
      
        addr = 0x7fa58f00c000
        [snip]
        7fa58f00c000-7fa58f00d000 rw-p 00000000 00:00 0
        7fa58f00d000-7fa58f00e000 rw-p 00000000 00:00 0
        7fa58f00e000-7fa58f00f000 rw-p 00000000 00:00 0
      
      here 7fa58f00c000->7fa58f00f000 we get 3 VMAs which are expected to be
      merged described as described in commit 9d8cebd4.
      
      Re-testing the patched kernel with the reproducer provided in commit
      9d8cebd4, we get the correct result:
      
        addr = 0x7ffa5aaa2000
        [snip]
        7ffa5aaa2000-7ffa5aaa6000 rw-p 00000000 00:00 0
        7fffd556f000-7fffd5584000 rw-p 00000000 00:00 0                          [stack]
      Signed-off-by: NCaspar Zhang <caspar@casparzhang.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aacc9f5
  9. 27 7月, 2011 1 次提交
    • M
      cpusets: randomize node rotor used in cpuset_mem_spread_node() · 778d3b0f
      Michal Hocko 提交于
      [ This patch has already been accepted as commit 0ac0c0d0 but later
        reverted (commit 35926ff5) because it itroduced arch specific
        __node_random which was defined only for x86 code so it broke other
        archs.  This is a followup without any arch specific code.  Other than
        that there are no functional changes.]
      
      Some workloads that create a large number of small files tend to assign
      too many pages to node 0 (multi-node systems).  Part of the reason is
      that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
      at node 0 for newly created tasks.
      
      This patch changes the rotor to be initialized to a random node number
      of the cpuset.
      
      [akpm@linux-foundation.org: fix layout]
      [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
      [mhocko@suse.cz: Make it arch independent]
      [akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Menage <menage@google.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778d3b0f
  10. 25 5月, 2011 6 次提交
  11. 23 3月, 2011 1 次提交
  12. 05 3月, 2011 2 次提交
  13. 01 3月, 2011 1 次提交
  14. 26 2月, 2011 1 次提交
  15. 14 1月, 2011 5 次提交
  16. 03 12月, 2010 1 次提交
  17. 29 10月, 2010 1 次提交
    • E
      numa: fix slab_node(MPOL_BIND) · 800416f7
      Eric Dumazet 提交于
      When a node contains only HighMem memory, slab_node(MPOL_BIND)
      dereferences a NULL pointer.
      
      [ This code seems to go back all the way to commit 19770b32: "mm:
        filter based on a nodemask as well as a gfp_mask".  Which was back in
        April 2008, and it got merged into 2.6.26.  - Linus ]
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      800416f7
  18. 27 10月, 2010 2 次提交
  19. 10 8月, 2010 2 次提交
  20. 30 6月, 2010 1 次提交
  21. 26 5月, 2010 1 次提交
  22. 25 5月, 2010 6 次提交
    • G
      mm: consider the entire user address space during node migration · 6ec3a127
      Greg Thelen 提交于
      Use mm->task_size instead of TASK_SIZE to ensure that the entire user
      address space is migrated.  mm->task_size is independent of the calling
      task context.  TASK SIZE may be dependant on the address space size of the
      calling process.  Usage of TASK_SIZE can lead to partial address space
      migration if the calling process was 32 bit and the migrating process was
      64 bit.
      
      Here is the test script used on 64 system with a 32 bit echo process:
      
        mount -t cgroup none /cgroup -o cpuset
        cd /cgroup
      
        mkdir 0
        echo 1 > 0/cpuset.cpus
        echo 0 > 0/cpuset.mems
        echo 1 > 0/cpuset.memory_migrate
      
        mkdir 1
        echo 1 > 1/cpuset.cpus
        echo 1 > 1/cpuset.mems
        echo 1 > 1/cpuset.memory_migrate
      
        echo $$ > 0/tasks
        64_bit_process &
        pid=$!
      
        echo $pid > 1/tasks   # This does not migrate all process pages without
                              # this patch.  If 64 bit echo is used or this patch is
                              # applied, then the full address space of $pid is
                              # migrated.
      
      To check memory migration, I watched:
        grep MemUsed /sys/devices/system/node/node*/meminfo
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ec3a127
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
    • M
      mempolicy: restructure rebinding-mempolicy functions · 708c1bbc
      Miao Xie 提交于
      Nick Piggin reported that the allocator may see an empty nodemask when
      changing cpuset's mems[1].  It happens only on the kernel that do not do
      atomic nodemask_t stores.  (MAX_NUMNODES > BITS_PER_LONG)
      
      But I found that there is also a problem on the kernel that can do atomic
      nodemask_t stores.  The problem is that the allocator can't find a node to
      alloc page when changing cpuset's mems though there is a lot of free
      memory.  The reason is like this:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      I can use the attached program reproduce it by the following step:
      
      # mkdir /dev/cpuset
      # mount -t cpuset cpuset /dev/cpuset
      # mkdir /dev/cpuset/1
      # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
      # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
      # echo $$ > /dev/cpuset/1/tasks
      # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> &
         <nr_tasks> = max(nr_cpus - 1, 1)
      # killall -s SIGUSR1 cpuset_mem_hog
      # ./change_mems.sh
      
      several hours later, oom will happen though there is a lot of free memory.
      
      This patchset fixes this problem by expanding the nodes range first(set
      newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
      we use a variable to tell the write-side task that read-side task is
      reading nodemask, and the write-side task clears newly disallowed nodes
      after read-side task ends the current memory allocation.
      
      This patch:
      
      In order to fix no node to alloc memory, when we want to update mempolicy
      and mems_allowed, we expand the set of nodes first (set all the newly
      nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
      mempolicy's rebind functions may breaks the expanding.
      
      So we restructure the mempolicy's rebind functions and split the rebind
      work to two steps, just like the update of cpuset's mems: The 1st step:
      expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
      the mempolicy's nodes.  It is used when there is no real lock to protect
      the mempolicy in the read-side.  Otherwise we can do rebind work at once.
      
      In order to implement it, we define
      
      	enum mpol_rebind_step {
      		MPOL_REBIND_ONCE,
      		MPOL_REBIND_STEP1,
      		MPOL_REBIND_STEP2,
      		MPOL_REBIND_NSTEP,
      	};
      
      If the mempolicy needn't be updated by two steps, we can pass
      MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
      MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
      MPOL_REBIND_STEP2 to do the second step work.
      
      Besides that, it maybe long time between these two step and we have to
      release the lock that protects mempolicy and mems_allowed.  If we hold the
      lock once again, we must check whether the current mempolicy is under the
      rebinding (the first step has been done) or not, because the task may
      alloc a new mempolicy when we don't hold the lock.  So we defined the
      following flag to identify it:
      
      #define MPOL_F_REBINDING (1 << 2)
      
      The new functions will be used in the next patch.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      708c1bbc
    • L
      mempolicy: factor mpol_shared_policy_init() return paths · 15d77835
      Lee Schermerhorn 提交于
      Factor out duplicate put/frees in mpol_shared_policy_init() to a common
      return path.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15d77835
    • L
      mempolicy: rename policy_types and cleanup initialization · 345ace9c
      Lee Schermerhorn 提交于
      Rename 'policy_types[]' to 'policy_modes[]' to better match the array
      contents.
      
      Use designated intializer syntax for policy_modes[].
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      345ace9c
    • L
      mempolicy: lose unnecessary loop variable in mpol_parse_str() · b4652e84
      Lee Schermerhorn 提交于
      We don't really need the extra variable 'i' in mpol_parse_str().  The only
      use is as the the loop variable.  Then, it's assigned to 'mode'.  Just use
      mode, and loose the 'uninitialized_var()' macro.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4652e84