1. 22 3月, 2012 1 次提交
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  2. 20 3月, 2012 1 次提交
  3. 16 3月, 2012 1 次提交
    • H
      memcg: free mem_cgroup by RCU to fix oops · 59927fb9
      Hugh Dickins 提交于
      After fixing the GPF in mem_cgroup_lru_del_list(), three times one
      machine running a similar load (moving and removing memcgs while
      swapping) has oopsed in mem_cgroup_zone_nr_lru_pages(), when retrieving
      memcg zone numbers for get_scan_count() for shrink_mem_cgroup_zone():
      this is where a struct mem_cgroup is first accessed after being chosen
      by mem_cgroup_iter().
      
      Just what protects a struct mem_cgroup from being freed, in between
      mem_cgroup_iter()'s css_get_next() and its css_tryget()? css_tryget()
      fails once css->refcnt is zero with CSS_REMOVED set in flags, yes: but
      what if that memory is freed and reused for something else, which sets
      "refcnt" non-zero? Hmm, and scope for an indefinite freeze if refcnt is
      left at zero but flags are cleared.
      
      It's tempting to move the css_tryget() into css_get_next(), to make it
      really "get" the css, but I don't think that actually solves anything:
      the same difficulty in moving from css_id found to stable css remains.
      
      But we already have rcu_read_lock() around the two, so it's easily fixed
      if __mem_cgroup_free() just uses kfree_rcu() to free mem_cgroup.
      
      However, a big struct mem_cgroup is allocated with vzalloc() instead of
      kzalloc(), and we're not allowed to vfree() at interrupt time: there
      doesn't appear to be a general vfree_rcu() to help with this, so roll
      our own using schedule_work().  The compiler decently removes
      vfree_work() and vfree_rcu() when the config doesn't need them.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59927fb9
  4. 10 3月, 2012 1 次提交
  5. 07 3月, 2012 4 次提交
    • L
      vm: avoid using find_vma_prev() unnecessarily · 097d5910
      Linus Torvalds 提交于
      Several users of "find_vma_prev()" were not in fact interested in the
      previous vma if there was no primary vma to be found either.  And in
      those cases, we're much better off just using the regular "find_vma()",
      and then "prev" can be looked up by just checking vma->vm_prev.
      
      The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
      commit 83cd904d: "mm: fix find_vma_prev"), and the whole "return
      prev by reference" means that it generates worse code too.
      
      Thus this "let's avoid using this inconvenient and clearly too subtle
      interface when we don't really have to" patch.
      
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      097d5910
    • M
      mm: fix find_vma_prev · 83cd904d
      Mikulas Patocka 提交于
      Commit 6bd4837d ("mm: simplify find_vma_prev()") broke memory
      management on PA-RISC.
      
      After application of the patch, programs that allocate big arrays on the
      stack crash with segfault, for example, this will crash if compiled
      without optimization:
      
        int main()
        {
      	char array[200000];
      	array[199999] = 0;
      	return 0;
        }
      
      The reason is that PA-RISC has up-growing stack and the stack is usually
      the last memory area.  In the above example, a page fault happens above
      the stack.
      
      Previously, if we passed too high address to find_vma_prev, it returned
      NULL and stored the last VMA in *pprev.  After "simplify find_vma_prev"
      change, it stores NULL in *pprev.  Consequently, the stack area is not
      found and it is not expanded, as it used to be before the change.
      
      This patch restores the old behavior and makes it return the last VMA in
      *pprev if the requested address is higher than address of any other VMA.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83cd904d
    • H
      mmap: EINVAL not ENOMEM when rejecting VM_GROWS · ce8fea7a
      Hugh Dickins 提交于
      Currently error is -ENOMEM when rejecting VM_GROWSDOWN|VM_GROWSUP
      from shared anonymous: hoist the file case's -EINVAL up for both.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce8fea7a
    • H
      page_cgroup: fix horrid swap accounting regression · c09ff089
      Hugh Dickins 提交于
      Why is memcg's swap accounting so broken? Insane counts, wrong
      ownership, unfreeable structures, which later get freed and then
      accessed after free.
      
      Turns out to be a tiny a little 3.3-rc1 regression in 9fb4b7cc
      "page_cgroup: add helper function to get swap_cgroup": the helper
      function (actually named lookup_swap_cgroup()) returns an address using
      void* arithmetic, but the structure in question is a short.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NBob Liu <lliubbo@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c09ff089
  6. 06 3月, 2012 6 次提交
    • N
      memcg: fix mapcount check in move charge code for anonymous page · e6ca7b89
      Naoya Horiguchi 提交于
      Currently the charge on shared anonyous pages is supposed not to moved in
      task migration.  To implement this, we need to check that mapcount > 1,
      instread of > 2.  So this patch fixes it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6ca7b89
    • A
      mm: thp: fix BUG on mm->nr_ptes · 1c641e84
      Andrea Arcangeli 提交于
      Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
      in exit_mmap() recently.
      
      Quoting Hugh's discovery and explanation of the SMP race condition:
      
        "mm->nr_ptes had unusual locking: down_read mmap_sem plus
         page_table_lock when incrementing, down_write mmap_sem (or mm_users
         0) when decrementing; whereas THP is careful to increment and
         decrement it under page_table_lock.
      
         Now most of those paths in THP also hold mmap_sem for read or write
         (with appropriate checks on mm_users), but two do not: when
         split_huge_page() is called by hwpoison_user_mappings(), and when
         called by add_to_swap().
      
         It's conceivable that the latter case is responsible for the
         exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora."
      
      The simplest way to fix it without having to alter the locking is to make
      split_huge_page() a noop in nr_ptes terms, so by counting the preallocated
      pagetables that exists for every mapped hugepage.  It was an arbitrary
      choice not to count them and either way is not wrong or right, because
      they are not used but they're still allocated.
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.0.x, 3.1.x, 3.2.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c641e84
    • H
      memcg: fix GPF when cgroup removal races with last exit · 7512102c
      Hugh Dickins 提交于
      When moving tasks from old memcg (with move_charge_at_immigrate on new
      memcg), followed by removal of old memcg, hit General Protection Fault in
      mem_cgroup_lru_del_list() (called from release_pages called from
      free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
      exit_mmap from mmput from exit_mm from do_exit).
      
      Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
      been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
      still trying to update its stats, and take page off lru before freeing.
      
      A task, or a charge, or a page on lru: each secures a memcg against
      removal.  In this case, the last task has been moved out of the old memcg,
      and it is exiting: anonymous pages are uncharged one by one from the
      memcg, as they are zapped from its pagetables, so the charge gets down to
      0; but the pages themselves are queued in an mmu_gather for freeing.
      
      Most of those pages will be on lru (and force_empty is careful to
      lru_add_drain_all, to add pages from pagevec to lru first), but not
      necessarily all: perhaps some have been isolated for page reclaim, perhaps
      some isolated for other reasons.  So, force_empty may find no task, no
      charge and no page on lru, and let the removal proceed.
      
      There would still be no problem if these pages were immediately freed; but
      typically (and the put_page_testzero protocol demands it) they have to be
      added back to lru before they are found freeable, then removed from lru
      and freed.  We don't see the issue when adding, because the
      mem_cgroup_iter() loops keep their own reference to the memcg being
      scanned; but when it comes to mem_cgroup_lru_del_list().
      
      I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
      PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
      of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
      38c5d72f ("memcg: simplify LRU handling by new rule") mercifully
      removed those convolutions, but left this General Protection Fault.
      
      But it's surprisingly easy to restore the old behaviour: just check
      PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
      to add), and reset pc to root_mem_cgroup if page is uncharged.  A risky
      change?  just going back to how it worked before; testing, and an audit of
      uses of pc->mem_cgroup, show no problem.
      
      And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
      sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
      longer has any purpose, and we can safely revert 4e5f01c2 ("memcg:
      clear pc->mem_cgroup if necessary").
      
      Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
      is not strictly necessary: the lru_lock there, with RCU before memcg
      structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
      without that; but it seems cleaner to rely on one dependency less.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7512102c
    • H
      memcg: fix deadlock by inverting lrucare nesting · 9ce70c02
      Hugh Dickins 提交于
      We have forgotten the rules of lock nesting: the irq-safe ones must be
      taken inside the non-irq-safe ones, otherwise we are open to deadlock:
      
      CPU0                          CPU1
      ----                          ----
      lock(&(&pc->lock)->rlock);
                                    local_irq_disable();
                                    lock(&(&zone->lru_lock)->rlock);
                                    lock(&(&pc->lock)->rlock);
      <Interrupt>
      lock(&(&zone->lru_lock)->rlock);
      
      To check a different locking issue, I happened to add a spin_lock to
      memcg's bit_spin_lock in lock_page_cgroup(), and lockdep very quickly
      complained about __mem_cgroup_commit_charge_lrucare() (on CPU1 above).
      
      So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare to
      __mem_cgroup_commit_charge() instead, taking zone->lru_lock under
      lock_page_cgroup() in the lrucare case.
      
      The original was using spin_lock_irqsave, but we'd be in more trouble if
      it were ever called at interrupt time: unconditional _irq is enough.  And
      ClearPageLRU before del from lru, SetPageLRU before add to lru: no strong
      reason, but that is the ordering used consistently elsewhere.
      
      Fixes 36b62ad5 ("memcg: simplify corner case handling
      of LRU").
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ce70c02
    • A
      flush_tlb_range() needs ->page_table_lock when ->mmap_sem is not held · cd2934a3
      Al Viro 提交于
      All other callers already hold either ->mmap_sem (exclusive) or
      ->page_table_lock.  And we need it because some page table flushing
      instanced do work explicitly with ge tables.
      
      See e.g.  arch/powerpc/mm/tlb_hash32.c, flush_tlb_range() and
      flush_range() in there.  The same goes for uml, with a lot more
      extensive playing with page tables.
      
      Almost all callers are actually fine - flush_tlb_range() may have no
      need to bother playing with page tables, but it can do so safely; again,
      this caller is the sole exception - everything else either has exclusive
      ->mmap_sem on the mm in question, or mm->page_table_lock is held.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd2934a3
    • A
      835ee797
  7. 01 3月, 2012 1 次提交
    • T
      memblock: Fix size aligning of memblock_alloc_base_nid() · 847854f5
      Tejun Heo 提交于
      memblock allocator aligns @size to @align to reduce the amount
      of fragmentation.  Commit:
      
       7bd0b0f0 ("memblock: Reimplement memblock allocation using reverse free area iterator")
      
      Broke it by incorrectly relocating @size aligning to
      memblock_find_in_range_node().  As the aligned size is not
      propagated back to memblock_alloc_base_nid(), the actually
      reserved size isn't aligned.
      
      While this increases memory use for memblock reserved array,
      this shouldn't cause any critical failure; however, it seems
      that the size aligning was hiding a use-beyond-allocation bug in
      sparc64 and losing the aligning causes boot failure.
      
      The underlying problem is currently being debugged but this is a
      proper fix in itself, it's already pretty late in -rc cycle for
      boot failures and reverting the change for debugging isn't
      difficult. Restore the size aligning moving it to
      memblock_alloc_base_nid().
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Grant Likely <grant.likely@secretlab.ca>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120228205621.GC3252@dhcp-172-17-108-109.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <alpine.SOC.1.00.1202130942030.1488@math.ut.ee>
      847854f5
  8. 25 2月, 2012 3 次提交
  9. 23 2月, 2012 1 次提交
  10. 14 2月, 2012 1 次提交
  11. 09 2月, 2012 2 次提交
    • H
      mm: fix UP THP spin_is_locked BUGs · b9980cdc
      Hugh Dickins 提交于
      Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y
      CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false,
      and so triggers some BUGs in Transparent HugePage codepaths.
      
      asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x);
      but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE,
      VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9980cdc
    • M
      mm: compaction: check for overlapping nodes during isolation for migration · dc908600
      Mel Gorman 提交于
      When isolating pages for migration, migration starts at the start of a
      zone while the free scanner starts at the end of the zone.  Migration
      avoids entering a new zone by never going beyond the free scanned.
      
      Unfortunately, in very rare cases nodes can overlap.  When this happens,
      migration isolates pages without the LRU lock held, corrupting lists
      which will trigger errors in reclaim or during page free such as in the
      following oops
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810f795c>] free_pcppages_bulk+0xcc/0x450
        PGD 1dda554067 PUD 1e1cb58067 PMD 0
        Oops: 0000 [#1] SMP
        CPU 37
        Pid: 17088, comm: memcg_process_s Tainted: G            X
        RIP: free_pcppages_bulk+0xcc/0x450
        Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
        Call Trace:
          free_hot_cold_page+0x17e/0x1f0
          __pagevec_free+0x90/0xb0
          release_pages+0x22a/0x260
          pagevec_lru_move_fn+0xf3/0x110
          putback_lru_page+0x66/0xe0
          unmap_and_move+0x156/0x180
          migrate_pages+0x9e/0x1b0
          compact_zone+0x1f3/0x2f0
          compact_zone_order+0xa2/0xe0
          try_to_compact_pages+0xdf/0x110
          __alloc_pages_direct_compact+0xee/0x1c0
          __alloc_pages_slowpath+0x370/0x830
          __alloc_pages_nodemask+0x1b1/0x1c0
          alloc_pages_vma+0x9b/0x160
          do_huge_pmd_anonymous_page+0x160/0x270
          do_page_fault+0x207/0x4c0
          page_fault+0x25/0x30
      
      The "X" in the taint flag means that external modules were loaded but but
      is unrelated to the bug triggering.  The real problem was because the PFN
      layout looks like this
      
        Zone PFN ranges:
          DMA      0x00000010 -> 0x00001000
          DMA32    0x00001000 -> 0x00100000
          Normal   0x00100000 -> 0x01e80000
        Movable zone start PFN for each node
        early_node_map[14] active PFN ranges
            0: 0x00000010 -> 0x0000009b
            0: 0x00000100 -> 0x0007a1ec
            0: 0x0007a354 -> 0x0007a379
            0: 0x0007f7ff -> 0x0007f800
            0: 0x00100000 -> 0x00680000
            1: 0x00680000 -> 0x00e80000
            0: 0x00e80000 -> 0x01080000
            1: 0x01080000 -> 0x01280000
            0: 0x01280000 -> 0x01480000
            1: 0x01480000 -> 0x01680000
            0: 0x01680000 -> 0x01880000
            1: 0x01880000 -> 0x01a80000
            0: 0x01a80000 -> 0x01c80000
            1: 0x01c80000 -> 0x01e80000
      
      The fix is straight-forward.  isolate_migratepages() has to make a
      similar check to isolate_freepage to ensure that it never isolates pages
      from a zone it does not hold the LRU lock for.
      
      This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
      and current mainline.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc908600
  12. 04 2月, 2012 5 次提交
    • M
      mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block... · 0bf380bc
      Mel Gorman 提交于
      mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration
      
      When isolating for migration, migration starts at the start of a zone
      which is not necessarily pageblock aligned.  Further, it stops isolating
      when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
      not aligned.  This allows isolate_migratepages() to call pfn_to_page() on
      an invalid PFN which can result in a crash.  This was originally reported
      against a 3.0-based kernel with the following trace in a crash dump.
      
      PID: 9902   TASK: d47aecd0  CPU: 0   COMMAND: "memcg_process_s"
       #0 [d72d3ad0] crash_kexec at c028cfdb
       #1 [d72d3b24] oops_end at c05c5322
       #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
       #3 [d72d3bec] bad_area at c0227fb6
       #4 [d72d3c00] do_page_fault at c05c72ec
       #5 [d72d3c80] error_code (via page_fault) at c05c47a4
          EAX: 00000000  EBX: 000c0000  ECX: 00000001  EDX: 00000807  EBP: 000c0000
          DS:  007b      ESI: 00000001  ES:  007b      EDI: f3000a80  GS:  6f50
          CS:  0060      EIP: c030b15a  ERR: ffffffff  EFLAGS: 00010002
       #6 [d72d3cb4] isolate_migratepages at c030b15a
       #7 [d72d3d14] zone_watermark_ok at c02d26cb
       #8 [d72d3d2c] compact_zone at c030b8de
       #9 [d72d3d68] compact_zone_order at c030bba1
      #10 [d72d3db4] try_to_compact_pages at c030bc84
      #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
      #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
      #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
      #14 [d72d3eb8] alloc_pages_vma at c030a845
      #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
      #16 [d72d3f00] handle_mm_fault at c02f36c6
      #17 [d72d3f30] do_page_fault at c05c70ed
      #18 [d72d3fb0] error_code (via page_fault) at c05c47a4
          EAX: b71ff000  EBX: 00000001  ECX: 00001600  EDX: 00000431
          DS:  007b      ESI: 08048950  ES:  007b      EDI: bfaa3788
          SS:  007b      ESP: bfaa36e0  EBP: bfaa3828  GS:  6f50
          CS:  0073      EIP: 080487c8  ERR: ffffffff  EFLAGS: 00010202
      
      It was also reported by Herbert van den Bergh against 3.1-based kernel
      with the following snippet from the console log.
      
      BUG: unable to handle kernel paging request at 01c00008
      IP: [<c0522399>] isolate_migratepages+0x119/0x390
      *pdpt = 000000002f7ce001 *pde = 0000000000000000
      
      It is expected that it also affects 3.2.x and current mainline.
      
      The problem is that pfn_valid is only called on the first PFN being
      checked and that PFN is not necessarily aligned.  Lets say we have a case
      like this
      
      H = MAX_ORDER_NR_PAGES boundary
      | = pageblock boundary
      m = cc->migrate_pfn
      f = cc->free_pfn
      o = memory hole
      
      H------|------H------|----m-Hoooooo|ooooooH-f----|------H
      
      The migrate_pfn is just below a memory hole and the free scanner is beyond
      the hole.  When isolate_migratepages started, it scans from migrate_pfn to
      migrate_pfn+pageblock_nr_pages which is now in a memory hole.  It checks
      pfn_valid() on the first PFN but then scans into the hole where there are
      not necessarily valid struct pages.
      
      This patch ensures that isolate_migratepages calls pfn_valid when
      necessary.
      Reported-by: NHerbert van den Bergh <herbert.van.den.bergh@oracle.com>
      Tested-by: NHerbert van den Bergh <herbert.van.den.bergh@oracle.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf380bc
    • S
      readahead: fix pipeline break caused by block plug · 3deaa719
      Shaohua Li 提交于
      Herbert Poetzl reported a performance regression since 2.6.39.  The test
      is a simple dd read, but with big block size.  The reason is:
      
      T1: ra (A, A+128k), (A+128k, A+256k)
      T2: lock_page for page A, submit the 256k
      T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
      because of plug and there isn't any lock_page till we hit page A+256k
      because all pages from A to A+256k is in memory
      T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
      submitted again.
      T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
      waitting for (A+256k, A+512k) finish.
      
      There is no request to disk in T3 and T4, so readahead pipeline breaks.
      
      We really don't need block plug for generic_file_aio_read() for buffered
      I/O.  The readahead already has plug and has fine grained control when I/O
      should be submitted.  Deleting plug for buffered I/O fixes the regression.
      
      One side effect is plug makes the request size 256k, the size is 128k
      without it.  This is because default ra size is 128k and not a reason we
      need plug here.
      
      Vivek said:
      
      : We submit some readahead IO to device request queue but because of nested
      : plug, queue never gets unplugged.  When read logic reaches a page which is
      : not in page cache, it waits for page to be read from the disk
      : (lock_page_killable()) and that time we flush the plug list.
      :
      : So effectively read ahead logic is kind of broken in parts because of
      : nested plugging.  Removing top level plug (generic_file_aio_read()) for
      : buffered reads, will allow unplugging queue earlier for readahead.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Reported-by: NHerbert Poetzl <herbert@13thfloor.at>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3deaa719
    • C
      mm/filemap_xip.c: fix race condition in xip_file_fault() · 99f02ef1
      Carsten Otte 提交于
      Fix a race condition that shows in conjunction with xip_file_fault() when
      two threads of the same user process fault on the same memory page.
      
      In this case, the race winner will install the page table entry and the
      unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via
      vm_insert_mixed) which drops out at this check:
      
      	retval = -EBUSY;
      	if (!pte_none(*pte))
      		goto out_unlock;
      
      The resulting -EBUSY return value will trigger a BUG_ON() in
      xip_file_fault.
      
      This fix simply considers the fault as fixed in this case, because the
      race winner has successfully installed the pte.
      
      [akpm@linux-foundation.org: use conventional (and consistent) comment layout]
      Reported-by: NDavid Sadler <dsadler@us.ibm.com>
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Reported-by: NLouis Alex Eisner <leisner@cs.ucsd.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99f02ef1
    • A
      mm/memcontrol.c: fix warning with CONFIG_NUMA=n · 82b3f2a7
      Andrew Morton 提交于
      mm/memcontrol.c: In function 'memcg_check_events':
      mm/memcontrol.c:779: warning: unused variable 'do_numainfo'
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82b3f2a7
    • K
      mm: postpone migrated page mapping reset · 35512eca
      Konstantin Khlebnikov 提交于
      Postpone resetting page->mapping until the final remove_migration_ptes().
      Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not
      work.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35512eca
  13. 03 2月, 2012 2 次提交
  14. 01 2月, 2012 1 次提交
  15. 24 1月, 2012 7 次提交
    • K
      mm: fix rss count leakage during migration · 9f9f1acd
      Konstantin Khlebnikov 提交于
      Memory migration fills a pte with a migration entry and it doesn't
      update the rss counters.  Then it replaces the migration entry with the
      new page (or the old one if migration failed).  But between these two
      passes this pte can be unmaped, or a task can fork a child and it will
      get a copy of this migration entry.  Nobody accounts for this in the rss
      counters.
      
      This patch properly adjust rss counters for migration entries in
      zap_pte_range() and copy_one_pte().  Thus we avoid extra atomic
      operations on the migration fast-path.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f9f1acd
    • H
      SHM_UNLOCK: fix Unevictable pages stranded after swap · 24513264
      Hugh Dickins 提交于
      Commit cc39c6a9 ("mm: account skipped entries to avoid looping in
      find_get_pages") correctly fixed an infinite loop; but left a problem
      that find_get_pages() on shmem would return 0 (appearing to callers to
      mean end of tree) when it meets a run of nr_pages swap entries.
      
      The only uses of find_get_pages() on shmem are via pagevec_lookup(),
      called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
      scan_mapping_unevictable_pages().  The first is already commented, and
      not worth worrying about; but the second can leave pages on the
      Unevictable list after an unusual sequence of swapping and locking.
      
      Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
      swap) instead of pagevec_lookup().
      
      But I don't want to contaminate vmscan.c with shmem internals, nor
      shmem.c with LRU locking.  So move scan_mapping_unevictable_pages() into
      shmem.c, renaming it shmem_unlock_mapping(); and rename
      check_move_unevictable_page() to check_move_unevictable_pages(), looping
      down an array of pages, oftentimes under the same lock.
      
      Leave out the "rotate unevictable list" block: that's a leftover from
      when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
      handling involved looking at pages at tail of LRU.
      
      Was there significance to the sequence first ClearPageUnevictable, then
      test page_evictable, then SetPageUnevictable here? I think not, we're
      under LRU lock, and have no barriers between those.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: <stable@vger.kernel.org> [back to 3.1 but will need respins]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24513264
    • H
      SHM_UNLOCK: fix long unpreemptible section · 85046579
      Hugh Dickins 提交于
      scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
      evictable again once the shared memory is unlocked.  It does this with
      pagevec_lookup()s across the whole object (which might occupy most of
      memory), and takes 300ms to unlock 7GB here.  A cond_resched() every
      PAGEVEC_SIZE pages would be good.
      
      However, KOSAKI-san points out that this is called under shmem.c's
      info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
      There is no strong reason for that: we need to take these pages off the
      unevictable list soonish, but those locks are not required for it.
      
      So move the call to scan_mapping_unevictable_pages() from shmem.c's
      unlock handling up to shm.c's unlock handling.  Remove the recently
      added barrier, not needed now we have spin_unlock() before the scan.
      
      Use get_file(), with subsequent fput(), to make sure we have a reference
      to mapping throughout scan_mapping_unevictable_pages(): that's something
      that was previously guaranteed by the shm_lock().
      
      Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
      time, and we lazily discover them to be Unevictable later, so it serves
      no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
      pages still on pagevec are not marked Unevictable.
      
      The original code avoided redundant rescans by checking VM_LOCKED flag
      at its level: now avoid them by checking shp's SHM_LOCKED.
      
      The original code called scan_mapping_unevictable_pages() on a locked
      area at shm_destroy() time: perhaps we once had accounting cross-checks
      which required that, but not now, so skip the overhead and just let
      inode eviction deal with them.
      
      Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
      under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
      more as comment than to save space; comment them used for SHM_UNLOCK.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85046579
    • H
      mm/hugetlb.c: undo change to page mapcount in fault handler · 409eb8c2
      Hillf Danton 提交于
      Page mapcount should be updated only if we are sure that the page ends
      up in the page table otherwise we would leak if we couldn't COW due to
      reservations or if idx is out of bounds.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      409eb8c2
    • J
      mm: memcg: update the correct soft limit tree during migration · 6568d4a9
      Johannes Weiner 提交于
      end_migration() passes the old page instead of the new page to commit
      the charge.  This page descriptor is not used for committing itself,
      though, since we also pass the (correct) page_cgroup descriptor.  But
      it's used to find the soft limit tree through the page's zone, so the
      soft limit tree of the old page's zone is updated instead of that of the
      new page's, which might get slightly out of date until the next charge
      reaches the ratelimit point.
      
      This glitch has been present since 5564e88b ("memcg: condense
      page_cgroup-to-page lookup points").
      
      This fixes a bug that I introduced in 2.6.38.  It's benign enough (to my
      knowledge) that we probably don't want this for stable.
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6568d4a9
    • M
      mm: __count_immobile_pages(): make sure the node is online · 656a0706
      Michal Hocko 提交于
      page_zone() requires an online node otherwise we are accessing NULL
      NODE_DATA.  This is not an issue at the moment because node_zones are
      located at the structure beginning but this might change in the future
      so better be careful about that.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      656a0706
    • M
      mm: fix NULL ptr dereference in __count_immobile_pages · 687875fb
      Michal Hocko 提交于
      Fix the following NULL ptr dereference caused by
      
        cat /sys/devices/system/memory/memory0/removable
      
      Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
      RIP: __count_immobile_pages+0x4/0x100
      Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
      Call Trace:
        is_pageblock_removable_nolock+0x34/0x40
        is_mem_section_removable+0x74/0xf0
        show_mem_removable+0x41/0x70
        sysfs_read_file+0xfe/0x1c0
        vfs_read+0xc7/0x130
        sys_read+0x53/0xa0
        system_call_fastpath+0x16/0x1b
      
      We are crashing because we are trying to dereference NULL zone which
      came from pfn=0 (struct page ffffea0000000000). According to the boot
      log this page is marked reserved:
      e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
      
      and early_node_map confirms that:
      early_node_map[3] active PFN ranges
          1: 0x00000010 -> 0x0000009c
          1: 0x00000100 -> 0x000bffa3
          1: 0x00100000 -> 0x00240000
      
      The problem is that memory_present works in PAGE_SECTION_MASK aligned
      blocks so the reserved range sneaks into the the section as well.  This
      also means that free_area_init_node will not take care of those reserved
      pages and they stay uninitialized.
      
      When we try to read the removable status we walk through all available
      sections and hope that the zone is valid for all pages in the section.
      But this is not true in this case as the zone and nid are not initialized.
      
      We have only one node in this particular case and it is marked as node=1
      (rather than 0) and that made the problem visible because page_to_nid will
      return 0 and there are no zones on the node.
      
      Let's check that the zone is valid and that the given pfn falls into its
      boundaries and mark the section not removable.  This might cause some
      false positives, probably, but we do not have any sane way to find out
      whether the page is reserved by the platform or it is just not used for
      whatever other reasons.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      687875fb
  16. 23 1月, 2012 1 次提交
  17. 21 1月, 2012 2 次提交