1. 11 12月, 2012 9 次提交
    • A
      mm: numa: Support NUMA hinting page faults from gup/gup_fast · 0b9d7052
      Andrea Arcangeli 提交于
      Introduce FOLL_NUMA to tell follow_page to check
      pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
      so because it always invokes handle_mm_fault and retries the
      follow_page later.
      
      KVM secondary MMU page faults will trigger the NUMA hinting page
      faults through gup_fast -> get_user_pages -> follow_page ->
      handle_mm_fault.
      
      Other follow_page callers like KSM should not use FOLL_NUMA, or they
      would fail to get the pages if they use follow_page instead of
      get_user_pages.
      
      [ This patch was picked up from the AutoNUMA tree. ]
      Originally-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ ported to this tree. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      0b9d7052
    • M
      mm: compaction: Add scanned and isolated counters for compaction · 397487db
      Mel Gorman 提交于
      Compaction already has tracepoints to count scanned and isolated pages
      but it requires that ftrace be enabled and if that information has to be
      written to disk then it can be disruptive. This patch adds vmstat counters
      for compaction called compact_migrate_scanned, compact_free_scanned and
      compact_isolated.
      
      With these counters, it is possible to define a basic cost model for
      compaction. This approximates of how much work compaction is doing and can
      be compared that with an oprofile showing TLB misses and see if the cost of
      compaction is being offset by THP for example. Minimally a compaction patch
      can be evaluated in terms of whether it increases or decreases cost. The
      basic cost model looks like this
      
      Fundamental unit u:	a word	sizeof(void *)
      
      Ca  = cost of struct page access = sizeof(struct page) / u
      
      Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
      Cmf = Cost migrate failure   = Ca * 2
      Ci  = Cost page isolation    = (Ca + Wi)
      	where Wi is a constant that should reflect the approximate
      	cost of the locking operation.
      
      Csm = Cost migrate scanning = Ca
      Csf = Cost free    scanning = Ca
      
      Overall cost =	(Csm * compact_migrate_scanned) +
      	      	(Csf * compact_free_scanned)    +
      	      	(Ci  * compact_isolated)	+
      		(Cmc * pgmigrate_success)	+
      		(Cmf * pgmigrate_failed)
      
      Where the values are read from /proc/vmstat.
      
      This is very basic and ignores certain costs such as the allocation cost
      to do a migrate page copy but any improvement to the model would still
      use the same vmstat counters.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      397487db
    • M
      mm: migrate: Add a tracepoint for migrate_pages · 7b2a2d4a
      Mel Gorman 提交于
      The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
      about migration activity but not the type or the reason. This patch adds
      a tracepoint to identify the type of page migration and why the page is
      being migrated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      7b2a2d4a
    • M
      mm: compaction: Move migration fail/success stats to migrate.c · 5647bc29
      Mel Gorman 提交于
      The compact_pages_moved and compact_pagemigrate_failed events are
      convenient for determining if compaction is active and to what
      degree migration is succeeding but it's at the wrong level. Other
      users of migration may also want to know if migration is working
      properly and this will be particularly true for any automated
      NUMA migration. This patch moves the counters down to migration
      with the new events called pgmigrate_success and pgmigrate_fail.
      The compact_blocks_moved counter is removed because while it was
      useful for debugging initially, it's worthless now as no meaningful
      conclusions can be drawn from its value.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      5647bc29
    • I
      mm: Optimize the TLB flush of sys_mprotect() and change_protection() users · 1233d588
      Ingo Molnar 提交于
      Reuse the NUMA code's 'modified page protections' count that
      change_protection() computes and skip the TLB flush if there's
      no changes to a range that sys_mprotect() modifies.
      
      Given that mprotect() already optimizes the same-flags case
      I expected this optimization to dominantly trigger on
      CONFIG_NUMA_BALANCING=y kernels - but even with that feature
      disabled it triggers rather often.
      
      There's two reasons for that:
      
      1)
      
      While sys_mprotect() already optimizes the same-flag case:
      
              if (newflags == oldflags) {
                      *pprev = vma;
                      return 0;
              }
      
      and this test works in many cases, but it is too sharp in some
      others, where it differentiates between protection values that the
      underlying PTE format makes no distinction about, such as
      PROT_EXEC == PROT_READ on x86.
      
      2)
      
      Even where the pte format over vma flag changes necessiates a
      modification of the pagetables, there might be no pagetables
      yet to modify: they might not be instantiated yet.
      
      During a regular desktop bootup this optimization hits a couple
      of hundred times. During a Java test I measured thousands of
      hits.
      
      So this optimization improves sys_mprotect() in general, not just
      CONFIG_NUMA_BALANCING=y kernels.
      
      [ We could further increase the efficiency of this optimization if
        change_pte_range() and change_huge_pmd() was a bit smarter about
        recognizing exact-same-value protection masks - when the hardware
        can do that safely. This would probably further speed up mprotect(). ]
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1233d588
    • P
      mm: Count the number of pages affected in change_protection() · 7da4d641
      Peter Zijlstra 提交于
      This will be used for three kinds of purposes:
      
       - to optimize mprotect()
      
       - to speed up working set scanning for working set areas that
         have not been touched
      
       - to more accurately scan per real working set
      
      No change in functionality from this patch.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7da4d641
    • M
      mm: Check if PTE is already allocated during page fault · 4fd01770
      Mel Gorman 提交于
      With transparent hugepage support, handle_mm_fault() has to be careful
      that a normal PMD has been established before handling a PTE fault. To
      achieve this, it used __pte_alloc() directly instead of pte_alloc_map
      as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map()
      is called once it is known the PMD is safe.
      
      pte_alloc_map() is smart enough to check if a PTE is already present
      before calling __pte_alloc but this check was lost. As a consequence,
      PTEs may be allocated unnecessarily and the page table lock taken.
      Thi useless PTE does get cleaned up but it's a performance hit which
      is visible in page_test from aim9.
      
      This patch simply re-adds the check normally done by pte_alloc_map to
      check if the PTE needs to be allocated before taking the page table
      lock. The effect is noticable in page_test from aim9.
      
       AIM9
                       2.6.38-vanilla 2.6.38-checkptenone
       creat-clo      446.10 ( 0.00%)   424.47 (-5.10%)
       page_test       38.10 ( 0.00%)    42.04 ( 9.37%)
       brk_test        52.45 ( 0.00%)    51.57 (-1.71%)
       exec_test      382.00 ( 0.00%)   456.90 (16.39%)
       fork_test       60.11 ( 0.00%)    67.79 (11.34%)
       MMTests Statistics: duration
       Total Elapsed Time (seconds)                611.90    612.22
      
      (While this affects 2.6.38, it is a performance rather than a
      functional bug and normally outside the rules -stable. While the big
      performance differences are to a microbench, the difference in fork
      and exec performance may be significant enough that -stable wants to
      consider the patch)
      Reported-by: NRaz Ben Yehuda <raziebe@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      [ Picked this up from the AutoNUMA tree to help
        it upstream and to allow apples-to-apples
        performance comparisons. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4fd01770
    • R
      mm: Only flush the TLB when clearing an accessible pte · 8d1acce4
      Rik van Riel 提交于
      If ptep_clear_flush() is called to clear a page table entry that is
      accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
      there is no need to flush the TLB on remote CPUs.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8d1acce4
    • R
      mm,generic: only flush the local TLB in ptep_set_access_flags · cef23d9d
      Rik van Riel 提交于
      The function ptep_set_access_flags is only ever used to upgrade
      access permissions to a page. That means the only negative side
      effect of not flushing remote TLBs is that other CPUs may incur
      spurious page faults, if they happen to access the same address,
      and still have a PTE with the old permissions cached in their
      TLB.
      
      Having another CPU maybe incur a spurious page fault is faster
      than always incurring the cost of a remote TLB flush, so replace
      the remote TLB flush with a purely local one.
      
      This should be safe on every architecture that correctly
      implements flush_tlb_fix_spurious_fault() to actually invalidate
      the local TLB entry that caused a page fault, as well as on
      architectures where the hardware invalidates TLB entries that
      cause page faults.
      
      In the unlikely event that you are hitting what appears to be
      an infinite loop of page faults, and 'git bisect' took you to
      this changeset, your architecture needs to implement
      flush_tlb_fix_spurious_fault to actually flush the TLB entry.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      cef23d9d
  2. 17 11月, 2012 10 次提交
    • A
      revert "mm: fix-up zone present pages" · 5576646f
      Andrew Morton 提交于
      Revert commit 7f1290f2 ("mm: fix-up zone present pages")
      
      That patch tried to fix a issue when calculating zone->present_pages,
      but it caused a regression on 32bit systems with HIGHMEM.  With that
      change, reset_zone_present_pages() resets all zone->present_pages to
      zero, and fixup_zone_present_pages() is called to recalculate
      zone->present_pages when the boot allocator frees core memory pages into
      buddy allocator.  Because highmem pages are not freed by bootmem
      allocator, all highmem zones' present_pages becomes zero.
      
      Various options for improving the situation are being discussed but for
      now, let's return to the 3.6 code.
      
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5576646f
    • H
      tmpfs: change final i_blocks BUG to WARNING · 0f3c42f5
      Hugh Dickins 提交于
      Under a particular load on one machine, I have hit shmem_evict_inode()'s
      BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
      race between swapout and eviction.
      
      It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(),
      and the lack of coherent locking between mapping's nrpages and shmem's
      swapped count.  There's a window in shmem_writepage(), between lowering
      nrpages in shmem_delete_from_page_cache() and then raising swapped
      count, when the freed count appears to be +1 when it should be 0, and
      then the asymmetry stops it from being corrected with -1 before hitting
      the BUG.
      
      One answer is coherent locking: using tree_lock throughout, without
      info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
      used_blocks makes that messier than expected.  Another answer may be a
      further effort to eliminate the weird shmem_recalc_inode() altogether,
      but previous attempts at that failed.
      
      So far undecided, but for now change the BUG_ON to WARN_ON: in usual
      circumstances it remains a useful consistency check.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f3c42f5
    • H
      tmpfs: fix shmem_getpage_gfp() VM_BUG_ON · 215c02bc
      Hugh Dickins 提交于
      Fuzzing with trinity hit the "impossible" VM_BUG_ON(error) (which Fedora
      has converted to WARNING) in shmem_getpage_gfp():
      
        WARNING: at mm/shmem.c:1151 shmem_getpage_gfp+0xa5c/0xa70()
        Pid: 29795, comm: trinity-child4 Not tainted 3.7.0-rc2+ #49
        Call Trace:
          warn_slowpath_common+0x7f/0xc0
          warn_slowpath_null+0x1a/0x20
          shmem_getpage_gfp+0xa5c/0xa70
          shmem_fault+0x4f/0xa0
          __do_fault+0x71/0x5c0
          handle_pte_fault+0x97/0xae0
          handle_mm_fault+0x289/0x350
          __do_page_fault+0x18e/0x530
          do_page_fault+0x2b/0x50
          page_fault+0x28/0x30
          tracesys+0xe1/0xe6
      
      Thanks to Johannes for pointing to truncation: free_swap_and_cache()
      only does a trylock on the page, so the page lock we've held since
      before confirming swap is not enough to protect against truncation.
      
      What cleanup is needed in this case? Just delete_from_swap_cache(),
      which takes care of the memcg uncharge.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      215c02bc
    • W
      mm: highmem: don't treat PKMAP_ADDR(LAST_PKMAP) as a highmem address · 498c2280
      Will Deacon 提交于
      kmap_to_page returns the corresponding struct page for a virtual address
      of an arbitrary mapping.  This works by checking whether the address
      falls in the pkmap region and using the pkmap page tables instead of the
      linear mapping if appropriate.
      
      Unfortunately, the bounds checking means that PKMAP_ADDR(LAST_PKMAP) is
      incorrectly treated as a highmem address and we can end up walking off
      the end of pkmap_page_table and subsequently passing junk to pte_page.
      
      This patch fixes the bound check to stay within the pkmap tables.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      498c2280
    • M
      mm: revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures" · 96710098
      Mel Gorman 提交于
      Jiri Slaby reported the following:
      
      	(It's an effective revert of "mm: vmscan: scale number of pages
      	reclaimed by reclaim/compaction based on failures".) Given kswapd
      	had hours of runtime in ps/top output yesterday in the morning
      	and after the revert it's now 2 minutes in sum for the last 24h,
      	I would say, it's gone.
      
      The intention of the patch in question was to compensate for the loss of
      lumpy reclaim.  Part of the reason lumpy reclaim worked is because it
      aggressively reclaimed pages and this patch was meant to be a sane
      compromise.
      
      When compaction fails, it gets deferred and both compaction and
      reclaim/compaction is deferred avoid excessive reclaim.  However, since
      commit c6543459 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
      each time and continues reclaiming which was not taken into account when
      the patch was developed.
      
      Attempts to address the problem ended up just changing the shape of the
      problem instead of fixing it.  The release window gets closer and while
      a THP allocation failing is not a major problem, kswapd chewing up a lot
      of CPU is.
      
      This patch reverts commit 83fde0f2 ("mm: vmscan: scale number of
      pages reclaimed by reclaim/compaction based on failures") and will be
      revisited in the future.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Tested-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96710098
    • X
      swapfile: fix name leak in swapoff · f58b59c1
      Xiaotian Feng 提交于
      There's a name leak introduced by commit 91a27b2a ("vfs: define
      struct filename and have getname() return it").  Add the missing
      putname.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NXiaotian Feng <dannyfeng@tencent.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f58b59c1
    • H
      memcg: fix hotplugged memory zone oops · bea8c150
      Hugh Dickins 提交于
      When MEMCG is configured on (even when it's disabled by boot option),
      when adding or removing a page to/from its lru list, the zone pointer
      used for stats updates is nowadays taken from the struct lruvec.  (On
      many configurations, calculating zone from page is slower.)
      
      But we have no code to update all the lruvecs (per zone, per memcg) when
      a memory node is hotadded.  Here's an extract from the oops which
      results when running numactl to bind a program to a newly onlined node:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
        IP:  __mod_zone_page_state+0x9/0x60
        Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
        Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
        Call Trace:
          __pagevec_lru_add_fn+0xdf/0x140
          pagevec_lru_move_fn+0xb1/0x100
          __pagevec_lru_add+0x1c/0x30
          lru_add_drain_cpu+0xa3/0x130
          lru_add_drain+0x2f/0x40
         ...
      
      The natural solution might be to use a memcg callback whenever memory is
      hotadded; but that solution has not been scoped out, and it happens that
      we do have an easy location at which to update lruvec->zone.  The lruvec
      pointer is discovered either by mem_cgroup_zone_lruvec() or by
      mem_cgroup_page_lruvec(), and both of those do know the right zone.
      
      So check and set lruvec->zone in those; and remove the inadequate
      attempt to set lruvec->zone from lruvec_init(), which is called before
      NODE_DATA(node) has been allocated in such cases.
      
      Ah, there was one exceptionr.  For no particularly good reason,
      mem_cgroup_force_empty_list() has its own code for deciding lruvec.
      Change it to use the standard mem_cgroup_zone_lruvec() and
      mem_cgroup_get_lru_size() too.  In fact it was already safe against such
      an oops (the lru lists in danger could only be empty), but we're better
      proofed against future changes this way.
      
      I've marked this for stable (3.6) since we introduced the problem in 3.5
      (now closed to stable); but I have no idea if this is the only fix
      needed to get memory hotadd working with memcg in 3.6, and received no
      answer when I enquired twice before.
      Reported-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea8c150
    • M
      memcg: oom: fix totalpages calculation for memory.swappiness==0 · 9a5a8f19
      Michal Hocko 提交于
      oom_badness() takes a totalpages argument which says how many pages are
      available and it uses it as a base for the score calculation.  The value
      is calculated by mem_cgroup_get_limit which considers both limit and
      total_swap_pages (resp.  memsw portion of it).
      
      This is usually correct but since fe35004f ("mm: avoid swapping out
      with swappiness==0") we do not swap when swappiness is 0 which means
      that we cannot really use up all the totalpages pages.  This in turn
      confuses oom score calculation if the memcg limit is much smaller than
      the available swap because the used memory (capped by the limit) is
      negligible comparing to totalpages so the resulting score is too small
      if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
      A wrong process might be selected as result.
      
      The problem can be worked around by checking mem_cgroup_swappiness==0
      and not considering swap at all in such a case.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a5a8f19
    • D
      mm: fix build warning for uninitialized value · 1756954c
      David Rientjes 提交于
      do_wp_page() sets mmun_called if mmun_start and mmun_end were
      initialized and, if so, may call mmu_notifier_invalidate_range_end()
      with these values.  This doesn't prevent gcc from emitting a build
      warning though:
      
        mm/memory.c: In function `do_wp_page':
        mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function
        mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function
      
      It's much easier to initialize the variables to impossible values and do
      a simple comparison to determine if they were initialized to remove the
      bool entirely.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1756954c
    • M
      mm: add anon_vma_lock to validate_mm() · 63c3b902
      Michel Lespinasse 提交于
      Iterating over the vma->anon_vma_chain without anon_vma_lock may cause
      NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the
      chain might have been removed.
      
        BUG: unable to handle kernel paging request at fffffffffffffff0
        IP: [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
        PGD 4e28067 PUD 4e29067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU 0
        Pid: 9050, comm: trinity-child64 Tainted: G        W    3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77
        RIP: 0010: anon_vma_interval_tree_verify+0xc/0xa0
        Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000)
        Call Trace:
          validate_mm+0x58/0x1e0
          vma_adjust+0x635/0x6b0
          __split_vma.isra.22+0x161/0x220
          split_vma+0x24/0x30
          sys_madvise+0x5da/0x7b0
          tracesys+0xe1/0xe6
        RIP  anon_vma_interval_tree_verify+0xc/0xa0
        CR2: fffffffffffffff0
      
      Figured out by Bob Liu.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63c3b902
  3. 09 11月, 2012 1 次提交
  4. 26 10月, 2012 4 次提交
    • D
      mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant · 6b187d02
      David Rientjes 提交于
      Commit 957f822a ("mm, numa: reclaim from all nodes within reclaim
      distance") caused zone_reclaim_mode to be set for all systems where two
      nodes are within RECLAIM_DISTANCE of each other.  This is the opposite
      of what we actually want: zone_reclaim_mode should be set if two nodes
      are sufficiently distant.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NJulian Wollrath <jwollrath@web.de>
      Tested-by: NJulian Wollrath <jwollrath@web.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Patrik Kullman <patrik.kullman@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b187d02
    • G
      mm/mmu_notifier: allocate mmu_notifier in advance · 35cfa2b0
      Gavin Shan 提交于
      While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
      to work in case of tight available memory.  Eventually, that would lead to
      a deadlock while the swap deamon swaps anonymous pages.  It was caused by
      commit e0f3c3f7 ("mm/mmu_notifier: init notifier if necessary").
      
        =================================
        [ INFO: inconsistent lock state ]
        3.7.0-rc1+ #518 Not tainted
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
        {RECLAIM_FS-ON-W} state was registered at:
           mark_held_locks+0x86/0x150
           lockdep_trace_alloc+0x67/0xc0
           kmem_cache_alloc_trace+0x33/0x230
           do_mmu_notifier_register+0x87/0x180
           mmu_notifier_register+0x13/0x20
           kvm_dev_ioctl+0x428/0x510
           do_vfs_ioctl+0x98/0x570
           sys_ioctl+0x91/0xb0
           system_call_fastpath+0x16/0x1b
        irq event stamp: 825
        hardirqs last  enabled at (825): _raw_spin_unlock_irq+0x30/0x60
        hardirqs last disabled at (824): _raw_spin_lock_irq+0x19/0x80
        softirqs last  enabled at (0): copy_process+0x630/0x17c0
        softirqs last disabled at (0): (null)
        ...
      
      Simply back out the above commit, which was a small performance
      optimization.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Reported-by: NAndrea Righi <andrea@betterlinux.com>
      Tested-by: NAndrea Righi <andrea@betterlinux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Sagi Grimberg <sagig@mellanox.co.il>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35cfa2b0
    • B
      mm/page_alloc.c:alloc_contig_range(): return early for err path · 86a595f9
      Bob Liu 提交于
      If start_isolate_page_range() failed, unset_migratetype_isolate() has been
      done inside it.
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Cc: Ni zhan Chen <nizhan.chen@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86a595f9
    • J
      mm: fix XFS oops due to dirty pages without buffers on s390 · ef5d437f
      Jan Kara 提交于
      On s390 any write to a page (even from kernel itself) sets architecture
      specific page dirty bit.  Thus when a page is written to via buffered
      write, HW dirty bit gets set and when we later map and unmap the page,
      page_remove_rmap() finds the dirty bit and calls set_page_dirty().
      
      Dirtying of a page which shouldn't be dirty can cause all sorts of
      problems to filesystems.  The bug we observed in practice is that
      buffers from the page get freed, so when the page gets later marked as
      dirty and writeback writes it, XFS crashes due to an assertion
      BUG_ON(!PagePrivate(page)) in page_buffers() called from
      xfs_count_page_state().
      
      Similar problem can also happen when zero_user_segment() call from
      xfs_vm_writepage() (or block_write_full_page() for that matter) set the
      hardware dirty bit during writeback, later buffers get freed, and then
      page unmapped.
      
      Fix the issue by ignoring s390 HW dirty bit for page cache pages of
      mappings with mapping_cap_account_dirty().  This is safe because for
      such mappings when a page gets marked as writeable in PTE it is also
      marked dirty in do_wp_page() or do_page_fault().  When the dirty bit is
      cleared by clear_page_dirty_for_io(), the page gets writeprotected in
      page_mkclean().  So pagecache page is writeable if and only if it is
      dirty.
      
      Thanks to Hugh Dickins for pointing out mapping has to have
      mapping_cap_account_dirty() for things to work and proposing a cleaned
      up variant of the patch.
      
      The patch has survived about two hours of running fsx-linux on tmpfs
      while heavily swapping and several days of running on out build machines
      where the original problem was triggered.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@vger.kernel.org>		[3.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef5d437f
  5. 25 10月, 2012 1 次提交
  6. 23 10月, 2012 1 次提交
  7. 20 10月, 2012 2 次提交
  8. 17 10月, 2012 1 次提交
    • D
      mm, mempolicy: fix printing stack contents in numa_maps · 32f8516a
      David Rientjes 提交于
      When reading /proc/pid/numa_maps, it's possible to return the contents of
      the stack where the mempolicy string should be printed if the policy gets
      freed from beneath us.
      
      This happens because mpol_to_str() may return an error the
      stack-allocated buffer is then printed without ever being stored.
      
      There are two possible error conditions in mpol_to_str():
      
       - if the buffer allocated is insufficient for the string to be stored,
         and
      
       - if the mempolicy has an invalid mode.
      
      The first error condition is not triggered in any of the callers to
      mpol_to_str(): at least 50 bytes is always allocated on the stack and this
      is sufficient for the string to be written.  A future patch should convert
      this into BUILD_BUG_ON() since we know the maximum strlen possible, but
      that's not -rc material.
      
      The second error condition is possible if a race occurs in dropping a
      reference to a task's mempolicy causing it to be freed during the read().
      The slab poison value is then used for the mode and mpol_to_str() returns
      -EINVAL.
      
      This race is only possible because get_vma_policy() believes that
      mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path
      does not hold mm->mmap_sem when dropping the reference or setting
      task->mempolicy to NULL: it uses task_lock(task) instead.
      
      Thus, it's required for the caller of a task mempolicy to hold
      task_lock(task) while grabbing the mempolicy and reading it.  Callers with
      a vma policy store their mempolicy earlier and can simply increment the
      reference count so it's guaranteed not to be freed.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32f8516a
  9. 15 10月, 2012 1 次提交
  10. 13 10月, 2012 2 次提交
    • J
      vfs: make path_openat take a struct filename pointer · 669abf4e
      Jeff Layton 提交于
      ...and fix up the callers. For do_file_open_root, just declare a
      struct filename on the stack and fill out the .name field. For
      do_filp_open, make it also take a struct filename pointer, and fix up its
      callers to call it appropriately.
      
      For filp_open, add a variant that takes a struct filename pointer and turn
      filp_open into a wrapper around it.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      669abf4e
    • J
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton 提交于
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
  11. 10 10月, 2012 3 次提交
    • J
      mm, slab: release slab_mutex earlier in kmem_cache_destroy() · 210ed9de
      Jiri Kosina 提交于
      Commit 1331e7a1 ("rcu: Remove _rcu_barrier() dependency on
      __stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency
      through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() ->
      get_online_cpus().
      
      Lockdep thinks that this might actually result in ABBA deadlock,
      and reports it as below:
      
      === [ cut here ] ===
       ======================================================
       [ INFO: possible circular locking dependency detected ]
       3.6.0-rc5-00004-g0d8ee37e #143 Not tainted
       -------------------------------------------------------
       kworker/u:2/40 is trying to acquire lock:
        (rcu_sched_state.barrier_mutex){+.+...}, at: [<ffffffff810f2126>] _rcu_barrier+0x26/0x1e0
      
       but task is already holding lock:
        (slab_mutex){+.+.+.}, at: [<ffffffff81176e15>] kmem_cache_destroy+0x45/0xe0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #2 (slab_mutex){+.+.+.}:
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff81558cb5>] cpuup_callback+0x2f/0xbe
              [<ffffffff81564b83>] notifier_call_chain+0x93/0x140
              [<ffffffff81076f89>] __raw_notifier_call_chain+0x9/0x10
              [<ffffffff8155719d>] _cpu_up+0xba/0x14e
              [<ffffffff815572ed>] cpu_up+0xbc/0x117
              [<ffffffff81ae05e3>] smp_init+0x6b/0x9f
              [<ffffffff81ac47d6>] kernel_init+0x147/0x1dc
              [<ffffffff8156ab44>] kernel_thread_helper+0x4/0x10
      
       -> #1 (cpu_hotplug.lock){+.+.+.}:
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff81049197>] get_online_cpus+0x37/0x50
              [<ffffffff810f21bb>] _rcu_barrier+0xbb/0x1e0
              [<ffffffff810f22f0>] rcu_barrier_sched+0x10/0x20
              [<ffffffff810f2309>] rcu_barrier+0x9/0x10
              [<ffffffff8118c129>] deactivate_locked_super+0x49/0x90
              [<ffffffff8118cc01>] deactivate_super+0x61/0x70
              [<ffffffff811aaaa7>] mntput_no_expire+0x127/0x180
              [<ffffffff811ab49e>] sys_umount+0x6e/0xd0
              [<ffffffff81569979>] system_call_fastpath+0x16/0x1b
      
       -> #0 (rcu_sched_state.barrier_mutex){+.+...}:
              [<ffffffff810adb4e>] check_prev_add+0x3de/0x440
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff810f2126>] _rcu_barrier+0x26/0x1e0
              [<ffffffff810f22f0>] rcu_barrier_sched+0x10/0x20
              [<ffffffff810f2309>] rcu_barrier+0x9/0x10
              [<ffffffff81176ea1>] kmem_cache_destroy+0xd1/0xe0
              [<ffffffffa04c3154>] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack]
              [<ffffffffa04c31aa>] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack]
              [<ffffffffa04c42ce>] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack]
              [<ffffffff81454b79>] ops_exit_list+0x39/0x60
              [<ffffffff814551ab>] cleanup_net+0xfb/0x1b0
              [<ffffffff8106917b>] process_one_work+0x26b/0x4c0
              [<ffffffff81069f3e>] worker_thread+0x12e/0x320
              [<ffffffff8106f73e>] kthread+0x9e/0xb0
              [<ffffffff8156ab44>] kernel_thread_helper+0x4/0x10
      
       other info that might help us debug this:
      
       Chain exists of:
         rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(slab_mutex);
                                      lock(cpu_hotplug.lock);
                                      lock(slab_mutex);
         lock(rcu_sched_state.barrier_mutex);
      
        *** DEADLOCK ***
      === [ cut here ] ===
      
      This is actually a false positive. Lockdep has no way of knowing the fact
      that the ABBA can actually never happen, because of special semantics of
      cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual
      exclusion there is not achieved through mutex, but through
      cpu_hotplug.refcount.
      
      The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin()
      until everyone who called get_online_cpus() will call put_online_cpus()"
      semantics is totally invisible to lockdep.
      
      This patch therefore moves the unlock of slab_mutex so that rcu_barrier()
      is being called with it unlocked. It has two advantages:
      
      - it slightly reduces hold time of slab_mutex; as it's used to protect
        the cachep list, it's not necessary to hold it over kmem_cache_free()
        call any more
      - it silences the lockdep false positive warning, as it avoids lockdep ever
        learning about slab_mutex -> cpu_hotplug.lock dependency
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      210ed9de
    • H
      tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking · 35c2a7f4
      Hugh Dickins 提交于
      Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
      	u64 inum = fid->raw[2];
      which is unhelpfully reported as at the end of shmem_alloc_inode():
      
      BUG: unable to handle kernel paging request at ffff880061cd3000
      IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Call Trace:
       [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
       [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
       [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
       [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6
      
      Right, tmpfs is being stupid to access fid->raw[2] before validating that
      fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
      fall at the end of a page, and the next page not be present.
      
      But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
      careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
      could oops in the same way: add the missing fh_len checks to those.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35c2a7f4
    • A
      mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN · baaf1dd4
      Arnd Bergmann 提交于
      The definition of ARCH_SLAB_MINALIGN is architecture dependent
      and can be either of type size_t or int. Comparing that value
      with ARCH_KMALLOC_MINALIGN can cause harmless warnings on
      platforms where they are different. Since both are always
      small positive integer numbers, using the size_t type to compare
      them is safe and gets rid of the warning.
      
      Without this patch, building ARM collie_defconfig results in:
      
      mm/slob.c: In function '__kmalloc_node':
      mm/slob.c:431:152: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      mm/slob.c: In function 'kfree':
      mm/slob.c:484:153: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      mm/slob.c: In function 'ksize':
      mm/slob.c:503:153: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      baaf1dd4
  12. 09 10月, 2012 5 次提交