1. 15 3月, 2011 1 次提交
  2. 14 3月, 2011 2 次提交
  3. 05 3月, 2011 3 次提交
  4. 26 2月, 2011 5 次提交
    • H
      memcg: more mem_cgroup_uncharge() batching · e5598f8b
      Hugh Dickins 提交于
      It seems odd that truncate_inode_pages_range(), called not only when
      truncating but also when evicting inodes, has mem_cgroup_uncharge_start
      and _end() batching in its second loop to clear up a few leftovers, but
      not in its first loop that does almost all the work: add them there too.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5598f8b
    • A
      thp: fix interleaving for transparent hugepages · 8eac563c
      Andi Kleen 提交于
      The THP code didn't pass the correct interleaving shift to the memory
      policy code.  Fix this here by adjusting for the order.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8eac563c
    • N
      mm: fix dubious code in __count_immobile_pages() · 29723fcc
      Namhyung Kim 提交于
      When pfn_valid_within() failed 'iter' was incremented twice.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29723fcc
    • M
      mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT · 2876592f
      Mel Gorman 提交于
      should_continue_reclaim() for reclaim/compaction allows scanning to
      continue even if pages are not being reclaimed until the full list is
      scanned.  In terms of allocation success, this makes sense but potentially
      it introduces unwanted latency for high-order allocations such as
      transparent hugepages and network jumbo frames that would prefer to fail
      the allocation attempt and fallback to order-0 pages.  Worse, there is a
      potential that the full LRU scan will clear all the young bits, distort
      page aging information and potentially push pages into swap that would
      have otherwise remained resident.
      
      This patch will stop reclaim/compaction if no pages were reclaimed in the
      last SWAP_CLUSTER_MAX pages that were considered.  For allocations such as
      hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
      LRU list may still be scanned.
      
      Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
      is not set so the following avoids the gfp_mask being examined:
      
              if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
                      return false;
      
      A tool was developed based on ftrace that tracked the latency of
      high-order allocations while transparent hugepage support was enabled and
      three benchmarks were run.  The "fix-infinite" figures are 2.6.38-rc4 with
      Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
      applied.
      
        STREAM Highorder Allocation Latency Statistics
                       fix-infinite     break-early
        1 :: Count            10298           10229
        1 :: Min             0.4560          0.4640
        1 :: Mean            1.0589          1.0183
        1 :: Max            14.5990         11.7510
        1 :: Stddev          0.5208          0.4719
        2 :: Count                2               1
        2 :: Min             1.8610          3.7240
        2 :: Mean            3.4325          3.7240
        2 :: Max             5.0040          3.7240
        2 :: Stddev          1.5715          0.0000
        9 :: Count           111696          111694
        9 :: Min             0.5230          0.4110
        9 :: Mean           10.5831         10.5718
        9 :: Max            38.4480         43.2900
        9 :: Stddev          1.1147          1.1325
      
      Mean time for order-1 allocations is reduced.  order-2 looks increased but
      with so few allocations, it's not particularly significant.  THP mean
      allocation latency is also reduced.  That said, allocation time varies so
      significantly that the reductions are within noise.
      
      Max allocation time is reduced by a significant amount for low-order
      allocations but reduced for THP allocations which presumably are now
      breaking before reclaim has done enough work.
      
        SysBench Highorder Allocation Latency Statistics
                       fix-infinite     break-early
        1 :: Count            15745           15677
        1 :: Min             0.4250          0.4550
        1 :: Mean            1.1023          1.0810
        1 :: Max            14.4590         10.8220
        1 :: Stddev          0.5117          0.5100
        2 :: Count                1               1
        2 :: Min             3.0040          2.1530
        2 :: Mean            3.0040          2.1530
        2 :: Max             3.0040          2.1530
        2 :: Stddev          0.0000          0.0000
        9 :: Count             2017            1931
        9 :: Min             0.4980          0.7480
        9 :: Mean           10.4717         10.3840
        9 :: Max            24.9460         26.2500
        9 :: Stddev          1.1726          1.1966
      
      Again, mean time for order-1 allocations is reduced while order-2
      allocations are too few to draw conclusions from.  The mean time for THP
      allocations is also slightly reduced albeit the reductions are within
      varianes.
      
      Once again, our maximum allocation time is significantly reduced for
      low-order allocations and slightly increased for THP allocations.
      
        Anon stream mmap reference Highorder Allocation Latency Statistics
        1 :: Count             1376            1790
        1 :: Min             0.4940          0.5010
        1 :: Mean            1.0289          0.9732
        1 :: Max             6.2670          4.2540
        1 :: Stddev          0.4142          0.2785
        2 :: Count                1               -
        2 :: Min             1.9060               -
        2 :: Mean            1.9060               -
        2 :: Max             1.9060               -
        2 :: Stddev          0.0000               -
        9 :: Count            11266           11257
        9 :: Min             0.4990          0.4940
        9 :: Mean        27250.4669      24256.1919
        9 :: Max      11439211.0000    6008885.0000
        9 :: Stddev     226427.4624     186298.1430
      
      This benchmark creates one thread per CPU which references an amount of
      anonymous memory 1.5 times the size of physical RAM.  This pounds swap
      quite heavily and is intended to exercise THP a bit.
      
      Mean allocation time for order-1 is reduced as before.  It's also reduced
      for THP allocations but the variations here are pretty massive due to
      swap.  As before, maximum allocation times are significantly reduced.
      
      Overall, the patch reduces the mean and maximum allocation latencies for
      the smaller high-order allocations.  This was with Slab configured so it
      would be expected to be more significant with Slub which uses these size
      allocations more aggressively.
      
      The mean allocation times for THP allocations are also slightly reduced.
      The maximum latency was slightly increased as predicted by the comments
      due to reclaim/compaction breaking early.  However, workloads care more
      about the latency of lower-order allocations than THP so it's an
      acceptable trade-off.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2876592f
    • G
      mm: grab rcu read lock in move_pages() · a879bf58
      Greg Thelen 提交于
      The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
      prevent free_pid() from reclaiming the pid.
      
      Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
      with:
      
        CONFIG_LOCKUP_DETECTOR=y
        CONFIG_PREEMPT=y
        CONFIG_LOCKDEP=y
        CONFIG_PROVE_LOCKING=y
        CONFIG_PROVE_RCU=y
      
      Previously, migrate_pages() went through a similar transformation
      replacing usage of tasklist_lock with rcu read lock:
      
        commit 55cfaa3c
        Author: Zeng Zhaoming <zengzm.kernel@gmail.com>
        Date:   Thu Dec 2 14:31:13 2010 -0800
      
            mm/mempolicy.c: add rcu read lock to protect pid structure
      
        commit 1e50df39
        Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
        Date:   Thu Jan 13 15:46:14 2011 -0800
      
            mempolicy: remove tasklist_lock from migrate_pages
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Zeng Zhaoming <zengzm.kernel@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a879bf58
  5. 25 2月, 2011 1 次提交
  6. 24 2月, 2011 2 次提交
    • H
      mm: fix possible cause of a page_mapped BUG · a3e8cc64
      Hugh Dickins 提交于
      Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
      a hole with madvise(,, MADV_REMOVE).  That path is under mutex, and
      cannot be explained by lack of serialization in unmap_mapping_range().
      
      Reviewing the code, I found one place where vm_truncate_count handling
      should have been updated, when I switched at the last minute from one
      way of managing the restart_addr to another: mremap move changes the
      virtual addresses, so it ought to adjust the restart_addr.
      
      But rather than exporting the notion of restart_addr from memory.c, or
      converting to restart_pgoff throughout, simply reset vm_truncate_count
      to 0 to force a rescan if mremap move races with preempted truncation.
      
      We have no confirmation that this fixes Robert's BUG,
      but it is a fix that's worth making anyway.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3e8cc64
    • M
      mm: prevent concurrent unmap_mapping_range() on the same inode · 2aa15890
      Miklos Szeredi 提交于
      Michael Leun reported that running parallel opens on a fuse filesystem
      can trigger a "kernel BUG at mm/truncate.c:475"
      
      Gurudas Pai reported the same bug on NFS.
      
      The reason is, unmap_mapping_range() is not prepared for more than
      one concurrent invocation per inode.  For example:
      
        thread1: going through a big range, stops in the middle of a vma and
           stores the restart address in vm_truncate_count.
      
        thread2: comes in with a small (e.g. single page) unmap request on
           the same vma, somewhere before restart_address, finds that the
           vma was already unmapped up to the restart address and happily
           returns without doing anything.
      
      Another scenario would be two big unmap requests, both having to
      restart the unmapping and each one setting vm_truncate_count to its
      own value.  This could go on forever without any of them being able to
      finish.
      
      Truncate and hole punching already serialize with i_mutex.  Other
      callers of unmap_mapping_range() do not, and it's difficult to get
      i_mutex protection for all callers.  In particular ->d_revalidate(),
      which calls invalidate_inode_pages2_range() in fuse, may be called
      with or without i_mutex.
      
      This patch adds a new mutex to 'struct address_space' to prevent
      running multiple concurrent unmap_mapping_range() on the same mapping.
      
      [ We'll hopefully get rid of all this with the upcoming mm
        preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
        lockbreak" patch in particular.  But that is for 2.6.39 ]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Reported-by: NMichael Leun <lkml20101129@newton.leun.net>
      Reported-by: NGurudas Pai <gurudas.pai@oracle.com>
      Tested-by: NGurudas Pai <gurudas.pai@oracle.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2aa15890
  7. 16 2月, 2011 1 次提交
  8. 12 2月, 2011 5 次提交
    • K
      memcg: fix leak of accounting at failure path of hugepage collapsing · 678ff896
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_uncharge_page() should be called in all failure cases after
      mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()
      
       [ 4209.076861] BUG: Bad page state in process khugepaged  pfn:1e9800
       [ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping:          (null) index:0x2800
       [ 4209.078674] page flags: 0x40000000004000(head)
       [ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
       [ 4209.082177] (/A)
       [ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
       [ 4209.083412] Call Trace:
       [ 4209.083678]  [<ffffffff810f4454>] ? bad_page+0xe4/0x140
       [ 4209.084240]  [<ffffffff810f53e6>] ? free_pages_prepare+0xd6/0x120
       [ 4209.084837]  [<ffffffff8155621d>] ? rwsem_down_failed_common+0xbd/0x150
       [ 4209.085509]  [<ffffffff810f5462>] ? __free_pages_ok+0x32/0xe0
       [ 4209.086110]  [<ffffffff810f552b>] ? free_compound_page+0x1b/0x20
       [ 4209.086699]  [<ffffffff810fad6c>] ? __put_compound_page+0x1c/0x30
       [ 4209.087333]  [<ffffffff810fae1d>] ? put_compound_page+0x4d/0x200
       [ 4209.087935]  [<ffffffff810fb015>] ? put_page+0x45/0x50
       [ 4209.097361]  [<ffffffff8113f779>] ? khugepaged+0x9e9/0x1430
       [ 4209.098364]  [<ffffffff8107c870>] ? autoremove_wake_function+0x0/0x40
       [ 4209.099121]  [<ffffffff8113ed90>] ? khugepaged+0x0/0x1430
       [ 4209.099780]  [<ffffffff8107c236>] ? kthread+0x96/0xa0
       [ 4209.100452]  [<ffffffff8100dda4>] ? kernel_thread_helper+0x4/0x10
       [ 4209.101214]  [<ffffffff8107c1a0>] ? kthread+0x0/0xa0
       [ 4209.101842]  [<ffffffff8100dda0>] ? kernel_thread_helper+0x0/0x10
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      678ff896
    • J
      vmscan: fix zone shrinking exit when scan work is done · f0fdc5e8
      Johannes Weiner 提交于
      Commit 3e7d3449 ("mm: vmscan: reclaim order-0 and use compaction
      instead of lumpy reclaim") introduced an indefinite loop in
      shrink_zone().
      
      It meant to break out of this loop when no pages had been reclaimed and
      not a single page was even scanned.  The way it would detect the latter
      is by taking a snapshot of sc->nr_scanned at the beginning of the
      function and comparing it against the new sc->nr_scanned after the scan
      loop.  But it would re-iterate without updating that snapshot, looping
      forever if sc->nr_scanned changed at least once since shrink_zone() was
      invoked.
      
      This is not the sole condition that would exit that loop, but it
      requires other processes to change the zone state, as the reclaimer that
      is stuck obviously can not anymore.
      
      This is only happening for higher-order allocations, where reclaim is
      run back to back with compaction.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: Kent Overstreet<kent.overstreet@gmail.com>
      Reported-by: NKent Overstreet <kent.overstreet@gmail.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0fdc5e8
    • M
      mlock: do not munlock pages in __do_fault() · 419d8c96
      Michel Lespinasse 提交于
      If the page is going to be written to, __do_page needs to break COW.
      
      However, the old page (before breaking COW) was never mapped mapped into
      the current pte (__do_fault is only called when the pte is not present),
      so vmscan can't have marked the old page as PageMlocked due to being
      mapped in __do_fault's VMA.  Therefore, __do_fault() does not need to
      worry about clearing PageMlocked() on the old page.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      419d8c96
    • M
      mlock: fix race when munlocking pages in do_wp_page() · e15f8c01
      Michel Lespinasse 提交于
      vmscan can lazily find pages that are mapped within VM_LOCKED vmas, and
      set the PageMlocked bit on these pages, transfering them onto the
      unevictable list.  When do_wp_page() breaks COW within a VM_LOCKED vma,
      it may need to clear PageMlocked on the old page and set it on the new
      page instead.
      
      This change fixes an issue where do_wp_page() was clearing PageMlocked
      on the old page while the pte was still pointing to it (as well as
      rmap).  Therefore, we were not protected against vmscan immediately
      transfering the old page back onto the unevictable list.  This could
      cause pages to get stranded there forever.
      
      I propose to move the corresponding code to the end of do_wp_page(),
      after the pte (and rmap) have been pointed to the new page.
      Additionally, we can use munlock_vma_page() instead of
      clear_page_mlock(), so that the old page stays mlocked if there are
      still other VM_LOCKED vmas mapping it.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e15f8c01
    • Y
      memblock: don't adjust size in memblock_find_base() · e6d2e2b2
      Yinghai Lu 提交于
      While applying patch to use memblock to find aperture for 64bit x86.
      Ingo found system with 1g + force_iommu
      
      > No AGP bridge found
      > Node 0: aperture @ 38000000 size 32 MB
      > Aperture pointing to e820 RAM. Ignoring.
      > Your BIOS doesn't leave a aperture memory hole
      > Please enable the IOMMU option in the BIOS setup
      > This costs you 64 MB of RAM
      > Cannot allocate aperture memory hole (0,65536K)
      
      the corresponding code:
      
      	addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20);
      	if (addr == MEMBLOCK_ERROR || addr + aper_size > 0xffffffff) {
      		printk(KERN_ERR
      			"Cannot allocate aperture memory hole (%lx,%uK)\n",
      				addr, aper_size>>10);
      		return 0;
      	}
      	memblock_x86_reserve_range(addr, addr + aper_size, "aperture64")
      
      fails because memblock core code align the size with 512M.  That could
      make size way too big.
      
      So don't align the size in that case.
      
      actually __memblock_alloc_base, the another caller already align that
      before calling that function.
      
      BTW. x86 does not use __memblock_alloc_base...
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: David Miller <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6d2e2b2
  9. 03 2月, 2011 11 次提交
  10. 02 2月, 2011 1 次提交
  11. 28 1月, 2011 2 次提交
  12. 26 1月, 2011 6 次提交