1. 23 3月, 2011 5 次提交
    • A
      oom: skip zombies when iterating tasklist · 30e2b41f
      Andrey Vagin 提交于
      We shouldn't defer oom killing if a thread has already detached its ->mm
      and still has TIF_MEMDIE set.  Memory needs to be freed, so find kill
      other threads that pin the same ->mm or find another task to kill.
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30e2b41f
    • D
      oom: prevent unnecessary oom kills or kernel panics · 3a5dda7a
      David Rientjes 提交于
      This patch prevents unnecessary oom kills or kernel panics by reverting
      two commits:
      
      	495789a5 (oom: make oom_score to per-process value)
      	cef1d352 (oom: multi threaded process coredump don't make deadlock)
      
      First, 495789a5 (oom: make oom_score to per-process value) ignores the
      fact that all threads in a thread group do not necessarily exit at the
      same time.
      
      It is imperative that select_bad_process() detect threads that are in the
      exit path, specifically those with PF_EXITING set, to prevent needlessly
      killing additional tasks.  If a process is oom killed and the thread group
      leader exits, select_bad_process() cannot detect the other threads that
      are PF_EXITING by iterating over only processes.  Thus, it currently
      chooses another task unnecessarily for oom kill or panics the machine when
      nothing else is eligible.
      
      By iterating over threads instead, it is possible to detect threads that
      are exiting and nominate them for oom kill so they get access to memory
      reserves.
      
      Second, cef1d352 (oom: multi threaded process coredump don't make
      deadlock) erroneously avoids making the oom killer a no-op when an
      eligible thread other than current isfound to be exiting.  We want to
      detect this situation so that we may allow that exiting thread time to
      exit and free its memory; if it is able to exit on its own, that should
      free memory so current is no loner oom.  If it is not able to exit on its
      own, the oom killer will nominate it for oom kill which, in this case,
      only means it will get access to memory reserves.
      
      Without this change, it is easy for the oom killer to unnecessarily target
      tasks when all threads of a victim don't exit before the thread group
      leader or, in the worst case, panic the machine.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: <stable@kernel.org>		[2.6.38.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a5dda7a
    • M
      mm: swap: unlock swapfile inode mutex before closing file on bad swapfiles · 52c50567
      Mel Gorman 提交于
      If an administrator tries to swapon a file backed by NFS, the inode mutex is
      taken (as it is for any swapfile) but later identified to be a bad swapfile
      due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
      made to close the file but with inode->i_mutex still held. Closing an NFS
      file syncs it which tries to acquire the inode mutex leading to deadlock. If
      lockdep is enabled the following appears on the console;
      
          =============================================
          [ INFO: possible recursive locking detected ]
          2.6.38-rc8-autobuild #1
          ---------------------------------------------
          swapon/2192 is trying to acquire lock:
           (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c
      
          but task is already holding lock:
           (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7
      
          other info that might help us debug this:
          1 lock held by swapon/2192:
           #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7
      
          stack backtrace:
          Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
          Call Trace:
              __lock_acquire+0x2eb/0x1623
              find_get_pages_tag+0x14a/0x174
              pagevec_lookup_tag+0x25/0x2e
              vfs_fsync_range+0x47/0x7c
              lock_acquire+0xd3/0x100
              vfs_fsync_range+0x47/0x7c
              nfs_flush_one+0x0/0xdf [nfs]
              mutex_lock_nested+0x40/0x2b1
              vfs_fsync_range+0x47/0x7c
              vfs_fsync_range+0x47/0x7c
              vfs_fsync+0x1c/0x1e
              nfs_file_flush+0x64/0x69 [nfs]
              filp_close+0x43/0x72
              sys_swapon+0xa39/0xae7
              sysret_check+0x2e/0x69
              system_call_fastpath+0x16/0x1b
      
      This patch releases the mutex if its held before calling filep_close()
      so swapon fails as expected without deadlock when the swapfile is backed
      by NFS.  If accepted for 2.6.39, it should also be considered a -stable
      candidate for 2.6.38 and 2.6.37.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: <stable@kernel.org>		[2.6.37+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52c50567
    • C
      slub: Add statistics for this_cmpxchg_double failures · 4fdccdfb
      Christoph Lameter 提交于
      Add some statistics for debugging.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      4fdccdfb
    • C
      slub: Add missing irq restore for the OOM path · 2fd66c51
      Christoph Lameter 提交于
      OOM path is missing the irq restore in the CONFIG_CMPXCHG_LOCAL case.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      2fd66c51
  2. 21 3月, 2011 1 次提交
  3. 18 3月, 2011 4 次提交
  4. 15 3月, 2011 2 次提交
  5. 14 3月, 2011 3 次提交
  6. 12 3月, 2011 3 次提交
  7. 11 3月, 2011 2 次提交
    • C
      Lockless (and preemptless) fastpaths for slub · 8a5ec0ba
      Christoph Lameter 提交于
      Use the this_cpu_cmpxchg_double functionality to implement a lockless
      allocation algorithm on arches that support fast this_cpu_ops.
      
      Each of the per cpu pointers is paired with a transaction id that ensures
      that updates of the per cpu information can only occur in sequence on
      a certain cpu.
      
      A transaction id is a "long" integer that is comprised of an event number
      and the cpu number. The event number is incremented for every change to the
      per cpu state. This means that the cmpxchg instruction can verify for an
      update that nothing interfered and that we are updating the percpu structure
      for the processor where we picked up the information and that we are also
      currently on that processor when we update the information.
      
      This results in a significant decrease of the overhead in the fastpaths. It
      also makes it easy to adopt the fast path for realtime kernels since this
      is lockless and does not require the use of the current per cpu area
      over the critical section. It is only important that the per cpu area is
      current at the beginning of the critical section and at the end.
      
      So there is no need even to disable preemption.
      
      Test results show that the fastpath cycle count is reduced by up to ~ 40%
      (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
      adds a few cycles.
      
      Sadly this does nothing for the slowpath which is where the main issues with
      performance in slub are but the best case performance rises significantly.
      (For that see the more complex slub patches that require cmpxchg_double)
      
      Kmalloc: alloc/free test
      
      Before:
      
      10000 times kmalloc(8)/kfree -> 134 cycles
      10000 times kmalloc(16)/kfree -> 152 cycles
      10000 times kmalloc(32)/kfree -> 144 cycles
      10000 times kmalloc(64)/kfree -> 142 cycles
      10000 times kmalloc(128)/kfree -> 142 cycles
      10000 times kmalloc(256)/kfree -> 132 cycles
      10000 times kmalloc(512)/kfree -> 132 cycles
      10000 times kmalloc(1024)/kfree -> 135 cycles
      10000 times kmalloc(2048)/kfree -> 135 cycles
      10000 times kmalloc(4096)/kfree -> 135 cycles
      10000 times kmalloc(8192)/kfree -> 144 cycles
      10000 times kmalloc(16384)/kfree -> 754 cycles
      
      After:
      
      10000 times kmalloc(8)/kfree -> 78 cycles
      10000 times kmalloc(16)/kfree -> 78 cycles
      10000 times kmalloc(32)/kfree -> 82 cycles
      10000 times kmalloc(64)/kfree -> 88 cycles
      10000 times kmalloc(128)/kfree -> 79 cycles
      10000 times kmalloc(256)/kfree -> 79 cycles
      10000 times kmalloc(512)/kfree -> 85 cycles
      10000 times kmalloc(1024)/kfree -> 82 cycles
      10000 times kmalloc(2048)/kfree -> 82 cycles
      10000 times kmalloc(4096)/kfree -> 85 cycles
      10000 times kmalloc(8192)/kfree -> 82 cycles
      10000 times kmalloc(16384)/kfree -> 706 cycles
      
      Kmalloc: Repeatedly allocate then free test
      
      Before:
      
      10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles
      10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles
      10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles
      10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles
      10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles
      10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles
      10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles
      10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles
      10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles
      10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles
      10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles
      10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles
      
      After:
      
      10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles
      10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles
      10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles
      10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles
      10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles
      10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles
      10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles
      10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles
      10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles
      10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles
      10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles
      10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8a5ec0ba
    • C
      slub: Get rid of slab_free_hook_irq() · d3f661d6
      Christoph Lameter 提交于
      The following patch will make the fastpaths lockless and will no longer
      require interrupts to be disabled. Calling the free hook with irq disabled
      will no longer be possible.
      
      Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
      interrupts if the features are selected that require callbacks with
      interrupts off and reenable after calls have been made.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      d3f661d6
  8. 05 3月, 2011 3 次提交
  9. 01 3月, 2011 1 次提交
  10. 27 2月, 2011 1 次提交
  11. 26 2月, 2011 6 次提交
    • Y
      mm: Move early_node_map[] reverse scan helpers under HAVE_MEMBLOCK · cc289894
      Yinghai Lu 提交于
      Heiko found recent memblock change triggers these warnings on s390:
      
        mm/page_alloc.c:3623:22: warning: 'last_active_region_index_in_nid' defined but not used
        mm/page_alloc.c:3638:22: warning: 'previous_active_region_index_in_nid' defined but not used
      
      Need to move those two function under HAVE_MEMBLOCK with its only
      user, find_memory_core_early().
      
      -tj: Minor updates to description.
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      cc289894
    • H
      memcg: more mem_cgroup_uncharge() batching · e5598f8b
      Hugh Dickins 提交于
      It seems odd that truncate_inode_pages_range(), called not only when
      truncating but also when evicting inodes, has mem_cgroup_uncharge_start
      and _end() batching in its second loop to clear up a few leftovers, but
      not in its first loop that does almost all the work: add them there too.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5598f8b
    • A
      thp: fix interleaving for transparent hugepages · 8eac563c
      Andi Kleen 提交于
      The THP code didn't pass the correct interleaving shift to the memory
      policy code.  Fix this here by adjusting for the order.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8eac563c
    • N
      mm: fix dubious code in __count_immobile_pages() · 29723fcc
      Namhyung Kim 提交于
      When pfn_valid_within() failed 'iter' was incremented twice.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29723fcc
    • M
      mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT · 2876592f
      Mel Gorman 提交于
      should_continue_reclaim() for reclaim/compaction allows scanning to
      continue even if pages are not being reclaimed until the full list is
      scanned.  In terms of allocation success, this makes sense but potentially
      it introduces unwanted latency for high-order allocations such as
      transparent hugepages and network jumbo frames that would prefer to fail
      the allocation attempt and fallback to order-0 pages.  Worse, there is a
      potential that the full LRU scan will clear all the young bits, distort
      page aging information and potentially push pages into swap that would
      have otherwise remained resident.
      
      This patch will stop reclaim/compaction if no pages were reclaimed in the
      last SWAP_CLUSTER_MAX pages that were considered.  For allocations such as
      hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
      LRU list may still be scanned.
      
      Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
      is not set so the following avoids the gfp_mask being examined:
      
              if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
                      return false;
      
      A tool was developed based on ftrace that tracked the latency of
      high-order allocations while transparent hugepage support was enabled and
      three benchmarks were run.  The "fix-infinite" figures are 2.6.38-rc4 with
      Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
      applied.
      
        STREAM Highorder Allocation Latency Statistics
                       fix-infinite     break-early
        1 :: Count            10298           10229
        1 :: Min             0.4560          0.4640
        1 :: Mean            1.0589          1.0183
        1 :: Max            14.5990         11.7510
        1 :: Stddev          0.5208          0.4719
        2 :: Count                2               1
        2 :: Min             1.8610          3.7240
        2 :: Mean            3.4325          3.7240
        2 :: Max             5.0040          3.7240
        2 :: Stddev          1.5715          0.0000
        9 :: Count           111696          111694
        9 :: Min             0.5230          0.4110
        9 :: Mean           10.5831         10.5718
        9 :: Max            38.4480         43.2900
        9 :: Stddev          1.1147          1.1325
      
      Mean time for order-1 allocations is reduced.  order-2 looks increased but
      with so few allocations, it's not particularly significant.  THP mean
      allocation latency is also reduced.  That said, allocation time varies so
      significantly that the reductions are within noise.
      
      Max allocation time is reduced by a significant amount for low-order
      allocations but reduced for THP allocations which presumably are now
      breaking before reclaim has done enough work.
      
        SysBench Highorder Allocation Latency Statistics
                       fix-infinite     break-early
        1 :: Count            15745           15677
        1 :: Min             0.4250          0.4550
        1 :: Mean            1.1023          1.0810
        1 :: Max            14.4590         10.8220
        1 :: Stddev          0.5117          0.5100
        2 :: Count                1               1
        2 :: Min             3.0040          2.1530
        2 :: Mean            3.0040          2.1530
        2 :: Max             3.0040          2.1530
        2 :: Stddev          0.0000          0.0000
        9 :: Count             2017            1931
        9 :: Min             0.4980          0.7480
        9 :: Mean           10.4717         10.3840
        9 :: Max            24.9460         26.2500
        9 :: Stddev          1.1726          1.1966
      
      Again, mean time for order-1 allocations is reduced while order-2
      allocations are too few to draw conclusions from.  The mean time for THP
      allocations is also slightly reduced albeit the reductions are within
      varianes.
      
      Once again, our maximum allocation time is significantly reduced for
      low-order allocations and slightly increased for THP allocations.
      
        Anon stream mmap reference Highorder Allocation Latency Statistics
        1 :: Count             1376            1790
        1 :: Min             0.4940          0.5010
        1 :: Mean            1.0289          0.9732
        1 :: Max             6.2670          4.2540
        1 :: Stddev          0.4142          0.2785
        2 :: Count                1               -
        2 :: Min             1.9060               -
        2 :: Mean            1.9060               -
        2 :: Max             1.9060               -
        2 :: Stddev          0.0000               -
        9 :: Count            11266           11257
        9 :: Min             0.4990          0.4940
        9 :: Mean        27250.4669      24256.1919
        9 :: Max      11439211.0000    6008885.0000
        9 :: Stddev     226427.4624     186298.1430
      
      This benchmark creates one thread per CPU which references an amount of
      anonymous memory 1.5 times the size of physical RAM.  This pounds swap
      quite heavily and is intended to exercise THP a bit.
      
      Mean allocation time for order-1 is reduced as before.  It's also reduced
      for THP allocations but the variations here are pretty massive due to
      swap.  As before, maximum allocation times are significantly reduced.
      
      Overall, the patch reduces the mean and maximum allocation latencies for
      the smaller high-order allocations.  This was with Slab configured so it
      would be expected to be more significant with Slub which uses these size
      allocations more aggressively.
      
      The mean allocation times for THP allocations are also slightly reduced.
      The maximum latency was slightly increased as predicted by the comments
      due to reclaim/compaction breaking early.  However, workloads care more
      about the latency of lower-order allocations than THP so it's an
      acceptable trade-off.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2876592f
    • G
      mm: grab rcu read lock in move_pages() · a879bf58
      Greg Thelen 提交于
      The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
      prevent free_pid() from reclaiming the pid.
      
      Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
      with:
      
        CONFIG_LOCKUP_DETECTOR=y
        CONFIG_PREEMPT=y
        CONFIG_LOCKDEP=y
        CONFIG_PROVE_LOCKING=y
        CONFIG_PROVE_RCU=y
      
      Previously, migrate_pages() went through a similar transformation
      replacing usage of tasklist_lock with rcu read lock:
      
        commit 55cfaa3c
        Author: Zeng Zhaoming <zengzm.kernel@gmail.com>
        Date:   Thu Dec 2 14:31:13 2010 -0800
      
            mm/mempolicy.c: add rcu read lock to protect pid structure
      
        commit 1e50df39
        Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
        Date:   Thu Jan 13 15:46:14 2011 -0800
      
            mempolicy: remove tasklist_lock from migrate_pages
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Zeng Zhaoming <zengzm.kernel@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a879bf58
  12. 25 2月, 2011 1 次提交
  13. 24 2月, 2011 5 次提交
    • Y
      bootmem: Move __alloc_memory_core_early() to nobootmem.c · 8bc1f91e
      Yinghai Lu 提交于
      Now that bootmem.c and nobootmem.c are separate, there's no reason to
      define __alloc_memory_core_early(), which is used only by nobootmem,
      inside #ifdef in page_alloc.c.  Move it to nobootmem.c and make it
      static.
      
      This patch doesn't introduce any behavior change.
      
      -tj: Updated commit description.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8bc1f91e
    • Y
      bootmem: Move contig_page_data definition to bootmem.c/nobootmem.c · e782ab42
      Yinghai Lu 提交于
      Now that bootmem.c and nobootmem.c are separate, it's cleaner to
      define contig_page_data in each file than in page_alloc.c with #ifdef.
      Move it.
      
      This patch doesn't introduce any behavior change.
      
      -v2: According to Andrew, fixed the struct layout.
      -tj: Updated commit description.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e782ab42
    • Y
      bootmem: Separate out CONFIG_NO_BOOTMEM code into nobootmem.c · 09325873
      Yinghai Lu 提交于
      mm/bootmem.c contained code paths for both bootmem and no bootmem
      configurations.  They implement about the same set of APIs in
      different ways and as a result bootmem.c contains massive amount of
      #ifdef CONFIG_NO_BOOTMEM.
      
      Separate out CONFIG_NO_BOOTMEM code into mm/nobootmem.c.  As the
      common part is relatively small, duplicate them in nobootmem.c instead
      of creating a common file or ifdef'ing in bootmem.c.
      
      The followings are duplicated.
      
      * {min|max}_low_pfn, max_pfn, saved_max_pfn
      * free_bootmem_late()
      * ___alloc_bootmem()
      * __alloc_bootmem_low()
      
      The followings are applicable only to nobootmem and moved verbatim.
      
      * __free_pages_memory()
      * free_all_memory_core_early()
      
      The followings are not applicable to nobootmem and omitted in
      nobootmem.c.
      
      * reserve_bootmem_node()
      * reserve_bootmem()
      
      The rest split function bodies according to CONFIG_NO_BOOTMEM.
      
      Makefile is updated so that only either bootmem.c or nobootmem.c is
      built according to CONFIG_NO_BOOTMEM.
      
      This patch doesn't introduce any behavior change.
      
      -tj: Rewrote commit description.
      Suggested-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      09325873
    • H
      mm: fix possible cause of a page_mapped BUG · a3e8cc64
      Hugh Dickins 提交于
      Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
      a hole with madvise(,, MADV_REMOVE).  That path is under mutex, and
      cannot be explained by lack of serialization in unmap_mapping_range().
      
      Reviewing the code, I found one place where vm_truncate_count handling
      should have been updated, when I switched at the last minute from one
      way of managing the restart_addr to another: mremap move changes the
      virtual addresses, so it ought to adjust the restart_addr.
      
      But rather than exporting the notion of restart_addr from memory.c, or
      converting to restart_pgoff throughout, simply reset vm_truncate_count
      to 0 to force a rescan if mremap move races with preempted truncation.
      
      We have no confirmation that this fixes Robert's BUG,
      but it is a fix that's worth making anyway.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3e8cc64
    • M
      mm: prevent concurrent unmap_mapping_range() on the same inode · 2aa15890
      Miklos Szeredi 提交于
      Michael Leun reported that running parallel opens on a fuse filesystem
      can trigger a "kernel BUG at mm/truncate.c:475"
      
      Gurudas Pai reported the same bug on NFS.
      
      The reason is, unmap_mapping_range() is not prepared for more than
      one concurrent invocation per inode.  For example:
      
        thread1: going through a big range, stops in the middle of a vma and
           stores the restart address in vm_truncate_count.
      
        thread2: comes in with a small (e.g. single page) unmap request on
           the same vma, somewhere before restart_address, finds that the
           vma was already unmapped up to the restart address and happily
           returns without doing anything.
      
      Another scenario would be two big unmap requests, both having to
      restart the unmapping and each one setting vm_truncate_count to its
      own value.  This could go on forever without any of them being able to
      finish.
      
      Truncate and hole punching already serialize with i_mutex.  Other
      callers of unmap_mapping_range() do not, and it's difficult to get
      i_mutex protection for all callers.  In particular ->d_revalidate(),
      which calls invalidate_inode_pages2_range() in fuse, may be called
      with or without i_mutex.
      
      This patch adds a new mutex to 'struct address_space' to prevent
      running multiple concurrent unmap_mapping_range() on the same mapping.
      
      [ We'll hopefully get rid of all this with the upcoming mm
        preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex
        lockbreak" patch in particular.  But that is for 2.6.39 ]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Reported-by: NMichael Leun <lkml20101129@newton.leun.net>
      Reported-by: NGurudas Pai <gurudas.pai@oracle.com>
      Tested-by: NGurudas Pai <gurudas.pai@oracle.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2aa15890
  14. 23 2月, 2011 1 次提交
  15. 17 2月, 2011 1 次提交
  16. 16 2月, 2011 1 次提交