1. 25 5月, 2010 5 次提交
    • M
      mm: remove return value of putback_lru_pages() · e13861d8
      Minchan Kim 提交于
      putback_lru_page() never can fail.  So it doesn't matter count of "the
      number of pages put back".
      
      In addition, users of this functions don't use return value.
      
      Let's remove unnecessary code.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e13861d8
    • H
      shmem: remove redundant code · 4b50dc26
      Huang Shijie 提交于
      prep_new_page() will call set_page_private(page, 0) to initialise the
      page, so the code is redundant.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b50dc26
    • Y
      sparsemem: on no vmemmap path put mem_map on node high too · e48e67e0
      Yinghai Lu 提交于
      We need to put mem_map high when virtual memmap is not used.
      
      before this patch
      free mem pfn range on first node:
      [    0.000000]  19 - 1f
      [    0.000000]  28 40 - 80 95
      [    0.000000]  702 740 - 1000 1000
      [    0.000000]  347c - 347e
      [    0.000000]  34e7 3500 - 3b80 3b8b
      [    0.000000]  73b8b 73bc0 - 73c00 73c00
      [    0.000000]  73ddd - 73e00
      [    0.000000]  73fdd - 74000
      [    0.000000]  741dd - 74200
      [    0.000000]  743dd - 74400
      [    0.000000]  745dd - 74600
      [    0.000000]  747dd - 74800
      [    0.000000]  749dd - 74a00
      [    0.000000]  74bdd - 74c00
      [    0.000000]  74ddd - 74e00
      [    0.000000]  74fdd - 75000
      [    0.000000]  751dd - 75200
      [    0.000000]  753dd - 75400
      [    0.000000]  755dd - 75600
      [    0.000000]  757dd - 75800
      [    0.000000]  759dd - 75a00
      [    0.000000]  79bdd 79c00 - 7d540 7d550
      [    0.000000]  7f745 - 7f750
      [    0.000000]  10000b 100040 - 2080000 2080000
      so only 79c00 - 7d540 are major free block under 4g...
      
      after this patch, we will get
      [    0.000000]  19 - 1f
      [    0.000000]  28 40 - 80 95
      [    0.000000]  702 740 - 1000 1000
      [    0.000000]  347c - 347e
      [    0.000000]  34e7 3500 - 3600 3600
      [    0.000000]  37dd - 3800
      [    0.000000]  39dd - 3a00
      [    0.000000]  3bdd - 3c00
      [    0.000000]  3ddd - 3e00
      [    0.000000]  3fdd - 4000
      [    0.000000]  41dd - 4200
      [    0.000000]  43dd - 4400
      [    0.000000]  45dd - 4600
      [    0.000000]  47dd - 4800
      [    0.000000]  49dd - 4a00
      [    0.000000]  4bdd - 4c00
      [    0.000000]  4ddd - 4e00
      [    0.000000]  4fdd - 5000
      [    0.000000]  51dd - 5200
      [    0.000000]  53dd - 5400
      [    0.000000]  95dd 9600 - 7d540 7d550
      [    0.000000]  7f745 - 7f750
      [    0.000000]  17000b 170040 - 2080000 2080000
      we will have 9600 - 7d540 for major free block...
      
      sparse-vmemmap path already used __alloc_bootmem_node_high()
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e48e67e0
    • C
      page allocator: reduce fragmentation in buddy allocator by adding buddies that... · 6dda9d55
      Corrado Zoccolo 提交于
      page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists
      
      In order to reduce fragmentation, this patch classifies freed pages in two
      groups according to their probability of being part of a high order merge.
       Pages belonging to a compound whose next-highest buddy is free are more
      likely to be part of a high order merge in the near future, so they will
      be added at the tail of the freelist.  The remaining pages are put at the
      front of the freelist.
      
      In this way, the pages that are more likely to cause a big merge are kept
      free longer.  Consequently there is a tendency to aggregate the
      long-living allocations on a subset of the compounds, reducing the
      fragmentation.
      
      This heuristic was tested on three machines, x86, x86-64 and ppc64 with
      3GB of RAM in each machine.  The tests were kernbench, netperf, sysbench
      and STREAM for performance and a high-order stress test for huge page
      allocations.
      
      KernBench X86
      Elapsed mean     374.77 ( 0.00%)   375.10 (-0.09%)
      User    mean     649.53 ( 0.00%)   650.44 (-0.14%)
      System  mean      54.75 ( 0.00%)    54.18 ( 1.05%)
      CPU     mean     187.75 ( 0.00%)   187.25 ( 0.27%)
      
      KernBench X86-64
      Elapsed mean      94.45 ( 0.00%)    94.01 ( 0.47%)
      User    mean     323.27 ( 0.00%)   322.66 ( 0.19%)
      System  mean      36.71 ( 0.00%)    36.50 ( 0.57%)
      CPU     mean     380.75 ( 0.00%)   381.75 (-0.26%)
      
      KernBench PPC64
      Elapsed mean     173.45 ( 0.00%)   173.74 (-0.17%)
      User    mean     587.99 ( 0.00%)   587.95 ( 0.01%)
      System  mean      60.60 ( 0.00%)    60.57 ( 0.05%)
      CPU     mean     373.50 ( 0.00%)   372.75 ( 0.20%)
      
      Nothing notable for kernbench.
      
      NetPerf UDP X86
            64    42.68 ( 0.00%)     42.77 ( 0.21%)
           128    85.62 ( 0.00%)     85.32 (-0.35%)
           256   170.01 ( 0.00%)    168.76 (-0.74%)
          1024   655.68 ( 0.00%)    652.33 (-0.51%)
          2048  1262.39 ( 0.00%)   1248.61 (-1.10%)
          3312  1958.41 ( 0.00%)   1944.61 (-0.71%)
          4096  2345.63 ( 0.00%)   2318.83 (-1.16%)
          8192  4132.90 ( 0.00%)   4089.50 (-1.06%)
         16384  6770.88 ( 0.00%)   6642.05 (-1.94%)*
      
      NetPerf UDP X86-64
            64   148.82 ( 0.00%)    154.92 ( 3.94%)
           128   298.96 ( 0.00%)    312.95 ( 4.47%)
           256   583.67 ( 0.00%)    626.39 ( 6.82%)
          1024  2293.18 ( 0.00%)   2371.10 ( 3.29%)
          2048  4274.16 ( 0.00%)   4396.83 ( 2.79%)
          3312  6356.94 ( 0.00%)   6571.35 ( 3.26%)
          4096  7422.68 ( 0.00%)   7635.42 ( 2.79%)*
          8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
         16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
                   1.64%             2.73%
      
      NetPerf UDP PPC64
            64    49.98 ( 0.00%)     50.25 ( 0.54%)
           128    98.66 ( 0.00%)    100.95 ( 2.27%)
           256   197.33 ( 0.00%)    191.03 (-3.30%)
          1024   761.98 ( 0.00%)    785.07 ( 2.94%)
          2048  1493.50 ( 0.00%)   1510.85 ( 1.15%)
          3312  2303.95 ( 0.00%)   2271.72 (-1.42%)
          4096  2774.56 ( 0.00%)   2773.06 (-0.05%)
          8192  4918.31 ( 0.00%)   4793.59 (-2.60%)
         16384  7497.98 ( 0.00%)   7749.52 ( 3.25%)
      
      The tests are run to have confidence limits within 1%.  Results marked
      with a * were not confident although in this case, it's only outside by
      small amounts.  Even with some results that were not confident, the
      netperf UDP results were generally positive.
      
      NetPerf TCP X86
            64   652.25 ( 0.00%)*   648.12 (-0.64%)*
                  23.80%            22.82%
           128  1229.98 ( 0.00%)*  1220.56 (-0.77%)*
                  21.03%            18.90%
           256  2105.88 ( 0.00%)   1872.03 (-12.49%)*
                   1.00%            16.46%
          1024  3476.46 ( 0.00%)*  3548.28 ( 2.02%)*
                  13.37%            11.39%
          2048  4023.44 ( 0.00%)*  4231.45 ( 4.92%)*
                   9.76%            12.48%
          3312  4348.88 ( 0.00%)*  4396.96 ( 1.09%)*
                   6.49%             8.75%
          4096  4726.56 ( 0.00%)*  4877.71 ( 3.10%)*
                   9.85%             8.50%
          8192  4732.28 ( 0.00%)*  5777.77 (18.10%)*
                   9.13%            13.04%
         16384  5543.05 ( 0.00%)*  5906.24 ( 6.15%)*
                   7.73%             8.68%
      
      NETPERF TCP X86-64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64  1895.87 ( 0.00%)*  1775.07 (-6.81%)*
                   5.79%             4.78%
           128  3571.03 ( 0.00%)*  3342.20 (-6.85%)*
                   3.68%             6.06%
           256  5097.21 ( 0.00%)*  4859.43 (-4.89%)*
                   3.02%             2.10%
          1024  8919.10 ( 0.00%)*  8892.49 (-0.30%)*
                   5.89%             6.55%
          2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
                   7.08%             7.44%
          3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
                   6.87%             7.33%
          4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
                   6.86%             8.18%
          8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
                   7.49%             5.55%
         16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
                   7.36%             6.49%
      
      NETPERF TCP PPC64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64   594.17 ( 0.00%)    596.04 ( 0.31%)*
                   1.00%             2.29%
           128  1064.87 ( 0.00%)*  1074.77 ( 0.92%)*
                   1.30%             1.40%
           256  1852.46 ( 0.00%)*  1856.95 ( 0.24%)
                   1.25%             1.00%
          1024  3839.46 ( 0.00%)*  3813.05 (-0.69%)
                   1.02%             1.00%
          2048  4885.04 ( 0.00%)*  4881.97 (-0.06%)*
                   1.15%             1.04%
          3312  5506.90 ( 0.00%)   5459.72 (-0.86%)
          4096  6449.19 ( 0.00%)   6345.46 (-1.63%)
          8192  7501.17 ( 0.00%)   7508.79 ( 0.10%)
         16384  9618.65 ( 0.00%)   9490.10 (-1.35%)
      
      There was a distinct lack of confidence in the X86* figures so I included
      what the devation was where the results were not confident.  Many of the
      results, whether gains or losses were within the standard deviation so no
      solid conclusion can be reached on performance impact.  Looking at the
      figures, only the X86-64 ones look suspicious with a few losses that were
      outside the noise.  However, the results were so unstable that without
      knowing why they vary so much, a solid conclusion cannot be reached.
      
      SYSBENCH X86
                    sysbench-vanilla     pgalloc-delay
                 1  7722.85 ( 0.00%)  7756.79 ( 0.44%)
                 2 14901.11 ( 0.00%) 13683.44 (-8.90%)
                 3 15171.71 ( 0.00%) 14888.25 (-1.90%)
                 4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
                 5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
                 6 14870.33 ( 0.00%) 14845.57 (-0.17%)
                 7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
                 8 14354.35 ( 0.00%) 14362.31 ( 0.06%)
      
      SYSBENCH X86-64
                 1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
                 2 34276.39 ( 0.00%) 34251.00 (-0.07%)
                 3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
                 4 66667.10 ( 0.00%) 66174.69 (-0.74%)
                 5 66003.91 ( 0.00%) 65685.25 (-0.49%)
                 6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
                 7 64933.16 ( 0.00%) 64379.23 (-0.86%)
                 8 63353.30 ( 0.00%) 63281.22 (-0.11%)
                 9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
                10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
                11 62092.81 ( 0.00%) 61787.75 (-0.49%)
                12 61330.11 ( 0.00%) 61036.34 (-0.48%)
                13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
                14 62304.48 ( 0.00%) 62064.90 (-0.39%)
                15 63296.48 ( 0.00%) 62875.16 (-0.67%)
                16 63951.76 ( 0.00%) 63769.09 (-0.29%)
      
      SYSBENCH PPC64
                                   -sysbench-pgalloc-delay-sysbench
                    sysbench-vanilla     pgalloc-delay
                 1  7645.08 ( 0.00%)  7467.43 (-2.38%)
                 2 14856.67 ( 0.00%) 14558.73 (-2.05%)
                 3 21952.31 ( 0.00%) 21683.64 (-1.24%)
                 4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
                 5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
                 6 27477.10 ( 0.00%) 27337.45 (-0.51%)
                 7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
                 8 26642.91 ( 0.00%) 25274.33 (-5.41%)
                 9 25137.27 ( 0.00%) 24810.06 (-1.32%)
                10 24451.99 ( 0.00%) 24275.85 (-0.73%)
                11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
                12 24234.81 ( 0.00%) 23640.89 (-2.51%)
                13 24577.75 ( 0.00%) 24433.50 (-0.59%)
                14 25640.19 ( 0.00%) 25116.52 (-2.08%)
                15 26188.84 ( 0.00%) 26181.36 (-0.03%)
                16 26782.37 ( 0.00%) 26255.99 (-2.00%)
      
      Again, there is little to conclude here.  While there are a few losses,
      the results vary by +/- 8% in some cases.  They are the results of most
      concern as there are some large losses but it's also within the variance
      typically seen between kernel releases.
      
      The STREAM results varied so little and are so verbose that I didn't
      include them here.
      
      The final test stressed how many huge pages can be allocated.  The
      absolute number of huge pages allocated are the same with or without the
      page.  However, the "unusability free space index" which is a measure of
      external fragmentation was slightly lower (lower is better) throughout the
      lifetime of the system.  I also measured the latency of how long it took
      to successfully allocate a huge page.  The latency was slightly lower and
      on X86 and PPC64, more huge pages were allocated almost immediately from
      the free lists.  The improvement is slight but there.
      
      [mel@csn.ul.ie: Tested, reworked for less branches]
      [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dda9d55
    • K
      tmpfs: insert tmpfs cache pages to inactive list at first · e9d6c157
      KOSAKI Motohiro 提交于
      Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer.
      This is regression of caused by commit 9ff473b9 ("vmscan: evict
      streaming IO first").  Wow, It is 2 years old patch!
      
      Currently, tmpfs file cache is inserted active list at first.  This means
      that the insertion doesn't only increase numbers of pages in anon LRU, but
      it also reduces anon scanning ratio.  Therefore, vmscan will get totally
      confused.  It scans almost only file LRU even though the system has plenty
      unused tmpfs pages.
      
      Historically, lru_cache_add_active_anon() was used for two reasons.
      1) Intend to priotize shmem page rather than regular file cache.
      2) Intend to avoid reclaim priority inversion of used once pages.
      
      But we've lost both motivation because (1) Now we have separate anon and
      file LRU list.  then, to insert active list doesn't help such priotize.
      (2) In past, one pte access bit will cause page activation.  then to
      insert inactive list with pte access bit mean higher priority than to
      insert active list.  Its priority inversion may lead to uninteded lru
      chun.  but it was already solved by commit 64574746 (vmscan: detect
      mapped file pages used only once).  (Thanks Hannes, you are great!)
      
      Thus, now we can use lru_cache_add_anon() instead.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9d6c157
  2. 22 5月, 2010 9 次提交
  3. 20 5月, 2010 3 次提交
  4. 19 5月, 2010 2 次提交
  5. 17 5月, 2010 1 次提交
    • J
      writeback: fix WB_SYNC_NONE writeback from umount · e913fc82
      Jens Axboe 提交于
      When umount calls sync_filesystem(), we first do a WB_SYNC_NONE
      writeback to kick off writeback of pending dirty inodes, then follow
      that up with a WB_SYNC_ALL to wait for it. Since umount already holds
      the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all
      writeback happens as WB_SYNC_ALL. This can greatly slow down umount,
      since WB_SYNC_ALL writeback is a data integrity operation and thus
      a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems
      it's a lot slower.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      e913fc82
  6. 12 5月, 2010 4 次提交
    • K
      memcg: fix css_is_ancestor() RCU locking · 747388d7
      KAMEZAWA Hiroyuki 提交于
      Some callers (in memcontrol.c) calls css_is_ancestor() without
      rcu_read_lock.  Because css_is_ancestor() has to access RCU protected
      data, it should be under rcu_read_lock().
      
      This makes css_is_ancestor() itself does safe access to RCU protected
      area.  (At least, "root" can have refcnt==0 if it's not an ancestor of
      "child".  So, we need rcu_read_lock().)
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747388d7
    • K
      memcg: fix css_id() RCU locking for real · 7f0f1546
      KAMEZAWA Hiroyuki 提交于
      Commit ad4ba375 ("memcg: css_id() must be
      called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
      message.  But Andrew Morton pointed out that the fix doesn't seems sane
      and it was just for hidining lockdep messages.
      
      This is a patch for do proper things.  Checking again, all places,
      accessing without rcu_read_lock, that commit fixies was intentional....
      all callers of css_id() has reference count on it.  So, it's not necessary
      to be under rcu_read_lock().
      
      Considering again, we can use rcu_dereference_check for css_id().  We know
      css->id is valid if css->refcnt > 0.  (css->id never changes and freed
      after css->refcnt going to be 0.)
      
      This patch makes use of rcu_dereference_check() in css_id/depth and remove
      unnecessary rcu-read-lock added by the commit.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f0f1546
    • N
      rmap: remove anon_vma check in page_address_in_vma() · ab941e0f
      Naoya Horiguchi 提交于
      Currently page_address_in_vma() compares vma->anon_vma and
      page_anon_vma(page) for parameter check, but in 2.6.34 a vma can have
      multiple anon_vmas with anon_vma_chain, so current check does not work.
      (For anonymous page shared by multiple processes, some verified (page,vma)
      pairs return -EFAULT wrongly.)
      
      We can go to checking all anon_vmas in the "same_vma" chain, but it needs
      to meet lock requirement.  Instead, we can remove anon_vma check safely
      because page_address_in_vma() assumes that page and vma are already
      checked to belong to the identical process.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab941e0f
    • M
      hugetlbfs: kill applications that use MAP_NORESERVE with SIGBUS instead of OOM-killer · 4a6018f7
      Mel Gorman 提交于
      Ordinarily, application using hugetlbfs will create mappings with
      reserves.  For shared mappings, these pages are reserved before mmap()
      returns success and for private mappings, the caller process is guaranteed
      and a child process that cannot get the pages gets killed with sigbus.
      
      An application that uses MAP_NORESERVE gets no reservations and mmap()
      will always succeed at the risk the page will not be available at fault
      time.  This might be used for example on very large sparse mappings where
      the developer is confident the necessary huge pages exist to satisfy all
      faults even though the whole mapping cannot be backed by huge pages.
      Unfortunately, if an allocation does fail, VM_FAULT_OOM is returned to the
      fault handler which proceeds to trigger the OOM-killer.  This is
      unhelpful.
      
      Even without hugetlbfs mounted, a user using mmap() can trivially trigger
      the OOM-killer because VM_FAULT_OOM is returned (will provide example
      program if desired - it's a whopping 24 lines long).  It could be
      considered a DOS available to an unprivileged user.
      
      This patch alters hugetlbfs to kill a process that uses MAP_NORESERVE
      where huge pages were not available with SIGBUS instead of triggering the
      OOM killer.
      
      This change affects hugetlb_cow() as well.  I feel there is a failure case
      in there, but I didn't create one.  It would need a fairly specific target
      in terms of the faulting application and the hugepage pool size.  The
      hugetlb_no_page() path is much easier to hit but both might as well be
      closed.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a6018f7
  7. 06 5月, 2010 1 次提交
  8. 05 5月, 2010 1 次提交
  9. 01 5月, 2010 5 次提交
    • T
      percpu: implement kernel memory based chunk allocation · b0c9778b
      Tejun Heo 提交于
      Implement an alternate percpu chunk management based on kernel memeory
      for nommu SMP architectures.  Instead of mapping into vmalloc area,
      chunks are allocated as a contiguous kernel memory using
      alloc_pages().  As such, percpu allocator on nommu will have the
      following restrictions.
      
      * It can't fill chunks on-demand page-by-page.  It has to allocate
        each chunk fully upfront.
      
      * It can't support sparse chunk for NUMA configurations.  SMP w/o mmu
        is crazy enough.  Let's hope no one does NUMA w/o mmu.  :-P
      
      * If chunk size isn't power-of-two multiple of PAGE_SIZE, the
        unaligned amount will be wasted on each chunk.  So, archs which use
        this better align chunk size.
      
      For instructions on how to use this, read the comment on top of
      mm/percpu-km.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Cc: Graff Yang <graff.yang@gmail.com>
      Cc: Sonic Zhang <sonic.adi@gmail.com>
      b0c9778b
    • T
      percpu: move vmalloc based chunk management into percpu-vm.c · 9f645532
      Tejun Heo 提交于
      Separate out and move chunk management (creation/desctruction and
      [de]population) code into percpu-vm.c which is included by percpu.c
      and compiled together.  The interface for chunk management is defined
      as follows.
      
       * pcpu_populate_chunk		- populate the specified range of a chunk
       * pcpu_depopulate_chunk	- depopulate the specified range of a chunk
       * pcpu_create_chunk		- create a new chunk
       * pcpu_destroy_chunk		- destroy a chunk, always preceded by full depop
       * pcpu_addr_to_page		- translate address to physical address
       * pcpu_verify_alloc_info	- check alloc_info is acceptable during init
      
      Other than wrapping vmalloc_to_page() inside pcpu_addr_to_page() and
      dummy pcpu_verify_alloc_info() implementation, this patch only moves
      code around.  This separation is to allow alternate chunk management
      implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Cc: Graff Yang <graff.yang@gmail.com>
      Cc: Sonic Zhang <sonic.adi@gmail.com>
      9f645532
    • T
      percpu: misc preparations for nommu support · 88999a89
      Tejun Heo 提交于
      Make the following misc preparations for percpu nommu support.
      
      * Remove refernces to vmalloc in common comments as nommu percpu won't
        use it.
      
      * Rename chunk->vms to chunk->data and make it void *.  Its use is
        determined by chunk management implementation.
      
      * Relocate utility functions and add __maybe_unused to functions which
        might not be used by different chunk management implementations.
      
      This patch doesn't cause any functional change.  This is to allow
      alternate chunk management implementation for percpu nommu support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Cc: Graff Yang <graff.yang@gmail.com>
      Cc: Sonic Zhang <sonic.adi@gmail.com>
      88999a89
    • T
      percpu: reorganize chunk creation and destruction · 6081089f
      Tejun Heo 提交于
      Reorganize alloc/free_pcpu_chunk() such that chunk struct alloc/free
      live in pcpu_alloc/free_chunk() and the rest in
      pcpu_create/destroy_chunk().  While at it, add missing error handling
      for chunk->map allocation failure.
      
      This is to allow alternate chunk management implementation for percpu
      nommu support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Cc: Graff Yang <graff.yang@gmail.com>
      Cc: Sonic Zhang <sonic.adi@gmail.com>
      6081089f
    • T
      percpu: factor out pcpu_addr_in_first/reserved_chunk() and update per_cpu_ptr_to_phys() · 020ec653
      Tejun Heo 提交于
      Factor out pcpu_addr_in_first/reserved_chunk() from
      pcpu_chunk_addr_search() and use it to update per_cpu_ptr_to_phys()
      such that it handles first chunk differently from the rest.
      
      This patch doesn't cause any functional change and is to prepare for
      percpu nommu support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDavid Howells <dhowells@redhat.com>
      Cc: Graff Yang <graff.yang@gmail.com>
      Cc: Sonic Zhang <sonic.adi@gmail.com>
      020ec653
  10. 29 4月, 2010 1 次提交
  11. 27 4月, 2010 1 次提交
  12. 25 4月, 2010 5 次提交
  13. 22 4月, 2010 1 次提交
  14. 20 4月, 2010 1 次提交
    • R
      rmap: add exclusively owned pages to the newest anon_vma · e8a03feb
      Rik van Riel 提交于
      The recent anon_vma fixes cause many anonymous pages to end up
      in the parent process anon_vma, even when the page is exclusively
      owned by the current process.
      
      Adding exclusively owned anonymous pages to the top anon_vma
      reduces rmap scanning overhead, especially in workloads with
      forking servers.
      
      This patch adds a parameter to __page_set_anon_rmap that can
      be used to indicate whether or not the added page is exclusively
      owned by the current process.
      
      Pages added through page_add_new_anon_rmap are exclusively
      owned by the current process, and can be added to the top
      anon_vma.
      
      Pages added through page_add_anon_rmap can be either shared
      or exclusively owned, so we do the conservative thing and
      add it to the oldest anon_vma.
      
      A next step would be to add the exclusive parameter to
      page_add_anon_rmap, to be used from functions where we do
      know for sure whether a page is exclusively owned.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Lightly-tested-by: NBorislav Petkov <bp@alien8.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      [ Edited to look nicer  - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8a03feb
新手
引导
客服 返回
顶部