1. 04 7月, 2013 22 次提交
    • M
      mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority · d43006d5
      Mel Gorman 提交于
      Currently kswapd queues dirty pages for writeback if scanning at an
      elevated priority but the priority kswapd scans at is not related to the
      number of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten
      kswapd priority loop", the priority is related to the size of the LRU
      and the zone watermark which is no indication as to whether kswapd
      should write pages or not.
      
      This patch tracks if an excessive number of unqueued dirty pages are
      being encountered at the end of the LRU.  If so, it indicates that dirty
      pages are being recycled before flusher threads can clean them and flags
      the zone so that kswapd will start writing pages until the zone is
      balanced.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d43006d5
    • M
      mm: vmscan: do not allow kswapd to scan at maximum priority · 9aa41348
      Mel Gorman 提交于
      Page reclaim at priority 0 will scan the entire LRU as priority 0 is
      considered to be a near OOM condition.  Kswapd can reach priority 0
      quite easily if it is encountering a large number of pages it cannot
      reclaim such as pages under writeback.  When this happens, kswapd
      reclaims very aggressively even though there may be no real risk of
      allocation failure or OOM.
      
      This patch prevents kswapd reaching priority 0 and trying to reclaim the
      world.  Direct reclaimers will still reach priority 0 in the event of an
      OOM situation.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9aa41348
    • M
      mm: vmscan: decide whether to compact the pgdat based on reclaim progress · 2ab44f43
      Mel Gorman 提交于
      In the past, kswapd makes a decision on whether to compact memory after
      the pgdat was considered balanced.  This more or less worked but it is
      late to make such a decision and does not fit well now that kswapd makes
      a decision whether to exit the zone scanning loop depending on reclaim
      progress.
      
      This patch will compact a pgdat if at least the requested number of
      pages were reclaimed from unbalanced zones for a given priority.  If any
      zone is currently balanced, kswapd will not call compaction as it is
      expected the necessary pages are already available.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ab44f43
    • M
      mm: vmscan: flatten kswapd priority loop · b8e83b94
      Mel Gorman 提交于
      kswapd stops raising the scanning priority when at least
      SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered
      balanced.  It then rechecks if it needs to restart at DEF_PRIORITY and
      whether high-order reclaim needs to be reset.  This is not wrong per-se
      but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY
      may require several restarts before it has scanned enough pages to meet
      the high watermark even at 100% efficiency.  This patch irons out the
      logic a bit by controlling when priority is raised and removing the
      "goto loop_again".
      
      This patch has kswapd raise the scanning priority until it is scanning
      enough pages that it could meet the high watermark in one shrink of the
      LRU lists if it is able to reclaim at 100% efficiency.  It will not
      raise the scanning prioirty higher unless it is failing to reclaim any
      pages.
      
      To avoid infinite looping for high-order allocation requests kswapd will
      not reclaim for high-order allocations when it has reclaimed at least
      twice the number of pages as the allocation request.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8e83b94
    • M
      mm: vmscan: obey proportional scanning requirements for kswapd · e82e0561
      Mel Gorman 提交于
      Simplistically, the anon and file LRU lists are scanned proportionally
      depending on the value of vm.swappiness although there are other factors
      taken into account by get_scan_count().  The patch "mm: vmscan: Limit
      the number of pages kswapd reclaims" limits the number of pages kswapd
      reclaims but it breaks this proportional scanning and may evenly shrink
      anon/file LRUs regardless of vm.swappiness.
      
      This patch preserves the proportional scanning and reclaim.  It does
      mean that kswapd will reclaim more than requested but the number of
      pages will be related to the high watermark.
      
      [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
      [kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target]
      [hannes@cmpxchg.org: Account for already scanned pages properly]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e82e0561
    • M
      mm: vmscan: limit the number of pages kswapd reclaims at each priority · 75485363
      Mel Gorman 提交于
      This series does not fix all the current known problems with reclaim but
      it addresses one important swapping bug when there is background IO.
      
      Changelog since V3
       - Drop the slab shrink changes in light of Glaubers series and
         discussions highlighted that there were a number of potential
         problems with the patch.					(mel)
       - Rebased to 3.10-rc1
      
      Changelog since V2
       - Preserve ratio properly for proportional scanning		(kamezawa)
      
      Changelog since V1
       - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY			(andi)
       - Reformat comment in shrink_page_list				(andi)
       - Clarify some comments					(dhillf)
       - Rework how the proportional scanning is preserved
       - Add PageReclaim check before kswapd starts writeback
       - Reset sc.nr_reclaimed on every full zone scan
      
      Kswapd and page reclaim behaviour has been screwy in one way or the
      other for a long time.  Very broadly speaking it worked in the far past
      because machines were limited in memory so it did not have that many
      pages to scan and it stalled congestion_wait() frequently to prevent it
      going completely nuts.  In recent times it has behaved very
      unsatisfactorily with some of the problems compounded by the removal of
      stall logic and the introduction of transparent hugepage support with
      high-order reclaims.
      
      There are many variations of bugs that are rooted in this area.  One
      example is reports of a large copy operations or backup causing the
      machine to grind to a halt or applications pushed to swap.  Sometimes in
      low memory situations a large percentage of memory suddenly gets
      reclaimed.  In other cases an application starts and kswapd hits 100%
      CPU usage for prolonged periods of time and so on.  There is now talk of
      introducing features like an extra free kbytes tunable to work around
      aspects of the problem instead of trying to deal with it.  It's
      compounded by the problem that it can be very workload and machine
      specific.
      
      This series aims at addressing some of the worst of these problems
      without attempting to fundmentally alter how page reclaim works.
      
      Patches 1-2 limits the number of pages kswapd reclaims while still obeying
      	the anon/file proportion of the LRUs it should be scanning.
      
      Patches 3-4 control how and when kswapd raises its scanning priority and
      	deletes the scanning restart logic which is tricky to follow.
      
      Patch 5 notes that it is too easy for kswapd to reach priority 0 when
      	scanning and then reclaim the world. Down with that sort of thing.
      
      Patch 6 notes that kswapd starts writeback based on scanning priority which
      	is not necessarily related to dirty pages. It will have kswapd
      	writeback pages if a number of unqueued dirty pages have been
      	recently encountered at the tail of the LRU.
      
      Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
      	to reduce LRU churn and the likelihood that it'll reclaim young
      	clean pages or push applications to swap. It will cause kswapd
      	to block on IO if it detects that pages being reclaimed under
      	writeback are recycling through the LRU before the IO completes.
      
      Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
      	are applied.
      
      This was tested using memcached+memcachetest while some background IO
      was in progress as implemented by the parallel IO tests implement in MM
      Tests.
      
      memcachetest benchmarks how many operations/second memcached can service
      and it is run multiple times.  It starts with no background IO and then
      re-runs the test with larger amounts of IO in the background to roughly
      simulate a large copy in progress.  The expectation is that the IO
      should have little or no impact on memcachetest which is running
      entirely in memory.
      
                                              3.10.0-rc1                  3.10.0-rc1
                                                 vanilla            lessdisrupt-v4
      Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
      Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
      Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
      Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
      Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
      Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
      Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
      Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
      Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
      Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
      Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
      Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
      Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
      Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)
      
      Note how the vanilla kernels performance collapses when there is enough
      IO taking place in the background.  This drop in performance is part of
      what users complain of when they start backups.  Note how the swapin and
      major fault figures indicate that processes were being pushed to swap
      prematurely.  With the series applied, there is no noticable performance
      drop and while there is still some swap activity, it's tiny.
      
      20 iterations of this test were run in total and averaged.  Every 5
      iterations, additional IO was generated in the background using dd to
      measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
      subblock refer to the amount of IO going on in the background at each
      iteration.  So memcachetest-2385M is reporting how many
      transactions/second memcachetest recorded on average over 5 iterations
      while there was 2385M of IO going on in the ground.  There are six
      blocks of information reported here
      
      memcachetest is the transactions/second reported by memcachetest. In
      	the vanilla kernel note that performance drops from around
      	22K/sec to just under 4K/second when there is 2385M of IO going
      	on in the background. This is one type of performance collapse
      	users complain about if a large cp or backup starts in the
      	background
      
      io-duration refers to how long it takes for the background IO to
      	complete. It's showing that with the patched kernel that the IO
      	completes faster while not interfering with the memcache
      	workload
      
      swaptotal is the total amount of swap traffic. With the patched kernel,
      	the total amount of swapping is much reduced although it is
      	still not zero.
      
      swapin in this case is an indication as to whether we are swap trashing.
      	The closer the swapin/swapout ratio is to 1, the worse the
      	trashing is.  Note with the patched kernel that there is no swapin
      	activity indicating that all the pages swapped were really inactive
      	unused pages.
      
      minorfaults are just minor faults. An increased number of minor faults
      	can indicate that page reclaim is unmapping the pages but not
      	swapping them out before they are faulted back in. With the
      	patched kernel, there is only a small change in minor faults
      
      majorfaults are just major faults in the target workload and a high
      	number can indicate that a workload is being prematurely
      	swapped. With the patched kernel, major faults are much reduced. As
      	there are no swapin's recorded so it's not being swapped. The likely
      	explanation is that that libraries or configuration files used by
      	the workload during startup get paged out by the background IO.
      
      Overall with the series applied, there is no noticable performance drop
      due to background IO and while there is still some swap activity, it's
      tiny and the lack of swapins imply that the swapped pages were inactive
      and unused.
      
                                  3.10.0-rc1  3.10.0-rc1
                                     vanilla lessdisrupt-v4
      Page Ins                       1234608      101892
      Page Outs                     12446272    11810468
      Swap Ins                        283406           0
      Swap Outs                       698469       27882
      Direct pages scanned                 0      136480
      Kswapd pages scanned           6266537     5369364
      Kswapd pages reclaimed         1088989      930832
      Direct pages reclaimed               0      120901
      Kswapd efficiency                  17%         17%
      Kswapd velocity               5398.371    4635.115
      Direct efficiency                 100%         88%
      Direct velocity                  0.000     117.817
      Percentage direct scans             0%          2%
      Page writes by reclaim         1655843     4009929
      Page writes file                957374     3982047
      Page writes anon                698469       27882
      Page reclaim immediate            5245        1745
      Page rescued immediate               0           0
      Slabs scanned                    33664       25216
      Direct inode steals                  0           0
      Kswapd inode steals              19409         778
      Kswapd skipped wait                  0           0
      THP fault alloc                     35          30
      THP collapse alloc                 472         401
      THP splits                          27          22
      THP fault fallback                   0           0
      THP collapse fail                    0           1
      Compaction stalls                    0           4
      Compaction success                   0           0
      Compaction failures                  0           4
      Page migrate success                 0           0
      Page migrate failure                 0           0
      Compaction pages isolated            0           0
      Compaction migrate scanned           0           0
      Compaction free scanned              0           0
      Compaction cost                      0           0
      NUMA PTE updates                     0           0
      NUMA hint faults                     0           0
      NUMA hint local faults               0           0
      NUMA pages migrated                  0           0
      AutoNUMA cost                        0           0
      
      Unfortunately, note that there is a small amount of direct reclaim due to
      kswapd no longer reclaiming the world.  ftrace indicates that the direct
      reclaim stalls are mostly harmless with the vast bulk of the stalls
      incurred by dd
      
           23 tclsh-3367
           38 memcachetest-13733
           49 memcachetest-12443
           57 tee-3368
         1541 dd-13826
         1981 dd-12539
      
      A consequence of the direct reclaim for dd is that the processes for the
      IO workload may show a higher system CPU usage.  There is also a risk that
      kswapd not reclaiming the world may mean that it stays awake balancing
      zones, does not stall on the appropriate events and continually scans
      pages it cannot reclaim consuming CPU.  This will be visible as continued
      high CPU usage but in my own tests I only saw a single spike lasting less
      than a second and I did not observe any problems related to reclaim while
      running the series on my desktop.
      
      This patch:
      
      The number of pages kswapd can reclaim is bound by the number of pages it
      scans which is related to the size of the zone and the scanning priority.
      In many cases the priority remains low because it's reset every
      SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
      number of pages it cannot reclaim, it will raise the priority and
      potentially discard a large percentage of the zone as sc->nr_to_reclaim is
      ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
      percentage of memory is suddenly freed.  It would be bad enough if this
      was just unused memory but because of how anon/file pages are balanced it
      is possible that applications get pushed to swap unnecessarily.
      
      This patch limits the number of pages kswapd will reclaim to the high
      watermark.  Reclaim will still overshoot due to it not being a hard limit
      as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
      prevents kswapd reclaiming the world at higher priorities.  The number of
      pages it reclaims is not adjusted for high-order allocations as kswapd
      will reclaim excessively if it is to balance zones for high-order
      allocations.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Tested-by: NZlatko Calusic <zcalusic@bitsync.net>
      Cc: dormando <dormando@rydia.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75485363
    • C
      mm/page_alloc: don't re-init pageset in zone_pcp_update() · 169f6c19
      Cody P Schafer 提交于
      When memory hotplug is triggered, we call pageset_init() on
      per-cpu-pagesets which both contain pages and are in use, causing both the
      leakage of those pages and (potentially) bad behaviour if a page is
      allocated from a pageset while it is being cleared.
      
      Avoid this by factoring out pageset_set_high_and_batch() (which contains
      all needed logic too set a pageset's ->high and ->batch inrespective of
      system state) from zone_pageset_init() and using the new
      pageset_set_high_and_batch() instead of zone_pageset_init() in
      zone_pcp_update().
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      169f6c19
    • C
      mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch() · 3664033c
      Cody P Schafer 提交于
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3664033c
    • C
      mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init() · 737af4c0
      Cody P Schafer 提交于
      Previously, zone_pcp_update() called pageset_set_batch() directly,
      essentially assuming that percpu_pagelist_fraction == 0.
      
      Correct this by calling zone_pageset_init(), which chooses the
      appropriate ->batch and ->high calculations.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      737af4c0
    • C
      mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset() · 56cef2b8
      Cody P Schafer 提交于
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56cef2b8
    • C
      mm/page_alloc: relocate comment to be directly above code it refers to. · dd1895e2
      Cody P Schafer 提交于
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1895e2
    • C
      mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch() · 88c90dbc
      Cody P Schafer 提交于
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88c90dbc
    • C
      mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high · 22a7f12b
      Cody P Schafer 提交于
      Simply moves calculation of the new 'high' value outside the
      for_each_possible_cpu() loop, as it does not depend on the cpu.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22a7f12b
    • C
      mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine() · 0a647f38
      Cody P Schafer 提交于
      zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
      percpu pageset based on a zone's ->managed_pages.  We don't need to drain
      the entire percpu pageset just to modify these fields.
      
      This lets us avoid calling setup_pageset() (and the draining required to
      call it) and instead allows simply setting the fields' values (with some
      attention paid to memory barriers to prevent the relationship between
      ->batch and ->high from being thrown off).
      
      This does change the behavior of zone_pcp_update() as the percpu pagesets
      will not be drained when zone_pcp_update() is called (they will end up
      being shrunk, not completely drained, later when a 0-order page is freed
      in free_hot_cold_page()).
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a647f38
    • C
      mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE · 998d39cb
      Cody P Schafer 提交于
      pcp->batch could change at any point, avoid relying on it being a stable
      value.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      998d39cb
    • C
      mm/page_alloc: insert memory barriers to allow async update of pcp batch and high · 8d7a8fa9
      Cody P Schafer 提交于
      Introduce pageset_update() to perform a safe transision from one set of
      pcp->{batch,high} to a new set using memory barriers.
      
      This ensures that batch is always set to a safe value (1) prior to
      updating high, and ensure that high is fully updated before setting the
      real value of batch.  It avoids ->batch ever rising above ->high.
      
      Suggested by Gilad Ben-Yossef in these threads:
      
      	https://lkml.org/lkml/2013/4/9/23
      	https://lkml.org/lkml/2013/4/10/49
      
      Also reproduces his proposed comment.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Reviewed-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d7a8fa9
    • C
      mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high · c8e251fa
      Cody P Schafer 提交于
      Because we are going to rely upon a careful transision between old and new
      ->high and ->batch values using memory barriers and will remove
      stop_machine(), we need to prevent multiple updaters from interweaving
      their memory writes.
      
      Add a simple mutex to protect both update loops.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8e251fa
    • C
      mm/page_alloc: factor out setting of pcp->high and pcp->batch · 4008bab7
      Cody P Schafer 提交于
      "Problems" with the current code:
      
      1: there is a lack of synchronization in setting ->high and ->batch in
         percpu_pagelist_fraction_sysctl_handler()
      
      2: stop_machine() in zone_pcp_update() is unnecissary.
      
      3: zone_pcp_update() does not consider the case where
         percpu_pagelist_fraction is non-zero
      
      To fix:
      
      1: add memory barriers, a safe ->batch value, an update side mutex when
         updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
         that expect a stable value.
      
      2: avoid draining pages in zone_pcp_update(), rely upon the memory
         barriers added to fix #1
      
      3: factor out quite a few functions, and then call the appropriate one.
      
      Note that it results in a change to the behavior of zone_pcp_update(),
      which is used by memory_hotplug.  I'm rather certain that I've diserned
      (and preserved) the essential behavior (changing ->high and ->batch), and
      only eliminated unneeded actions (draining the per cpu pages), but this
      may not be the case.
      
      Further note that the draining of pages that previously took place in
      zone_pcp_update() occured after repeated draining when attempting to
      offline a page, and after the offline has "succeeded".  It appears that
      the draining was added to zone_pcp_update() to avoid refactoring
      setup_pageset() into 2 funtions.
      
      This patch:
      
      Creates pageset_set_batch() for use in setup_pageset().
      pageset_set_batch() imitates the functionality of
      setup_pagelist_highmark(), but uses the boot time
      (percpu_pagelist_fraction == 0) calculations for determining ->high based
      on ->batch.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4008bab7
    • L
      mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT · d6e93217
      Libin 提交于
      (*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
      as a inline funcion vma_pages() in linux/mm.h, so using it.
      Signed-off-by: NLibin <huawei.libin@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6e93217
    • M
      mm: remove compressed copy from zram in-memory · b430e9d1
      Minchan Kim 提交于
      Swap subsystem does lazy swap slot free with expecting the page would be
      swapped out again so we can avoid unnecessary write.
      
      But the problem in in-memory swap(ex, zram) is that it consumes memory
      space until vm_swap_full(ie, used half of all of swap device) condition
      meet.  It could be bad if we use multiple swap device, small in-memory
      swap and big storage swap or in-memory swap alone.
      
      This patch makes swap subsystem free swap slot as soon as swap-read is
      completed and make the swapcache page dirty so the page should be
      written out the swap device to reclaim it.  It means we never lose it.
      
      I tested this patch with kernel compile workload.
      
      1. before
      
         compile time : 9882.42
         zram max wasted space by fragmentation: 13471881 byte
         memory space consumed by zram: 174227456 byte
         the number of slot free notify: 206684
      
      2. after
      
         compile time : 9653.90
         zram max wasted space by fragmentation: 11805932 byte
         memory space consumed by zram: 154001408 byte
         the number of slot free notify: 426972
      
      [akpm@linux-foundation.org: tweak comment text]
      [artem.savkov@gmail.com: fix BUG due to non-swapcache pages in end_swap_bio_read()]
      [akpm@linux-foundation.org: invert unlikely() test, augment comment, 80-col cleanup]
      Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NArtem Savkov <artem.savkov@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b430e9d1
    • D
      mm, memcg: don't take task_lock in task_in_mem_cgroup · ffbdccf5
      David Rientjes 提交于
      For processes that have detached their mm's, task_in_mem_cgroup()
      unnecessarily takes task_lock() when rcu_read_lock() is all that is
      necessary to call mem_cgroup_from_task().
      
      While we're here, switch task_in_mem_cgroup() to return bool.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffbdccf5
    • P
      mm: soft-dirty bits for user memory changes tracking · 0f8975ec
      Pavel Emelyanov 提交于
      The soft-dirty is a bit on a PTE which helps to track which pages a task
      writes to.  In order to do this tracking one should
      
        1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
        2. Wait some time.
        3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
      
      To do this tracking, the writable bit is cleared from PTEs when the
      soft-dirty bit is.  Thus, after this, when the task tries to modify a
      page at some virtual address the #PF occurs and the kernel sets the
      soft-dirty bit on the respective PTE.
      
      Note, that although all the task's address space is marked as r/o after
      the soft-dirty bits clear, the #PF-s that occur after that are processed
      fast.  This is so, since the pages are still mapped to physical memory,
      and thus all the kernel does is finds this fact out and puts back
      writable, dirty and soft-dirty bits on the PTE.
      
      Another thing to note, is that when mremap moves PTEs they are marked
      with soft-dirty as well, since from the user perspective mremap modifies
      the virtual memory at mremap's new address.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f8975ec
  2. 03 7月, 2013 1 次提交
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
  3. 29 6月, 2013 1 次提交
  4. 26 6月, 2013 1 次提交
    • Z
      futex: Take hugepages into account when generating futex_key · 13d60f4b
      Zhang Yi 提交于
      The futex_keys of process shared futexes are generated from the page
      offset, the mapping host and the mapping index of the futex user space
      address. This should result in an unique identifier for each futex.
      
      Though this is not true when futexes are located in different subpages
      of an hugepage. The reason is, that the mapping index for all those
      futexes evaluates to the index of the base page of the hugetlbfs
      mapping. So a futex at offset 0 of the hugepage mapping and another
      one at offset PAGE_SIZE of the same hugepage mapping have identical
      futex_keys. This happens because the futex code blindly uses
      page->index.
      
      Steps to reproduce the bug:
      
      1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
         and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
         mapping.
      
         The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
         PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
         their keys solely depend on the user space address.
      
      2. Lock mutex1 and mutex2
      
      3. Create thread1 and in the thread function lock mutex1, which
         results in thread1 blocking on the locked mutex1.
      
      4. Create thread2 and in the thread function lock mutex2, which
         results in thread2 blocking on the locked mutex2.
      
      5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
         still blocks on mutex2 because the futex_key points to mutex1.
      
      To solve this issue we need to take the normal page index of the page
      which contains the futex into account, if the futex is in an hugetlbfs
      mapping. In other words, we calculate the normal page mapping index of
      the subpage in the hugetlbfs mapping.
      
      Mappings which are not based on hugetlbfs are not affected and still
      use page->index.
      
      Thanks to Mel Gorman who provided a patch for adding proper evaluation
      functions to the hugetlbfs code to avoid exposing hugetlbfs specific
      details to the futex code.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NZhang Yi <zhang.yi20@zte.com.cn>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Tested-by: NMa Chenggong <ma.chenggong@zte.com.cn>
      Reviewed-by: N'Mel Gorman' <mgorman@suse.de>
      Acked-by: N'Darren Hart' <dvhart@linux.intel.com>
      Cc: 'Peter Zijlstra' <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      13d60f4b
  5. 14 6月, 2013 2 次提交
  6. 13 6月, 2013 7 次提交
    • S
      slab: prevent warnings when allocating with __GFP_NOWARN · 907985f4
      Sasha Levin 提交于
      Sasha Levin noticed that the warning introduced by commit 6286ae97
      ("slab: Return NULL for oversized allocations) is being triggered:
      
        WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0()
        can: request_module (can-proto-4) failed.
        mpoa: proc_mpc_write: could not parse ''
        Modules linked in:
        CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W    3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2
         0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000
         ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0
         0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78
        Call Trace:
         [<ffffffff83ff4041>] dump_stack+0x4e/0x82
         [<ffffffff8111fe12>] warn_slowpath_common+0x82/0xb0
         [<ffffffff8111fe55>] warn_slowpath_null+0x15/0x20
         [<ffffffff81243dcf>] kmalloc_slab+0x2f/0xb0
         [<ffffffff81278d54>] __kmalloc+0x24/0x4b0
         [<ffffffff8196ffe3>] ? security_capable+0x13/0x20
         [<ffffffff812a26b7>] ? pipe_fcntl+0x107/0x210
         [<ffffffff812a26b7>] pipe_fcntl+0x107/0x210
         [<ffffffff812b7ea0>] ? fget_raw_light+0x130/0x3f0
         [<ffffffff812aa5fb>] SyS_fcntl+0x60b/0x6a0
         [<ffffffff8403ca98>] tracesys+0xe1/0xe6
      
      Andrew Morton writes:
      
        __GFP_NOWARN is frequently used by kernel code to probe for "how big
        an allocation can I get".  That's a bit lame, but it's used on slow
        paths and is pretty simple.
      
      However, SLAB would still spew a warning when a big allocation happens
      if the __GFP_NOWARN flag is _not_ set to expose kernel bugs.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      [ penberg@kernel.org: improve changelog ]
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      907985f4
    • J
      mm: memcontrol: fix lockless reclaim hierarchy iterator · 89dc991f
      Johannes Weiner 提交于
      The lockless reclaim hierarchy iterator currently has a misplaced
      barrier that can lead to use-after-free crashes.
      
      The reclaim hierarchy iterator consist of a sequence count and a
      position pointer that are read and written locklessly, with memory
      barriers enforcing ordering.
      
      The write side sets the position pointer first, then updates the
      sequence count to "publish" the new position.  Likewise, the read side
      must read the sequence count first, then the position.  If the sequence
      count is up to date, it's guaranteed that the position is up to date as
      well:
      
        writer:                         reader:
        iter->position = position       if iter->sequence == expected:
        smp_wmb()                           smp_rmb()
        iter->sequence = sequence           position = iter->position
      
      However, the read side barrier is currently misplaced, which can lead to
      dereferencing stale position pointers that no longer point to valid
      memory.  Fix this.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: <stable@kernel.org>		[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89dc991f
    • A
      frontswap: fix incorrect zeroing and allocation size for frontswap_map · 7b57976d
      Akinobu Mita 提交于
      The bitmap accessed by bitops must have enough size to hold the required
      numbers of bits rounded up to a multiple of BITS_PER_LONG.  And the
      bitmap must not be zeroed by memset() if the number of bits cleared is
      not a multiple of BITS_PER_LONG.
      
      This fixes incorrect zeroing and allocation size for frontswap_map.  The
      incorrect zeroing part doesn't cause any problem because frontswap_map
      is freed just after zeroing.  But the wrongly calculated allocation size
      may cause the problem.
      
      For 32bit systems, the allocation size of frontswap_map is about twice
      as large as required size.  For 64bit systems, the allocation size is
      smaller than requeired if the number of bits is not a multiple of
      BITS_PER_LONG.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b57976d
    • N
      mm: migration: add migrate_entry_wait_huge() · 30dad309
      Naoya Horiguchi 提交于
      When we have a page fault for the address which is backed by a hugepage
      under migration, the kernel can't wait correctly and do busy looping on
      hugepage fault until the migration finishes.  As a result, users who try
      to kick hugepage migration (via soft offlining, for example) occasionally
      experience long delay or soft lockup.
      
      This is because pte_offset_map_lock() can't get a correct migration entry
      or a correct page table lock for hugepage.  This patch introduces
      migration_entry_wait_huge() to solve this.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[2.6.35+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30dad309
    • T
      mm/page_alloc.c: fix watermark check in __zone_watermark_ok() · 026b0814
      Tomasz Stanislawski 提交于
      The watermark check consists of two sub-checks.  The first one is:
      
      	if (free_pages <= min + lowmem_reserve)
      		return false;
      
      The check assures that there is minimal amount of RAM in the zone.  If
      CMA is used then the free_pages is reduced by the number of free pages
      in CMA prior to the over-mentioned check.
      
      	if (!(alloc_flags & ALLOC_CMA))
      		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
      
      This prevents the zone from being drained from pages available for
      non-movable allocations.
      
      The second check prevents the zone from getting too fragmented.
      
      	for (o = 0; o < order; o++) {
      		free_pages -= z->free_area[o].nr_free << o;
      		min >>= 1;
      		if (free_pages <= min)
      			return false;
      	}
      
      The field z->free_area[o].nr_free is equal to the number of free pages
      including free CMA pages.  Therefore the CMA pages are subtracted twice.
      This may cause a false positive fail of __zone_watermark_ok() if the CMA
      area gets strongly fragmented.  In such a case there are many 0-order
      free pages located in CMA.  Those pages are subtracted twice therefore
      they will quickly drain free_pages during the check against
      fragmentation.  The test fails even though there are many free non-cma
      pages in the zone.
      
      This patch fixes this issue by subtracting CMA pages only for a purpose of
      (free_pages <= min + lowmem_reserve) check.
      
      Laura said:
      
        We were observing allocation failures of higher order pages (order 5 =
        128K typically) under tight memory conditions resulting in driver
        failure.  The output from the page allocation failure showed plenty of
        free pages of the appropriate order/type/zone and mostly CMA pages in
        the lower orders.
      
        For full disclosure, we still observed some page allocation failures
        even after applying the patch but the number was drastically reduced and
        those failures were attributed to fragmentation/other system issues.
      Signed-off-by: NTomasz Stanislawski <t.stanislaws@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Tested-by: NLaura Abbott <lauraa@codeaurora.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      026b0814
    • R
      swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion · cbab0e4e
      Rafael Aquini 提交于
      read_swap_cache_async() can race against get_swap_page(), and stumble
      across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
      into the swapcache yet.
      
      This transient swap_map state is expected to be transitory, but the
      actual placement of discard at scan_swap_map() inserts a wait for I/O
      completion thus making the thread at read_swap_cache_async() to loop
      around its -EEXIST case, while the other end at get_swap_page() is
      scheduled away at scan_swap_map().  This can leave the system deadlocked
      if the I/O completion happens to be waiting on the CPU waitqueue where
      read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.
      
      This patch introduces a cond_resched() call to make the aforementioned
      read_swap_cache_async() busy loop condition to bail out when necessary,
      thus avoiding the subtle race window.
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbab0e4e
    • A
      memcg: don't initialize kmem-cache destroying work for root caches · f101a946
      Andrey Vagin 提交于
      struct memcg_cache_params has a union.  Different parts of this union
      are used for root and non-root caches.  A part with destroying work is
      used only for non-root caches.
      
        BUG: unable to handle kernel paging request at 0000000fffffffe0
        IP: kmem_cache_alloc+0x41/0x1f0
        Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
        CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G      D      3.10.0-rc1+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: kmem_cache_alloc+0x41/0x1f0
        Call Trace:
         getname_flags.part.34+0x30/0x140
         getname+0x38/0x60
         do_sys_open+0xc5/0x1e0
         SyS_open+0x22/0x30
         system_call_fastpath+0x16/0x1b
        Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 <4d> 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
        RIP  [<ffffffff8116b641>] kmem_cache_alloc+0x41/0x1f0
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: <stable@vger.kernel.org>	[3.9.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f101a946
  7. 06 6月, 2013 1 次提交
    • P
      arch, mm: Remove tlb_fast_mode() · 29eb7782
      Peter Zijlstra 提交于
      Since the introduction of preemptible mmu_gather TLB fast mode has been
      broken. TLB fast mode relies on there being absolutely no concurrency;
      it frees pages first and invalidates TLBs later.
      
      However now we can get concurrency and stuff goes *bang*.
      
      This patch removes all tlb_fast_mode() code; it was found the better
      option vs trying to patch the hole by entangling tlb invalidation with
      the scheduler.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Reported-by: NMax Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29eb7782
  8. 04 6月, 2013 1 次提交
  9. 28 5月, 2013 3 次提交
    • M
      mm, sched: Allow uaccess in atomic with pagefault_disable() · 662bbcb2
      Michael S. Tsirkin 提交于
      This changes might_fault() so that it does not
      trigger a false positive diagnostic for e.g. the following
      sequence:
      
      	spin_lock_irqsave()
      	pagefault_disable()
      	copy_to_user()
      	pagefault_enable()
      	spin_unlock_irqrestore()
      
      In particular vhost wants to do this, to call
      socket ops from under a lock.
      
      There are 3 cases to consider:
      
       - CONFIG_PROVE_LOCKING - might_fault is non-inline
         so it's easy to move the in_atomic test to fix
         up the false positive warning.
      
       - CONFIG_DEBUG_ATOMIC_SLEEP - might_fault
         is currently inline, but we are calling a
         non-inline __might_sleep anyway,
         so let's use the non-line version of might_fault
         that does the right thing.
      
       - !CONFIG_DEBUG_ATOMIC_SLEEP && !CONFIG_PROVE_LOCKING
         __might_sleep is a nop so might_fault is a nop.
      
      Make this explicit.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1369577426-26721-11-git-send-email-mst@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      662bbcb2
    • M
      mm, sched: Drop voluntary schedule from might_fault() · 114276ac
      Michael S. Tsirkin 提交于
      might_fault() is called from functions like copy_to_user()
      which most callers expect to be very fast, like a couple of
      instructions.
      
      So functions like memcpy_toiovec() call them many times in a loop.
      
      But might_fault() calls might_sleep() and with CONFIG_PREEMPT_VOLUNTARY
      this results in a function call.
      
      Let's not do this - just call __might_sleep() that produces
      a diagnostic for sleep within atomic, but drop
      might_preempt().
      
      Here's a test sending traffic between the VM and the host,
      host is built with CONFIG_PREEMPT_VOLUNTARY:
      
       before:
      	incoming: 7122.77   Mb/s
      	outgoing: 8480.37   Mb/s
      
       after:
      	incoming: 8619.24   Mb/s
      	outgoing: 9455.42   Mb/s
      
      As a side effect, this fixes an issue pointed
      out by Ingo: might_fault might schedule differently
      depending on PROVE_LOCKING. Now there's no
      preemption point in both cases, so it's consistent.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1369577426-26721-10-git-send-email-mst@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      114276ac
    • L
      mm: teach truncate_inode_pages_range() to handle non page aligned ranges · 5a720394
      Lukas Czerner 提交于
      This commit changes truncate_inode_pages_range() so it can handle non
      page aligned regions of the truncate. Currently we can hit BUG_ON when
      the end of the range is not page aligned, but we can handle unaligned
      start of the range.
      
      Being able to handle non page aligned regions of the page can help file
      system punch_hole implementations and save some work, because once we're
      holding the page we might as well deal with it right away.
      
      In previous commits we've changed ->invalidatepage() prototype to accept
      'length' argument to be able to specify range to invalidate. No we can
      use that new ability in truncate_inode_pages_range().
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      5a720394
  10. 25 5月, 2013 1 次提交
    • C
      mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas · a9ff785e
      Cliff Wickman 提交于
      A panic can be caused by simply cat'ing /proc/<pid>/smaps while an
      application has a VM_PFNMAP range.  It happened in-house when a
      benchmarker was trying to decipher the memory layout of his program.
      
      /proc/<pid>/smaps and similar walks through a user page table should not
      be looking at VM_PFNMAP areas.
      
      Certain tests in walk_page_range() (specifically split_huge_page_pmd())
      assume that all the mapped PFN's are backed with page structures.  And
      this is not usually true for VM_PFNMAP areas.  This can result in panics
      on kernel page faults when attempting to address those page structures.
      
      There are a half dozen callers of walk_page_range() that walk through a
      task's entire page table (as N.  Horiguchi pointed out).  So rather than
      change all of them, this patch changes just walk_page_range() to ignore
      VM_PFNMAP areas.
      
      The logic of hugetlb_vma() is moved back into walk_page_range(), as we
      want to test any vma in the range.
      
      VM_PFNMAP areas are used by:
      - graphics memory manager   gpu/drm/drm_gem.c
      - global reference unit     sgi-gru/grufile.c
      - sgi special memory        char/mspec.c
      - and probably several out-of-tree modules
      
      [akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub]
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9ff785e