- 04 7月, 2013 22 次提交
-
-
由 Mel Gorman 提交于
Currently kswapd queues dirty pages for writeback if scanning at an elevated priority but the priority kswapd scans at is not related to the number of unqueued dirty encountered. Since commit "mm: vmscan: Flatten kswapd priority loop", the priority is related to the size of the LRU and the zone watermark which is no indication as to whether kswapd should write pages or not. This patch tracks if an excessive number of unqueued dirty pages are being encountered at the end of the LRU. If so, it indicates that dirty pages are being recycled before flusher threads can clean them and flags the zone so that kswapd will start writing pages until the zone is balanced. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: NZlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Page reclaim at priority 0 will scan the entire LRU as priority 0 is considered to be a near OOM condition. Kswapd can reach priority 0 quite easily if it is encountering a large number of pages it cannot reclaim such as pages under writeback. When this happens, kswapd reclaims very aggressively even though there may be no real risk of allocation failure or OOM. This patch prevents kswapd reaching priority 0 and trying to reclaim the world. Direct reclaimers will still reach priority 0 in the event of an OOM situation. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: NZlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
In the past, kswapd makes a decision on whether to compact memory after the pgdat was considered balanced. This more or less worked but it is late to make such a decision and does not fit well now that kswapd makes a decision whether to exit the zone scanning loop depending on reclaim progress. This patch will compact a pgdat if at least the requested number of pages were reclaimed from unbalanced zones for a given priority. If any zone is currently balanced, kswapd will not call compaction as it is expected the necessary pages are already available. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: NZlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX pages have been reclaimed or the pgdat is considered balanced. It then rechecks if it needs to restart at DEF_PRIORITY and whether high-order reclaim needs to be reset. This is not wrong per-se but it is confusing to follow and forcing kswapd to stay at DEF_PRIORITY may require several restarts before it has scanned enough pages to meet the high watermark even at 100% efficiency. This patch irons out the logic a bit by controlling when priority is raised and removing the "goto loop_again". This patch has kswapd raise the scanning priority until it is scanning enough pages that it could meet the high watermark in one shrink of the LRU lists if it is able to reclaim at 100% efficiency. It will not raise the scanning prioirty higher unless it is failing to reclaim any pages. To avoid infinite looping for high-order allocation requests kswapd will not reclaim for high-order allocations when it has reclaimed at least twice the number of pages as the allocation request. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: NZlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Simplistically, the anon and file LRU lists are scanned proportionally depending on the value of vm.swappiness although there are other factors taken into account by get_scan_count(). The patch "mm: vmscan: Limit the number of pages kswapd reclaims" limits the number of pages kswapd reclaims but it breaks this proportional scanning and may evenly shrink anon/file LRUs regardless of vm.swappiness. This patch preserves the proportional scanning and reclaim. It does mean that kswapd will reclaim more than requested but the number of pages will be related to the high watermark. [mhocko@suse.cz: Correct proportional reclaim for memcg and simplify] [kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target] [hannes@cmpxchg.org: Account for already scanned pages properly] Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: NZlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This series does not fix all the current known problems with reclaim but it addresses one important swapping bug when there is background IO. Changelog since V3 - Drop the slab shrink changes in light of Glaubers series and discussions highlighted that there were a number of potential problems with the patch. (mel) - Rebased to 3.10-rc1 Changelog since V2 - Preserve ratio properly for proportional scanning (kamezawa) Changelog since V1 - Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi) - Reformat comment in shrink_page_list (andi) - Clarify some comments (dhillf) - Rework how the proportional scanning is preserved - Add PageReclaim check before kswapd starts writeback - Reset sc.nr_reclaimed on every full zone scan Kswapd and page reclaim behaviour has been screwy in one way or the other for a long time. Very broadly speaking it worked in the far past because machines were limited in memory so it did not have that many pages to scan and it stalled congestion_wait() frequently to prevent it going completely nuts. In recent times it has behaved very unsatisfactorily with some of the problems compounded by the removal of stall logic and the introduction of transparent hugepage support with high-order reclaims. There are many variations of bugs that are rooted in this area. One example is reports of a large copy operations or backup causing the machine to grind to a halt or applications pushed to swap. Sometimes in low memory situations a large percentage of memory suddenly gets reclaimed. In other cases an application starts and kswapd hits 100% CPU usage for prolonged periods of time and so on. There is now talk of introducing features like an extra free kbytes tunable to work around aspects of the problem instead of trying to deal with it. It's compounded by the problem that it can be very workload and machine specific. This series aims at addressing some of the worst of these problems without attempting to fundmentally alter how page reclaim works. Patches 1-2 limits the number of pages kswapd reclaims while still obeying the anon/file proportion of the LRUs it should be scanning. Patches 3-4 control how and when kswapd raises its scanning priority and deletes the scanning restart logic which is tricky to follow. Patch 5 notes that it is too easy for kswapd to reach priority 0 when scanning and then reclaim the world. Down with that sort of thing. Patch 6 notes that kswapd starts writeback based on scanning priority which is not necessarily related to dirty pages. It will have kswapd writeback pages if a number of unqueued dirty pages have been recently encountered at the tail of the LRU. Patch 7 notes that sometimes kswapd should stall waiting on IO to complete to reduce LRU churn and the likelihood that it'll reclaim young clean pages or push applications to swap. It will cause kswapd to block on IO if it detects that pages being reclaimed under writeback are recycling through the LRU before the IO completes. Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they are applied. This was tested using memcached+memcachetest while some background IO was in progress as implemented by the parallel IO tests implement in MM Tests. memcachetest benchmarks how many operations/second memcached can service and it is run multiple times. It starts with no background IO and then re-runs the test with larger amounts of IO in the background to roughly simulate a large copy in progress. The expectation is that the IO should have little or no impact on memcachetest which is running entirely in memory. 3.10.0-rc1 3.10.0-rc1 vanilla lessdisrupt-v4 Ops memcachetest-0M 22155.00 ( 0.00%) 22180.00 ( 0.11%) Ops memcachetest-715M 22720.00 ( 0.00%) 22355.00 ( -1.61%) Ops memcachetest-2385M 3939.00 ( 0.00%) 23450.00 (495.33%) Ops memcachetest-4055M 3628.00 ( 0.00%) 24341.00 (570.92%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) Ops io-duration-2385M 118.00 ( 0.00%) 21.00 ( 82.20%) Ops io-duration-4055M 162.00 ( 0.00%) 36.00 ( 77.78%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 140134.00 ( 0.00%) 18.00 ( 99.99%) Ops swaptotal-2385M 392438.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-4055M 449037.00 ( 0.00%) 27864.00 ( 93.79%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-2385M 148031.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-4055M 135109.00 ( 0.00%) 0.00 ( 0.00%) Ops minorfaults-0M 1529984.00 ( 0.00%) 1530235.00 ( -0.02%) Ops minorfaults-715M 1794168.00 ( 0.00%) 1613750.00 ( 10.06%) Ops minorfaults-2385M 1739813.00 ( 0.00%) 1609396.00 ( 7.50%) Ops minorfaults-4055M 1754460.00 ( 0.00%) 1614810.00 ( 7.96%) Ops majorfaults-0M 0.00 ( 0.00%) 0.00 ( 0.00%) Ops majorfaults-715M 185.00 ( 0.00%) 180.00 ( 2.70%) Ops majorfaults-2385M 24472.00 ( 0.00%) 101.00 ( 99.59%) Ops majorfaults-4055M 22302.00 ( 0.00%) 229.00 ( 98.97%) Note how the vanilla kernels performance collapses when there is enough IO taking place in the background. This drop in performance is part of what users complain of when they start backups. Note how the swapin and major fault figures indicate that processes were being pushed to swap prematurely. With the series applied, there is no noticable performance drop and while there is still some swap activity, it's tiny. 20 iterations of this test were run in total and averaged. Every 5 iterations, additional IO was generated in the background using dd to measure how the workload was impacted. The 0M, 715M, 2385M and 4055M subblock refer to the amount of IO going on in the background at each iteration. So memcachetest-2385M is reporting how many transactions/second memcachetest recorded on average over 5 iterations while there was 2385M of IO going on in the ground. There are six blocks of information reported here memcachetest is the transactions/second reported by memcachetest. In the vanilla kernel note that performance drops from around 22K/sec to just under 4K/second when there is 2385M of IO going on in the background. This is one type of performance collapse users complain about if a large cp or backup starts in the background io-duration refers to how long it takes for the background IO to complete. It's showing that with the patched kernel that the IO completes faster while not interfering with the memcache workload swaptotal is the total amount of swap traffic. With the patched kernel, the total amount of swapping is much reduced although it is still not zero. swapin in this case is an indication as to whether we are swap trashing. The closer the swapin/swapout ratio is to 1, the worse the trashing is. Note with the patched kernel that there is no swapin activity indicating that all the pages swapped were really inactive unused pages. minorfaults are just minor faults. An increased number of minor faults can indicate that page reclaim is unmapping the pages but not swapping them out before they are faulted back in. With the patched kernel, there is only a small change in minor faults majorfaults are just major faults in the target workload and a high number can indicate that a workload is being prematurely swapped. With the patched kernel, major faults are much reduced. As there are no swapin's recorded so it's not being swapped. The likely explanation is that that libraries or configuration files used by the workload during startup get paged out by the background IO. Overall with the series applied, there is no noticable performance drop due to background IO and while there is still some swap activity, it's tiny and the lack of swapins imply that the swapped pages were inactive and unused. 3.10.0-rc1 3.10.0-rc1 vanilla lessdisrupt-v4 Page Ins 1234608 101892 Page Outs 12446272 11810468 Swap Ins 283406 0 Swap Outs 698469 27882 Direct pages scanned 0 136480 Kswapd pages scanned 6266537 5369364 Kswapd pages reclaimed 1088989 930832 Direct pages reclaimed 0 120901 Kswapd efficiency 17% 17% Kswapd velocity 5398.371 4635.115 Direct efficiency 100% 88% Direct velocity 0.000 117.817 Percentage direct scans 0% 2% Page writes by reclaim 1655843 4009929 Page writes file 957374 3982047 Page writes anon 698469 27882 Page reclaim immediate 5245 1745 Page rescued immediate 0 0 Slabs scanned 33664 25216 Direct inode steals 0 0 Kswapd inode steals 19409 778 Kswapd skipped wait 0 0 THP fault alloc 35 30 THP collapse alloc 472 401 THP splits 27 22 THP fault fallback 0 0 THP collapse fail 0 1 Compaction stalls 0 4 Compaction success 0 0 Compaction failures 0 4 Page migrate success 0 0 Page migrate failure 0 0 Compaction pages isolated 0 0 Compaction migrate scanned 0 0 Compaction free scanned 0 0 Compaction cost 0 0 NUMA PTE updates 0 0 NUMA hint faults 0 0 NUMA hint local faults 0 0 NUMA pages migrated 0 0 AutoNUMA cost 0 0 Unfortunately, note that there is a small amount of direct reclaim due to kswapd no longer reclaiming the world. ftrace indicates that the direct reclaim stalls are mostly harmless with the vast bulk of the stalls incurred by dd 23 tclsh-3367 38 memcachetest-13733 49 memcachetest-12443 57 tee-3368 1541 dd-13826 1981 dd-12539 A consequence of the direct reclaim for dd is that the processes for the IO workload may show a higher system CPU usage. There is also a risk that kswapd not reclaiming the world may mean that it stays awake balancing zones, does not stall on the appropriate events and continually scans pages it cannot reclaim consuming CPU. This will be visible as continued high CPU usage but in my own tests I only saw a single spike lasting less than a second and I did not observe any problems related to reclaim while running the series on my desktop. This patch: The number of pages kswapd can reclaim is bound by the number of pages it scans which is related to the size of the zone and the scanning priority. In many cases the priority remains low because it's reset every SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large number of pages it cannot reclaim, it will raise the priority and potentially discard a large percentage of the zone as sc->nr_to_reclaim is ULONG_MAX. The user-visible effect is a reclaim "spike" where a large percentage of memory is suddenly freed. It would be bad enough if this was just unused memory but because of how anon/file pages are balanced it is possible that applications get pushed to swap unnecessarily. This patch limits the number of pages kswapd will reclaim to the high watermark. Reclaim will still overshoot due to it not being a hard limit as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it prevents kswapd reclaiming the world at higher priorities. The number of pages it reclaims is not adjusted for high-order allocations as kswapd will reclaim excessively if it is to balance zones for high-order allocations. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Tested-by: NZlatko Calusic <zcalusic@bitsync.net> Cc: dormando <dormando@rydia.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
When memory hotplug is triggered, we call pageset_init() on per-cpu-pagesets which both contain pages and are in use, causing both the leakage of those pages and (potentially) bad behaviour if a page is allocated from a pageset while it is being cleared. Avoid this by factoring out pageset_set_high_and_batch() (which contains all needed logic too set a pageset's ->high and ->batch inrespective of system state) from zone_pageset_init() and using the new pageset_set_high_and_batch() instead of zone_pageset_init() in zone_pcp_update(). Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
Previously, zone_pcp_update() called pageset_set_batch() directly, essentially assuming that percpu_pagelist_fraction == 0. Correct this by calling zone_pageset_init(), which chooses the appropriate ->batch and ->high calculations. Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
Simply moves calculation of the new 'high' value outside the for_each_possible_cpu() loop, as it does not depend on the cpu. Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a percpu pageset based on a zone's ->managed_pages. We don't need to drain the entire percpu pageset just to modify these fields. This lets us avoid calling setup_pageset() (and the draining required to call it) and instead allows simply setting the fields' values (with some attention paid to memory barriers to prevent the relationship between ->batch and ->high from being thrown off). This does change the behavior of zone_pcp_update() as the percpu pagesets will not be drained when zone_pcp_update() is called (they will end up being shrunk, not completely drained, later when a 0-order page is freed in free_hot_cold_page()). Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
pcp->batch could change at any point, avoid relying on it being a stable value. Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
Introduce pageset_update() to perform a safe transision from one set of pcp->{batch,high} to a new set using memory barriers. This ensures that batch is always set to a safe value (1) prior to updating high, and ensure that high is fully updated before setting the real value of batch. It avoids ->batch ever rising above ->high. Suggested by Gilad Ben-Yossef in these threads: https://lkml.org/lkml/2013/4/9/23 https://lkml.org/lkml/2013/4/10/49 Also reproduces his proposed comment. Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: NGilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
Because we are going to rely upon a careful transision between old and new ->high and ->batch values using memory barriers and will remove stop_machine(), we need to prevent multiple updaters from interweaving their memory writes. Add a simple mutex to protect both update loops. Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cody P Schafer 提交于
"Problems" with the current code: 1: there is a lack of synchronization in setting ->high and ->batch in percpu_pagelist_fraction_sysctl_handler() 2: stop_machine() in zone_pcp_update() is unnecissary. 3: zone_pcp_update() does not consider the case where percpu_pagelist_fraction is non-zero To fix: 1: add memory barriers, a safe ->batch value, an update side mutex when updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users that expect a stable value. 2: avoid draining pages in zone_pcp_update(), rely upon the memory barriers added to fix #1 3: factor out quite a few functions, and then call the appropriate one. Note that it results in a change to the behavior of zone_pcp_update(), which is used by memory_hotplug. I'm rather certain that I've diserned (and preserved) the essential behavior (changing ->high and ->batch), and only eliminated unneeded actions (draining the per cpu pages), but this may not be the case. Further note that the draining of pages that previously took place in zone_pcp_update() occured after repeated draining when attempting to offline a page, and after the offline has "succeeded". It appears that the draining was added to zone_pcp_update() to avoid refactoring setup_pageset() into 2 funtions. This patch: Creates pageset_set_batch() for use in setup_pageset(). pageset_set_batch() imitates the functionality of setup_pagelist_highmark(), but uses the boot time (percpu_pagelist_fraction == 0) calculations for determining ->high based on ->batch. Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Libin 提交于
(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented as a inline funcion vma_pages() in linux/mm.h, so using it. Signed-off-by: NLibin <huawei.libin@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Swap subsystem does lazy swap slot free with expecting the page would be swapped out again so we can avoid unnecessary write. But the problem in in-memory swap(ex, zram) is that it consumes memory space until vm_swap_full(ie, used half of all of swap device) condition meet. It could be bad if we use multiple swap device, small in-memory swap and big storage swap or in-memory swap alone. This patch makes swap subsystem free swap slot as soon as swap-read is completed and make the swapcache page dirty so the page should be written out the swap device to reclaim it. It means we never lose it. I tested this patch with kernel compile workload. 1. before compile time : 9882.42 zram max wasted space by fragmentation: 13471881 byte memory space consumed by zram: 174227456 byte the number of slot free notify: 206684 2. after compile time : 9653.90 zram max wasted space by fragmentation: 11805932 byte memory space consumed by zram: 154001408 byte the number of slot free notify: 426972 [akpm@linux-foundation.org: tweak comment text] [artem.savkov@gmail.com: fix BUG due to non-swapcache pages in end_swap_bio_read()] [akpm@linux-foundation.org: invert unlikely() test, augment comment, 80-col cleanup] Signed-off-by: NDan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NArtem Savkov <artem.savkov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Konrad Rzeszutek Wilk <konrad@darnok.org> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
For processes that have detached their mm's, task_in_mem_cgroup() unnecessarily takes task_lock() when rcu_read_lock() is all that is necessary to call mem_cgroup_from_task(). While we're here, switch task_in_mem_cgroup() to return bool. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Emelyanov 提交于
The soft-dirty is a bit on a PTE which helps to track which pages a task writes to. In order to do this tracking one should 1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs) 2. Wait some time. 3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries) To do this tracking, the writable bit is cleared from PTEs when the soft-dirty bit is. Thus, after this, when the task tries to modify a page at some virtual address the #PF occurs and the kernel sets the soft-dirty bit on the respective PTE. Note, that although all the task's address space is marked as r/o after the soft-dirty bits clear, the #PF-s that occur after that are processed fast. This is so, since the pages are still mapped to physical memory, and thus all the kernel does is finds this fact out and puts back writable, dirty and soft-dirty bits on the PTE. Another thing to note, is that when mremap moves PTEs they are marked with soft-dirty as well, since from the user perspective mremap modifies the virtual memory at mremap's new address. Signed-off-by: NPavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 7月, 2013 1 次提交
-
-
由 Jie Liu 提交于
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support SEEK_DATA/SEEK_HOLE functions, we end up handling the similar matter in lseek_execute() to update the current file offset to the desired offset if it is valid, ceph also does the simliar things at ceph_llseek(). To reduce the duplications, this patch make lseek_execute() public accessible so that we can call it directly from the underlying file systems. Thanks Dave Chinner for this suggestion. [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back] v2->v1: - Add kernel-doc comments for lseek_execute() - Call lseek_execute() in ceph->llseek() Signed-off-by: NJie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Ben Myers <bpm@sgi.com> Cc: Ted Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Sage Weil <sage@inktank.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 29 6月, 2013 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 26 6月, 2013 1 次提交
-
-
由 Zhang Yi 提交于
The futex_keys of process shared futexes are generated from the page offset, the mapping host and the mapping index of the futex user space address. This should result in an unique identifier for each futex. Though this is not true when futexes are located in different subpages of an hugepage. The reason is, that the mapping index for all those futexes evaluates to the index of the base page of the hugetlbfs mapping. So a futex at offset 0 of the hugepage mapping and another one at offset PAGE_SIZE of the same hugepage mapping have identical futex_keys. This happens because the futex code blindly uses page->index. Steps to reproduce the bug: 1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0 and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs mapping. The mutexes must be initialized as PTHREAD_PROCESS_SHARED because PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as their keys solely depend on the user space address. 2. Lock mutex1 and mutex2 3. Create thread1 and in the thread function lock mutex1, which results in thread1 blocking on the locked mutex1. 4. Create thread2 and in the thread function lock mutex2, which results in thread2 blocking on the locked mutex2. 5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2 still blocks on mutex2 because the futex_key points to mutex1. To solve this issue we need to take the normal page index of the page which contains the futex into account, if the futex is in an hugetlbfs mapping. In other words, we calculate the normal page mapping index of the subpage in the hugetlbfs mapping. Mappings which are not based on hugetlbfs are not affected and still use page->index. Thanks to Mel Gorman who provided a patch for adding proper evaluation functions to the hugetlbfs code to avoid exposing hugetlbfs specific details to the futex code. [ tglx: Massaged changelog ] Signed-off-by: NZhang Yi <zhang.yi20@zte.com.cn> Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn> Tested-by: NMa Chenggong <ma.chenggong@zte.com.cn> Reviewed-by: N'Mel Gorman' <mgorman@suse.de> Acked-by: N'Darren Hart' <dvhart@linux.intel.com> Cc: 'Peter Zijlstra' <peterz@infradead.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 14 6月, 2013 2 次提交
-
-
由 Steve Capper 提交于
The huge_pte_alloc, huge_pte_offset and follow_huge_p[mu]d functions in x86/mm/hugetlbpage.c do not rely on any architecture specific knowledge other than the fact that pmds and puds can be treated as huge ptes. To allow other architectures to use this code (and reduce the need for code duplication), this patch copies these functions into mm, replaces the use of pud_large with pud_huge and provides a config flag to activate them: CONFIG_ARCH_WANT_GENERAL_HUGETLB If CONFIG_ARCH_WANT_HUGE_PMD_SHARE is also active then the huge_pmd_share code will be called by huge_pte_alloc (othewise we call pmd_alloc and skip the sharing code). Signed-off-by: NSteve Capper <steve.capper@linaro.org> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Steve Capper 提交于
Under x86, multiple puds can be made to reference the same bank of huge pmds provided that they represent a full PUD_SIZE of shared huge memory that is aligned to a PUD_SIZE boundary. The code to share pmds does not require any architecture specific knowledge other than the fact that pmds can be indexed, thus can be beneficial to some other architectures. This patch copies the huge pmd sharing (and unsharing) logic from x86/ to mm/ and introduces a new config option to activate it: CONFIG_ARCH_WANTS_HUGE_PMD_SHARE Signed-off-by: NSteve Capper <steve.capper@linaro.org> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 13 6月, 2013 7 次提交
-
-
由 Sasha Levin 提交于
Sasha Levin noticed that the warning introduced by commit 6286ae97 ("slab: Return NULL for oversized allocations) is being triggered: WARNING: CPU: 15 PID: 21519 at mm/slab_common.c:376 kmalloc_slab+0x2f/0xb0() can: request_module (can-proto-4) failed. mpoa: proc_mpc_write: could not parse '' Modules linked in: CPU: 15 PID: 21519 Comm: trinity-child15 Tainted: G W 3.10.0-rc4-next-20130607-sasha-00011-gcd78395-dirty #2 0000000000000009 ffff880020a95e30 ffffffff83ff4041 0000000000000000 ffff880020a95e68 ffffffff8111fe12 fffffffffffffff0 00000000000082d0 0000000000080000 0000000000080000 0000000001400000 ffff880020a95e78 Call Trace: [<ffffffff83ff4041>] dump_stack+0x4e/0x82 [<ffffffff8111fe12>] warn_slowpath_common+0x82/0xb0 [<ffffffff8111fe55>] warn_slowpath_null+0x15/0x20 [<ffffffff81243dcf>] kmalloc_slab+0x2f/0xb0 [<ffffffff81278d54>] __kmalloc+0x24/0x4b0 [<ffffffff8196ffe3>] ? security_capable+0x13/0x20 [<ffffffff812a26b7>] ? pipe_fcntl+0x107/0x210 [<ffffffff812a26b7>] pipe_fcntl+0x107/0x210 [<ffffffff812b7ea0>] ? fget_raw_light+0x130/0x3f0 [<ffffffff812aa5fb>] SyS_fcntl+0x60b/0x6a0 [<ffffffff8403ca98>] tracesys+0xe1/0xe6 Andrew Morton writes: __GFP_NOWARN is frequently used by kernel code to probe for "how big an allocation can I get". That's a bit lame, but it's used on slow paths and is pretty simple. However, SLAB would still spew a warning when a big allocation happens if the __GFP_NOWARN flag is _not_ set to expose kernel bugs. Signed-off-by: NSasha Levin <sasha.levin@oracle.com> [ penberg@kernel.org: improve changelog ] Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Johannes Weiner 提交于
The lockless reclaim hierarchy iterator currently has a misplaced barrier that can lead to use-after-free crashes. The reclaim hierarchy iterator consist of a sequence count and a position pointer that are read and written locklessly, with memory barriers enforcing ordering. The write side sets the position pointer first, then updates the sequence count to "publish" the new position. Likewise, the read side must read the sequence count first, then the position. If the sequence count is up to date, it's guaranteed that the position is up to date as well: writer: reader: iter->position = position if iter->sequence == expected: smp_wmb() smp_rmb() iter->sequence = sequence position = iter->position However, the read side barrier is currently misplaced, which can lead to dereferencing stale position pointers that no longer point to valid memory. Fix this. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-by: NTejun Heo <tj@kernel.org> Reviewed-by: NTejun Heo <tj@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Glauber Costa <glommer@parallels.com> Cc: <stable@kernel.org> [3.10+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Akinobu Mita 提交于
The bitmap accessed by bitops must have enough size to hold the required numbers of bits rounded up to a multiple of BITS_PER_LONG. And the bitmap must not be zeroed by memset() if the number of bits cleared is not a multiple of BITS_PER_LONG. This fixes incorrect zeroing and allocation size for frontswap_map. The incorrect zeroing part doesn't cause any problem because frontswap_map is freed just after zeroing. But the wrongly calculated allocation size may cause the problem. For 32bit systems, the allocation size of frontswap_map is about twice as large as required size. For 64bit systems, the allocation size is smaller than requeired if the number of bits is not a multiple of BITS_PER_LONG. Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
When we have a page fault for the address which is backed by a hugepage under migration, the kernel can't wait correctly and do busy looping on hugepage fault until the migration finishes. As a result, users who try to kick hugepage migration (via soft offlining, for example) occasionally experience long delay or soft lockup. This is because pte_offset_map_lock() can't get a correct migration entry or a correct page table lock for hugepage. This patch introduces migration_entry_wait_huge() to solve this. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [2.6.35+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tomasz Stanislawski 提交于
The watermark check consists of two sub-checks. The first one is: if (free_pages <= min + lowmem_reserve) return false; The check assures that there is minimal amount of RAM in the zone. If CMA is used then the free_pages is reduced by the number of free pages in CMA prior to the over-mentioned check. if (!(alloc_flags & ALLOC_CMA)) free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES); This prevents the zone from being drained from pages available for non-movable allocations. The second check prevents the zone from getting too fragmented. for (o = 0; o < order; o++) { free_pages -= z->free_area[o].nr_free << o; min >>= 1; if (free_pages <= min) return false; } The field z->free_area[o].nr_free is equal to the number of free pages including free CMA pages. Therefore the CMA pages are subtracted twice. This may cause a false positive fail of __zone_watermark_ok() if the CMA area gets strongly fragmented. In such a case there are many 0-order free pages located in CMA. Those pages are subtracted twice therefore they will quickly drain free_pages during the check against fragmentation. The test fails even though there are many free non-cma pages in the zone. This patch fixes this issue by subtracting CMA pages only for a purpose of (free_pages <= min + lowmem_reserve) check. Laura said: We were observing allocation failures of higher order pages (order 5 = 128K typically) under tight memory conditions resulting in driver failure. The output from the page allocation failure showed plenty of free pages of the appropriate order/type/zone and mostly CMA pages in the lower orders. For full disclosure, we still observed some page allocation failures even after applying the patch but the number was drastically reduced and those failures were attributed to fragmentation/other system issues. Signed-off-by: NTomasz Stanislawski <t.stanislaws@samsung.com> Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com> Tested-by: NLaura Abbott <lauraa@codeaurora.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: NMinchan Kim <minchan@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rafael Aquini 提交于
read_swap_cache_async() can race against get_swap_page(), and stumble across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought into the swapcache yet. This transient swap_map state is expected to be transitory, but the actual placement of discard at scan_swap_map() inserts a wait for I/O completion thus making the thread at read_swap_cache_async() to loop around its -EEXIST case, while the other end at get_swap_page() is scheduled away at scan_swap_map(). This can leave the system deadlocked if the I/O completion happens to be waiting on the CPU waitqueue where read_swap_cache_async() is busy looping and !CONFIG_PREEMPT. This patch introduces a cond_resched() call to make the aforementioned read_swap_cache_async() busy loop condition to bail out when necessary, thus avoiding the subtle race window. Signed-off-by: NRafael Aquini <aquini@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Vagin 提交于
struct memcg_cache_params has a union. Different parts of this union are used for root and non-root caches. A part with destroying work is used only for non-root caches. BUG: unable to handle kernel paging request at 0000000fffffffe0 IP: kmem_cache_alloc+0x41/0x1f0 Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G D 3.10.0-rc1+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: kmem_cache_alloc+0x41/0x1f0 Call Trace: getname_flags.part.34+0x30/0x140 getname+0x38/0x60 do_sys_open+0xc5/0x1e0 SyS_open+0x22/0x30 system_call_fastpath+0x16/0x1b Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 <4d> 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49 RIP [<ffffffff8116b641>] kmem_cache_alloc+0x41/0x1f0 Signed-off-by: NAndrey Vagin <avagin@openvz.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Glauber Costa <glommer@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Li Zefan <lizefan@huawei.com> Cc: <stable@vger.kernel.org> [3.9.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 6月, 2013 1 次提交
-
-
由 Peter Zijlstra 提交于
Since the introduction of preemptible mmu_gather TLB fast mode has been broken. TLB fast mode relies on there being absolutely no concurrency; it frees pages first and invalidates TLBs later. However now we can get concurrency and stuff goes *bang*. This patch removes all tlb_fast_mode() code; it was found the better option vs trying to patch the hole by entangling tlb invalidation with the scheduler. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Reported-by: NMax Filippov <jcmvbkbc@gmail.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 6月, 2013 1 次提交
-
-
由 Stephen Rothwell 提交于
Ever since commit 45f035ab ("CONFIG_HOTPLUG should be always on"), it has been basically impossible to build a kernel with CONFIG_HOTPLUG turned off. Remove all the remaining references to it. Cc: Russell King <linux@arm.linux.org.uk> Cc: Doug Thompson <dougthompson@xmission.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Acked-by: NMauro Carvalho Chehab <mchehab@redhat.com> Acked-by: NHans Verkuil <hans.verkuil@cisco.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 28 5月, 2013 3 次提交
-
-
由 Michael S. Tsirkin 提交于
This changes might_fault() so that it does not trigger a false positive diagnostic for e.g. the following sequence: spin_lock_irqsave() pagefault_disable() copy_to_user() pagefault_enable() spin_unlock_irqrestore() In particular vhost wants to do this, to call socket ops from under a lock. There are 3 cases to consider: - CONFIG_PROVE_LOCKING - might_fault is non-inline so it's easy to move the in_atomic test to fix up the false positive warning. - CONFIG_DEBUG_ATOMIC_SLEEP - might_fault is currently inline, but we are calling a non-inline __might_sleep anyway, so let's use the non-line version of might_fault that does the right thing. - !CONFIG_DEBUG_ATOMIC_SLEEP && !CONFIG_PROVE_LOCKING __might_sleep is a nop so might_fault is a nop. Make this explicit. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1369577426-26721-11-git-send-email-mst@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Michael S. Tsirkin 提交于
might_fault() is called from functions like copy_to_user() which most callers expect to be very fast, like a couple of instructions. So functions like memcpy_toiovec() call them many times in a loop. But might_fault() calls might_sleep() and with CONFIG_PREEMPT_VOLUNTARY this results in a function call. Let's not do this - just call __might_sleep() that produces a diagnostic for sleep within atomic, but drop might_preempt(). Here's a test sending traffic between the VM and the host, host is built with CONFIG_PREEMPT_VOLUNTARY: before: incoming: 7122.77 Mb/s outgoing: 8480.37 Mb/s after: incoming: 8619.24 Mb/s outgoing: 9455.42 Mb/s As a side effect, this fixes an issue pointed out by Ingo: might_fault might schedule differently depending on PROVE_LOCKING. Now there's no preemption point in both cases, so it's consistent. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1369577426-26721-10-git-send-email-mst@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Lukas Czerner 提交于
This commit changes truncate_inode_pages_range() so it can handle non page aligned regions of the truncate. Currently we can hit BUG_ON when the end of the range is not page aligned, but we can handle unaligned start of the range. Being able to handle non page aligned regions of the page can help file system punch_hole implementations and save some work, because once we're holding the page we might as well deal with it right away. In previous commits we've changed ->invalidatepage() prototype to accept 'length' argument to be able to specify range to invalidate. No we can use that new ability in truncate_inode_pages_range(). Signed-off-by: NLukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 25 5月, 2013 1 次提交
-
-
由 Cliff Wickman 提交于
A panic can be caused by simply cat'ing /proc/<pid>/smaps while an application has a VM_PFNMAP range. It happened in-house when a benchmarker was trying to decipher the memory layout of his program. /proc/<pid>/smaps and similar walks through a user page table should not be looking at VM_PFNMAP areas. Certain tests in walk_page_range() (specifically split_huge_page_pmd()) assume that all the mapped PFN's are backed with page structures. And this is not usually true for VM_PFNMAP areas. This can result in panics on kernel page faults when attempting to address those page structures. There are a half dozen callers of walk_page_range() that walk through a task's entire page table (as N. Horiguchi pointed out). So rather than change all of them, this patch changes just walk_page_range() to ignore VM_PFNMAP areas. The logic of hugetlb_vma() is moved back into walk_page_range(), as we want to test any vma in the range. VM_PFNMAP areas are used by: - graphics memory manager gpu/drm/drm_gem.c - global reference unit sgi-gru/grufile.c - sgi special memory char/mspec.c - and probably several out-of-tree modules [akpm@linux-foundation.org: remove now-unused hugetlb_vma() stub] Signed-off-by: NCliff Wickman <cpw@sgi.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Sterba <dsterba@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-