- 01 11月, 2011 26 次提交
-
-
由 Mel Gorman 提交于
Testing from the XFS folk revealed that there is still too much I/O from the end of the LRU in kswapd. Previously it was considered acceptable by VM people for a small number of pages to be written back from reclaim with testing generally showing about 0.3% of pages reclaimed were written back (higher if memory was low). That writing back a small number of pages is ok has been heavily disputed for quite some time and Dave Chinner explained it well; It doesn't have to be a very high number to be a problem. IO is orders of magnitude slower than the CPU time it takes to flush a page, so the cost of making a bad flush decision is very high. And single page writeback from the LRU is almost always a bad flush decision. To complicate matters, filesystems respond very differently to requests from reclaim according to Christoph Hellwig; xfs tries to write it back if the requester is kswapd ext4 ignores the request if it's a delayed allocation btrfs ignores the request As a result, each filesystem has different performance characteristics when under memory pressure and there are many pages being dirtied. In some cases, the request is ignored entirely so the VM cannot depend on the IO being dispatched. The objective of this series is to reduce writing of filesystem-backed pages from reclaim, play nicely with writeback that is already in progress and throttle reclaim appropriately when writeback pages are encountered. The assumption is that the flushers will always write pages faster than if reclaim issues the IO. A secondary goal is to avoid the problem whereby direct reclaim splices two potentially deep call stacks together. There is a potential new problem as reclaim has less control over how long before a page in a particularly zone or container is cleaned and direct reclaimers depend on kswapd or flusher threads to do the necessary work. However, as filesystems sometimes ignore direct reclaim requests already, it is not expected to be a serious issue. Patch 1 disables writeback of filesystem pages from direct reclaim entirely. Anonymous pages are still written. Patch 2 removes dead code in lumpy reclaim as it is no longer able to synchronously write pages. This hurts lumpy reclaim but there is an expectation that compaction is used for hugepage allocations these days and lumpy reclaim's days are numbered. Patches 3-4 add warnings to XFS and ext4 if called from direct reclaim. With patch 1, this "never happens" and is intended to catch regressions in this logic in the future. Patch 5 disables writeback of filesystem pages from kswapd unless the priority is raised to the point where kswapd is considered to be in trouble. Patch 6 throttles reclaimers if too many dirty pages are being encountered and the zones or backing devices are congested. Patch 7 invalidates dirty pages found at the end of the LRU so they are reclaimed quickly after being written back rather than waiting for a reclaimer to find them I consider this series to be orthogonal to the writeback work but it is worth noting that the writeback work affects the viability of patch 8 in particular. I tested this on ext4 and xfs using fs_mark, a simple writeback test based on dd and a micro benchmark that does a streaming write to a large mapping (exercises use-once LRU logic) followed by streaming writes to a mix of anonymous and file-backed mappings. The command line for fs_mark when botted with 512M looked something like ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760 The number of files was adjusted depending on the amount of available memory so that the files created was about 3xRAM. For multiple threads, the -d switch is specified multiple times. The test machine is x86-64 with an older generation of AMD processor with 4 cores. The underlying storage was 4 disks configured as RAID-0 as this was the best configuration of storage I had available. Swap is on a separate disk. Dirty ratio was tuned to 40% instead of the default of 20%. Testing was run with and without monitors to both verify that the patches were operating as expected and that any performance gain was real and not due to interference from monitors. Here is a summary of results based on testing XFS. 512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%) 512M1P-xfs Elapsed Time fsmark 51.41 48.29 512M1P-xfs Elapsed Time simple-wb 114.09 108.61 512M1P-xfs Elapsed Time mmap-strm 113.46 109.34 512M1P-xfs Kswapd efficiency fsmark 62% 63% 512M1P-xfs Kswapd efficiency simple-wb 56% 61% 512M1P-xfs Kswapd efficiency mmap-strm 44% 42% 512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%) 512M-xfs Elapsed Time fsmark 56.08 48.90 512M-xfs Elapsed Time simple-wb 112.22 98.13 512M-xfs Elapsed Time mmap-strm 219.15 196.67 512M-xfs Kswapd efficiency fsmark 54% 56% 512M-xfs Kswapd efficiency simple-wb 54% 55% 512M-xfs Kswapd efficiency mmap-strm 45% 44% 512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%) 512M-4X-xfs Elapsed Time fsmark 63.26 55.88 512M-4X-xfs Elapsed Time simple-wb 100.90 90.25 512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38 512M-4X-xfs Kswapd efficiency fsmark 49% 50% 512M-4X-xfs Kswapd efficiency simple-wb 54% 56% 512M-4X-xfs Kswapd efficiency mmap-strm 37% 36% 512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%) 512M-16X-xfs Elapsed Time fsmark 67.47 58.25 512M-16X-xfs Elapsed Time simple-wb 103.22 90.89 512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82 512M-16X-xfs Kswapd efficiency fsmark 45% 46% 512M-16X-xfs Kswapd efficiency simple-wb 53% 55% 512M-16X-xfs Kswapd efficiency mmap-strm 33% 33% Up until 512-4X, the FSmark improvements were statistically significant. For the 4X and 16X tests the results were within standard deviations but just barely. The time to completion for all tests is improved which is an important result. In general, kswapd efficiency is not affected by skipping dirty pages. 1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%) 1024M1P-xfs Elapsed Time fsmark 84.14 80.41 1024M1P-xfs Elapsed Time simple-wb 210.77 184.78 1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34 1024M1P-xfs Kswapd efficiency fsmark 69% 75% 1024M1P-xfs Kswapd efficiency simple-wb 71% 77% 1024M1P-xfs Kswapd efficiency mmap-strm 43% 44% 1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%) 1024M-xfs Elapsed Time fsmark 94.59 91.00 1024M-xfs Elapsed Time simple-wb 229.84 195.08 1024M-xfs Elapsed Time mmap-strm 405.38 440.29 1024M-xfs Kswapd efficiency fsmark 79% 71% 1024M-xfs Kswapd efficiency simple-wb 74% 74% 1024M-xfs Kswapd efficiency mmap-strm 39% 42% 1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%) 1024M-4X-xfs Elapsed Time fsmark 103.33 97.74 1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57 1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88 1024M-4X-xfs Kswapd efficiency fsmark 81% 70% 1024M-4X-xfs Kswapd efficiency simple-wb 73% 72% 1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38% 1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%) 1024M-16X-xfs Elapsed Time fsmark 103.11 99.11 1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24 1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82 1024M-16X-xfs Kswapd efficiency fsmark 84% 69% 1024M-16X-xfs Kswapd efficiency simple-wb 74% 73% 1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40% All FSMark tests up to 16X had statistically significant improvements. For the most part, tests are completing faster with the exception of the streaming writes to a mixture of anonymous and file-backed mappings which were slower in two cases In the cases where the mmap-strm tests were slower, there was more swapping due to dirty pages being skipped. The number of additional pages swapped is almost identical to the fewer number of pages written from reclaim. In other words, roughly the same number of pages were reclaimed but swapping was slower. As the test is a bit unrealistic and stresses memory heavily, the small shift is acceptable. 4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%) 4608M1P-xfs Elapsed Time fsmark 512.01 492.15 4608M1P-xfs Elapsed Time simple-wb 618.18 566.24 4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07 4608M1P-xfs Kswapd efficiency fsmark 93% 86% 4608M1P-xfs Kswapd efficiency simple-wb 88% 84% 4608M1P-xfs Kswapd efficiency mmap-strm 46% 45% 4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%) 4608M-xfs Elapsed Time fsmark 555.96 532.34 4608M-xfs Elapsed Time simple-wb 659.72 571.85 4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38 4608M-xfs Kswapd efficiency fsmark 89% 91% 4608M-xfs Kswapd efficiency simple-wb 88% 82% 4608M-xfs Kswapd efficiency mmap-strm 48% 46% 4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%) 4608M-4X-xfs Elapsed Time fsmark 592.91 564.00 4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07 4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53 4608M-4X-xfs Kswapd efficiency fsmark 90% 94% 4608M-4X-xfs Kswapd efficiency simple-wb 87% 82% 4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43% 4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%) 4608M-16X-xfs Elapsed Time fsmark 602.69 585.78 4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81 4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86 4608M-16X-xfs Kswapd efficiency fsmark 98% 98% 4608M-16X-xfs Kswapd efficiency simple-wb 88% 82% 4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42% Unlike the other tests, the fsmark results are not statistically significant but the min and max times are both improved and for the most part, tests completed faster. There are other indications that this is an improvement as well. For example, in the vast majority of cases, there were fewer pages scanned by direct reclaim implying in many cases that stalls due to direct reclaim are reduced. KSwapd is scanning more due to skipping dirty pages which is unfortunate but the CPU usage is still acceptable In an earlier set of tests, I used blktrace and in almost all cases throughput throughout the entire test was higher. However, I ended up discarding those results as recording blktrace data was too heavy for my liking. On a laptop, I plugged in a USB stick and ran a similar tests of tests using it as backing storage. A desktop environment was running and for the entire duration of the tests, firefox and gnome terminal were launching and exiting to vaguely simulate a user. 1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%) 1024M-xfs Elapsed Time fsmark 2053.52 1641.03 1024M-xfs Elapsed Time simple-wb 1229.53 768.05 1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03 1024M-xfs Kswapd efficiency fsmark 84% 85% 1024M-xfs Kswapd efficiency simple-wb 92% 81% 1024M-xfs Kswapd efficiency mmap-strm 60% 51% 1024M-xfs Avg wait ms fsmark 5404.53 4473.87 1024M-xfs Avg wait ms simple-wb 2541.35 1453.54 1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53 The mmap-strm results were hurt because firefox launching had a tendency to push the test out of memory. On the postive side, firefox launched marginally faster with the patches applied. Time to completion for many tests was faster but more importantly - the "Avg wait" time as measured by iostat was far lower implying the system would be more responsive. It was also the case that "Avg wait ms" on the root filesystem was lower. I tested it manually and while the system felt slightly more responsive while copying data to a USB stick, it was marginal enough that it could be my imagination. This patch: do not writeback filesystem pages in direct reclaim. When kswapd is failing to keep zones above the min watermark, a process will enter direct reclaim in the same manner kswapd does. If a dirty page is encountered during the scan, this page is written to backing storage using mapping->writepage. This causes two problems. First, it can result in very deep call stacks, particularly if the target storage or filesystem are complex. Some filesystems ignore write requests from direct reclaim as a result. The second is that a single-page flush is inefficient in terms of IO. While there is an expectation that the elevator will merge requests, this does not always happen. Quoting Christoph Hellwig; The elevator has a relatively small window it can operate on, and can never fix up a bad large scale writeback pattern. This patch prevents direct reclaim writing back filesystem pages by checking if current is kswapd. Anonymous pages are still written to swap as there is not the equivalent of a flusher thread for anonymous pages. If the dirty pages cannot be written back, they are placed back on the LRU lists. There is now a direct dependency on dirty page balancing to prevent too many pages in the system being dirtied which would prevent reclaim making forward progress. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Add comments to explain the page statistics field in the mm_struct. [akpm@linux-foundation.org: add missing ;] Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Some kernel components pin user space memory (infiniband and perf) (by increasing the page count) and account that memory as "mlocked". The difference between mlocking and pinning is: A. mlocked pages are marked with PG_mlocked and are exempt from swapping. Page migration may move them around though. They are kept on a special LRU list. B. Pinned pages cannot be moved because something needs to directly access physical memory. They may not be on any LRU list. I recently saw an mlockalled process where mm->locked_vm became bigger than the virtual size of the process (!) because some memory was accounted for twice: Once when the page was mlocked and once when the Infiniband layer increased the refcount because it needt to pin the RDMA memory. This patch introduces a separate counter for pinned pages and accounts them seperately. Signed-off-by: NChristoph Lameter <cl@linux.com> Cc: Mike Marciniszyn <infinipath@qlogic.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The nr_force_scan[] tuple holds the effective scan numbers for anon and file pages in case the situation called for a forced scan and the regularly calculated scan numbers turned out zero. However, the effective scan number can always be assumed to be SWAP_CLUSTER_MAX right before the division into anon and file. The numerators and denominator are properly set up for all cases, be it force scan for just file, just anon, or both, to do the right thing. Signed-off-by: NJohannes Weiner <jweiner@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Ying Han <yinghan@google.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Jones 提交于
When we get a bad_page bug report, it's useful to see what modules the user had loaded. Signed-off-by: NDave Jones <davej@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Robert P. J. Day 提交于
Add the leading word "tmpfs" to the Kconfig string to make it blindingly obvious that this selection refers to tmpfs. Signed-off-by: NRobert P. J. Day <rpjday@crashcourse.ca> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
test_set_oom_score_adj() was introduced in 72788c38 ("oom: replace PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: NDavid Rientjes <rientjes@google.com> Reported-by: NOleg Nesterov <oleg@redhat.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
After selecting a task to kill, the oom killer iterates all processes and kills all other threads that share the same mm_struct in different thread groups. It would not otherwise be helpful to kill a thread if its memory would not be subsequently freed. A kernel thread, however, may assume a user thread's mm by using use_mm(). This is only temporary and should not result in sending a SIGKILL to that kthread. This patch ensures that only user threads and not kthreads are sent a SIGKILL if they share the same mm_struct as the oom killed task. Signed-off-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
If a thread has been oom killed and is frozen, thaw it before returning to the page allocator. Otherwise, it can stay frozen indefinitely and no memory will be freed. Signed-off-by: NDavid Rientjes <rientjes@google.com> Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Looks like someone got distracted after adding the comment characters. Signed-off-by: NJohannes Weiner <jweiner@redhat.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
per-task block plug can reduce block queue lock contention and increase request merge. Currently page reclaim doesn't support it. I originally thought page reclaim doesn't need it, because kswapd thread count is limited and file cache write is done at flusher mostly. When I test a workload with heavy swap in a 4-node machine, each CPU is doing direct page reclaim and swap. This causes block queue lock contention. In my test, without below patch, the CPU utilization is about 2% ~ 7%. With the patch, the CPU utilization is about 1% ~ 3%. Disk throughput isn't changed. This should improve normal kswapd write and file cache write too (increase request merge for example), but might not be so obvious as I explain above. Signed-off-by: NShaohua Li <shaohua.li@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind. Remove it now, and return 0 as soon as we see unset tag - we already rely upon the root tag to be correct, returning 0 immediately if it's not set. Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
unmap_and_move() is one a big messy function. Clean it up. Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless, we have isolated mapped page and re-add it into LRU's head. It's unnecessary CPU overhead and makes LRU churning. Of course, when we isolate the page, the page might be mapped but when we try to migrate the page, the page would be not mapped. So it could be migrated. But race is rare and although it happens, it's no big deal. Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
In async mode, compaction doesn't migrate dirty or writeback pages. So, it's meaningless to pick the page and re-add it to lru list. Of course, when we isolate the page in compaction, the page might be dirty or writeback but when we try to migrate the page, the page would be not dirty, writeback. So it could be migrated. But it's very unlikely as isolate and migration cycle is much faster than writeout. So, this patch helps cpu overhead and prevent unnecessary LRU churning. Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally, macro isn't recommended as it's type-unsafe and making debugging harder as symbol cannot be passed throught to the debugger. Quote from Johannes " Hmm, it would probably be cleaner to fully convert the isolation mode into independent flags. INACTIVE, ACTIVE, BOTH is currently a tri-state among flags, which is a bit ugly." This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
acct_isolated of compaction uses page_lru_base_type which returns only base type of LRU list so it never returns LRU_ACTIVE_ANON or LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only acct_isolated so it doesn't have fields in conpact_control. This patch removes fields from compact_control and makes clear function of acct_issolated which counts the number of anon|file pages isolated. Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christopher Yeoh 提交于
The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wanlong Gao 提交于
Fix the wrong use of schedule_hrtimeout_range_clock() in wq_sleep(), although it is harmless for the syscall mq_timed* now. It was introduced by 9ca7d8e6 ("mqueue: Convert message queue timeout to use hrtimers"). Signed-off-by: NWanlong Gao <gaowanlong@cn.fujitsu.com> Cc: Carsten Emde <C.Emde@osadl.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
The display of the "huge" tag was accidentally removed in 29ea2f69 ("mm: use walk_page_range() instead of custom page table walking code"). Reported-by: NStephen Hemminger <shemminger@vyatta.com> Tested-by: NStephen Hemminger <shemminger@vyatta.com> Reviewed-by: NStephen Wilson <wilsons@start.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
x86_64 allnoconfig: In file included from arch/x86/kernel/pci-dma.c:3: include/linux/dmar.h:248: warning: 'struct acpi_dmar_header' declared inside parameter list include/linux/dmar.h:248: warning: its scope is only this definition or declaration, which is probably not what you want Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Clemens Ladisch 提交于
Commit 5fd75a78 (dma-mapping: remove unnecessary sync_single_range_* in dma_map_ops) unified not only the dma_map_ops but also the corresponding debug_dma_sync_* calls. This led to spurious WARN()ings like the following because the DMA debug code was no longer able to detect the DMA buffer base address without the separate offset parameter: WARNING: at lib/dma-debug.c:911 check_sync+0xce/0x446() firewire_ohci 0000:04:00.0: DMA-API: device driver tries to sync DMA memory it has not allocated [device address=0x00000000cedaa400] [size=1024 bytes] Call Trace: ... [<ffffffff811326a5>] check_sync+0xce/0x446 [<ffffffff81132ad9>] debug_dma_sync_single_for_device+0x39/0x3b [<ffffffffa01d6e6a>] ohci_queue_iso+0x4f3/0x77d [firewire_ohci] ... To fix this, unshare the sync_single_* and sync_single_range_* implementations so that we are able to call the correct debug_dma_sync_* functions. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NClemens Ladisch <clemens@ladisch.de> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net由 Linus Torvalds 提交于
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (27 commits) vlan: allow nested vlan_do_receive() ipv6: fix route lookup in addrconf_prefix_rcv() bonding: eliminate bond_close race conditions qlcnic: fix beacon and LED test. qlcnic: Updated License file qlcnic: updated reset sequence qlcnic: reset loopback mode if promiscous mode setting fails. qlcnic: skip IDC ack check in fw reset path. i825xx: Fix incorrect dependency for BVME6000_NET ipv6: fix route error binding peer in func icmp6_dst_alloc ipv6: fix error propagation in ip6_ufo_append_data() stmmac: update normal descriptor structure (v2) stmmac: fix NULL pointer dereference in capabilities fixup (v2) stmmac: fix a bug while checking the HW cap reg (v2) be2net: Changing MAC Address of a VF was broken. be2net: Refactored be_cmds.c file. bnx2x: update driver version to 1.70.30-0 bnx2x: use FW 7.0.29.0 bnx2x: Enable changing speed when port type is PORT_DA bnx2x: Fix 54618se LED behavior ...
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc由 Linus Torvalds 提交于
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Fix masking and shifting in VIS fpcmp emulation. sparc32: Correct the return value of memcpy. sparc32: Remove uses of %g7 in memcpy implementation. sparc32: Remove non-kernel code from memcpy implementation.
-
git://neil.brown.name/md由 Linus Torvalds 提交于
* 'for-linus' of git://neil.brown.name/md: md/raid10: Fix bug when activating a hot-spare.
-
- 31 10月, 2011 7 次提交
-
-
由 David S. Miller 提交于
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 NeilBrown 提交于
This is a fairly serious bug in RAID10. When a RAID10 array is degraded and a hot-spare is activated, the spare does not take up the empty slot, but rather replaces the first working device. This is likely to make the array non-functional. It would normally be possible to recover the data, but that would need care and is not guaranteed. This bug was introduced in commit 2bb77736 which first appeared in 3.1. Cc: stable@kernel.org Signed-off-by: NNeilBrown <neilb@suse.de>
-
git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging由 Linus Torvalds 提交于
* 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging: i2c: Functions for byte-swapped smbus_write/read_word_data i2c-algo-pca: Return standard fault codes i2c-algo-bit: Return standard fault codes i2c-algo-bit: Be verbose on bus testing failure i2c-algo-bit: Let user test buses without failing i2c/scx200_acb: Fix section mismatch warning in scx200_pci_drv i2c: I2C_ELEKTOR should depend on HAS_IOPORT
-
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu由 Linus Torvalds 提交于
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits) iommu/core: Remove global iommu_ops and register_iommu iommu/msm: Use bus_set_iommu instead of register_iommu iommu/omap: Use bus_set_iommu instead of register_iommu iommu/vt-d: Use bus_set_iommu instead of register_iommu iommu/amd: Use bus_set_iommu instead of register_iommu iommu/core: Use bus->iommu_ops in the iommu-api iommu/core: Convert iommu_found to iommu_present iommu/core: Add bus_type parameter to iommu_domain_alloc Driver core: Add iommu_ops to bus_type iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API iommu/amd: Fix wrong shift direction iommu/omap: always provide iommu debug code iommu/core: let drivers know if an iommu fault handler isn't installed iommu/core: export iommu_set_fault_handler() iommu/omap: Fix build error with !IOMMU_SUPPORT iommu/omap: Migrate to the generic fault report mechanism iommu/core: Add fault reporting mechanism iommu/core: Use PAGE_SIZE instead of hard-coded value iommu/core: use the existing IS_ALIGNED macro iommu/msm: ->unmap() should return order of unmapped page ... Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config options" just happened to touch lines next to each other.
-
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp由 Linus Torvalds 提交于
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: amd64_edac: Cleanup return type of amd64_determine_edac_cap() amd64_edac: Add a fix for Erratum 505 EDAC, MCE, AMD: Simplify NB MCE decoder interface EDAC, MCE, AMD: Drop local coreid reporting EDAC, MCE, AMD: Print valid addr when reporting an error EDAC, MCE, AMD: Print CPU number when reporting the error
-
git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm由 Linus Torvalds 提交于
* 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm: (75 commits) KVM: SVM: Keep intercepting task switching with NPT enabled KVM: s390: implement sigp external call KVM: s390: fix register setting KVM: s390: fix return value of kvm_arch_init_vm KVM: s390: check cpu_id prior to using it KVM: emulate lapic tsc deadline timer for guest x86: TSC deadline definitions KVM: Fix simultaneous NMIs KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode KVM: x86 emulator: streamline decode of segment registers KVM: x86 emulator: simplify OpMem64 decode KVM: x86 emulator: switch src decode to decode_operand() KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack KVM: x86 emulator: switch OpImmUByte decode to decode_imm() KVM: x86 emulator: free up some flag bits near src, dst KVM: x86 emulator: switch src2 to generic decode_operand() KVM: x86 emulator: expand decode flags to 64 bits KVM: x86 emulator: split dst decode to a generic decode_operand() KVM: x86 emulator: move memop, memopp into emulation context ...
-
git://github.com/schandinat/linux-2.6由 Linus Torvalds 提交于
* 'fbdev-next' of git://github.com/schandinat/linux-2.6: (270 commits) video: platinumfb: Add __devexit_p at necessary place drivers/video: fsl-diu-fb: merge diu_pool into fsl_diu_data drivers/video: fsl-diu-fb: merge diu_hw into fsl_diu_data drivers/video: fsl-diu-fb: only DIU modes 0 and 1 are supported drivers/video: fsl-diu-fb: remove unused panel operating mode support drivers/video: fsl-diu-fb: use an enum for the AOI index drivers/video: fsl-diu-fb: add several new video modes drivers/video: fsl-diu-fb: remove broken screen blanking support drivers/video: fsl-diu-fb: move some definitions out of the header file drivers/video: fsl-diu-fb: fix some ioctls video: da8xx-fb: Increased resolution configuration of revised LCDC IP OMAPDSS: picodlp: add missing #include <linux/module.h> fb: fix au1100fb bitrot. mx3fb: fix NULL pointer dereference in screen blanking. video: irq: Remove IRQF_DISABLED smscufx: change edid data to u8 instead of char OMAPDSS: DISPC: zorder support for DSS overlays OMAPDSS: DISPC: VIDEO3 pipeline support OMAPDSS/OMAP_VOUT: Fix incorrect OMAP3-alpha compatibility setting video/omap: fix build dependencies ... Fix up conflicts in: - drivers/staging/xgifb/XGI_main_26.c Changes to XGIfb_pan_var() - drivers/video/omap/{lcd_apollon.c,lcd_ldp.c,lcd_overo.c} Removed (or in the case of apollon.c, merged into the generic DSS panel in drivers/video/omap2/displays/panel-generic-dpi.c)
-
- 30 10月, 2011 7 次提交
-
-
由 Jonathan Cameron 提交于
Reimplemented at least 17 times discounting error mangling cases where it could be used. Signed-off-by: NJonathan Cameron <jic23@cam.ac.uk> Signed-off-by: NJean Delvare <khali@linux-fr.org>
-
由 Jean Delvare 提交于
Adjust i2c-algo-pca to return fault codes compliant with Documentation/i2c/fault-codes, rather than the undocumented and vague -EREMOTEIO. Signed-off-by: NJean Delvare <khali@linux-fr.org> Cc: Wolfram Sang <w.sang@pengutronix.de>
-
由 Jean Delvare 提交于
Adjust i2c-algo-bit to return fault codes compliant with Documentation/i2c/fault-codes, rather than the undocumented and vague -EREMOTEIO. Signed-off-by: NJean Delvare <khali@linux-fr.org>
-
由 Jean Delvare 提交于
If bus testing fails due to the bus being seen as busy, it might be helpful for developers to know which line is unexpectedly low. Signed-off-by: NJean Delvare <jdelvare@suse.de> Reviewed-by: NAlex Deucher <alexdeucher@gmail.com>
-
由 Jean Delvare 提交于
Always failing to register I2C buses when the line testing fails is a little harsh. While such a failure is definitely a bug in the driver that exposes the affected I2C bus, things may still work fine if the missing initialization steps are done later, before the I2C bus is used. So it seems a better debugging tool to just report the test failure by default. I introduce bit_test=2 if anyone really misses the original behavior of bit_test=1. Signed-off-by: NJean Delvare <jdelvare@suse.de> Reviewed-by: NAlex Deucher <alexdeucher@gmail.com>
-
由 Harvey Yang 提交于
WARNING: drivers/i2c/busses/built-in.o(.data+0x47c8): Section mismatch in reference from the variable scx200_pci_drv to the function .devinit.text:scx200_probe() The variable scx200_pci_drv references the function __devinit scx200_probe() If the reference is valid then annotate the variable with __init* or __refdata (see linux/init.h) or name the variable: *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console Signed-off-by: NHarvey Yang <harvey.huawei.yang@gmail.com> Signed-off-by: NJean Delvare <khali@linux-fr.org>
-
由 Geert Uytterhoeven 提交于
On m68k, I get: drivers/i2c/busses/i2c-elektor.c: In function ‘pcf_isa_init’: drivers/i2c/busses/i2c-elektor.c:153: error: implicit declaration of function ‘ioport_map’ drivers/i2c/busses/i2c-elektor.c:153: warning: assignment makes pointer from integer without a cast drivers/i2c/busses/i2c-elektor.c: In function ‘elektor_probe’: drivers/i2c/busses/i2c-elektor.c:287: error: implicit declaration of function ‘ioport_unmap’ Since commit 82ed223c ("iomap: make IOPORT/PCI mapping functions conditional"), ioport_map() is only available on platforms that set HAS_IOPORT. Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NJean Delvare <khali@linux-fr.org>
-