- 22 9月, 2010 1 次提交
-
-
由 Jan Kara 提交于
Properly initialize this backing dev info so that writeback code does not barf when getting to it e.g. via sb->s_bdi. Cc: stable@kernel.org Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 21 9月, 2010 1 次提交
-
-
由 Hugh Dickins 提交于
Commit 4969c119 ("mm: fix swapin race condition") is now agreed to be incomplete. There's a race, not very much less likely than the original race envisaged, in which it is further necessary to check that the swapcache page's swap has not changed. Here's the reasoning: cast in terms of reuse_swap_page(), but probably could be reformulated to rely on try_to_free_swap() instead, or on swapoff+swapon. A, faults into do_swap_page(): does page1 = lookup_swap_cache(swap1) and comes through the lock_page(page1). B, a racing thread of the same process, faults on the same address: does page1 = lookup_swap_cache(swap1) and now waits in lock_page(page1), but for whatever reason is unlucky not to get the lock any time soon. A carries on through do_swap_page(), a write fault, but cannot reuse the swap page1 (another reference to swap1). Unlocks the page1 (but B doesn't get it yet), does COW in do_wp_page(), page2 now in that pte. C, perhaps the parent of A+B, comes in and write faults the same swap page1 into its mm, reuse_swap_page() succeeds this time, swap1 is freed. kswapd comes in after some time (B still unlucky) and swaps out some pages from A+B and C: it allocates the original swap1 to page2 in A+B, and some other swap2 to the original page1 now in C. But does not immediately free page1 (actually it couldn't: B holds a reference), leaving it in swap cache for now. B at last gets the lock on page1, hooray! Is PageSwapCache(page1)? Yes. Is pte_same(*page_table, orig_pte)? Yes, because page2 has now been given the swap1 which page1 used to have. So B proceeds to insert page1 into A+B's page_table, though its content now belongs to C, quite different from what A wrote there. B ought to have checked that page1's swap was still swap1. Signed-off-by: NHugh Dickins <hughd@google.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 9月, 2010 14 次提交
-
-
由 Mel Gorman 提交于
When under significant memory pressure, a process enters direct reclaim and immediately afterwards tries to allocate a page. If it fails and no further progress is made, it's possible the system will go OOM. However, on systems with large amounts of memory, it's possible that a significant number of pages are on per-cpu lists and inaccessible to the calling process. This leads to a process entering direct reclaim more often than it should increasing the pressure on the system and compounding the problem. This patch notes that if direct reclaim is making progress but allocations are still failing that the system is already under heavy pressure. In this case, it drains the per-cpu lists and tries the allocation a second time before continuing. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NChristoph Lameter <cl@linux.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark, at worst reducing the real number of pages to zero. Even if the OOM killer kills some victim for freeing memory, it may not free memory if the exit path requires a new page resulting in livelock. This patch introduces a zone_page_state_snapshot() function (courtesy of Christoph) that takes a slightly more accurate view of an arbitrary vmstat counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark being accidentally broken. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
When allocating a page, the system uses NR_FREE_PAGES counters to determine if watermarks would remain intact after the allocation was made. This check is made without interrupts disabled or the zone lock held and so is race-prone by nature. Unfortunately, when pages are being freed in batch, the counters are updated before the pages are added on the list. During this window, the counters are misleading as the pages do not exist yet. When under significant pressure on systems with large numbers of CPUs, it's possible for processes to make progress even though they should have been stalled. This is particularly problematic if a number of the processes are using GFP_ATOMIC as the min watermark can be accidentally breached and in extreme cases, the system can livelock. This patch updates the counters after the pages have been added to the list. This makes the allocator more cautious with respect to preserving the watermarks and mitigates livelock possibilities. [akpm@linux-foundation.org: avoid modifying incoming args] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NChristoph Lameter <cl@linux.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
refresh_zone_stat_thresholds() calculates parameter based on the number of online cpus. It's called at cpu offlining but needs to be called at onlining, too. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs show a shift since I last tested in December: in part because of firmware updates, in part because of the necessary move from barriers to awaiting completion at the block layer. While discard at swapon still shows as slightly beneficial on both, discarding 1MB swap cluster when allocating is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV). Surrender: discard as presently implemented is more hindrance than help for swap; but might prove useful on other devices, or with improvements. So continue to do the discard at swapon, but make discard while swapping conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using only the lower 16 bits of int flags). We can add a --discard or -d to swapon(8), and a "discard" to swap in /etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nigel Cunningham <nigel@tuxonice.net> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Hellwig 提交于
The swap code already uses synchronous discards, no need to add I/O barriers. This fixes the worst of the terrible slowdown in swap allocation for hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely eliminate that regression. [tj@kernel.org: superflous newlines removed] Signed-off-by: NChristoph Hellwig <hch@lst.de> Tested-by: NNigel Cunningham <nigel@tuxonice.net> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Move the hibernation check from scan_swap_map() into try_to_free_swap(): to catch not only the common case when hibernation's allocation itself triggers swap reuse, but also the less likely case when concurrent page reclaim (shrink_page_list) might happen to try_to_free_swap from a page. Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop reclaim from going to swap: check that to prevent swap reuse too. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ondrej Zary <linux@rainbow-software.org> Cc: Andrea Gelmini <andrea.gelmini@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nigel Cunningham <nigel@tuxonice.net> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Please revert 2.6.36-rc commit d2997b10 "hibernation: freeze swap at hibernation". It complicated matters by adding a second swap allocation path, just for hibernation; without in any way fixing the issue that it was intended to address - page reclaim after fixing the hibernation image might free swap from a page already imaged as swapcache, letting its swap be reallocated to store a different page of the image: resulting in data corruption if the imaged page were freed as clean then swapped back in. Pages freed to si->swap_map were still in danger of being reallocated by the alternative allocation path. I guess it inadvertently fixed slow SSD swap allocation for hibernation, as reported by Nigel Cunningham: by missing out the discards that occur on the usual swap allocation path; but that was unintentional, and needs a separate fix. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ondrej Zary <linux@rainbow-software.org> Cc: Andrea Gelmini <andrea.gelmini@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gary King 提交于
I have been seeing problems on Tegra 2 (ARMv7 SMP) systems with HIGHMEM enabled on 2.6.35 (plus some patches targetted at 2.6.36 to perform cache maintenance lazily), and the root cause appears to be that the mm bouncing code is calling flush_dcache_page before it copies the bounce buffer into the bio. The bounced page needs to be flushed after data is copied into it, to ensure that architecture implementations can synchronize instruction and data caches if necessary. Signed-off-by: NGary King <gking@nvidia.com> Cc: Tejun Heo <tj@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Acked-by: NJens Axboe <axboe@kernel.dk> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
next_active_pageblock() is for finding next _used_ freeblock. It skips several blocks when it finds there are a chunk of free pages lager than pageblock. But it has 2 bugs. 1. We have no lock. page_order(page) - pageblock_order can be minus. 2. pageblocks_stride += is wrong. it should skip page_order(p) of pages. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Iram reported that compaction's too_many_isolated() loops forever. (http://www.spinics.net/lists/linux-mm/msg08123.html) The meminfo when the situation happened was inactive anon is zero. That's because the system has no memory pressure until then. While all anon pages were in the active lru, compaction could select active lru as well as inactive lru. That's a different thing from vmscan's isolated. So we has been two too_many_isolated. While compaction can isolate pages in both active and inactive, current implementation of too_many_isolated only considers inactive. It made Iram's problem. This patch handles active and inactive fairly. That's because we can't expect where from and how many compaction would isolated pages. This patch changes (nr_isolated > nr_inactive) with nr_isolated > (nr_active + nr_inactive) / 2. Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Reported-by: NIram Shahzad <iram.shahzad@jp.fujitsu.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NWu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
COMPACTION enables MIGRATION, but MIGRATION spawns a warning if numa or memhotplug aren't selected. However MIGRATION doesn't depend on them. I guess it's just trying to be strict doing a double check on who's enabling it, but it doesn't know that compaction also enables MIGRATION. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
The pte_same check is reliable only if the swap entry remains pinned (by the page lock on swapcache). We've also to ensure the swapcache isn't removed before we take the lock as try_to_free_swap won't care about the page pin. One of the possible impacts of this patch is that a KSM-shared page can point to the anon_vma of another process, which could exit before the page is freed. This can leave a page with a pointer to a recycled anon_vma object, or worse, a pointer to something that is no longer an anon_vma. [riel@redhat.com: changelog help] Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Stefan Bader 提交于
So it can be used by all that need to check for that. Signed-off-by: NStefan Bader <stefan.bader@canonical.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 8月, 2010 1 次提交
-
-
由 Hugh Dickins 提交于
After several hours, kbuild tests hang with anon_vma_prepare() spinning on a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y (which makes this very much more likely, but it could happen without). The ever-subtle page_lock_anon_vma() now needs a further twist: since anon_vma_prepare() and anon_vma_fork() are liable to change the ->root of a reused anon_vma structure at any moment, page_lock_anon_vma() needs to check page_mapped() again before succeeding, otherwise page_unlock_anon_vma() might address a different root->lock. Signed-off-by: NHugh Dickins <hughd@google.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 8月, 2010 3 次提交
-
-
由 Namhyung Kim 提交于
When pcpu_build_alloc_info() searches best_upa value, it ignores current value if the number of waste units exceeds 1/3 of the number of total cpus. But the comment on the code says that it will ignore if wastage is over 25%. Modify the comment. Signed-off-by: NNamhyung Kim <namhyung@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Huang Shijie 提交于
The original code did not free the old map. This patch fixes it. tj: use @old as memcpy source instead of @chunk->map, and indentation and description update Signed-off-by: NHuang Shijie <shijie8@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@kernel.org
-
由 Artem Bityutskiy 提交于
This patch fixes the following issue: INFO: task mount.nfs4:1120 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mount.nfs4 D 00000000fffc6a21 0 1120 1119 0x00000000 ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000 ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8 00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0 Call Trace: [<ffffffff813bc747>] schedule_timeout+0x34/0xf1 [<ffffffff813bc530>] ? wait_for_common+0x3f/0x130 [<ffffffff8106b50b>] ? trace_hardirqs_on+0xd/0xf [<ffffffff813bc5c3>] wait_for_common+0xd2/0x130 [<ffffffff8104159c>] ? default_wake_function+0x0/0xf [<ffffffff813beaa0>] ? _raw_spin_unlock+0x26/0x2a [<ffffffff813bc6bb>] wait_for_completion+0x18/0x1a [<ffffffff81101a03>] sync_inodes_sb+0xca/0x1bc [<ffffffff811056a6>] __sync_filesystem+0x47/0x7e [<ffffffff81105798>] sync_filesystem+0x47/0x4b [<ffffffff810e7ffd>] generic_shutdown_super+0x22/0xd2 [<ffffffff810e80f8>] kill_anon_super+0x11/0x4f [<ffffffffa00d06d7>] nfs4_kill_super+0x3f/0x72 [nfs] [<ffffffff810e7b68>] deactivate_locked_super+0x21/0x41 [<ffffffff810e7fd6>] deactivate_super+0x40/0x45 [<ffffffff810fc66c>] mntput_no_expire+0xb8/0xed [<ffffffff810fc73b>] release_mounts+0x9a/0xb0 [<ffffffff810fc7bb>] put_mnt_ns+0x6a/0x7b [<ffffffffa00d0fb2>] nfs_follow_remote_path+0x19a/0x296 [nfs] [<ffffffffa00d11ca>] nfs4_try_mount+0x75/0xaf [nfs] [<ffffffffa00d1790>] nfs4_get_sb+0x276/0x2ff [nfs] [<ffffffff810e7dba>] vfs_kern_mount+0xb8/0x196 [<ffffffff810e7ef6>] do_kern_mount+0x48/0xe8 [<ffffffff810fdf68>] do_mount+0x771/0x7e8 [<ffffffff810fe062>] sys_mount+0x83/0xbd [<ffffffff810089c2>] system_call_fastpath+0x16/0x1b The reason of this hang was a race condition: when the flusher thread is forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep. 'bdi->wb.task' is still NULL. And this is a dangerous time window. If at this time someone queues a work for this bdi, he does not see the bdi thread and wakes up the forker thread instead! But the forker has already forked this bdi thread, but just did not make it visible yet! The result is that we lose the wake up event for this bdi thread and the NFS4 code waits forever. To fix the problem, we should use 'ktrhead_create()' for creating bdi threads, then make them visible in 'bdi->wb.task', and only after this wake them up. This is exactly what this patch does. Signed-off-by: NArtem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
-
- 25 8月, 2010 1 次提交
-
-
由 Luck, Tony 提交于
pa-risc and ia64 have stacks that grow upwards. Check that they do not run into other mappings. By making VM_GROWSUP 0x0 on architectures that do not ever use it, we can avoid some unpleasant #ifdefs in check_stack_guard_page(). Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 8月, 2010 1 次提交
-
-
由 Dave Chinner 提交于
I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have been. Enabling writeback tracing showed: flush-253:16-8516 [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 flush-253:16-8516 [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0 Writeback is not terminating in background writeback if ->writepage is returning with wbc->nr_to_write == 0, resulting in sub-optimal single page writeback on XFS. Fix the write_cache_pages loop to terminate correctly when this situation occurs and so prevent this sub-optimal background writeback pattern. This improves sustained sequential buffered write performance from around 250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM. Cc:<stable@kernel.org> Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NWu Fengguang <fengguang.wu@intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 23 8月, 2010 1 次提交
-
-
由 Michael Rubin 提交于
This allows code outside of the mm core to safely manipulate page state and not worry about the other accounting. Not using these routines means that some code will lose track of the accounting and we get bugs. This has happened once already. Signed-off-by: NMichael Rubin <mrubin@google.com> Signed-off-by: NSage Weil <sage@newdream.net>
-
- 21 8月, 2010 7 次提交
-
-
由 Linus Torvalds 提交于
Like the mlock() change previously, this makes the stack guard check code use vma->vm_prev to see what the mapping below the current stack is, rather than have to look it up with find_vma(). Also, accept an abutting stack segment, since that happens naturally if you split the stack with mlock or mprotect. Tested-by: NIan Campbell <ijc@hellion.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
If we've split the stack vma, only the lowest one has the guard page. Now that we have a doubly linked list of vma's, checking this is trivial. Tested-by: NIan Campbell <ijc@hellion.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
It's a really simple list, and several of the users want to go backwards in it to find the previous vma. So rather than have to look up the previous entry with 'find_vma_prev()' or something similar, just make it doubly linked instead. Tested-by: NIan Campbell <ijc@hellion.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
dump_tasks() needs to hold the RCU read lock around its access of the target task's UID. To this end it should use task_uid() as it only needs that one thing from the creds. The fact that dump_tasks() holds tasklist_lock is insufficient to prevent the target process replacing its credentials on another CPU. Then, this patch change to call rcu_read_lock() explicitly. =================================================== [ INFO: suspicious rcu_dereference_check() usage. ] --------------------------------------------------- mm/oom_kill.c:410 invoked rcu_dereference_check() without protection! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 1 4 locks held by kworker/1:2/651: #0: (events){+.+.+.}, at: [<ffffffff8106aae7>] process_one_work+0x137/0x4a0 #1: (moom_work){+.+...}, at: [<ffffffff8106aae7>] process_one_work+0x137/0x4a0 #2: (tasklist_lock){.+.+..}, at: [<ffffffff810fafd4>] out_of_memory+0x164/0x3f0 #3: (&(&p->alloc_lock)->rlock){+.+...}, at: [<ffffffff810fa48e>] find_lock_task_mm+0x2e/0x70 Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Commit 0aad4b31 ("oom: fold __out_of_memory into out_of_memory") introduced a tasklist_lock leak. Then it caused following obvious danger warnings and panic. ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ rsyslogd/1422 is leaving the kernel with locks still held! 1 lock held by rsyslogd/1422: #0: (tasklist_lock){.+.+.+}, at: [<ffffffff810faf64>] out_of_memory+0x164/0x3f0 BUG: scheduling while atomic: rsyslogd/1422/0x00000002 INFO: lockdep is turned off. This patch fixes it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Commit b940fd70 ("oom: remove unnecessary code and cleanup") added an unnecessary NULL pointer dereference. remove it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows index and tree traversal code goes astray reading memory until it hits unreadable memory. Check for overflow and exit in that case. Signed-off-by: NJan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 8月, 2010 1 次提交
-
-
由 Hugh Dickins 提交于
list_add() corruption messages reported from shmem_fill_super()'s recently introduced percpu_counter_init(): shmem_put_super() needs to remember to percpu_counter_destroy(). And also check error from percpu_counter_init(). Reported-bisected-and-tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 8月, 2010 1 次提交
-
-
由 Linus Torvalds 提交于
This commit makes the stack guard page somewhat less visible to user space. It does this by: - not showing the guard page in /proc/<pid>/maps It looks like lvm-tools will actually read /proc/self/maps to figure out where all its mappings are, and effectively do a specialized "mlockall()" in user space. By not showing the guard page as part of the mapping (by just adding PAGE_SIZE to the start for grows-up pages), lvm-tools ends up not being aware of it. - by also teaching the _real_ mlock() functionality not to try to lock the guard page. That would just expand the mapping down to create a new guard page, so there really is no point in trying to lock it in place. It would perhaps be nice to show the guard page specially in /proc/<pid>/maps (or at least mark grow-down segments some way), but let's not open ourselves up to more breakage by user space from programs that depends on the exact deails of the 'maps' file. Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools source code to see what was going on with the whole new warning. Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be Reported-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 8月, 2010 2 次提交
-
-
由 Randy Dunlap 提交于
Remove leading /** from non-kernel-doc function comments to prevent kernel-doc warnings. Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
We do in fact need to unmap the page table _before_ doing the whole stack guard page logic, because if it is needed (mainly 32-bit x86 with PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it will do a kmap_atomic/kunmap_atomic. And those kmaps will create an atomic region that we cannot do allocations in. However, the whole stack expand code will need to do anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an atomic region. Now, a better model might actually be to do the anon_vma_prepare() when _creating_ a VM_GROWSDOWN segment, and not have to worry about any of this at page fault time. But in the meantime, this is the straightforward fix for the issue. See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details. Reported-by: NWylda <wylda@volny.cz> Reported-by: NSedat Dilek <sedat.dilek@gmail.com> Reported-by: NMike Pagano <mpagano@gentoo.org> Reported-by: NFrançois Valenduc <francois.valenduc@tvcablenet.be> Tested-by: NEd Tomlinson <edt@aei.ca> Cc: Pekka Enberg <penberg@kernel.org> Cc: Greg KH <gregkh@suse.de> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 8月, 2010 2 次提交
-
-
由 David Howells 提交于
Remove an extraneous no_printk() in mm/nommu.c that got missed when the function got generalised from several things that used it in commit 12fdff3f ("Add a dummy printk function for the maintenance of unused printks"). Without this, the following error is observed: mm/nommu.c:41: error: conflicting types for 'no_printk' include/linux/kernel.h:314: error: previous definition of 'no_printk' was here Reported-by: NMichal Simek <monstr@monstr.eu> Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
.. which didn't show up in my tests because it's a no-op on x86-64 and most other architectures. But we enter the function with the last-level page table mapped, and should unmap it at exit. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 8月, 2010 1 次提交
-
-
由 Linus Torvalds 提交于
This is a rather minimally invasive patch to solve the problem of the user stack growing into a memory mapped area below it. Whenever we fill the first page of the stack segment, expand the segment down by one page. Now, admittedly some odd application might _want_ the stack to grow down into the preceding memory mapping, and so we may at some point need to make this a process tunable (some people might also want to have more than a single page of guarding), but let's try the minimal approach first. Tested with trivial application that maps a single page just below the stack, and then starts recursing. Without this, we will get a SIGSEGV _after_ the stack has smashed the mapping. With this patch, we'll get a nice SIGBUS just as the stack touches the page just above the mapping. Requested-by: NKeith Packard <keithp@keithp.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 8月, 2010 3 次提交
-
-
由 Wu Fengguang 提交于
Document global_dirty_limits() and bdi_dirty_limit(). Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wu Fengguang 提交于
Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(), so that the latter can be avoided when under global dirty background threshold (which is the normal state for most systems). Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wu Fengguang 提交于
Reducing the number of times balance_dirty_pages calls global_page_state reduces the cache references and so improves write performance on a variety of workloads. 'perf stats' of simple fio write tests shows the reduction in cache access. Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10 times, dropping the fasted & slowest values then taking the average & standard deviation average (s.d.) in millions (10^6) 2.6.31-rc8 648.6 (14.6) +patch 620.1 (16.5) Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads the counters to apply the dirty_threshold and moving this check up into balance_dirty_pages where it has already read the counters. Also by rearrange the for loop to only contain one copy of the limit tests allows the pdflush test after the loop to use the local copies of the counters rather than rereading them. In the common case with no throttling it now calls global_page_state 5 fewer times and bdi_stat 2 fewer. Fengguang: This patch slightly changes behavior by replacing clip_bdi_dirty_limit() with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh) to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly accurate we don't need to do routinely clip. A simple dirty limit check would be enough. The check is necessary because, in principle we should throttle everything calling balance_dirty_pages() when we're over the total limit, as said by Peter. We now set and clear dirty_exceeded not only based on bdi dirty limits, but also on the global dirty limit. The global limit check is added in place of clip_bdi_dirty_limit() for safety and not intended as a behavior change. The bdi limits should be tight enough to keep all dirty pages under the global limit at most time; occasional small exceeding should be OK though. The change makes the logic more obvious: the global limit is the ultimate goal and shall be always imposed. We may now start background writeback work based on outdated conditions. That's safe because the bdi flush thread will (and have to) double check the states. It reduces overall overheads because the test based on old states still have good chance to be right. [akpm@linux-foundation.org] fix uninitialized dirty_exceeded Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Cc: Jan Kara <jack@suse.cz> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-