提交 · b6a84016bd2598e35ead635147fa53619982648d · openanolis / cloud-kernel

23 3月, 2011 40 次提交

mm: NUMA aware alloc_thread_info_node() · b6a84016

由 Eric Dumazet 提交于 3月 22, 2011

Add a node parameter to alloc_thread_info(), and change its name to
alloc_thread_info_node()

This change is needed to allow NUMA aware kthread_create_on_cpu()
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6a84016

mm: NUMA aware alloc_task_struct_node() · 504f52b5

由 Eric Dumazet 提交于 3月 22, 2011

All kthreads being created from a single helper task, they all use memory
from a single node for their kernel stack and task struct.

This patch suite creates kthread_create_on_cpu(), adding a 'cpu' parameter
to parameters already used by kthread_create().

This parameter serves in allocating memory for the new kthread on its
memory node if available.

Users of this new function are : ksoftirqd, kworker, migration, pktgend...

This patch:

Add a node parameter to alloc_task_struct(), and change its name to
alloc_task_struct_node()

This change is needed to allow NUMA aware kthread_create_on_cpu()
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

504f52b5

mm/compaction: check migrate_pages's return value instead of list_empty() · 9d502c1c

由 Minchan Kim 提交于 3月 22, 2011

Many migrate_page's caller check return value instead of list_empy by
cf608ac1 ("mm: compaction: fix COMPACTPAGEFAILED counting").  This patch
makes compaction's migrate_pages consistent with others.  This patch
should not change old behavior.
Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9d502c1c

mm: compaction: prevent kswapd compacting memory to reduce CPU usage · d527caf2

由 Andrea Arcangeli 提交于 3月 22, 2011

This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
order > 0") due to reports stating that kswapd CPU usage was higher and
IRQs were being disabled more frequently.  This was reported at
http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

Without this patch applied, CPU usage by kswapd hovers around the 20% mark
according to the tester (Arthur Marsh:
http://www.spinics.net/linux/fedora/alsa-user/msg09899.html).  With this
patch applied, it's around 2%.

The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
triggered by high-order allocations hitting the low watermark for their
order and waking kswapd on kernels with CONFIG_COMPACTION set.  The most
common trigger for this is network cards configured for jumbo frames but
it's also possible it'll be triggered by fork-heavy workloads (order-1)
and some wireless cards which depend on order-1 allocations.

The symptoms for the user will be high CPU usage by kswapd in low-memory
situations which could be confused with another writeback problem.  While
a patch like 5a03b051 may be reintroduced in the future, this patch plays
it safe for now and reverts it.

[mel@csn.ul.ie: Beefed up the changelog]
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NMel Gorman <mel@csn.ul.ie>
Reported-by: NArthur Marsh <arthur.marsh@internode.on.net>
Tested-by: NArthur Marsh <arthur.marsh@internode.on.net>
Cc: <stable@kernel.org>		[2.6.38.1]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d527caf2

mm: vmap area cache · 89699605

由 Nick Piggin 提交于 3月 22, 2011

Provide a free area cache for the vmalloc virtual address allocator, based
on the algorithm used by the user virtual memory allocator.

This reduces the number of rbtree operations and linear traversals over
the vmap extents in order to find a free area, by starting off at the last
point that a free area was found.

The free area cache is reset if areas are freed behind it, or if we are
searching for a smaller area or alignment than last time.  So allocation
patterns are not changed (verified by corner-case and random test cases in
userspace testing).

This solves a regression caused by lazy vunmap TLB purging introduced in
db64fe02 (mm: rewrite vmap layer).  That patch will leave extents in the
vmap allocator after they are vunmapped, and until a significant number
accumulate that can be flushed in a single batch.  So in a workload that
vmalloc/vfree frequently, a chain of extents will build up from
VMALLOC_START address, which have to be iterated over each time (giving an
O(n) type of behaviour).

After this patch, the search will start from where it left off, giving
closer to an amortized O(1).

This is verified to solve regressions reported Steven in GFS2, and Avi in
KVM.

Hugh's update:

: I tried out the recent mmotm, and on one machine was fortunate to hit
: the BUG_ON(first->va_start < addr) which seems to have been stalling
: your vmap area cache patch ever since May.

: I can get you addresses etc, I did dump a few out; but once I stared
: at them, it was easier just to look at the code: and I cannot see how
: you would be so sure that first->va_start < addr, once you've done
: that addr = ALIGN(max(...), align) above, if align is over 0x1000
: (align was 0x8000 or 0x4000 in the cases I hit: ioremaps like Steve).

: I originally got around it by just changing the
: 		if (first->va_start < addr) {
: to
: 		while (first->va_start < addr) {
: without thinking about it any further; but that seemed unsatisfactory,
: why would we want to loop here when we've got another very similar
: loop just below it?

: I am never going to admit how long I've spent trying to grasp your
: "while (n)" rbtree loop just above this, the one with the peculiar
: 		if (!first && tmp->va_start < addr + size)
: in.  That's unfamiliar to me, I'm guessing it's designed to save a
: subsequent rb_next() in a few circumstances (at risk of then setting
: a wrong cached_hole_size?); but they did appear few to me, and I didn't
: feel I could sign off something with that in when I don't grasp it,
: and it seems responsible for extra code and mistaken BUG_ON below it.

: I've reverted to the familiar rbtree loop that find_vma() does (but
: with va_end >= addr as you had, to respect the additional guard page):
: and then (given that cached_hole_size starts out 0) I don't see the
: need for any complications below it.  If you do want to keep that loop
: as you had it, please add a comment to explain what it's trying to do,
: and where addr is relative to first when you emerge from it.

: Aren't your tests "size <= cached_hole_size" and
: "addr + size > first->va_start" forgetting the guard page we want
: before the next area?  I've changed those.

: I have not changed your many "addr + size - 1 < addr" overflow tests,
: but have since come to wonder, shouldn't they be "addr + size < addr"
: tests - won't the vend checks go wrong if addr + size is 0?

: I have added a few comments - Wolfgang Wander's 2.6.13 description of
: 1363c3cd Avoiding mmap fragmentation
: helped me a lot, perhaps a pointer to that would be good too.  And I found
: it easier to understand when I renamed cached_start slightly and moved the
: overflow label down.

: This patch would go after your mm-vmap-area-cache.patch in mmotm.
: Trivially, nobody is going to get that BUG_ON with this patch, and it
: appears to work fine on my machines; but I have not given it anything like
: the testing you did on your original, and may have broken all the
: performance you were aiming for.  Please take a look and test it out
: integrate with yours if you're satisfied - thanks.

[akpm@linux-foundation.org: add locking comment]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Signed-off-by: NHugh Dickins <hughd@google.com>
Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
Reported-and-tested-by: NSteven Whitehouse <swhiteho@redhat.com>
Reported-and-tested-by: NAvi Kivity <avi@redhat.com>
Tested-by: N"Barry J. Marson" <bmarson@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

89699605

pwm_backlight: add check_fb() hook · ef0a5e80

由 Robert Morell 提交于 3月 22, 2011

In systems with multiple framebuffer devices, one of the devices might be
blanked while another is unblanked.  In order for the backlight blanking
logic to know whether to turn off the backlight for a particular
framebuffer's blanking notification, it needs to be able to check if a
given framebuffer device corresponds to the backlight.

This plumbs the check_fb hook from core backlight through the
pwm_backlight helper to allow platform code to plug in a check_fb hook.
Signed-off-by: NRobert Morell <rmorell@nvidia.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Arun Murthy <arun.murthy@stericsson.com>
Cc: Linus Walleij <linus.walleij@stericsson.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ef0a5e80

drivers/video/backlight/jornada720_*.c: make needlessly global symbols static · 0508e04e

由 Axel Lin 提交于 3月 22, 2011

The following symbols are needlessly defined global: jornada_bl_init,
jornada_bl_exit, jornada_lcd_init, jornada_lcd_exit.

Make them static.
Signed-off-by: NAxel Lin <axel.lin@gmail.com>
Acked-by: NKristoffer Ericson <kristoffer.ericson@gmail.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0508e04e

backlight: apple_bl depends on ACPI · b372412e

由 Randy Dunlap 提交于 3月 22, 2011

apple_bl uses ACPI interfaces (data & code), so it should depend on ACPI.

  drivers/video/backlight/apple_bl.c:142: warning: 'struct acpi_device' declared inside parameter list
  drivers/video/backlight/apple_bl.c:142: warning: its scope is only this definition or declaration, which is probably not what you want
  drivers/video/backlight/apple_bl.c:201: warning: 'struct acpi_device' declared inside parameter list
  drivers/video/backlight/apple_bl.c:215: error: variable 'apple_bl_driver' has initializer but incomplete type
  drivers/video/backlight/apple_bl.c:216: error: unknown field 'name' specified in initializer
  ...
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Acked-by: NMatthew Garrett <mjg@redhat.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b372412e

mbp_nvidia_bl: rename to apple_bl · 39b3dee7

由 Matthew Garrett 提交于 3月 22, 2011

It works on hardware other than Macbook Pros, and it works on GPUs other
than Nvidia.  It should even work on iMacs, so change the name to match
reality more precisely and include an alias so existing users don't get
confused.
Signed-off-by: NMatthew Garrett <mjg@redhat.com>
Acked-by: NRichard Purdie <richard.purdie@linuxfoundation.org>
Cc: Mourad De Clerck <mourad@aquazul.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

39b3dee7

mbp_nvidia_bl: check that the backlight control functions · 99fd28e1

由 Matthew Garrett 提交于 3月 22, 2011

The SMI-based backlight control functionality may fail to work if the
system is running under EFI rather than BIOS.  Check that the hardware
responds as expected, and exit if it doesn't.
Signed-off-by: NMatthew Garrett <mjg@redhat.com>
Acked-by: NRichard Purdie <richard.purdie@linuxfoundation.org>
Cc: Mourad De Clerck <mourad@aquazul.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

99fd28e1

mbp_nvidia_bl: remove DMI dependency · 23a9847f

由 Matthew Garrett 提交于 3月 22, 2011

This driver only has to deal with two different classes of hardware, but
right now it needs new DMI entries for every new machine. It turns out
that there's an ACPI device that uniquely identifies Apples with backlights,
so this patch reworks the driver into an ACPI one, identifies the hardware
by checking the PCI vendor of the root bridge and strips out all the DMI
code. It also changes the config text to clarify that it works on devices
other than Macbook Pros and GPUs other than nvidia.
Signed-off-by: NMatthew Garrett <mjg@redhat.com>
Acked-by: NRichard Purdie <richard.purdie@linuxfoundation.org>
Cc: Mourad De Clerck <mourad@aquazul.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

23a9847f

acpi: tie ACPI backlight devices to PCI devices if possible · 9661e92c

由 Matthew Garrett 提交于 3月 22, 2011

Dual-GPU machines may provide more than one ACPI backlight interface.  Tie
the backlight device to the GPU in order to allow userspace to identify
the correct interface.
Signed-off-by: NMatthew Garrett <mjg@redhat.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: David Airlie <airlied@linux.ie>
Cc: Alex Deucher <alexdeucher@gmail.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Tested-by: NSedat Dilek <sedat.dilek@googlemail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9661e92c

nouveau: change the backlight parent device to the connector, not the PCI dev · 7eae3efa

由 Matthew Garrett 提交于 3月 22, 2011

We may eventually end up with per-connector backlights, especially with
ddcci devices.  Make sure that the parent node for the backlight device is
the connector rather than the PCI device.
Signed-off-by: NMatthew Garrett <mjg@redhat.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: David Airlie <airlied@linux.ie>
Cc: Alex Deucher <alexdeucher@gmail.com>
Acked-by: NBen Skeggs <bskeggs@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Tested-by: NSedat Dilek <sedat.dilek@googlemail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7eae3efa

radeon: expose backlight class device for legacy LVDS encoder · 63ec0119

由 Michel Dänzer 提交于 3月 22, 2011

Allows e.g. power management daemons to control the backlight level. Inspired
by the corresponding code in radeonfb.

[mjg@redhat.com: updated to add backlight type and make the connector the parent device]
Signed-off-by: NMichel Dänzer <michel@daenzer.net>
Signed-off-by: NMatthew Garrett <mjg@redhat.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: David Airlie <airlied@linux.ie>
Acked-by: NAlex Deucher <alexdeucher@gmail.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Tested-by: NSedat Dilek <sedat.dilek@googlemail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

63ec0119

backlight: add backlight type · bb7ca747

由 Matthew Garrett 提交于 3月 22, 2011

There may be multiple ways of controlling the backlight on a given
machine.  Allow drivers to expose the type of interface they are
providing, making it possible for userspace to make appropriate policy
decisions.
Signed-off-by: NMatthew Garrett <mjg@redhat.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: David Airlie <airlied@linux.ie>
Cc: Alex Deucher <alexdeucher@gmail.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bb7ca747

drivers/leds/leds-lp5523.c: world-writable engine* sysfs files · ccd7510f

由 Vasiliy Kulikov 提交于 3月 22, 2011

Don't allow everybody to change LED settings.
Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ccd7510f

drivers/leds/leds-lp5521.c: world-writable sysfs engine* files · 67d1da79

由 Vasiliy Kulikov 提交于 3月 22, 2011

Don't allow everybody to change LED settings.
Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

67d1da79

drivers/vidfeo/backlight: ld9040 amoled driver support · 1baf0eb3

由 Donghwa Lee 提交于 3月 22, 2011

Add a ld9040 amoled panel driver.
Signed-off-by: NDonghwa Lee <dh09.lee@samsung.com>
Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: NInki Dae <inki.dae@samsung.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1baf0eb3

leds: make *struct gpio_led_platform_data.leds const · 9517f925

由 Uwe Kleine-König 提交于 3月 22, 2011

And fix a typo.
Signed-off-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9517f925

leds: add driver for LM3530 ALS · b1e6b706

由 Shreshtha Kumar Sahu 提交于 3月 22, 2011

Simple backlight driver for National Semiconductor LM3530.  Presently only
manual mode is supported, PWM and ALS support to be added.
Signed-off-by: NShreshtha Kumar Sahu <shreshthakumar.sahu@stericsson.com>
Cc: Linus Walleij <linus.walleij@stericsson.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b1e6b706

leds: convert bd2802 driver to dev_pm_ops · 551ea738

由 Mark Brown 提交于 3月 22, 2011

There is a move to deprecate bus-specific PM operations and move to using
dev_pm_ops instead in order to reduce the amount of boilerplate code in
buses and facilitiate updates to the PM core.  Do this move for the bs2802
driver.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Kim Kyuwon <q1.kim@samsung.com>
Cc: Kim Kyuwon <chammoru@gmail.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

551ea738

cgroups: if you list_empty() a head then don't list_del() it · 8d258797

由 Phil Carmody 提交于 3月 22, 2011

list_del() leaves poison in the prev and next pointers.  The next
list_empty() will compare those poisons, and say the list isn't empty.
Any list operations that assume the node is on a list because of such a
check will be fooled into dereferencing poison.  One needs to INIT the
node after the del, and fortunately there's already a wrapper for that -
list_del_init().

Some of the dels are followed by deallocations, so can be ignored, and one
can be merged with an add to make a move.  Apart from that, I erred on the
side of caution in making nodes list_empty()-queriable.
Signed-off-by: NPhil Carmody <ext-phil.2.carmody@nokia.com>
Reviewed-by: NPaul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8d258797

oom: avoid deferring oom killer if exiting task is being traced · edd45544

由 David Rientjes 提交于 3月 22, 2011

The oom killer naturally defers killing anything if it finds an eligible
task that is already exiting and has yet to detach its ->mm.  This avoids
unnecessarily killing tasks when one is already in the exit path and may
free enough memory that the oom killer is no longer needed.  This is
detected by PF_EXITING since threads that have already detached its ->mm
are no longer considered at all.

The problem with always deferring when a thread is PF_EXITING, however, is
that it may never actually exit when being traced, specifically if another
task is tracing it with PTRACE_O_TRACEEXIT.  The oom killer does not want
to defer in this case since there is no guarantee that thread will ever
exit without intervention.

This patch will now only defer the oom killer when a thread is PF_EXITING
and no ptracer has stopped its progress in the exit path.  It also ensures
that a child is sacrificed for the chosen parent only if it has a
different ->mm as the comment implies: this ensures that the thread group
leader is always targeted appropriately.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Reported-by: NOleg Nesterov <oleg@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

edd45544

oom: skip zombies when iterating tasklist · 30e2b41f

由 Andrey Vagin 提交于 3月 22, 2011

We shouldn't defer oom killing if a thread has already detached its ->mm
and still has TIF_MEMDIE set.  Memory needs to be freed, so find kill
other threads that pin the same ->mm or find another task to kill.
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

30e2b41f

oom: prevent unnecessary oom kills or kernel panics · 3a5dda7a

由 David Rientjes 提交于 3月 22, 2011

This patch prevents unnecessary oom kills or kernel panics by reverting
two commits:

	495789a5 (oom: make oom_score to per-process value)
	cef1d352 (oom: multi threaded process coredump don't make deadlock)

First, 495789a5 (oom: make oom_score to per-process value) ignores the
fact that all threads in a thread group do not necessarily exit at the
same time.

It is imperative that select_bad_process() detect threads that are in the
exit path, specifically those with PF_EXITING set, to prevent needlessly
killing additional tasks.  If a process is oom killed and the thread group
leader exits, select_bad_process() cannot detect the other threads that
are PF_EXITING by iterating over only processes.  Thus, it currently
chooses another task unnecessarily for oom kill or panics the machine when
nothing else is eligible.

By iterating over threads instead, it is possible to detect threads that
are exiting and nominate them for oom kill so they get access to memory
reserves.

Second, cef1d352 (oom: multi threaded process coredump don't make
deadlock) erroneously avoids making the oom killer a no-op when an
eligible thread other than current isfound to be exiting.  We want to
detect this situation so that we may allow that exiting thread time to
exit and free its memory; if it is able to exit on its own, that should
free memory so current is no loner oom.  If it is not able to exit on its
own, the oom killer will nominate it for oom kill which, in this case,
only means it will get access to memory reserves.

Without this change, it is easy for the oom killer to unnecessarily target
tasks when all threads of a victim don't exit before the thread group
leader or, in the worst case, panic the machine.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: <stable@kernel.org>		[2.6.38.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3a5dda7a

mm: swap: unlock swapfile inode mutex before closing file on bad swapfiles · 52c50567

由 Mel Gorman 提交于 3月 22, 2011

If an administrator tries to swapon a file backed by NFS, the inode mutex is
taken (as it is for any swapfile) but later identified to be a bad swapfile
due to the lack of bmap and tries to cleanup. During cleanup, an attempt is
made to close the file but with inode->i_mutex still held. Closing an NFS
file syncs it which tries to acquire the inode mutex leading to deadlock. If
lockdep is enabled the following appears on the console;

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.38-rc8-autobuild #1
    ---------------------------------------------
    swapon/2192 is trying to acquire lock:
     (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c

    but task is already holding lock:
     (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    other info that might help us debug this:
    1 lock held by swapon/2192:
     #0:  (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7

    stack backtrace:
    Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1
    Call Trace:
        __lock_acquire+0x2eb/0x1623
        find_get_pages_tag+0x14a/0x174
        pagevec_lookup_tag+0x25/0x2e
        vfs_fsync_range+0x47/0x7c
        lock_acquire+0xd3/0x100
        vfs_fsync_range+0x47/0x7c
        nfs_flush_one+0x0/0xdf [nfs]
        mutex_lock_nested+0x40/0x2b1
        vfs_fsync_range+0x47/0x7c
        vfs_fsync_range+0x47/0x7c
        vfs_fsync+0x1c/0x1e
        nfs_file_flush+0x64/0x69 [nfs]
        filp_close+0x43/0x72
        sys_swapon+0xa39/0xae7
        sysret_check+0x2e/0x69
        system_call_fastpath+0x16/0x1b

This patch releases the mutex if its held before calling filep_close()
so swapon fails as expected without deadlock when the swapfile is backed
by NFS.  If accepted for 2.6.39, it should also be considered a -stable
candidate for 2.6.38 and 2.6.37.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Acked-by: NHugh Dickins <hughd@google.com>
Cc: <stable@kernel.org>		[2.6.37+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52c50567

include/asm-generic/unistd.h: fix syncfs syscall number · c7a1fcd8

由 Andrew Morton 提交于 3月 22, 2011

syncfs() is duplicating name_to_handle_at() due to a merging mistake.

Cc: Sage Weil <sage@newdream.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c7a1fcd8

Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 · 01ba8251

由 Linus Torvalds 提交于 3月 22, 2011

* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: Add statistics for this_cmpxchg_double failures
  slub: Add missing irq restore for the OOM path

01ba8251

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs · ab70a1d7

由 Linus Torvalds 提交于 3月 22, 2011

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  [net/9p]: Introduce basic flow-control for VirtIO transport.
  9p: use the updated offset given by generic_write_checks
  [net/9p] Don't re-pin pages on retrying virtqueue_add_buf().
  [net/9p] Set the condition just before waking up.
  [net/9p] unconditional wake_up to proc waiting for space on VirtIO ring
  fs/9p: Add v9fs_dentry2v9ses
  fs/9p: Attach writeback_fid on first open with WR flag
  fs/9p: Open writeback fid in O_SYNC mode
  fs/9p: Use truncate_setsize instead of vmtruncate
  net/9p: Fix compile warning
  net/9p: Convert the in the 9p rpc call path to GFP_NOFS
  fs/9p: Fix race in initializing writeback fid

ab70a1d7

Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client · 0adfc56c

由 Linus Torvalds 提交于 3月 22, 2011

* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: use watch/notify for changes in rbd header
  libceph: add lingering request and watch/notify event framework
  rbd: update email address in Documentation
  ceph: rename dentry_release -> d_release, fix comment
  ceph: add request to the tail of unsafe write list
  ceph: remove request from unsafe list if it is canceled/timed out
  ceph: move readahead default to fs/ceph from libceph
  ceph: add ino32 mount option
  ceph: update common header files
  ceph: remove debugfs debug cruft
  libceph: fix osd request queuing on osdmap updates
  ceph: preserve I_COMPLETE across rename
  libceph: Fix base64-decoding when input ends in newline.

0adfc56c

tty: stop using "delayed_work" in the tty layer · f23eb2b2

由 Linus Torvalds 提交于 3月 22, 2011

Using delayed-work for tty flip buffers ends up causing us to wait for
the next tick to complete some actions.  That's usually not all that
noticeable, but for certain latency-critical workloads it ends up being
totally unacceptable.

As an extreme case of this, passing a token back-and-forth over a pty
will take two ticks per iteration, so even just a thousand iterations
will take 8 seconds assuming a common 250Hz configuration.

Avoiding the whole delayed work issue brings that ping-pong test-case
down to 0.009s on my machine.

In more practical terms, this latency has been a performance problem for
things like dive computer simulators (simulating the serial interface
using the ptys) and for other environments (Alan mentions a CP/M emulator).
Reported-by: NJef Driesen <jefdriesen@telenet.be>
Acked-by: NGreg KH <gregkh@suse.de>
Acked-by: NAlan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f23eb2b2

[net/9p]: Introduce basic flow-control for VirtIO transport. · 68da9ba4

由 Venkateswararao Jujjuri (JV) 提交于 3月 18, 2011

Recent zerocopy work in the 9P VirtIO transport maps and pins
user buffers into kernel memory for the server to work on them.
Since the user process can initiate this kind of pinning with a simple
read/write call, thousands of IO threads initiated by the user process can
hog the system resources and could result into denial of service.

This patch introduces flow control to avoid that extreme scenario.

The ceiling limit to avoid denial of service attacks is set to relatively
high (nr_free_pagecache_pages()/4) so that it won't interfere with
regular usage, but can step in extreme cases to limit the total system
hang. Since we don't have a global structure to accommodate this variable,
I choose the virtio_chan as the home for this.
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Reviewed-by: NBadari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

68da9ba4

9p: use the updated offset given by generic_write_checks · aaf0ef1d

由 M. Mohan Kumar 提交于 3月 16, 2011

Without this fix, even if a file is opened in O_APPEND mode, data will be
written at current file position instead of end of file.
Signed-off-by: NM. Mohan Kumar <mohan@in.ibm.com>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

aaf0ef1d

V
[net/9p] Don't re-pin pages on retrying virtqueue_add_buf(). · 316ad550
由 Venkateswararao Jujjuri (JV) 提交于 3月 14, 2011
```
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>
```
316ad550

[net/9p] Set the condition just before waking up. · a01a9840

由 Venkateswararao Jujjuri (JV) 提交于 3月 14, 2011

Given that the sprious wake-ups are common, we need to move the
condition setting right next to the wake_up().  After setting the condition
to req->status = REQ_STATUS_RCVD, sprious wakeups may cause the
virtqueue back on the free list for someone else to use.
This may result in kernel panic while relasing the pinned pages
in p9_release_req_pages().

Also rearranged the while loop in req_done() for better redability.
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

a01a9840

[net/9p] unconditional wake_up to proc waiting for space on VirtIO ring · 53bda3e5

由 Venkateswararao Jujjuri (JV) 提交于 3月 08, 2011

Process may wait to get space on VirtIO ring to send a transaction to
VirtFS server. Current code just does a conditional wake_up() which
means only one process will be woken up even if multiple processes
are waiting.

This fix makes the wake_up unconditional. Hence we won't have any
processes waiting for-ever.
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

53bda3e5

fs/9p: Add v9fs_dentry2v9ses · 42869c8a

由 Aneesh Kumar K.V 提交于 3月 08, 2011

Add the new static inline and use the same
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

42869c8a

fs/9p: Attach writeback_fid on first open with WR flag · 7add697a

由 Aneesh Kumar K.V 提交于 3月 08, 2011

We don't need writeback fid if we are only doing O_RDONLY open
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

7add697a

fs/9p: Open writeback fid in O_SYNC mode · ea59bb75

由 Aneesh Kumar K.V 提交于 3月 08, 2011

Older version of protocol don't support tsyncfs operation.
So for them force a O_SYNC flag on the server
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

ea59bb75

fs/9p: Use truncate_setsize instead of vmtruncate · 059c138b

由 Aneesh Kumar K.V 提交于 3月 08, 2011

convert vmtruncate usage to truncate_setsize. We also writeback
all dirty pages before doing 9p operations and on success call truncate_setsize.
This ensure that we continue sanely on failed truncate on the server. The
disadvantage is that we are now going to write back the content that get
thrown away later as a part of truncate.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NVenkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: NEric Van Hensbergen <ericvh@gmail.com>

059c138b

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功