- 03 11月, 2011 40 次提交
-
-
由 Andrew Bresticker 提交于
While back-porting Johannes Weiner's patch "mm: memcg-aware global reclaim" for an internal effort, we noticed a significant performance regression during page-reclaim heavy workloads due to high contention of the ss->id_lock. This lock protects idr map, and serializes calls to idr_get_next() in css_get_next() (which is used during the memcg hierarchy walk). Since idr_get_next() is just doing a look up, we need only serialize it with respect to idr_remove()/idr_get_new(). By making the ss->id_lock a rwlock, contention is greatly reduced and performance improves. Tested: cat a 256m file from a ramdisk in a 128m container 50 times on each core (one file + container per core) in parallel on a NUMA machine. Result is the time for the test to complete in 1 of the containers. Both kernels included Johannes' memcg-aware global reclaim patches. Before rwlock patch: 1710.778s After rwlock patch: 152.227s Signed-off-by: NAndrew Bresticker <abrestic@google.com> Cc: Paul Menage <menage@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jeff Moyer 提交于
In testing aio on a fast storage device, I found that the context lock takes up a fair amount of cpu time in the I/O submission path. The reason is that we take it for every I/O submitted (see __aio_get_req). Since we know how many I/Os are passed to io_submit, we can preallocate the kiocbs in batches, reducing the number of times we take and release the lock. In my testing, I was able to reduce the amount of time spent in _raw_spin_lock_irq by .56% (average of 3 runs). The command I used to test this was: aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev> I also tested the patch with various numbers of events passed to io_submit, and I ran the xfstests aio group of tests to ensure I didn't break anything. Signed-off-by: NJeff Moyer <jmoyer@redhat.com> Cc: Daniel Ehrenberg <dehrenberg@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rakib Mullick 提交于
Fix typo in code comment. Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com> Acked-by: NDmitry Torokhov <dtor@vmware.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rakib Mullick 提交于
In vmballoon_reserve_page(), flags has been passed from the callee function (vmballoon_inflate here). So, we can determine can_sleep outside the loop. Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com> Acked-by: NDmitry Torokhov <dtor@vmware.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Weitzel 提交于
Interrupting w1_delay() in w1_read_bit() results in missing the low level on the w1 line and receiving "1" instead of "0". Add local_irq_save()/local_irq_restore() around the critical section Signed-off-by: NJan Weitzel <j.weitzel@phytec.de> Acked-by: NEvgeniy Polyakov <zbr@ioremap.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Florian Faber 提交于
When using multiple masters, w1_int.c would use the .init_name from w1.c for all entities, which will fail when creating a corresponding sysfs entry. This patch uses the unique name previously generated. WARNING: at fs/sysfs/dir.c:451 sysfs_add_one+0x48/0x64() sysfs: cannot create duplicate filename '/devices/w1 bus master' Modules linked in: Call trace: [<9001a604>] warn_slowpath_common+0x34/0x44 [<9001a64c>] warn_slowpath_fmt+0x14/0x18 [<90078020>] sysfs_add_one+0x48/0x64 [<900784ec>] create_dir+0x40/0x68 [<9007857a>] sysfs_create_dir+0x66/0x78 [<900c1a8a>] kobject_add_internal+0x6e/0x104 [<900c1bc0>] kobject_add_varg+0x20/0x2c [<900c1c1c>] kobject_add+0x30/0x3c [<900dbd66>] device_add+0x6a/0x378 [<900dbb4a>] device_initialize+0x12/0x48 [<900dc080>] device_register+0xc/0x10 [<900f99be>] w1_add_master_device+0x162/0x274 [<90008e7a>] w1_gpio_probe+0x66/0xb4 [<9000030c>] kernel_init+0x0/0xe8 [<900dde54>] platform_drv_probe+0xc/0xe [<9000030c>] kernel_init+0x0/0xe8 [<900dd4f8>] driver_probe_device+0x6c/0xdc [<900dd5fc>] __driver_attach+0x34/0x48 [<900dcce8>] bus_for_each_dev+0x2c/0x48 [<900dd5c8>] __driver_attach+0x0/0x48 [<900dd38c>] driver_attach+0x10/0x14 [<900dd16a>] bus_add_driver+0x6a/0x18c [<900dd768>] driver_register+0x60/0xb8 [<90011594>] __initcall_w1_therm_init6+0x0/0x4 [<90008e00>] w1_gpio_init+0x0/0x14 [<9000030c>] kernel_init+0x0/0xe8 [<900ddf48>] platform_driver_register+0x30/0x38 [<90011594>] __initcall_w1_therm_init6+0x0/0x4 [<90008e00>] w1_gpio_init+0x0/0x14 [<9000030c>] kernel_init+0x0/0xe8 [<900ddf5e>] platform_driver_probe+0xe/0x3c [<90008e0c>] w1_gpio_init+0xc/0x14 [<90011594>] __initcall_w1_therm_init6+0x0/0x4 [<90008e00>] w1_gpio_init+0x0/0x14 [<900126d4>] do_one_initcall+0x34/0x130 [<90000372>] kernel_init+0x66/0xe8 [<90011594>] __initcall_w1_therm_init6+0x0/0x4 [<9001ca3e>] do_exit+0x0/0x3a6 [<9000030c>] kernel_init+0x0/0xe8 [<9001ca3e>] do_exit+0x0/0x3a6 ---[ end trace 5a9233884fead918 ]--- kobject_add_internal failed for w1 bus master with -EEXIST, don't try to register things with the same name in the same directory. Signed-off-by: NFlorian Faber <faber@faberman.de> Cc: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Clifton Barnes 提交于
Fixes the deadlock when inserting and removing the ds2780. Signed-off-by: NClifton Barnes <cabarnes@indesign-llc.com> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Clifton Barnes 提交于
Adds a nolock function to the w1 interface to avoid locking the mutex if needed. Signed-off-by: NClifton Barnes <cabarnes@indesign-llc.com> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Clifton Barnes 提交于
Simply creates one point to call the w1 interface. Signed-off-by: NClifton Barnes <cabarnes@indesign-llc.com> Cc: Evgeniy Polyakov <zbr@ioremap.net> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jonathan Cameron 提交于
Straightforward. As an aside, the ida_init calls are not needed as far as I can see needed. (DEFINE_IDA does the same already). Signed-off-by: NJonathan Cameron <jic23@cam.ac.uk> Cc: Evgeniy Polyakov <zbr@ioremap.net> Acked-by: NClifton Barnes <cabarnes@indesign-llc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Heiko Carstens 提交于
Add "depends on GENERIC_HARDIRQS" to avoid compile breakage on s390: drivers/built-in.o: In function `pps_gpio_remove': linux-next/drivers/pps/clients/pps-gpio.c:189: undefined reference to `free_irq' Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: James Nuss <jamesnuss@nanometrics.ca> Cc: Rodolfo Giometti <giometti@enneenne.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 James Nuss 提交于
This client driver allows you to use a GPIO pin as a source for PPS signals. Platform data [1] are used to specify the GPIO pin number, label, assert event edge type, and whether clear events are captured. This driver is based on the work by Ricardo Martins who submitted an initial implementation [2] of a PPS IRQ client driver to the linuxpps mailing-list on Dec 3 2010. [1] include/linux/pps-gpio.h [2] http://ml.enneenne.com/pipermail/linuxpps/2010-December/004155.html [akpm@linux-foundation.org: remove unneeded cast of void*] Signed-off-by: NJames Nuss <jamesnuss@nanometrics.ca> Cc: Ricardo Martins <rasm@fe.up.pt> Acked-by: NRodolfo Giometti <giometti@linux.it> Signed-off-by: NRicardo Martins <rasm@fe.up.pt> Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su> Cc: Igor Plyatov <plyatov@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 James Nuss 提交于
A default echo function has been provided so it is no longer an error when you specify PPS_ECHOASSERT or PPS_ECHOCLEAR without an explicit echo function. This allows some code re-use and also makes it easier to write client drivers since the default echo function does not normally need to change. Signed-off-by: NJames Nuss <jamesnuss@nanometrics.ca> Reviewed-by: NBen Gardiner <bengardiner@nanometrics.ca> Acked-by: NRodolfo Giometti <giometti@linux.it> Cc: Ricardo Martins <rasm@fe.up.pt> Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su> Cc: Igor Plyatov <plyatov@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Lots of driver code does a dma_alloc_coherent() and then zeroes out the memory with a memset. Make it easy for them. Cc: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 WANG Cong 提交于
When I tried to send a patch to remove it, Andi told me we still need to keep compabitlies for old libc, so we can't remove this completely. Then just make it default to n and remove the doc from feature-removal-schedule.txt. Signed-off-by: NWANG Cong <amwang@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lucas De Marchi 提交于
Adding support for poll() in sysctl fs allows userspace to receive notifications of changes in sysctl entries. This adds a infrastructure to allow files in sysctl fs to be pollable and implements it for hostname and domainname. [akpm@linux-foundation.org: s/declare/define/ for definitions] Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexandre Bounine 提交于
Update rapidio.txt to reflect changes from recent patch. See http://marc.info/?l=linux-kernel&m=131285620113589&w=2 for details. Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com> Cc: Liu Gang <Gang.Liu@freescale.com> Cc: Micha Nelissen <micha@neli.hopto.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexandre Bounine 提交于
Modify Ethernet addess macros to be compatible with BE/LE platforms Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com> Cc: Chul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: <stable@kernel.org> [2.6.39+] Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexandre Bounine 提交于
The "goto cleanup" path can deference "rswitch" when it is NULL. Reported-by: NDan Carpenter <error27@gmail.com> Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Chul Kim <chul.kim@idt.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexandre Bounine 提交于
Add RapidIO mport driver for IDT TSI721 PCI Express-to-SRIO bridge device. The driver provides full set of callback functions defined for mport devices in RapidIO subsystem. It also is compatible with current version of RIONET driver (Ethernet over RapidIO messaging services). This patch is applicable to kernel versions starting from 2.6.39. Signed-off-by: NAlexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: NChul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liu Gang 提交于
arch/powerpc/sysdev/fsl_rio.c: release rapidio port I/O region resource if port failed to initialize The "struct rio_mport" contains a member of master port I/O memory resource structure "struct resource iores". This resource will be read from device tree and be used for rapidio R/W transaction memory space. Rapidio requests the port I/O memory resource under the root resource "iomem_resource". struct rio_mport *port; port = kzalloc(sizeof(struct rio_mport), GFP_KERNEL); request_resource(&iomem_resource, &port->iores); When port failed to initialize, allocated "rio_mport" structure memory will be freed, and the port I/O memory resource structure pointer "&port->iores" will be invalid. If other requests resource under "iomem_resource", "&port->iores" node may be operated in the child resources list and this will cause the system to crash. So the requested port I/O memory resource should be released before freeing allocated "rio_mport" structure. Signed-off-by: NLiu Gang <Gang.Liu@freescale.com> Acked-by: NAlexandre Bounine <alexandre.bounine@idt.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Liu Gang 提交于
The discovered bit in PGCCSR register indicates if the device has been discovered by system host. In Rapidio systems, some agent devices can also be master devices. They can issue requests into the system. Signed-off-by: NLiu Gang <Gang.Liu@freescale.com> Acked-by: NAlexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Will Drewry 提交于
Expand root=PARTUUID=UUID syntax to support selecting a root partition by integer offset from a known, unique partition. This approach provides similar properties to specifying a device and partition number, but using the UUID as the unique path prior to evaluating the offset. For example, root=PARTUUID=99DE9194-FC15-4223-9192-FC243948F88B/PARTNROFF=1 selects the partition with UUID 99DE.. then select the next partition. This change is motivated by a particular usecase in Chromium OS where the bootloader can easily determine what partition it is on (by UUID) but doesn't perform general partition table walking. That said, support for this model provides a direct mechanism for the user to modify the root partition to boot without specifically needing to extract each UUID or update the bootloader explicitly when the root partition UUID is changed (if it is recreated to be larger, for instance). Pinning to a /boot-style partition UUID allows the arbitrary root partition reconfiguration/modifications with slightly less ambiguity than just [dev][partition] and less stringency than the specific root partition UUID. [sfr@canb.auug.org.au: fix init sections warning] Signed-off-by: NWill Drewry <wad@chromium.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Manfred Spraul 提交于
For the sysvsem undo, each task struct contains a sysv_sem structure with a pointer to the undo information. This pointer is only necessary if sysvipc is enabled - thus the pointer can be made conditional on CONFIG_SYSVIPC. Signed-off-by: NManfred Spraul <manfred@colorfullife.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Manfred Spraul 提交于
include/linux/sem.h contains several structures that are only used within ipc/sem.c. The patch moves them into ipc/sem.c - there is no need to expose the structures to the whole kernel. No functional changes, only whitespace cleanups and 80-char per line fixes. Signed-off-by: NManfred Spraul <manfred@colorfullife.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Manfred Spraul 提交于
semtimedop() does not handle spurious wakeups, it returns -EINTR to user space. Most other schedule() users would just loop and not return to user space. The patch adds such a loop to semtimedop() Signed-off-by: NManfred Spraul <manfred@colorfullife.com> Reported-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Manfred Spraul 提交于
sys_semtimedop() may return -EIDRM although the semaphore operation completed successfully: thread 1: thread 2: semtimedop(), sleeps semop(): * acquires sem_lock() semtimedop() woken up due to timeout sem_lock() loops * notices that thread 2 could be completed. * performs the operations that thread 2 is sleeping on. * marks the semaphore operation as IN_WAKEUP * drops sem_lock(), does wakeup, sets return code to 0 * thread delayed due to interrupt, whatever * returns to user space * thread still delayed semctl(IPC_RMID) * acquires sem_lock() * ipc_rmid(), ipcp->deleted=1 * drops sem_lock() * thread finally continues - but seem_lock() now fails due to ipcp->deleted == 1 * returns -EIDRM instead of 0 The fix is trivial: Always use the return code in queue.status. In real world, the race probably doesn't matter: If the semaphore array is destroyed, the app is probably not interested if the last operation succeeded or was already cancelled. Signed-off-by: NManfred Spraul <manfred@colorfullife.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <efault@gmx.de> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tejun Heo 提交于
It's often convenient to be able to release resource from IRQ context. Make ida_simple_*() use irqsave/restore spin ops so that they are IRQ safe. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NRusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vasiliy Kulikov 提交于
fd* files are restricted to the task's owner, and other users may not get direct access to them. But one may open any of these files and run any setuid program, keeping opened file descriptors. As there are permission checks on open(), but not on readdir() and read(), operations on the kept file descriptors will not be checked. It makes it possible to violate procfs permission model. Reading fdinfo/* may disclosure current fds' position and flags, reading directory contents of fdinfo/ and fd/ may disclosure the number of opened files by the target task. This information is not sensible per se, but it can reveal some private information (like length of a password stored in a file) under certain conditions. Used existing (un)lock_trace functions to check for ptrace_may_access(), but instead of using EPERM return code from it use EACCES to be consistent with existing proc_pid_follow_link()/proc_pid_readlink() return code. If they differ, attacker can guess what fds exist by analyzing stat() return code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*, readdir() and lookup() for fd/ and fdinfo/. Signed-off-by: NVasiliy Kulikov <segoon@openwall.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pavel Emelyanov 提交于
On reading sysctl dirs we should return -EISDIR instead of -EINVAL. Signed-off-by: NPavel Emelyanov <xemul@openvz.org> Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
{get,put}_mems_allowed() exist so that general kernel code may locklessly access a task's set of allowable nodes without having the chance that a concurrent write will cause the nodemask to be empty on configurations where MAX_NUMNODES > BITS_PER_LONG. This could incur a significant delay, however, especially in low memory conditions because the page allocator is blocking and reclaim requires get_mems_allowed() itself. It is not atypical to see writes to cpuset.mems take over 2 seconds to complete, for example. In low memory conditions, this is problematic because it's one of the most imporant times to change cpuset.mems in the first place! The only way a task's set of allowable nodes may change is through cpusets by writing to cpuset.mems and when attaching a task to a generic code is not reading the nodemask with get_mems_allowed() at the same time, and then clearing all the old nodes. This prevents the possibility that a reader will see an empty nodemask at the same time the writer is storing a new nodemask. If at least one node remains unchanged, though, it's possible to simply set all new nodes and then clear all the old nodes. Changing a task's nodemask is protected by cgroup_mutex so it's guaranteed that two threads are not changing the same task's nodemask at the same time, so the nodemask is guaranteed to be stored before another thread changes it and determines whether a node remains set or not. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Paul Menage <paul@paulmenage.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 H Hartley Sweeten 提交于
warning: symbol 'swap_cgroup_ctrl' was not declared. Should it be static? Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: NBalbir Singh <bsingharora@gmail.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Steven Rostedt 提交于
Various code in memcontrol.c () calls this_cpu_read() on the calculations to be done from two different percpu variables, or does an open-coded read-modify-write on a single percpu variable. Disable preemption throughout these operations so that the writes go to the correct palces. [hannes@cmpxchg.org: added this_cpu to __this_cpu conversion] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
There is a potential race between a thread charging a page and another thread putting it back to the LRU list: charge: putback: SetPageCgroupUsed SetPageLRU PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU The order of setting one flag and checking the other is crucial, otherwise the charge may observe !PageLRU while the putback observes !PageCgroupUsed and the page is not linked to the memcg LRU at all. Global memory pressure may fix this by trying to isolate and putback the page for reclaim, where that putback would link it to the memcg LRU again. Without that, the memory cgroup is undeletable due to a charge whose physical page can not be found and moved out. Signed-off-by: NJohannes Weiner <jweiner@redhat.com> Cc: Ying Han <yinghan@google.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Reclaim decides to skip scanning an active list when the corresponding inactive list is above a certain size in comparison to leave the assumed working set alone while there are still enough reclaim candidates around. The memcg implementation of comparing those lists instead reports whether the whole memcg is low on the requested type of inactive pages, considering all nodes and zones. This can lead to an oversized active list not being scanned because of the state of the other lists in the memcg, as well as an active list being scanned while its corresponding inactive list has enough pages. Not only is this wrong, it's also a scalability hazard, because the global memory state over all nodes and zones has to be gathered for each memcg and zone scanned. Make these calculations purely based on the size of the two LRU lists that are actually affected by the outcome of the decision. Signed-off-by: NJohannes Weiner <jweiner@redhat.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NYing Han <yinghan@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Igor Mammedov 提交于
If somebody is touching data too early, it might be easier to diagnose a problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to understand why mem_cgroup_per_zone is [un|partly]initialized. Signed-off-by: NIgor Mammedov <imammedo@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Before calling schedule_timeout(), task state should be changed. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Raghavendra K T 提交于
The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses "struct mem_cgroup *memcg". Rename all mem variables to memcg in source file. Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Steven Rostedt 提交于
When the cgroup base was allocated with kmalloc, it was necessary to annotate the variable with kmemleak_not_leak(). But because it has recently been changed to be allocated with alloc_page() (which skips kmemleak checks) causes a warning on boot up. I was triggering this output: allocated 8388608 bytes of page_cgroup please try 'cgroup_disable=memory' option if you don't want memory cgroups kmemleak: Trying to color unknown object at 0xf5840000 as Grey Pid: 0, comm: swapper Not tainted 3.0.0-test #12 Call Trace: [<c17e34e6>] ? printk+0x1d/0x1f^M [<c10e2941>] paint_ptr+0x4f/0x78 [<c178ab57>] kmemleak_not_leak+0x58/0x7d [<c108ae9f>] ? __rcu_read_unlock+0x9/0x7d [<c1cdb462>] kmemleak_init+0x19d/0x1e9 [<c1cbf771>] start_kernel+0x346/0x3ec [<c1cbf1b4>] ? loglevel+0x18/0x18 [<c1cbf0aa>] i386_start_kernel+0xaa/0xb0 After a bit of debugging I tracked the object 0xf840000 (and others) down to the cgroup code. The change from allocating base with kmalloc to alloc_page() has the base not calling kmemleak_alloc() which adds the pointer to the object_tree_root, but kmemleak_not_leak() adds it to the crt_early_log[] table. On kmemleak_init(), the entry is found in the early_log[] but not the object_tree_root, and this error message is displayed. If alloc_page() fails then it defaults back to vmalloc() which still uses the kmemleak_alloc() which makes us still need the kmemleak_not_leak() call. The solution is to call the kmemleak_alloc() directly if the alloc_page() succeeds. Reviewed-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NJonathan Nieder <jrnieder@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ben Blum 提交于
If a task has exited to the point it has called cgroup_exit() already, then we can't migrate it to another cgroup anymore. This can happen when we are attaching a task to a new cgroup between the call to ->can_attach_task() on subsystems and the migration that is eventually tried in cgroup_task_migrate(). In this case cgroup_task_migrate() returns -ESRCH and we don't want to attach the task to the subsystems because the attachment to the new cgroup itself failed. Fix this by only calling ->attach_task() on the subsystems if the cgroup migration succeeded. Reported-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NBen Blum <bblum@andrew.cmu.edu> Acked-by: NPaul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-