- 16 10月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
Currently, cgroup->nr_populated counts whether the cgroup has any css_sets linked to it and the number of children which has non-zero ->nr_populated. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch updates cgroup->nr_populated so that for the cgroup itself it counts the number of css_sets which have tasks associated with them so that empty css_sets don't skew the populated test. Signed-off-by: NTejun Heo <tj@kernel.org> -
由 Tejun Heo 提交于
cgroup_task_migrate() no longer uses @old_cgrp. Remove it. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 26 9月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
49d1dc4b ("cgroup: implement static_key based cgroup_subsys_enabled() and cgroup_subsys_on_dfl()") converted cgroup enabled test to use static_key; however, cgroup_disable() is called before static_key subsystem itself is initialized and thus leads to the following warning when "cgroup_disable=" parameter is specified. WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:99 static_key_slow_dec+0x44/0x60() static_key_slow_dec used before call to jump_label_init ... Call Trace: [<ffffffff813b18c2>] dump_stack+0x44/0x62 [<ffffffff8108dd52>] warn_slowpath_common+0x82/0xc0 [<ffffffff8108ddec>] warn_slowpath_fmt+0x5c/0x80 [<ffffffff8119c054>] static_key_slow_dec+0x44/0x60 [<ffffffff81d826b6>] cgroup_disable+0xaf/0xd6 [<ffffffff81d5f9de>] unknown_bootoption+0x8c/0x194 [<ffffffff810b0c03>] parse_args+0x273/0x4a0 [<ffffffff81d5fd67>] start_kernel+0x205/0x4b8 ... Fix it by making cgroup_disable() to record the subsystems to disable in cgroup_disable_mask and moving the actual application to cgroup_init() which is late enough and where the enabled state is first used. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NAndrey Wagin <avagin@gmail.com> Link: http://lkml.kernel.org/g/CANaxB-yFuS4SA2znSvcKrO9L_CbHciHYW+o9bN8sZJ8eR9FxYA@mail.gmail.com Fixes: 49d1dc4b
-
- 23 9月, 2015 5 次提交
-
-
由 Tejun Heo 提交于
cgroup_update_dfl_csses() is responsible for migrating processes when controllers are enabled or disabled on the default hierarchy. As the css association changes for all the processes in the affected cgroups, this involves migrating multiple processes. Up until now, it was implemented by migrating process-by-process until the source css_sets are empty; however, this means that if a process fails to migrate after some succeed before it, the recovery is very tricky. This was considered okay as subsystems weren't allowed to reject process migration on the default hierarchy; unfortunately, enforcing this policy turned out to be problematic for certain types of resources - realtime slices for now. As such, the default hierarchy is gonna allow restricted failures during migration and to support that this patch makes cgroup_update_dfl_csses() migrate all target processes atomically rather than one-by-one. The preceding patches made subsystems ready for multi-process migration and factored out taskset operations making this almost trivial. All tasks of the target processes are put in the same taskset and the migration operations are performed once which either fails or succeeds for all. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
由 Tejun Heo 提交于
Currently, cgroup_migreate() implements large part of the migration logic inline including building the target taskset and actually migrating them. This patch separates out the following taskset operations. CGROUP_TASKSET_INIT() : taskset initializer cgroup_taskset_add() : add a task to a taskset cgroup_taskset_migrate() : migrate a taskset to the destination cgroup This will be used to implement atomic multi-process migration in cgroup_update_dfl_csses(). This is pure reorganization which doesn't introduce any functional changes. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
由 Tejun Heo 提交于
cgroup_migrate() has the destination cgroup as the first parameter while cgroup_task_migrate() has the destination cset as the last. Another migration function is scheduled to be added which can make the discrepancy further stand out. Let's reorder cgroup_migrate()'s parameters so that the destination cgroup is the last. This doesn't cause any functional difference. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
由 Tejun Heo 提交于
It wasn't explicitly documented but, when a process is being migrated, cpuset and memcg depend on cgroup_taskset_first() returning the threadgroup leader; however, this approach is somewhat ghetto and would no longer work for the planned multi-process migration. This patch introduces explicit cgroup_taskset_for_each_leader() which iterates over only the threadgroup leaders and replaces cgroup_taskset_first() usages for accessing the leader with it. This prepares both memcg and cpuset for multi-process migration. This patch also updates the documentation for cgroup_taskset_for_each() to clarify the iteration rules and removes comments mentioning task ordering in tasksets. v2: A previous patch which added threadgroup leader test was dropped. Patch updated accordingly. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
If memory_migrate flag is set, cpuset migrates memory according to the destnation css's nodemask. The current implementation migrates memory whenever any thread of a process is migrated making the behavior somewhat arbitrary. Let's tie memory operations to the threadgroup leader so that memory is migrated only when the leader is migrated. While this is a behavior change, given the inherent fuziness, this change is not too likely to be noticed and allows us to clearly define who owns the memory (always the leader) and helps the planned atomic multi-process migration. Note that we're currently migrating memory in migration path proper while holding all the locks. In the long term, this should be moved out to an async work item. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
- 19 9月, 2015 7 次提交
-
-
由 Tejun Heo 提交于
cgroup core handles creations and removals of cgroup interface files as described by cftypes. There are cases where the handle for a given file instance is necessary, for example, to generate a file modified event. Currently, this is handled by explicitly matching the callback method pointer and storing the file handle manually in cgroup_add_file(). While this simple approach works for cgroup core files, it can't for controller interface files. This patch generalizes cgroup interface file handle handling. struct cgroup_file is defined and each cftype can optionally tell cgroup core to store the file handle by setting ->file_offset. A file handle remains accessible as long as the containing css is accessible. Both "cgroup.procs" and "cgroup.events" are converted to use the new generic mechanism instead of hooking directly into cgroup_add_file(). Also, cgroup_file_notify() which takes a struct cgroup_file and generates a file modified event on it is added and replaces explicit kernfs_notify() invocations. This generalizes cgroup file handle handling and allows controllers to generate file modified notifications. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
The file creation / removal path has always been a bit icky and the planned notification update requires css during file creation. Restructure as follows. * cgroup_addrm_files() now takes both @css and @cgrp and is only called directly by other file handling functions. * cgroup_populate/clear_dir() are replaced with css_populate/clear_dir() taking @css and @cgrp_override. @cgrp_override is used only when files needs to be created on / removed from a cgroup which isn't attached to @css which happens during subsystem rebinds. Subsystem loops are moved to the callers. * cgroup_add_file() now takes both @css and @cgrp. @css isn't used yet but will be used by the planned notification update. This patch doens't cause any behavior changes. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
* Use local variables @scgrp and @dcgrp for @src_root->cgrp and @dst_root->cgrp respectively. * Use initializers to set @src_root and @css in the inner bind loop. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
After a file creation failure, cgroup_addrm_files() it didn't remove the files which had already been created. When cgroup_populate_dir() is the caller, this is fine as the caller performs cleanup; however, for other callers, this may leave unactivated dangling files behind. As kernfs directory removals are recursive, this doesn't lead to permanent memory leak but it can, for example, fail future attempts to create those files again. There's no point in keeping around this sort of subtlety and it gets in the way of planned updates to file handling. This patch makes cgroup_addrm_files() clean up after itself on failures. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
Move it upwards so that it's right below cgroup_clear_dir() and the forward declaration is unnecessary. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
cftype->mode allows controllers to give arbitrary permissions to interface knobs. Except for "cgroup.event_control", the existing uses are spurious. * Some explicitly specify S_IRUGO | S_IWUSR even though that's the default. * "cpuset.memory_pressure" specifies S_IRUGO while also setting a write callback which returns -EACCES. All it needs to do is simply not setting a write callback. "cgroup.event_control" uses cftype->mode to make the file world-writable. It's a misdesigned interface and we don't want controllers to be tweaking interface file permissions in general. This patch removes cftype->mode and all its spurious uses and implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is marked as compatibility-only. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
memcg already uses "memory.events" for event reporting and other controllers may need event reporting too. Let's standardize on "$SUBSYS.events" interface file for reporting events which don't happen too frequently and thus can share event notification. "cgroup.populated" is replaced with "populated" field in "cgroup.events" and documentation is updated accordingly. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
-
- 18 9月, 2015 3 次提交
-
-
由 Tejun Heo 提交于
cgroup_on_dfl() tests whether the cgroup's root is the default hierarchy; however, an individual controller is only interested in whether the controller is attached to the default hierarchy and never tests a cgroup which doesn't belong to the hierarchy that the controller is attached to. This patch replaces cgroup_on_dfl() tests in controllers with faster static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as the only user of cgroup_on_dfl() and the function is moved from the header file to cgroup.c. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>
-
由 Tejun Heo 提交于
Replace cgroup_subsys->disabled tests in controllers with cgroup_subsys_enabled(). cgroup_subsys_enabled() requires literal subsys name as its parameter and thus can't be used for cgroup core which iterates through controllers. For cgroup core, introduce and use cgroup_ssid_enabled() which uses slower static_key_enabled() test and can be indexed by subsys ID. This leaves cgroup_subsys->disabled unused. Removed. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>
-
由 Tejun Heo 提交于
Whether a subsys is enabled and attached to the default hierarchy seldom changes and may be tested in the hot paths. This patch implements static_key based cgroup_subsys_enabled() and cgroup_subsys_on_dfl() tests. The following patches will update the users and remove duplicate mechanisms. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
- 17 9月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
Note: This commit was originally committed as b5ba75b5 but got reverted by f9f9e7b7 due to the performance regression from the percpu_rwsem write down/up operations added to cgroup task migration path. percpu_rwsem changes which alleviate the performance issue are pending for v4.4-rc1 merge window. Re-apply. Now that threadgroup locking is made global, code paths around it can be simplified. * lock-verify-unlock-retry dancing removed from __cgroup_procs_write(). * Race protection against de_thread() removed from cgroup_update_dfl_csses(). Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
-
由 Tejun Heo 提交于
Note: This commit was originally committed as d59cfc09 but got reverted by 0c986253 due to the performance regression from the percpu_rwsem write down/up operations added to cgroup task migration path. percpu_rwsem changes which alleviate the performance issue are pending for v4.4-rc1 merge window. Re-apply. The cgroup side of threadgroup locking uses signal_struct->group_rwsem to synchronize against threadgroup changes. This per-process rwsem adds small overhead to thread creation, exit and exec paths, forces cgroup code paths to do lock-verify-unlock-retry dance in a couple places and makes it impossible to atomically perform operations across multiple processes. This patch replaces signal_struct->group_rwsem with a global percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader side and contained in cgroups proper. This patch converts one-to-one. This does make writer side heavier and lower the granularity; however, cgroup process migration is a fairly cold path, we do want to optimize thread operations over it and cgroup migration operations don't take enough time for the lower granularity to matter. Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
- 16 9月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
This reverts commit d59cfc09. d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify threadgroup locking") changed how cgroup synchronizes against task fork and exits so that it uses global percpu_rwsem instead of per-process rwsem; unfortunately, the write [un]lock paths of percpu_rwsem always involve synchronize_rcu_expedited() which turned out to be too expensive. Improvements for percpu_rwsem are scheduled to be merged in the coming v4.4-rc1 merge window which alleviates this issue. For now, revert the two commits to restore per-process rwsem. They will be re-applied for the v4.4-rc1 merge window. Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org # v4.2+
-
由 Tejun Heo 提交于
This reverts commit b5ba75b5. d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify threadgroup locking") changed how cgroup synchronizes against task fork and exits so that it uses global percpu_rwsem instead of per-process rwsem; unfortunately, the write [un]lock paths of percpu_rwsem always involve synchronize_rcu_expedited() which turned out to be too expensive. Improvements for percpu_rwsem are scheduled to be merged in the coming v4.4-rc1 merge window which alleviates this issue. For now, revert the two commits to restore per-process rwsem. They will be re-applied for the v4.4-rc1 merge window. Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org # v4.2+
-
- 12 9月, 2015 1 次提交
-
-
由 Mathieu Desnoyers 提交于
Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads running on the system. It is implemented by calling synchronize_sched(). It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU [1], rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side. The existing applications of which I am aware that would be improved by this system call are as follows: * Through Userspace RCU library (http://urcu.so) - DNS server (Knot DNS) https://www.knot-dns.cz/ - Network sniffer (http://netsniff-ng.org/) - Distributed object storage (https://sheepdog.github.io/sheepdog/) - User-space tracing (http://lttng.org) - Network storage system (https://www.gluster.org/) - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf) - Financial software (https://lkml.org/lkml/2015/3/23/189) Those projects use RCU in userspace to increase read-side speed and scalability compared to locking. Especially in the case of RCU used by libraries, sys_membarrier can speed up the read-side by moving the bulk of the memory barrier cost to synchronize_rcu(). * Direct users of sys_membarrier - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198) Microsoft core dotnet GC developers are planning to use the mprotect() side-effect of issuing memory barriers through IPIs as a way to implement Windows FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their github thread, specifically stating that sys_membarrier() is what they are looking for. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B previous mem accesses previous mem accesses smp_mb() smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by synchronize_sched(). * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) 1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 2202847 writes signal-based scheme: 9830061167 reads, 6700 writes sys_membarrier: 9952759104 reads, 425 writes sys_membarrier (dyn. check): 7970328887 reads, 425 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, sys_membarrier slightly outperforms the signal-based scheme. However, this non-expedited sys_membarrier implementation has a much slower grace period than signal and memory barrier schemes. Besides diminishing the number of wake-ups, one major advantage of the membarrier system call over the signal-based scheme is that it does not need to reserve a signal. This plays much more nicely with libraries, and with processes injected into for tracing purposes, for which we cannot expect that signals will be unused by the application. An expedited version of this system call can be added later on to speed up the grace period. Its implementation will likely depend on reading the cpu_curr()->mm without holding each CPU's rq lock. This patch adds the system call to x86 and to asm-generic. [1] http://urcu.so membarrier(2) man page: MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2) NAME membarrier - issue memory barriers on a set of threads SYNOPSIS #include <linux/membarrier.h> int membarrier(int cmd, int flags); DESCRIPTION The cmd argument is one of the following: MEMBARRIER_CMD_QUERY Query the set of supported commands. It returns a bitmask of supported commands. MEMBARRIER_CMD_SHARED Execute a memory barrier on all threads running on the system. Upon return from system call, the caller thread is ensured that all running threads have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This covers threads from all pro=E2=80=90 cesses running on the system. This command returns 0. The flags argument needs to be 0. For future extensions. All memory accesses performed in program order from each targeted thread is guaranteed to be ordered with respect to sys_membarrier(). If we use the semantic "barrier()" to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pair of barrier(), sys_membarrier() and smp_mb(): The pair ordering is detailed as (O: ordered, X: not ordered): barrier() smp_mb() sys_membarrier() barrier() X X O smp_mb() X O O sys_membarrier() O O O RETURN VALUE On success, these system calls return zero. On error, -1 is returned, and errno is set appropriately. For a given command, with flags argument set to 0, this system call is guaranteed to always return the same value until reboot. ERRORS ENOSYS System call is not implemented. EINVAL Invalid arguments. Linux 2015-04-15 MEMBARRIER(2) Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: NJosh Triplett <josh@joshtriplett.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nicholas Miell <nmiell@comcast.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 9月, 2015 14 次提交
-
-
由 Ilya Dryomov 提交于
The following if (val < 0) *lvalp = (unsigned long)-val; is incorrect because the compiler is free to assume -val to be positive and use a sign-extend instruction for extending the bit pattern. This is a problem if val == INT_MIN: # echo -2147483648 >/proc/sys/dev/scsi/logging_level # cat /proc/sys/dev/scsi/logging_level -18446744071562067968 Cast to unsigned long before negation - that way we first sign-extend and then negate an unsigned, which is well defined. With this: # cat /proc/sys/dev/scsi/logging_level -2147483648 Signed-off-by: NIlya Dryomov <idryomov@gmail.com> Cc: Mikulas Patocka <mikulas@twibright.com> Cc: Robert Xiao <nneonneo@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> -
由 Baoquan He 提交于
In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and accordingly the MODULES_VADDR is changed to 0xffffffffa0000000. However, in v3.12 Kees Cook introduced kaslr to randomise the location of kernel. And the kernel text mapping addr space is enlarged from 512M to 1G. That means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr support is not compiled in and 1G when kaslr support is compiled in. Accordingly the MODULES_VADDR is changed too to be: #define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE) So when kaslr is compiled in and enabled, the kernel text mapping addr space and modules vaddr space need be adjusted. Otherwise makedumpfile will collapse since the addr for some symbols is not correct. Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in makedumpfile to help calculate MODULES_VADDR. Signed-off-by: NBaoquan He <bhe@redhat.com> Acked-by: NKees Cook <keescook@chromium.org> Acked-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> -
由 Baoquan He 提交于
People reported that crash_notes in /proc/vmcore were corrupted and this cause crash kdump failure. With code debugging and log we got the root cause. This is because percpu variable crash_notes are allocated in 2 vmalloc pages. Currently percpu is based on vmalloc by default. Vmalloc can't guarantee 2 continuous vmalloc pages are also on 2 continuous physical pages. So when 1st kernel exports the starting address and size of crash_notes through sysfs like below: /sys/devices/system/cpu/cpux/crash_notes /sys/devices/system/cpu/cpux/crash_notes_size kdump kernel use them to get the content of crash_notes. However the 2nd part may not be in the next neighbouring physical page as we expected if crash_notes are allocated accross 2 vmalloc pages. That's why nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in update_note_header_size_elf64() and cause note header merging failure or some warnings. In this patch change to call __alloc_percpu() to passed in the align value by rounding crash_notes_size up to the nearest power of two. This makes sure the crash_notes is allocated inside one physical page since sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE. Meanwhile add a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since crash_notes definitely will be in 2 pages. That need be avoided, and need be reported if it's unavoidable. [akpm@linux-foundation.org: use correct comment layout] Signed-off-by: NBaoquan He <bhe@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minfei Huang 提交于
Transforming PFN(Page Frame Number) to struct page is never failure, so we can simplify the code logic to do the image->control_page assignment directly in the loop, and remove the unnecessary conditional judgement. Signed-off-by: NMinfei Huang <mnfhuang@gmail.com> Acked-by: NDave Young <dyoung@redhat.com> Acked-by: NVivek Goyal <vgoyal@redhat.com> Cc: Simon Horman <horms@verge.net.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Young 提交于
There are two kexec load syscalls, kexec_load another and kexec_file_load. kexec_file_load has been splited as kernel/kexec_file.c. In this patch I split kexec_load syscall code to kernel/kexec.c. And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and use kexec_file_load only, or vice verse. The original requirement is from Ted Ts'o, he want kexec kernel signature being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use kexec_load syscall can bypass the checking. Vivek Goyal proposed to create a common kconfig option so user can compile in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects KEXEC_CORE so that old config files still work. Because there's general code need CONFIG_KEXEC_CORE, so I updated all the architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects KEXEC_CORE in arch Kconfig. Also updated general kernel code with to kexec_load syscall. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NDave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Young 提交于
Split kexec_file syscall related code to another file kernel/kexec_file.c so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped. Sharing variables and functions are moved to kernel/kexec_internal.h per suggestion from Vivek and Petr. [akpm@linux-foundation.org: fix bisectability] [akpm@linux-foundation.org: declare the various arch_kexec functions] [akpm@linux-foundation.org: fix build] Signed-off-by: NDave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Frederic Weisbecker 提交于
The UMH_WAIT_PROC handler runs in its own thread in order to make sure that waiting for the exec kernel thread completion won't block other usermodehelper queued jobs. On older workqueue implementations, worklets couldn't sleep without blocking the rest of the queue. But now the workqueue subsystem handles that. Khelper still had the older limitation due to its singlethread properties but we replaced it to system unbound workqueues. Those are affine to the current node and can block up to some number of instances. They are a good candidate to handle UMH_WAIT_PROC assuming that we have enough system unbound workers to handle lots of parallel usermodehelper jobs. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Frederic Weisbecker 提交于
We need to launch the usermodehelper kernel threads with the widest affinity and this is partly why we use khelper. This workqueue has unbound properties and thus a wide affinity inherited by all its children. Now khelper also has special properties that we aren't much interested in: ordered and singlethread. There is really no need about ordering as all we do is creating kernel threads. This can be done concurrently. And singlethread is a useless limitation as well. The workqueue engine already proposes generic unbound workqueues that don't share these useless properties and handle well parallel jobs. The only worrysome specific is their affinity to the node of the current CPU. It's fine for creating the usermodehelper kernel threads but those inherit this affinity for longer jobs such as requesting modules. This patch proposes to use these node affine unbound workqueues assuming that a node is sufficient to handle several parallel usermodehelper requests. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Frederic Weisbecker 提交于
There seem to be quite some confusions on the comments, likely due to changes that came after them. Now since it's very non obvious why we have 3 levels of asynchronous code to implement usermodehelpers, it's important to comment in detail the reason of this layout. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Frederic Weisbecker 提交于
Khelper is affine to all CPUs. Now since it creates the call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide affinity. As such explicitly forcing a wide affinity from those kernel threads is like a no-op. Just remove it. It's needless and it breaks CPU isolation users who rely on workqueue affinity tuning. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Frederic Weisbecker 提交于
This patchset does a bunch of cleanups and converts khelper to use system unbound workqueues. The 3 first patches should be uncontroversial. The last 2 patches are debatable. Kmod creates kernel threads that perform userspace jobs and we want those to have a large affinity in order not to contend busy CPUs. This is (partly) why we use khelper which has a wide affinity that the kernel threads it create can inherit from. Now khelper is a dedicated workqueue that has singlethread properties which we aren't interested in. Hence those two debatable changes: _ We would like to use generic workqueues. System unbound workqueues are a very good candidate but they are not wide affine, only node affine. Now probably a node is enough to perform many parallel kmod jobs. _ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC handler) to use the workqueue. It means that if the workqueue blocks, and no other worker can take pending kmod request, we can be screwed. Now if we have 512 threads, this should be enough. This patch (of 5): Underscores on function names aren't much verbose to explain the purpose of a function. And kmod has interesting such flavours. Lets rename the following functions: * __call_usermodehelper -> call_usermodehelper_exec_work * ____call_usermodehelper -> call_usermodehelper_exec_async * wait_for_helper -> call_usermodehelper_exec_sync Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 NeilBrown 提交于
If request_module() successfully runs modprobe, but modprobe exits with a non-zero status, then the return value from request_module() will be that (positive) error status. So the return from request_module can be: negative errno zero for success positive exit code. Signed-off-by: NNeilBrown <neilb@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joe Perches 提交于
Commit e0e81739 ("CRED: Add some configurable debugging [try #6]") added the kdebug mechanism to this file back in 2009. The kdebug macro calls no_printk which always evaluates arguments. Most of the kdebug uses have an unnecessary call of atomic_read(&cred->usage) Make the kdebug macro do nothing by defining it with do { if (0) no_printk(...); } while (0) when not enabled. $ size kernel/cred.o* (defconfig x86-64) text data bss dec hex filename 2748 336 8 3092 c14 kernel/cred.o.new 2788 336 8 3132 c3c kernel/cred.o.old Miscellanea: o Neaten the #define kdebug macros while there Signed-off-by: NJoe Perches <joe@perches.com> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yongjun 提交于
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: NSteven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 9月, 2015 2 次提交
-
-
由 Alexei Starovoitov 提交于
when the verifier log is enabled the print_bpf_insn() is doing bpf_alu_string[BPF_OP(insn->code) >> 4] and bpf_jmp_string[BPF_OP(insn->code) >> 4] where BPF_OP is a 4-bit instruction opcode. Malformed insns can cause out of bounds access. Fix it by sizing arrays appropriately. The bug was found by clang address sanitizer with libfuzzer. Reported-by: NYonghong Song <yhs@plumgrid.com> Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
We may already have gotten a proper fd struct through fdget(), so whenever we return at the end of an map operation, we need to call fdput(). However, each map operation from syscall side first probes CHECK_ATTR() to verify that unused fields in the bpf_attr union are zero. In case of malformed input, we return with error, but the lookup to the map_fd was already performed at that time, so that we return without an corresponding fdput(). Fix it by performing an fdget() only right before bpf_map_get(). The fdget() invocation on maps in the verifier is not affected. Fixes: db20fd2b ("bpf: add lookup/update/delete/iterate methods to BPF maps") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 9月, 2015 1 次提交
-
-
由 Vlastimil Babka 提交于
alloc_pages_exact_node() was introduced in commit 6484eb3e ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047a ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb4 ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NRobin Holt <robinmholt@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NMichael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-