- 18 2月, 2015 1 次提交
-
-
由 Viresh Kumar 提交于
It is not possible for the clockevents core to know which modes (other than those with a corresponding feature flag) are supported by a particular implementation. And drivers are expected to handle transition to all modes elegantly, as ->set_mode() would be issued for them unconditionally. Now, adding support for a new mode complicates things a bit if we want to use the legacy ->set_mode() callback. We need to closely review all clockevents drivers to see if they would break on addition of a new mode. And after such reviews, it is found that we have to do non-trivial changes to most of the drivers [1]. Introduce mode-specific set_mode_*() callbacks, some of which the drivers may or may not implement. A missing callback would clearly convey the message that the corresponding mode isn't supported. A driver may still choose to keep supporting the legacy ->set_mode() callback, but ->set_mode() wouldn't be supporting any new modes beyond RESUME. If a driver wants to benefit from using a new mode, it would be required to migrate to the mode specific callbacks. The legacy ->set_mode() callback and the newly introduced mode-specific callbacks are mutually exclusive. Only one of them should be supported by the driver. Sanity check is done at the time of registration to distinguish between optional and required callbacks and to make error recovery and handling simpler. If the legacy ->set_mode() callback is provided, all mode specific ones would be ignored by the core but a warning is thrown if they are present. Call sites calling ->set_mode() directly are also updated to use __clockevents_set_mode() instead, as ->set_mode() may not be available anymore for few drivers. [1] https://lkml.org/lkml/2014/12/9/605 [2] https://lkml.org/lkml/2015/1/23/255 Suggested-by: Thomas Gleixner <tglx@linutronix.de> [2] Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: linaro-kernel@lists.linaro.org Cc: linaro-networking@linaro.org Link: http://lkml.kernel.org/r/792d59a40423f0acffc9bb0bec9de1341a06fa02.1423788565.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 1月, 2015 4 次提交
-
-
由 kbuild test robot 提交于
kernel/time/hrtimer.c:444:9: sparse: symbol '__hrtimer_get_next_event' was not declared. Should it be static? Fixes: 9bc74919 hrtimer: Prevent stale expiry time in hrtimer_interrupt() Signed-off-by: NFengguang Wu <fengguang.wu@intel.com> Cc: kbuild-all@01.org Link: http://lkml.kernel.org/r/20150123121206.GA4766@snbSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Xunlei Pang 提交于
rtc_set_ntp_time() uses timespec which is y2038-unsafe, so modify to use timespec64 which is y2038-safe, then replace rtc_time_to_tm() with rtc_time64_to_tm(). Also adjust all its call sites(only NTP uses it) accordingly. Cc: pang.xunlei <pang.xunlei@linaro.org> Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: NXunlei Pang <pang.xunlei@linaro.org> Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
-
由 John Stultz 提交于
Adds a timespec64 based getboottime64() implementation that can be used as we convert internal users of getboottime away from using timespecs. Cc: pang.xunlei <pang.xunlei@linaro.org> Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
-
由 Nicolas Pitre 提交于
At least on ARM, do_div() is optimized to turn constant divisors into an inline multiplication by the reciprocal value at compile time. However this optimization is missed entirely whenever ktime_divns() is used and the slow out-of-line division code is used all the time. Let ktime_divns() use do_div() inline whenever the divisor is constant and small enough. This will make things like ktime_to_us() and ktime_to_ms() much faster. Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nicolas Pitre <nico@linaro.org> Acked-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NNicolas Pitre <nico@linaro.org> Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
-
- 23 1月, 2015 1 次提交
-
-
由 Thomas Gleixner 提交于
hrtimer_interrupt() has the following subtle issue: hrtimer_interrupt() lock(cpu_base); expires_next = KTIME_MAX; expire_timers(CLOCK_MONOTONIC); expires = get_next_timer(CLOCK_MONOTONIC); if (expires < expires_next) expires_next = expires; expire_timers(CLOCK_REALTIME); unlock(cpu_base); wakeup() hrtimer_start(CLOCK_MONOTONIC, newtimer); lock(cpu_base(); expires = get_next_timer(CLOCK_REALTIME); if (expires < expires_next) expires_next = expires; So because we already evaluated the next expiring timer of CLOCK_MONOTONIC we ignore that the expiry time of newtimer might be earlier than the overall next expiry time in hrtimer_interrupt(). To solve this, remove the caching of the next expiry value from hrtimer_interrupt() and reevaluate all active clock bases for the next expiry value. To avoid another code duplication, create a shared evaluation function and use it for hrtimer_get_next_event(), hrtimer_force_reprogram() and hrtimer_interrupt(). There is another subtlety in this mechanism: While hrtimer_interrupt() is running, we want to avoid to touch the hardware device because we will reprogram it anyway at the end of hrtimer_interrupt(). This works nicely for hrtimers which get rearmed via the HRTIMER_RESTART mechanism, because we drop out when the callback on that CPU is running. But that fails, if a new timer gets enqueued like in the example above. This has another implication: While hrtimer_interrupt() is running we refuse remote enqueueing of timers - see hrtimer_interrupt() and hrtimer_check_target(). hrtimer_interrupt() tries to prevent this by setting cpu_base->expires to KTIME_MAX, but that fails if a new timer gets queued. Prevent both the hardware access and the remote enqueue explicitely. We can loosen the restriction on the remote enqueue now due to reevaluation of the next expiry value, but that needs a seperate patch. Folded in a fix from Vignesh Radhakrishnan. Reported-and-tested-by: NStanislav Fomichev <stfomichev@yandex-team.ru> Based-on-patch-by: NStanislav Fomichev <stfomichev@yandex-team.ru> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: vigneshr@codeaurora.org Cc: john.stultz@linaro.org Cc: viresh.kumar@linaro.org Cc: fweisbec@gmail.com Cc: cl@linux.com Cc: stuart.w.hayes@gmail.com Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1501202049190.5526@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 17 1月, 2015 1 次提交
-
-
由 Louis Langholtz 提交于
Avoid overflow possibility. [ The overflow is purely theoretical, since this is used for memory ranges that aren't even close to using the full 64 bits, but this is the right thing to do regardless. - Linus ] Signed-off-by: NLouis Langholtz <lou_langholtz@me.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Peter Anvin <hpa@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 1月, 2015 4 次提交
-
-
由 Steven Rostedt (Red Hat) 提交于
Commit 5f893b26 "tracing: Move enabling tracepoints to just after rcu_init()" broke the enabling of system call events from the command line. The reason was that the enabling of command line trace events was moved before PID 1 started, and the syscall tracepoints require that all tasks have the TIF_SYSCALL_TRACEPOINT flag set. But the swapper task (pid 0) is not part of that. Since the swapper task is the only task that is running at this early in boot, no task gets the flag set, and the tracepoint never gets reached. Instead of setting the swapper task flag (there should be no reason to do that), re-enabled trace events again after the init thread (PID 1) has been started. It requires disabling all command line events and re-enabling them, as just enabling them again will not reset the logic to set the TIF_SYSCALL_TRACEPOINT flag, as the syscall tracepoint will be fooled into thinking that it was already set, and wont try setting it again. For this reason, we must first disable it and re-enable it. Link: http://lkml.kernel.org/r/1421188517-18312-1-git-send-email-mpe@ellerman.id.au Link: http://lkml.kernel.org/r/20150115040506.216066449@goodmis.orgReported-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Steven Rostedt (Red Hat) 提交于
trace_init() calls init_ftrace_syscalls() and then calls trace_event_init() which also calls init_ftrace_syscalls(). It makes more sense to only call it from trace_event_init(). Calling it twice wastes memory, as it allocates the syscall events twice, and loses the first copy of it. Link: http://lkml.kernel.org/r/54AF53BD.5070303@huawei.com Link: http://lkml.kernel.org/r/20150115040505.930398632@goodmis.orgReported-by: NWang Nan <wangnan0@huawei.com> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Steven Rostedt (Red Hat) 提交于
Using just the filter for checking for trampolines or regs is not enough when updating the code against the records that represent all functions. Both the filter hash and the notrace hash need to be checked. To trigger this bug (using trace-cmd and perf): # perf probe -a do_fork # trace-cmd start -B foo -e probe # trace-cmd record -p function_graph -n do_fork sleep 1 The trace-cmd record at the end clears the filter before it disables function_graph tracing and then that causes the accounting of the ftrace function records to become incorrect and causes ftrace to bug. Link: http://lkml.kernel.org/r/20150114154329.358378039@goodmis.org Cc: stable@vger.kernel.org [ still need to switch old_hash_ops to old_ops_hash ] Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Steven Rostedt (Red Hat) 提交于
As the set_ftrace_filter affects both the function tracer as well as the function graph tracer, the ops that represent each have a shared ftrace_ops_hash structure. This allows both to be updated when the filter files are updated. But if function graph is enabled and the global_ops (function tracing) ops is not, then it is possible that the filter could be changed without the update happening for the function graph ops. This will cause the changes to not take place and may even cause a ftrace_bug to occur as it could mess with the trampoline accounting. The solution is to check if the ops uses the shared global_ops filter and if the ops itself is not enabled, to check if there's another ops that is enabled and also shares the global_ops filter. In that case, the modification still needs to be executed. Link: http://lkml.kernel.org/r/20150114154329.055980438@goodmis.org Cc: stable@vger.kernel.org # 3.17+ Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
- 09 1月, 2015 7 次提交
-
-
由 Chris Wilson 提交于
Currently if DEBUG_MUTEXES is enabled, the mutex->owner field is only cleared iff debug_locks is active. This exposes a race to other users of the field where the mutex->owner may be still set to a stale value, potentially upsetting mutex_spin_on_owner() among others. References: https://bugs.freedesktop.org/show_bug.cgi?id=87955Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NDavidlohr Bueso <dave@stgolabs.net> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1420540175-30204-1-git-send-email-chris@chris-wilson.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Tetsuo Handa 提交于
When alloc_fair_sched_group() in sched_create_group() fails, free_sched_group() is called, and free_fair_sched_group() is called by free_sched_group(). Since destroy_cfs_bandwidth() is called by free_fair_sched_group() without calling init_cfs_bandwidth(), RCU stall occurs at hrtimer_cancel(): INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies g=13074 c=13073 q=0) Task dump for CPU 1: (fprintd) R running task 0 6249 1 0x00000088 ... Call Trace: <IRQ> [<ffffffff81094988>] sched_show_task+0xa8/0x110 [<ffffffff81097acd>] dump_cpu_task+0x3d/0x50 [<ffffffff810c3a80>] rcu_dump_cpu_stacks+0x90/0xd0 [<ffffffff810c7751>] rcu_check_callbacks+0x491/0x700 [<ffffffff810cbf2b>] update_process_times+0x4b/0x80 [<ffffffff810db046>] tick_sched_handle.isra.20+0x36/0x50 [<ffffffff810db0a2>] tick_sched_timer+0x42/0x70 [<ffffffff810ccb19>] __run_hrtimer+0x69/0x1a0 [<ffffffff810db060>] ? tick_sched_handle.isra.20+0x50/0x50 [<ffffffff810ccedf>] hrtimer_interrupt+0xef/0x230 [<ffffffff810452cb>] local_apic_timer_interrupt+0x3b/0x70 [<ffffffff8164a465>] smp_apic_timer_interrupt+0x45/0x60 [<ffffffff816485bd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810cc588>] ? lock_hrtimer_base.isra.23+0x18/0x50 [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230 [<ffffffff810cc9d2>] hrtimer_try_to_cancel+0x22/0xd0 [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230 [<ffffffff810ccaa2>] hrtimer_cancel+0x22/0x30 [<ffffffff810a3cb5>] free_fair_sched_group+0x25/0xd0 [<ffffffff8108df46>] free_sched_group+0x16/0x40 [<ffffffff810971bb>] sched_create_group+0x4b/0x80 [<ffffffff810aa383>] sched_autogroup_create_attach+0x43/0x1c0 [<ffffffff8107dc9c>] sys_setsid+0x7c/0x110 [<ffffffff81647729>] system_call_fastpath+0x12/0x17 Check whether init_cfs_bandwidth() was called before calling destroy_cfs_bandwidth(). Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [ Move the check into destroy_cfs_bandwidth() to aid compilability. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
The dl_runtime_exceeded() function is supposed to ckeck if a SCHED_DEADLINE task must be throttled, by checking if its current runtime is <= 0. However, it also checks if the scheduling deadline has been missed (the current time is larger than the current scheduling deadline), further decreasing the runtime if this happens. This "double accounting" is wrong: - In case of partitioned scheduling (or single CPU), this happens if task_tick_dl() has been called later than expected (due to small HZ values). In this case, the current runtime is also negative, and replenish_dl_entity() can take care of the deadline miss by recharging the current runtime to a value smaller than dl_runtime - In case of global scheduling on multiple CPUs, scheduling deadlines can be missed even if the task did not consume more runtime than expected, hence penalizing the task is wrong This patch fix this problem by throttling a SCHED_DEADLINE task only when its runtime becomes negative, and not modifying the runtime Signed-off-by: NLuca Abeni <luca.abeni@unitn.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJuri Lelli <juri.lelli@gmail.com> Cc: <stable@vger.kernel.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
According to global EDF, tasks should be migrated between runqueues without checking if their scheduling deadlines and runtimes are valid. However, SCHED_DEADLINE currently performs such a check: a migration happens doing: deactivate_task(rq, next_task, 0); set_task_cpu(next_task, later_rq->cpu); activate_task(later_rq, next_task, 0); which ends up calling dequeue_task_dl(), setting the new CPU, and then calling enqueue_task_dl(). enqueue_task_dl() then calls enqueue_dl_entity(), which calls update_dl_entity(), which can modify scheduling deadline and runtime, breaking global EDF scheduling. As a result, some of the properties of global EDF are not respected: for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on two cores can have unbounded response times for the third task even if 30/80+40/80+120/170 = 1.5809 < 2 This can be fixed by invoking update_dl_entity() only in case of wakeup, or if this is a new SCHED_DEADLINE task. Signed-off-by: NLuca Abeni <luca.abeni@unitn.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJuri Lelli <juri.lelli@gmail.com> Cc: <stable@vger.kernel.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yuyang Du 提交于
In effective_load, we have (long w * unsigned long tg->shares) / long W, when w is negative, it is cast to unsigned long and hence the product is insanely large. Fix this by casting tg->shares to long. Reported-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NYuyang Du <yuyang.du@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Jones <davej@redhat.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
On x86_64, at least, task_pt_regs may be only partially initialized in many contexts, so x86_64 should not use it without extra care from interrupt context, let alone NMI context. This will allow x86_64 to override the logic and will supply some scratch space to use to make a cleaner copy of user regs. Tested-by: NJiri Olsa <jolsa@kernel.org> Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: chenggang.qcg@taobao.com Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jean Pihet <jean.pihet@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salter <msalter@redhat.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/e431cd4c18c2e1c44c774f10758527fb2d1025c4.1420396372.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Oleg Nesterov 提交于
wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE change in between, gcc needs to reload p->exit_state after security_task_wait(). In this case ->notask_error will be wrongly cleared and do_wait() can hang forever if it was the last eligible child. Many thanks to Arne who carefully investigated the problem. Note: this bug is very old but it was pure theoretical until commit b3ab0316 ("wait: completely ignore the EXIT_DEAD tasks"). Before this commit "-O2" was probably enough to guarantee that compiler won't read ->exit_state twice. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reported-by: NArne Goedeke <el@laramies.com> Tested-by: NArne Goedeke <el@laramies.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 12月, 2014 1 次提交
-
-
由 Paul Moore 提交于
Unfortunately, while commit 4a928436 ("audit: correctly record file names with different path name types") fixed a problem where we were not recording filenames, it created a new problem by attempting to use these file names after they had been freed. This patch resolves the issue by creating a copy of the filename which the audit subsystem frees after it is done with the string. At some point it would be nice to resolve this issue with refcounts, or something similar, instead of having to allocate/copy strings, but that is almost surely beyond the scope of a -rcX patch so we'll defer that for later. On the plus side, only audit users should be impacted by the string copying. Reported-by: NToralf Foerster <toralf.foerster@gmx.de> Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
- 27 12月, 2014 1 次提交
-
-
由 Johannes Berg 提交于
Netlink families can exist in multiple namespaces, and for the most part multicast subscriptions are per network namespace. Thus it only makes sense to have bind/unbind notifications per network namespace. To achieve this, pass the network namespace of a given client socket to the bind/unbind functions. Also do this in generic netlink, and there also make sure that any bind for multicast groups that only exist in init_net is rejected. This isn't really a problem if it is accepted since a client in a different namespace will never receive any notifications from such a group, but it can confuse the family if not rejected (it's also possible to silently (without telling the family) accept it, but it would also have to be ignored on unbind so families that take any kind of action on bind/unbind won't do unnecessary work for invalid clients like that. Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 12月, 2014 1 次提交
-
-
由 Richard Guy Briggs 提交于
A regression was caused by commit 780a7654: audit: Make testing for a valid loginuid explicit. (which in turn attempted to fix a regression caused by e1760bd5) When audit_krule_to_data() fills in the rules to get a listing, there was a missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID. This broke userspace by not returning the same information that was sent and expected. The rule: auditctl -a exit,never -F auid=-1 gives: auditctl -l LIST_RULES: exit,never f24=0 syscall=all when it should give: LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all Tag it so that it is reported the same way it was set. Create a new private flags audit_krule field (pflags) to store it that won't interact with the public one from the API. Cc: stable@vger.kernel.org # v3.10-rc1+ Signed-off-by: NRichard Guy Briggs <rgb@redhat.com> Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
- 23 12月, 2014 2 次提交
-
-
由 Alex Thorlton 提交于
When allocating space for load_balance_mask, in sched_init, when CPUMASK_OFFSTACK is set, we've managed to spill over KMALLOC_MAX_SIZE on our 6144 core machine. The patch below breaks up the allocations so that they don't overflow the max alloc size. It also allocates the masks on the the node from which they'll most commonly be accessed, to minimize remote accesses on NUMA machines. Suggested-by: NGeorge Beshers <gbeshers@sgi.com> Signed-off-by: NAlex Thorlton <athorlton@sgi.com> Cc: George Beshers <gbeshers@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Paul Moore 提交于
There is a problem with the audit system when multiple audit records are created for the same path, each with a different path name type. The root cause of the problem is in __audit_inode() when an exact match (both the path name and path name type) is not found for a path name record; the existing code creates a new path name record, but it never sets the path name in this record, leaving it NULL. This patch corrects this problem by assigning the path name to these newly created records. There are many ways to reproduce this problem, but one of the easiest is the following (assuming auditd is running): # mkdir /root/tmp/test # touch /root/tmp/test/567 # auditctl -a always,exit -F dir=/root/tmp/test # touch /root/tmp/test/567 Afterwards, or while the commands above are running, check the audit log and pay special attention to the PATH records. A faulty kernel will display something like the following for the file creation: type=SYSCALL msg=audit(1416957442.025:93): arch=c000003e syscall=2 success=yes exit=3 ... comm="touch" exe="/usr/bin/touch" type=CWD msg=audit(1416957442.025:93): cwd="/root/tmp" type=PATH msg=audit(1416957442.025:93): item=0 name="test/" inode=401409 ... nametype=PARENT type=PATH msg=audit(1416957442.025:93): item=1 name=(null) inode=393804 ... nametype=NORMAL type=PATH msg=audit(1416957442.025:93): item=2 name=(null) inode=393804 ... nametype=NORMAL While a patched kernel will show the following: type=SYSCALL msg=audit(1416955786.566:89): arch=c000003e syscall=2 success=yes exit=3 ... comm="touch" exe="/usr/bin/touch" type=CWD msg=audit(1416955786.566:89): cwd="/root/tmp" type=PATH msg=audit(1416955786.566:89): item=0 name="test/" inode=401409 ... nametype=PARENT type=PATH msg=audit(1416955786.566:89): item=1 name="test/567" inode=393804 ... nametype=NORMAL This issue was brought up by a number of people, but special credit should go to hujianyang@huawei.com for reporting the problem along with an explanation of the problem and a patch. While the original patch did have some problems (see the archive link below), it did demonstrate the problem and helped kickstart the fix presented here. * https://lkml.org/lkml/2014/9/5/66Reported-by: Nhujianyang <hujianyang@huawei.com> Signed-off-by: NPaul Moore <pmoore@redhat.com> Acked-by: NRichard Guy Briggs <rgb@redhat.com>
-
- 20 12月, 2014 3 次提交
-
-
由 Richard Guy Briggs 提交于
Eric Paris explains: Since kauditd_send_multicast_skb() gets called in audit_log_end(), which can come from any context (aka even a sleeping context) GFP_KERNEL can't be used. Since the audit_buffer knows what context it should use, pass that down and use that. See: https://lkml.org/lkml/2014/12/16/542 BUG: sleeping function called from invalid context at mm/slab.c:2849 in_atomic(): 1, irqs_disabled(): 0, pid: 885, name: sulogin 2 locks held by sulogin/885: #0: (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff91152e30>] prepare_bprm_creds+0x28/0x8b #1: (tty_files_lock){+.+.+.}, at: [<ffffffff9123e787>] selinux_bprm_committing_creds+0x55/0x22b CPU: 1 PID: 885 Comm: sulogin Not tainted 3.18.0-next-20141216 #30 Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A15 06/20/2014 ffff880223744f10 ffff88022410f9b8 ffffffff916ba529 0000000000000375 ffff880223744f10 ffff88022410f9e8 ffffffff91063185 0000000000000006 0000000000000000 0000000000000000 0000000000000000 ffff88022410fa38 Call Trace: [<ffffffff916ba529>] dump_stack+0x50/0xa8 [<ffffffff91063185>] ___might_sleep+0x1b6/0x1be [<ffffffff910632a6>] __might_sleep+0x119/0x128 [<ffffffff91140720>] cache_alloc_debugcheck_before.isra.45+0x1d/0x1f [<ffffffff91141d81>] kmem_cache_alloc+0x43/0x1c9 [<ffffffff914e148d>] __alloc_skb+0x42/0x1a3 [<ffffffff914e2b62>] skb_copy+0x3e/0xa3 [<ffffffff910c263e>] audit_log_end+0x83/0x100 [<ffffffff9123b8d3>] ? avc_audit_pre_callback+0x103/0x103 [<ffffffff91252a73>] common_lsm_audit+0x441/0x450 [<ffffffff9123c163>] slow_avc_audit+0x63/0x67 [<ffffffff9123c42c>] avc_has_perm+0xca/0xe3 [<ffffffff9123dc2d>] inode_has_perm+0x5a/0x65 [<ffffffff9123e7ca>] selinux_bprm_committing_creds+0x98/0x22b [<ffffffff91239e64>] security_bprm_committing_creds+0xe/0x10 [<ffffffff911515e6>] install_exec_creds+0xe/0x79 [<ffffffff911974cf>] load_elf_binary+0xe36/0x10d7 [<ffffffff9115198e>] search_binary_handler+0x81/0x18c [<ffffffff91153376>] do_execveat_common.isra.31+0x4e3/0x7b7 [<ffffffff91153669>] do_execve+0x1f/0x21 [<ffffffff91153967>] SyS_execve+0x25/0x29 [<ffffffff916c61a9>] stub_execve+0x69/0xa0 Cc: stable@vger.kernel.org #v3.16-rc1 Reported-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: NRichard Guy Briggs <rgb@redhat.com> Tested-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
由 Paul Moore 提交于
Commit f1dc4867 ("audit: anchor all pid references in the initial pid namespace") introduced a find_vpid() call when adding/removing audit rules with PID/PPID filters; unfortunately this is problematic as find_vpid() only works if there is a task with the associated PID alive on the system. The following commands demonstrate a simple reproducer. # auditctl -D # auditctl -l # autrace /bin/true # auditctl -l This patch resolves the problem by simply using the PID provided by the user without any additional validation, e.g. no calls to check to see if the task/PID exists. Cc: stable@vger.kernel.org # 3.15 Cc: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: NPaul Moore <pmoore@redhat.com> Acked-by: NEric Paris <eparis@redhat.com> Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>
-
由 Rafael J. Wysocki 提交于
Having switched over all of the users of CONFIG_PM_RUNTIME to use CONFIG_PM directly, turn the latter into a user-selectable option and drop the former entirely from the tree. Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org> Acked-by: NKevin Hilman <khilman@linaro.org>
-
- 19 12月, 2014 1 次提交
-
-
由 Thomas Gleixner 提交于
commit 4dbd2771 "tick: export nohz tick idle symbols for module use" was merged via the thermal tree without an explicit ack from the relevant maintainers. The exports are abused by the intel powerclamp driver which implements a fake idle state from a sched FIFO task. This causes all kinds of wreckage in the NOHZ core code which rightfully assumes that tick_nohz_idle_enter/exit() are only called from the idle task itself. Recent changes in the NOHZ core lead to a failure of the powerclamp driver and now people try to hack completely broken and backwards workarounds into the NOHZ core code. This is completely unacceptable and just papers over the real problem. There are way more subtle issues lurking around the corner. The real solution is to fix the powerclamp driver by rewriting it with a sane concept, but that's beyond the scope of this. So the only solution for now is to remove the calls into the core NOHZ code from the powerclamp trainwreck along with the exports. Fixes: d6d71ee4 "PM: Introduce Intel PowerClamp Driver" Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Pan Jacob jun <jacob.jun.pan@intel.com> Cc: LKP <lkp@01.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Zhang Rui <rui.zhang@intel.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 18 12月, 2014 1 次提交
-
-
由 Kees Cook 提交于
When a module_param is defined without DAC write permissions, it can still be changed at runtime and updated. Drivers using a 0444 permission may be surprised that these values can still be changed. For drivers that want to allow updates, any S_IW* flag will set the "store" function as before. Drivers without S_IW* flags will have the "store" function unset, unforcing a read-only value. Drivers that wish neither "store" nor "get" can continue to use "0" for perms to stay out of sysfs entirely. Old behavior: # cd /sys/module/snd/parameters # ls -l total 0 -r--r--r-- 1 root root 4096 Dec 11 13:55 cards_limit -r--r--r-- 1 root root 4096 Dec 11 13:55 major -r--r--r-- 1 root root 4096 Dec 11 13:55 slots # cat major 116 # echo -1 > major -bash: major: Permission denied # chmod u+w major # echo -1 > major # cat major -1 New behavior: ... # chmod u+w major # echo -1 > major -bash: echo: write error: Input/output error Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
-
- 15 12月, 2014 2 次提交
-
-
由 Steven Rostedt (Red Hat) 提交于
Add the kernel command line tp_printk option that will have tracepoints that are active sent to printk() as well as to the trace buffer. Passing "tp_printk" will activate this. To turn it off, the sysctl /proc/sys/kernel/tracepoint_printk can have '0' echoed into it. Note, this only works if the cmdline option is used. Echoing 1 into the sysctl file without the cmdline option will have no affect. Note, this is a dangerous option. Having high frequency tracepoints send their data to printk() can possibly cause a live lock. This is another reason why this is only active if the command line option is used. Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanosSuggested-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Steven Rostedt (Red Hat) 提交于
Enabling tracepoints at boot up can be very useful. The tracepoint can be initialized right after RCU has been. There's no need to wait for the early_initcall() to be called. That's too late for some things that can use tracepoints for debugging. Move the logic to enable tracepoints out of the initcalls and into init/main.c to right after rcu_init(). This also allows trace_printk() to be used early too. Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos Link: http://lkml.kernel.org/r/20141214164104.307127356@goodmis.orgReviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Suggested-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
- 14 12月, 2014 8 次提交
-
-
由 Jan Kara 提交于
There's a lot of common code in inode and mount marks handling. Factor it out to a common helper function. Signed-off-by: NJan Kara <jack@suse.cz> Cc: Eric Paris <eparis@redhat.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Riku Voipio 提交于
Following the suggestions from Andrew Morton and Stephen Rothwell, Dont expand the ARCH list in kernel/gcov/Kconfig. Instead, define a ARCH_HAS_GCOV_PROFILE_ALL bool which architectures can enable. set ARCH_HAS_GCOV_PROFILE_ALL on Architectures where it was previously allowed + ARM64 which I tested. Signed-off-by: NRiku Voipio <riku.voipio@linaro.org> Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masanari Iida 提交于
Remove unnecessary KERN_ERR from pr_err() within kexec.c. Signed-off-by: NMasanari Iida <standby24x7@gmail.com> Acked-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Drysdale 提交于
This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/<fd>" (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: NDavid Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Current stacktrace only have the function for console output. page_owner that will be introduced in following patch needs to print the output of stacktrace into the buffer for our own output format so so new function, snprint_stack_trace(), is needed. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dave Hansen <dave@sr71.net> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Jungsoo Son <jungsoo.son@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
Both register and unregister call build_map_info() in order to create the list of mappings before installing or removing breakpoints for every mm which maps file backed memory. As such, there is no reason to hold the i_mmap_rwsem exclusively, so share it and allow concurrent readers to build the mapping data. Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting similar data, one for file backed pages and the other for anon memory. To this end, this lock can also be a rwsem. In addition, there are some important opportunities to share the lock when there are no tree modifications. This conversion is straightforward. For now, all users take the write lock. [sfr@canb.auug.org.au: update fremap.c] Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: NHugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Davidlohr Bueso 提交于
Convert all open coded mutex_lock/unlock calls to the i_mmap_[lock/unlock]_write() helpers. Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: N"Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: NHugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2014 2 次提交
-
-
由 Thomas Gleixner 提交于
Since the rework of the sparse interrupt code to actually free the unused interrupt descriptors there exists a race between the /proc interfaces to the irq subsystem and the code which frees the interrupt descriptor. CPU0 CPU1 show_interrupts() desc = irq_to_desc(X); free_desc(desc) remove_from_radix_tree(); kfree(desc); raw_spinlock_irq(&desc->lock); /proc/interrupts is the only interface which can actively corrupt kernel memory via the lock access. /proc/stat can only read from freed memory. Extremly hard to trigger, but possible. The interfaces in /proc/irq/N/ are not affected by this because the removal of the proc file is serialized in procfs against concurrent readers/writers. The removal happens before the descriptor is freed. For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue as the descriptor is never freed. It's merely cleared out with the irq descriptor lock held. So any concurrent proc access will either see the old correct value or the cleared out ones. Protect the lookup and access to the irq descriptor in show_interrupts() with the sparse_irq_lock. Provide kstat_irqs_usr() which is protecting the lookup and access with sparse_irq_lock and switch /proc/stat to use it. Document the existing kstat_irqs interfaces so it's clear that the caller needs to take care about protection. The users of these interfaces are either not affected due to SPARSE_IRQ=n or already protected against removal. Fixes: 1f5a5b87 "genirq: Implement a sane sparse_irq allocator" Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
-
由 Rafael J. Wysocki 提交于
After commit b2b49ccb (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is selected) PM_RUNTIME is always set if PM is set, so files that are build conditionally if CONFIG_PM_RUNTIME is set may now be build if CONFIG_PM is set. Replace CONFIG_PM_RUNTIME with CONFIG_PM in kernel/trace/Makefile for this reason. Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Steven Rostedt <rostedt@goodmis.org.
-