- 20 5月, 2016 3 次提交
-
-
由 Du, Changbin 提交于
Update the return type to use bool instead of int, corresponding to change (debugobjects: make fixup functions return bool instead of int) Signed-off-by: NDu, Changbin <changbin.du@intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Triplett <josh@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Deepa Dinamani 提交于
All references to timespec_add_safe() now use timespec64_add_safe(). The plan is to replace struct timespec references with struct timespec64 throughout the kernel as timespec is not y2038 safe. Drop timespec_add_safe() and use timespec64_add_safe() for all architectures. Link: http://lkml.kernel.org/r/1461947989-21926-4-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NJohn Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Deepa Dinamani 提交于
timespec64_add_safe() has been defined in time64.h for 64 bit systems. But, 32 bit systems only have an extern function prototype defined. Provide a definition for the above function. The function will be necessary as part of y2038 changes. struct timespec is not y2038 safe. All references to timespec will be replaced by struct timespec64. The function is meant to be a replacement for timespec_add_safe(). The implementation is similar to timespec_add_safe(). Link: http://lkml.kernel.org/r/1461947989-21926-2-git-send-email-deepa.kernel@gmail.comSigned-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Acked-by: NJohn Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 5月, 2016 4 次提交
-
-
由 Daniel Borkmann 提交于
This work adds a generic facility for use from eBPF JIT compilers that allows for further hardening of JIT generated images through blinding constants. In response to the original work on BPF JIT spraying published by Keegan McAllister [1], most BPF JITs were changed to make images read-only and start at a randomized offset in the page, where the rest was filled with trap instructions. We have this nowadays in x86, arm, arm64 and s390 JIT compilers. Additionally, later work also made eBPF interpreter images read only for kernels supporting DEBUG_SET_MODULE_RONX, that is, x86, arm, arm64 and s390 archs as well currently. This is done by default for mentioned JITs when JITing is enabled. Furthermore, we had a generic and configurable constant blinding facility on our todo for quite some time now to further make spraying harder, and first implementation since around netconf 2016. We found that for systems where untrusted users can load cBPF/eBPF code where JIT is enabled, start offset randomization helps a bit to make jumps into crafted payload harder, but in case where larger programs that cross page boundary are injected, we again have some part of the program opcodes at a page start offset. With improved guessing and more reliable payload injection, chances can increase to jump into such payload. Elena Reshetova recently wrote a test case for it [2, 3]. Moreover, eBPF comes with 64 bit constants, which can leave some more room for payloads. Note that for all this, additional bugs in the kernel are still required to make the jump (and of course to guess right, to not jump into a trap) and naturally the JIT must be enabled, which is disabled by default. For helping mitigation, the general idea is to provide an option bpf_jit_harden that admins can tweak along with bpf_jit_enable, so that for cases where JIT should be enabled for performance reasons, the generated image can be further hardened with blinding constants for unpriviledged users (bpf_jit_harden == 1), with trading off performance for these, but not for privileged ones. We also added the option of blinding for all users (bpf_jit_harden == 2), which is quite helpful for testing f.e. with test_bpf.ko. There are no further e.g. hardening levels of bpf_jit_harden switch intended, rationale is to have it dead simple to use as on/off. Since this functionality would need to be duplicated over and over for JIT compilers to use, which are already complex enough, we provide a generic eBPF byte-code level based blinding implementation, which is then just transparently JITed. JIT compilers need to make only a few changes to integrate this facility and can be migrated one by one. This option is for eBPF JITs and will be used in x86, arm64, s390 without too much effort, and soon ppc64 JITs, thus that native eBPF can be blinded as well as cBPF to eBPF migrations, so that both can be covered with a single implementation. The rule for JITs is that bpf_jit_blind_constants() must be called from bpf_int_jit_compile(), and in case blinding is disabled, we follow normally with JITing the passed program. In case blinding is enabled and we fail during the process of blinding itself, we must return with the interpreter. Similarly, in case the JITing process after the blinding failed, we return normally to the interpreter with the non-blinded code. Meaning, interpreter doesn't change in any way and operates on eBPF code as usual. For doing this pre-JIT blinding step, we need to make use of a helper/auxiliary register, here BPF_REG_AX. This is strictly internal to the JIT and not in any way part of the eBPF architecture. Just like in the same way as JITs internally make use of some helper registers when emitting code, only that here the helper register is one abstraction level higher in eBPF bytecode, but nevertheless in JIT phase. That helper register is needed since f.e. manually written program can issue loads to all registers of eBPF architecture. The core concept with the additional register is: blind out all 32 and 64 bit constants by converting BPF_K based instructions into a small sequence from K_VAL into ((RND ^ K_VAL) ^ RND). Therefore, this is transformed into: BPF_REG_AX := (RND ^ K_VAL), BPF_REG_AX ^= RND, and REG <OP> BPF_REG_AX, so actual operation on the target register is translated from BPF_K into BPF_X one that is operating on BPF_REG_AX's content. During rewriting phase when blinding, RND is newly generated via prandom_u32() for each processed instruction. 64 bit loads are split into two 32 bit loads to make translation and patching not too complex. Only basic thing required by JITs is to call the helper bpf_jit_blind_constants()/bpf_jit_prog_release_other() pair, and to map BPF_REG_AX into an unused register. Small bpf_jit_disasm extract from [2] when applied to x86 JIT: echo 0 > /proc/sys/net/core/bpf_jit_harden ffffffffa034f5e9 + <x>: [...] 39: mov $0xa8909090,%eax 3e: mov $0xa8909090,%eax 43: mov $0xa8ff3148,%eax 48: mov $0xa89081b4,%eax 4d: mov $0xa8900bb0,%eax 52: mov $0xa810e0c1,%eax 57: mov $0xa8908eb4,%eax 5c: mov $0xa89020b0,%eax [...] echo 1 > /proc/sys/net/core/bpf_jit_harden ffffffffa034f1e5 + <x>: [...] 39: mov $0xe1192563,%r10d 3f: xor $0x4989b5f3,%r10d 46: mov %r10d,%eax 49: mov $0xb8296d93,%r10d 4f: xor $0x10b9fd03,%r10d 56: mov %r10d,%eax 59: mov $0x8c381146,%r10d 5f: xor $0x24c7200e,%r10d 66: mov %r10d,%eax 69: mov $0xeb2a830e,%r10d 6f: xor $0x43ba02ba,%r10d 76: mov %r10d,%eax 79: mov $0xd9730af,%r10d 7f: xor $0xa5073b1f,%r10d 86: mov %r10d,%eax 89: mov $0x9a45662b,%r10d 8f: xor $0x325586ea,%r10d 96: mov %r10d,%eax [...] As can be seen, original constants that carry payload are hidden when enabled, actual operations are transformed from constant-based to register-based ones, making jumps into constants ineffective. Above extract/example uses single BPF load instruction over and over, but of course all instructions with constants are blinded. Performance wise, JIT with blinding performs a bit slower than just JIT and faster than interpreter case. This is expected, since we still get all the performance benefits from JITing and in normal use-cases not every single instruction needs to be blinded. Summing up all 296 test cases averaged over multiple runs from test_bpf.ko suite, interpreter was 55% slower than JIT only and JIT with blinding was 8% slower than JIT only. Since there are also some extremes in the test suite, I expect for ordinary workloads that the performance for the JIT with blinding case is even closer to JIT only case, f.e. nmap test case from suite has averaged timings in ns 29 (JIT), 35 (+ blinding), and 151 (interpreter). BPF test suite, seccomp test suite, eBPF sample code and various bigger networking eBPF programs have been tested with this and were running fine. For testing purposes, I also adapted interpreter and redirected blinded eBPF image to interpreter and also here all tests pass. [1] http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.html [2] https://github.com/01org/jit-spray-poc-for-ksp/ [3] http://www.openwall.com/lists/kernel-hardening/2016/05/03/5Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NElena Reshetova <elena.reshetova@intel.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Since the blinding is strictly only called from inside eBPF JITs, we need to change signatures for bpf_int_jit_compile() and bpf_prog_select_runtime() first in order to prepare that the eBPF program we're dealing with can change underneath. Hence, for call sites, we need to return the latest prog. No functional change in this patch. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Move the functionality to patch instructions out of the verifier code and into the core as the new bpf_patch_insn_single() helper will be needed later on for blinding as well. No changes in functionality. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Besides others, remove redundant comments where the code is self documenting enough, and properly indent various bpf_verifier_ops and bpf_prog_type_list declarations. Moreover, remove two exports that actually have no module user. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 5月, 2016 1 次提交
-
-
由 Peter Zijlstra 提交于
The new signal_pending exit path in __rwsem_down_write_failed_common() was fingered as breaking his kernel by Tetsuo Handa. Upon inspection it was found that there are two things wrong with it; - it forgets to remove WAITING_BIAS if it leaves the list empty, or - it forgets to wake further waiters that were blocked on the now removed waiter. Especially the first issue causes new lock attempts to block and stall indefinitely, as the code assumes that pending waiters mean there is an owner that will wake when it releases the lock. Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: NMichal Hocko <mhocko@kernel.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Waiman Long <Waiman.Long@hpe.com> Link: http://lkml.kernel.org/r/20160512115745.GP3192@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 5月, 2016 1 次提交
-
-
由 Steven Rostedt (Red Hat) 提交于
If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE then the DIV_ROUND_UP() will return zero. Here's the details: # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb tracing_entries_write() processes this and converts kb to bytes. 18014398509481980 << 10 = 18446744073709547520 and this is passed to ring_buffer_resize() as unsigned long size. size = DIV_ROUND_UP(size, BUF_PAGE_SIZE); Where DIV_ROUND_UP(a, b) is (a + b - 1)/b BUF_PAGE_SIZE is 4080 and here 18446744073709547520 + 4080 - 1 = 18446744073709551599 where 18446744073709551599 is still smaller than 2^64 2^64 - 18446744073709551599 = 17 But now 18446744073709551599 / 4080 = 4521260802379792 and size = size * 4080 = 18446744073709551360 This is checked to make sure its still greater than 2 * 4080, which it is. Then we convert to the number of buffer pages needed. nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE) but this time size is 18446744073709551360 and 2^64 - (18446744073709551360 + 4080 - 1) = -3823 Thus it overflows and the resulting number is less than 4080, which makes 3823 / 4080 = 0 an nr_pages is set to this. As we already checked against the minimum that nr_pages may be, this causes the logic to fail as well, and we crash the kernel. There's no reason to have the two DIV_ROUND_UP() (that's just result of historical code changes), clean up the code and fix this bug. Cc: stable@vger.kernel.org # 3.5+ Fixes: 83f40318 ("ring-buffer: Make removal of ring buffer pages atomic") Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
- 13 5月, 2016 2 次提交
-
-
由 Steven Rostedt (Red Hat) 提交于
The size variable to change the ring buffer in ftrace is a long. The nr_pages used to update the ring buffer based on the size is int. On 64 bit machines this can cause an overflow problem. For example, the following will cause the ring buffer to crash: # cd /sys/kernel/debug/tracing # echo 10 > buffer_size_kb # echo 8556384240 > buffer_size_kb Then you get the warning of: WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260 Which is: RB_WARN_ON(cpu_buffer, nr_removed); Note each ring buffer page holds 4080 bytes. This is because: 1) 10 causes the ring buffer to have 3 pages. (10kb requires 3 * 4080 pages to hold) 2) (2^31 / 2^10 + 1) * 4080 = 8556384240 The value written into buffer_size_kb is shifted by 10 and then passed to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760 3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE which is 4080. 8761737461760 / 4080 = 2147484672 4) nr_pages is subtracted from the current nr_pages (3) and we get: 2147484669. This value is saved in a signed integer nr_pages_to_update 5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int turns into the value of -2147482627 6) As the value is a negative number, in update_pages_handler() it is negated and passed to rb_remove_pages() and 2147482627 pages will be removed, which is much larger than 3 and it causes the warning because not all the pages asked to be removed were removed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001 Cc: stable@vger.kernel.org # 2.6.28+ Fixes: 7a8e76a3 ("tracing: unified trace buffer") Reported-by: NHao Qin <QEver.cn@gmail.com> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Wanpeng Li 提交于
------------[ cut here ]------------ WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0 Modules linked in: CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31 Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013 0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010 0000000000000000 0000000000000000 0000000000000000 ffff881037babba8 ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046 Call Trace: dump_stack+0x89/0xd4 __warn+0xfd/0x120 warn_slowpath_null+0x1d/0x20 rebind_workers+0x1c0/0x1d0 workqueue_cpu_up_callback+0xf5/0x1d0 notifier_call_chain+0x64/0x90 ? trace_hardirqs_on_caller+0xf2/0x220 ? notify_prepare+0x80/0x80 __raw_notifier_call_chain+0xe/0x10 __cpu_notify+0x35/0x50 notify_down_prepare+0x5e/0x80 ? notify_prepare+0x80/0x80 cpuhp_invoke_callback+0x73/0x330 ? __schedule+0x33e/0x8a0 cpuhp_down_callbacks+0x51/0xc0 cpuhp_thread_fun+0xc1/0xf0 smpboot_thread_fn+0x159/0x2a0 ? smpboot_create_threads+0x80/0x80 kthread+0xef/0x110 ? wait_for_completion+0xf0/0x120 ? schedule_tail+0x35/0xf0 ret_from_fork+0x22/0x50 ? __init_kthread_worker+0x70/0x70 ---[ end trace eb12ae47d2382d8f ]--- notify_down_prepare: attempt to take down CPU 0 failed This bug can be reproduced by below config w/ nohz_full= all cpus: CONFIG_BOOTPARAM_HOTPLUG_CPU0=y CONFIG_DEBUG_HOTPLUG_CPU0=y CONFIG_NO_HZ_FULL=y As Thomas pointed out: | If a down prepare callback fails, then DOWN_FAILED is invoked for all | callbacks which have successfully executed DOWN_PREPARE. | | But, workqueue has actually two notifiers. One which handles | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE. | | Now look at the priorities of those callbacks: | | CPU_PRI_WORKQUEUE_UP = 5 | CPU_PRI_WORKQUEUE_DOWN = -5 | | So the call order on DOWN_PREPARE is: | | CB 1 | CB ... | CB workqueue_up() -> Ignores DOWN_PREPARE | CB ... | CB X ---> Fails | | So we call up to CB X with DOWN_FAILED | | CB 1 | CB ... | CB workqueue_up() -> Handles DOWN_FAILED | CB ... | CB X-1 | | So the problem is that the workqueue stuff handles DOWN_FAILED in the up | callback, while it should do it in the down callback. Which is not a good idea | either because it wants to be called early on rollback... | | Brilliant stuff, isn't it? The hotplug rework will solve this problem because | the callbacks become symetric, but for the existing mess, we need some | workaround in the workqueue code. The boot CPU handles housekeeping duty(unbound timers, workqueues, timekeeping, ...) on behalf of full dynticks CPUs. It must remain online when nohz full is enabled. There is a priority set to every notifier_blocks: workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down So tick_nohz_cpu_down callback failed when down prepare cpu 0, and notifier_blocks behind tick_nohz_cpu_down will not be called any more, which leads to workers are actually not unbound. Then hotplug state machine will fallback to undo and online cpu 0 again. Workers will be rebound unconditionally even if they are not unbound and trigger the warning in this progress. This patch fix it by catching !DISASSOCIATED to avoid rebind bound workers. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org Suggested-by: NLai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
-
- 12 5月, 2016 13 次提交
-
-
由 Felipe Balbi 提交于
commit 4f41fc59 ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") added the following compile warning: kernel/cgroup.c: In function ‘cgroup_show_path’: kernel/cgroup.c:1634:15: warning: unused variable ‘ret’ [-Wunused-variable] int len = 0, ret = 0; ^ fix it. Fixes: 4f41fc59 ("cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces") Signed-off-by: NFelipe Balbi <felipe.balbi@linux.intel.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Alexander Shishkin 提交于
When the PMU driver reports a truncated AUX record, it effectively means that there is no more usable room in the event's AUX buffer (even though there may still be some room, so that perf_aux_output_begin() doesn't take action). At this point the consumer still has to be woken up and the event has to be disabled, otherwise the event will just keep spinning between perf_aux_output_begin() and perf_aux_output_end() until its context gets unscheduled. Again, for cpu-wide events this means never, so once in this condition, they will be forever losing data. Fix this by disabling the event and waking up the consumer in case of a truncated AUX record. Reported-by: NMarkus Metzger <markus.t.metzger@intel.com> Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1462886313-13660-3-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Alexander Shishkin 提交于
When the PMU driver reports a truncated AUX record, it effectively means that there is no more usable room in the event's AUX buffer (even though there may still be some room, so that perf_aux_output_begin() doesn't take action). At this point the consumer still has to be woken up and the event has to be disabled, otherwise the event will just keep spinning between perf_aux_output_begin() and perf_aux_output_end() until its context gets unscheduled. Again, for cpu-wide events this means never, so once in this condition, they will be forever losing data. Fix this by disabling the event and waking up the consumer in case of a truncated AUX record. Reported-by: NMarkus Metzger <markus.t.metzger@intel.com> Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1462886313-13660-3-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Thomas Gleixner 提交于
tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows us to change the representation of ->nr_cpus_allowed if required. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1462969411-17735-2-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Thomas Gleixner 提交于
Use the future-safe accessor for struct task_struct's. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1462969411-17735-1-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vik Heyndrickx 提交于
Systems show a minimal load average of 0.00, 0.01, 0.05 even when they have no load at all. Uptime and /proc/loadavg on all systems with kernels released during the last five years up until kernel version 4.6-rc5, show a 5- and 15-minute minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on idle systems, but the way the kernel calculates this value prevents it from getting lower than the mentioned values. Likewise but not as obviously noticeable, a fully loaded system with no processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95 (multiplied by number of cores). Once the (old) load becomes 93 or higher, it mathematically can never get lower than 93, even when the active (load) remains 0 forever. This results in the strange 0.00, 0.01, 0.05 uptime values on idle systems. Note: 93/2048 = 0.0454..., which rounds up to 0.05. It is not correct to add a 0.5 rounding (=1024/2048) here, since the result from this function is fed back into the next iteration again, so the result of that +0.5 rounding value then gets multiplied by (2048-2037), and then rounded again, so there is a virtual "ghost" load created, next to the old and active load terms. By changing the way the internally kept value is rounded, that internal value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon increasing load, the internally kept load value is rounded up, when the load is decreasing, the load value is rounded down. The modified code was tested on nohz=off and nohz kernels. It was tested on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was tested on single, dual, and octal cores system. It was tested on virtual hosts and bare hardware. No unwanted effects have been observed, and the problems that the patch intended to fix were indeed gone. Tested-by: NDamien Wyart <damien.wyart@free.fr> Signed-off-by: NVik Heyndrickx <vik.heyndrickx@veribox.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Doug Smythies <dsmythies@telus.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 0f004f5a ("sched: Cure more NO_HZ load average woes") Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Morten Rasmussen 提交于
In calculate_imbalance() load_above_capacity currently has the unit [capacity] while it is used as being [load/capacity]. Not only is it wrong it also makes it unlikely that load_above_capacity is ever used as the subsequent code picks the smaller of load_above_capacity and the avg_load This patch ensures that load_above_capacity has the right unit [load/capacity]. Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com> [ Changed changelog to note it was in capacity unit; +rebase. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Wanpeng noted that the scale_load_down() in calculate_imbalance() was weird. I agree, it should be SCHED_CAPACITY_SCALE, since we're going to compare against busiest->group_capacity, which is in [capacity] units. Reported-by: NWanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
The following commit: 9642d18e ("nohz: Affine unpinned timers to housekeepers")' intended to affine unpinned timers to housekeepers: unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(houserkeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself) However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the intention to: unpinned timers(full dynaticks, idle) => any housekeepers(no mattter cpu topology) unpinned timers(full dynaticks, busy) => any housekeepers(no mattter cpu topology) unpinned timers(housekeepers, idle) => any busy cpus(otherwise, fallback to any housekeepers) This patch fixes it by checking if there are busy housekeepers nearby, otherwise falls to any housekeepers/itself. After the patch: unpinned timers(full dynaticks, idle) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(full dynaticks, busy) => nearest busy housekeepers(otherwise, fallback to any housekeepers) unpinned timers(housekeepers, idle) => nearest busy housekeepers(otherwise, fallback to itself) Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> [ Fixed the changelog. ] Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Fixes: 'commit 9642d18e ("nohz: Affine unpinned timers to housekeepers")' Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Pavan reported that in the presence of very light tasks (or cgroups) the placement of migrated tasks can cause severe fairness issues. The problem is that enqueue_entity() places the task before it updates time, thereby it can place the task far in the past (remember that light tasks will shoot virtual time forward at a high speed, so in relation to the pre-existing light task, we can land far in the past). This is done because update_curr() needs the current task, and we might be placing the current task. The obvious solution is to differentiate between the current and any other task; placing the current before we update time, and placing any other task after, such that !curr tasks end up at the current moment in time, and not in the past. This commit re-introduces the previously reverted commit: 3a47d512 ("sched/fair: Fix fairness issue on migration") ... which is now safe to do, after we've also fixed another underlying bug first, in: sched/fair: Prepare to fix fairness problems on migration and cleaned up other details in the migration code: sched/core: Kill sched_class::task_waking Reported-by: NPavan Kondeti <pkondeti@codeaurora.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
With sched_class::task_waking being called only when we do set_task_cpu(), we can make sched_class::migrate_task_rq() do the work and eliminate sched_class::task_waking entirely. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Hunter <ahh@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Pavan Kondeti <pkondeti@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: byungchul.park@lge.com Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Mike reported that our recent attempt to fix migration problems: 3a47d512 ("sched/fair: Fix fairness issue on migration") broke interactivity and the signal starve test. We reverted that commit and now let's try it again more carefully, with some other underlying problems fixed first. One problem is that I assumed ENQUEUE_WAKING was only set when we do a cross-cpu wakeup (migration), which isn't true. This means we now destroy the vruntime history of tasks and wakeup-preemption suffers. Cure this by making my assumption true, only call sched_class::task_waking() when we do a cross-cpu wakeup. This avoids the indirect call in the case we do a local wakeup. Reported-by: NMike Galbraith <mgalbraith@suse.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Hunter <ahh@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Pavan Kondeti <pkondeti@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: byungchul.park@lge.com Cc: linux-kernel@vger.kernel.org Fixes: 3a47d512 ("sched/fair: Fix fairness issue on migration") Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Since I want to make ->task_woken() conditional on the task getting migrated, we cannot use it to call record_wakee(). Move it to select_task_rq_fair(), which gets called in almost all the same conditions. The only exception is if the woken task (@p) is CPU-bound (as per the nr_cpus_allowed test in select_task_rq()). Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Hunter <ahh@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Pavan Kondeti <pkondeti@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: byungchul.park@lge.com Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 11 5月, 2016 2 次提交
-
-
由 Jon Hunter 提交于
In the function, setup_irq(), we don't check that the descriptor returned from irq_to_desc() is valid before we start using it. For example chip_bus_lock() called from setup_irq(), assumes that the descriptor pointer is valid and doesn't check before dereferencing it. In many other functions including setup/free_percpu_irq() we do check that the descriptor returned is not NULL and therefore add the same test to setup_irq() to ensure the descriptor returned is valid. Signed-off-by: NJon Hunter <jonathanh@nvidia.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
由 Ingo Molnar 提交于
Mike reported that this recent commit: 3a47d512 ("sched/fair: Fix fairness issue on migration") ... broke interactivity and the signal starvation test. We have a proper fix series in the works but ran out of time for v4.6, so revert the commit. Reported-by: NMike Galbraith <efault@gmx.de> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 10 5月, 2016 5 次提交
-
-
由 Shaohua Li 提交于
BLK_TC_NOTIFY is missed in mask_maps, so we can't print out notify or set mask with 'notify' name. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Shaohua Li 提交于
commit f4a1d08c introduces a regression. Originally for BLK_TN_MESSAGE, we add message in trace and return. The commit ignores the early return and add garbage info. Signed-off-by: NShaohua Li <shli@fb.com> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Xunlei Pang 提交于
We got this warning: WARNING: CPU: 1 PID: 2468 at kernel/sched/core.c:1161 set_task_cpu+0x1af/0x1c0 [...] Call Trace: dump_stack+0x63/0x87 __warn+0xd1/0xf0 warn_slowpath_null+0x1d/0x20 set_task_cpu+0x1af/0x1c0 push_dl_task.part.34+0xea/0x180 push_dl_tasks+0x17/0x30 __balance_callback+0x45/0x5c __sched_setscheduler+0x906/0xb90 SyS_sched_setattr+0x150/0x190 do_syscall_64+0x62/0x110 entry_SYSCALL64_slow_path+0x25/0x25 This corresponds to: WARN_ON_ONCE(p->state == TASK_RUNNING && p->sched_class == &fair_sched_class && (p->on_rq && !task_on_rq_migrating(p))) It happens because in find_lock_later_rq(), the task whose scheduling class was changed to fair class is still pushed away as if it were a deadline task ... So, check in find_lock_later_rq() after double_lock_balance(), if the scheduling class of the deadline task was changed, break and retry. Apply the same logic to RT tasks. Signed-off-by: NXunlei Pang <xlpang@redhat.com> Reviewed-by: NSteven Rostedt <rostedt@goodmis.org> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/1462767091-1215-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Allowing unprivileged kernel profiling lets any user dump follow kernel control flow and dump kernel registers. This most likely allows trivial kASLR bypassing, and it may allow other mischief as well. (Off the top of my head, the PERF_SAMPLE_REGS_INTR output during /dev/urandom reads could be quite interesting.) Signed-off-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NKees Cook <keescook@chromium.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Serge E. Hallyn 提交于
Patch summary: When showing a cgroupfs entry in mountinfo, show the path of the mount root dentry relative to the reader's cgroup namespace root. Short explanation (courtesy of mkerrisk): If we create a new cgroup namespace, then we want both /proc/self/cgroup and /proc/self/mountinfo to show cgroup paths that are correctly virtualized with respect to the cgroup mount point. Previous to this patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo does not. Long version: When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup namespace, and then mounts a new instance of the freezer cgroup, the new mount will be rooted at /a/b. The root dentry field of the mountinfo entry will show '/a/b'. cat > /tmp/do1 << EOF mount -t cgroup -o freezer freezer /mnt grep freezer /proc/self/mountinfo EOF unshare -Gm bash /tmp/do1 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer The task's freezer cgroup entry in /proc/self/cgroup will simply show '/': grep freezer /proc/self/cgroup 9:freezer:/ If instead the same task simply bind mounts the /a/b cgroup directory, the resulting mountinfo entry will again show /a/b for the dentry root. However in this case the task will find its own cgroup at /mnt/a/b, not at /mnt: mount --bind /sys/fs/cgroup/freezer/a/b /mnt 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer In other words, there is no way for the task to know, based on what is in mountinfo, which cgroup directory is its own. Example (by mkerrisk): First, a little script to save some typing and verbiage: echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)" cat /proc/self/mountinfo | grep freezer | awk '{print "\tmountinfo:\t\t" $4 "\t" $5}' Create cgroup, place this shell into the cgroup, and look at the state of the /proc files: 2653 2653 # Our shell 14254 # cat(1) /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer Create a shell in new cgroup and mount namespaces. The act of creating a new cgroup namespace causes the process's current cgroups directories to become its cgroup root directories. (Here, I'm using my own version of the "unshare" utility, which takes the same options as the util-linux version): Look at the state of the /proc files: /proc/self/cgroup: 10:freezer:/ mountinfo: / /sys/fs/cgroup/freezer The third entry in /proc/self/cgroup (the pathname of the cgroup inside the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which is rooted at /a/b in the outer namespace. However, the info in /proc/self/mountinfo is not for this cgroup namespace, since we are seeing a duplicate of the mount from the old mount namespace, and the info there does not correspond to the new cgroup namespace. However, trying to create a new mount still doesn't show us the right information in mountinfo: # propagating to other mountns /proc/self/cgroup: 7:freezer:/ mountinfo: /a/b /mnt/freezer The act of creating a new cgroup namespace caused the process's current freezer directory, "/a/b", to become its cgroup freezer root directory. In other words, the pathname directory of the directory within the newly mounted cgroup filesystem should be "/", but mountinfo wrongly shows us "/a/b". The consequence of this is that the process in the cgroup namespace cannot correctly construct the pathname of its cgroup root directory from the information in /proc/PID/mountinfo. With this patch, the dentry root field in mountinfo is shown relative to the reader's cgroup namespace. So the same steps as above: /proc/self/cgroup: 10:freezer:/a/b mountinfo: / /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: /../.. /sys/fs/cgroup/freezer /proc/self/cgroup: 10:freezer:/ mountinfo: / /mnt/freezer cgroup.clone_children freezer.parent_freezing freezer.state tasks cgroup.procs freezer.self_freezing notify_on_release 3164 2653 # First shell that placed in this cgroup 3164 # Shell started by 'unshare' 14197 # cat(1) Signed-off-by: NSerge Hallyn <serge.hallyn@ubuntu.com> Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com> Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 09 5月, 2016 2 次提交
-
-
由 Davidlohr Bueso 提交于
... the comment clearly refers to wake_up_q(), and not wake_up_list(). Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1462766290-28664-1-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Muhammad Falak R Wani 提交于
Remove unused variable 'ret', and directly return 0. Signed-off-by: NMuhammad Falak R Wani <falakreyaz@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1462441879-10092-1-git-send-email-falakreyaz@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 07 5月, 2016 4 次提交
-
-
由 Rafael J. Wysocki 提交于
The following commit: 34e2c555 ("cpufreq: Add mechanism for registering utilization update callbacks") overlooked the fact that update_load_avg(), where CFS invokes cpufreq utilization update callbacks, becomes an empty stub on UP kernels. In consequence, if !CONFIG_SMP, cpufreq governors are never invoked from CFS and they do not have a chance to evaluate CPU performace levels and update them often enough. Needless to say, things don't work as expected then. Fix the problem by making the !CONFIG_SMP stub of update_load_avg() invoke cpufreq update callbacks too. Reported-by: NSteve Muckle <steve.muckle@linaro.org> Tested-by: NSteve Muckle <steve.muckle@linaro.org> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: NSteve Muckle <steve.muckle@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux PM list <linux-pm@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Fixes: 34e2c555 (cpufreq: Add mechanism for registering utilization update callbacks) Link: http://lkml.kernel.org/r/6282396.VVEdgVYxO3@vostro.rjw.lanSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Alexei Starovoitov 提交于
since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach verifier its recognition of constants in conditional branches without affecting safety. Ex: if (reg == 123) { .. here verifier was marking reg->type as CONST_IMM instead keep reg as UNKNOWN_VALUE } Two verifier states with UNKNOWN_VALUE are equivalent, whereas CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range verification and other cases. So help search pruning by marking registers as UNKNOWN_VALUE where possible instead of CONST_IMM. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
Extended BPF carried over two instructions from classic to access packet data: LD_ABS and LD_IND. They're highly optimized in JITs, but due to their design they have to do length check for every access. When BPF is processing 20M packets per second single LD_ABS after JIT is consuming 3% cpu. Hence the need to optimize it further by amortizing the cost of 'off < skb_headlen' over multiple packet accesses. One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW with similar usage as skb_header_pointer(). The kernel part for interpreter and x64 JIT was implemented in [1], but such new insns behave like old ld_abs and abort the program with 'return 0' if access is beyond linear data. Such hidden control flow is hard to workaround plus changing JITs and rolling out new llvm is incovenient. Therefore allow cls_bpf/act_bpf program access skb->data directly: int bpf_prog(struct __sk_buff *skb) { struct iphdr *ip; if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end) /* packet too small */ return 0; ip = skb->data + ETH_HLEN; /* access IP header fields with direct loads */ if (ip->version != 4 || ip->saddr == 0x7f000001) return 1; [...] } This solution avoids introduction of new instructions. llvm stays the same and all JITs stay the same, but verifier has to work extra hard to prove safety of the above program. For XDP the direct store instructions can be allowed as well. The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check the alignment. The complex packet parsers where packet pointer is adjusted incrementally cannot be tracked for alignment, so allow byte access in such cases and misaligned access on architectures that define efficient_unaligned_access [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dwSigned-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
cleanup verifier code and prepare it for addition of "pointer to packet" logic Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 5月, 2016 3 次提交
-
-
由 Thomas Gleixner 提交于
No need for an extra notifier. We don't need to handle all these states. It's sufficient to kill the timer when the cpu dies. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.770528462@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Thomas Gleixner 提交于
No need for an extra notifier. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.693720241@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Thomas Gleixner 提交于
The scheduler can handle per cpu threads before the cpu is set to active and it does not allow user space threads on the cpu before active is set. Attaching to the scheduling domains is also not required before user space threads can be handled. Move the activation to the end of the hotplug state space. That also means that deactivation is the first action when a cpu is shut down. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160310120025.597477199@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-