- 06 11月, 2015 15 次提交
-
-
由 Eric B Munson 提交于
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in. Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric B Munson 提交于
With the refactored mlock code, introduce a new system call for mlock. The new call will allow the user to specify what lock states are being added. mlock2 is trivial at the moment, but a follow on patch will add a new mlock state making it useful. Signed-off-by: NEric B Munson <emunson@akamai.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: NDavid Rientjes <rientjes@google.com> Suggested-by: NOleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
Theoretically it is possible that the watchdog timer expires right at the time when a user sets 'watchdog_thresh' to zero (note: this disables the lockup detectors). In this scenario, the is_softlockup() function - which is called by the timer - could produce a false positive. Fix this by checking the current value of 'watchdog_thresh'. Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
watchdog_{park|unpark}_threads() are now called in code paths that protect themselves against CPU hotplug, so {get|put}_online_cpus() calls are redundant and can be removed. Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
The handler functions for watchdog parameters in /proc/sys/kernel do not protect themselves against races with CPU hotplug. Hence, theoretically it is possible that a new watchdog thread is started on a hotplugged CPU while a parameter is being modified, and the thread could thus use a parameter value that is 'in transition'. For example, if 'watchdog_thresh' is being set to zero (note: this disables the lockup detectors) the thread would erroneously use the value zero as the sample period. To avoid such races and to keep the /proc handler code consistent, call {get|put}_online_cpus() in proc_watchdog_common() {get|put}_online_cpus() in proc_watchdog_thresh() {get|put}_online_cpus() in proc_watchdog_cpumask() Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
The lockup detector suspend/resume interface that was introduced by commit 8c073d27 ("watchdog: introduce watchdog_suspend() and watchdog_resume()") does not protect itself against races with CPU hotplug. Hence, theoretically it is possible that a new watchdog thread is started on a hotplugged CPU while the lockup detector is suspended, and the thread could thus interfere unexpectedly with the code that requested to suspend the lockup detector. Avoid the race by calling get_online_cpus() in lockup_detector_suspend() put_online_cpus() in lockup_detector_resume() Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Don Zickus 提交于
The only way to enable a hardlockup to panic the machine is to set 'nmi_watchdog=panic' on the kernel command line. This makes it awkward for end users and folks who want to run automate tests (like myself). Mimic the softlockup_panic knob and create a /proc/sys/kernel/hardlockup_panic knob. Signed-off-by: NDon Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Acked-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jiri Kosina 提交于
In many cases of hardlockup reports, it's actually not possible to know why it triggered, because the CPU that got stuck is usually waiting on a resource (with IRQs disabled) in posession of some other CPU is holding. IOW, we are often looking at the stacktrace of the victim and not the actual offender. Introduce sysctl / cmdline parameter that makes it possible to have hardlockup detector perform all-CPU backtrace. Signed-off-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
If kthread_park() returns an error, watchdog_park_threads() should not blindly 'roll back' the already parked threads to the unparked state. Instead leave it up to the callers to handle such errors appropriately in their context. For example, it is redundant to unpark the threads if the lockup detectors will soon be disabled by the callers anyway. Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
lockup_detector_suspend() now handles errors from watchdog_park_threads(). Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
update_watchdog_all_cpus() now passes errors from watchdog_park_threads() up to functions in the call chain. This allows watchdog_enable_all_cpus() and proc_watchdog_update() to handle such errors too. Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
Move watchdog_disable_all_cpus() outside of the ifdef so that it is available if CONFIG_SYSCTL is not defined. This is preparation for "watchdog: implement error handling in update_watchdog_all_cpus() and callers". Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ulrich Obergfell 提交于
The original watchdog_park_threads() function that was introduced by commit 81a4beef ("watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads()") takes a very simple approach to handle errors returned by kthread_park(): It attempts to roll back all watchdog threads to the unparked state. However, this may be undesired behaviour from the perspective of the caller which may want to handle errors as appropriate in its specific context. Currently, there are two possible call chains: - watchdog suspend/resume interface lockup_detector_suspend watchdog_park_threads - write to parameters in /proc/sys/kernel proc_watchdog_update watchdog_enable_all_cpus update_watchdog_all_cpus watchdog_park_threads Instead of 'blindly' attempting to unpark the watchdog threads if a kthread_park() call fails, the new approach is to disable the lockup detectors in the above call chains. Failure becomes visible to the user as follows: - error messages from lockup_detector_suspend() or watchdog_enable_all_cpus() - the state that can be read from /proc/sys/kernel/watchdog_enabled - the 'write' system call in the latter call chain returns an error I did not experience kthread_park() failures in practice, I used some instrumentation to fake error returns from kthread_park() in order to test the patches. This patch (of 5): Restore the previous value of watchdog_thresh _and_ sample_period if proc_watchdog_update() returns an error. The variables must be consistent to avoid false positives of the lockup detectors. Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
Make is_hardlockup return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: NYaowei Bai <bywxiaobai@163.com> Reviewed-by: NAaron Tomlin <atomlin@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 11月, 2015 8 次提交
-
-
由 Paul Moore 提交于
It always returns zero and no one is checking the return value. Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
由 Saurabh Sengar 提交于
Variable rc in not required as it is just used for unchanged for return, and return is always 0 in the function. Signed-off-by: NSaurabh Sengar <saurabh.truth@gmail.com> [PM: fixed spelling errors in description] Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
由 Scott Matheina 提交于
Signed-off-by: NScott Matheina <scott@matheina.com> [PM: fixed subject line] Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
由 Yaowei Bai 提交于
This patch makes audit_tree_match return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: NYaowei Bai <bywxiaobai@163.com> [PM: tweaked the subject line] Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
由 Yaowei Bai 提交于
This patch makes audit_string_contains_control return bool to improve readability due to this particular function only using either one or zero as its return value. Signed-off-by: NYaowei Bai <bywxiaobai@163.com> [PM: tweaked subject line] Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
由 Richard Guy Briggs 提交于
There are several reports of the kernel losing contact with auditd when it is, in fact, still running. When this happens, kernel syslogs show: "audit: *NO* daemon at audit_pid=<pid>" although auditd is still running, and is apparently happy, listening on the netlink socket. The pid in the "*NO* daemon" message matches the pid of the running auditd process. Restarting auditd solves this. The problem appears to happen randomly, and doesn't seem to be strongly correlated to the rate of audit events being logged. The problem happens fairly regularly (every few days), but not yet reproduced to order. On production kernels, BUG_ON() is a no-op, so any error will trigger this. Commit 34eab0a7 ("audit: prevent an older auditd shutdown from orphaning a newer auditd startup") eliminates one possible cause. This isn't the case here, since the PID in the error message and the PID of the running auditd match. The primary expected cause of error here is -ECONNREFUSED when the audit daemon goes away, when netlink_getsockbyportid() can't find the auditd portid entry in the netlink audit table (or there is no receive function). If -EPERM is returned, that situation isn't likely to be resolved in a timely fashion without administrator intervention. In both cases, reset the audit_pid. This does not rule out a race condition. SELinux is expected to return zero since this isn't an INET or INET6 socket. Other LSMs may have other return codes. Log the error code for better diagnosis in the future. In the case of -ENOMEM, the situation could be temporary, based on local or general availability of buffers. -EAGAIN should never happen since the netlink audit (kernel) socket is set to MAX_SCHEDULE_TIMEOUT. -ERESTARTSYS and -EINTR are not expected since this kernel thread is not expected to receive signals. In these cases (or any other unexpected ones for now), report the error and re-schedule the thread, retrying up to 5 times. v2: Removed BUG_ON(). Moved comma in pr_*() statements. Removed audit_strerror() text. Reported-by: NVipin Rathor <v.rathor@gmail.com> Reported-by: <ctcard@hotmail.com> Signed-off-by: NRichard Guy Briggs <rgb@redhat.com> [PM: applied rgb's fixup patch to correct audit_log_lost() format issues] Signed-off-by: NPaul Moore <pmoore@redhat.com>
-
由 Linus Torvalds 提交于
This seems to be a mis-reading of how alpha memory ordering works, and is not backed up by the alpha architecture manual. The helper functions don't do anything special on any other architectures, and the arguments that support them being safe on other architectures also argue that they are safe on alpha. Basically, the "control dependency" is between a previous read and a subsequent write that is dependent on the value read. Even if the subsequent write is actually done speculatively, there is no way that such a speculative write could be made visible to other cpu's until it has been committed, which requires validating the speculation. Note that most weakely ordered architectures (very much including alpha) do not guarantee any ordering relationship between two loads that depend on each other on a control dependency: read A if (val == 1) read B because the conditional may be predicted, and the "read B" may be speculatively moved up to before reading the value A. So we require the user to insert a smp_rmb() between the two accesses to be correct: read A; if (A == 1) smp_rmb() read B Alpha is further special in that it can break that ordering even if the *address* of B depends on the read of A, because the cacheline that is read later may be stale unless you have a memory barrier in between the pointer read and the read of the value behind a pointer: read ptr read offset(ptr) whereas all other weakly ordered architectures guarantee that the data dependency (as opposed to just a control dependency) will order the two accesses. As a result, alpha needs a "smp_read_barrier_depends()" in between those two reads for them to be ordered. The coontrol dependency that "READ_ONCE_CTRL()" and "atomic_read_ctrl()" had was a control dependency to a subsequent *write*, however, and nobody can finalize such a subsequent write without having actually done the read. And were you to write such a value to a "stale" cacheline (the way the unordered reads came to be), that would seem to lose the write entirely. So the things that make alpha able to re-order reads even more aggressively than other weak architectures do not seem to be relevant for a subsequent write. Alpha memory ordering may be strange, but there's no real indication that it is *that* strange. Also, the alpha architecture reference manual very explicitly talks about the definition of "Dependence Constraints" in section 5.6.1.7, where a preceding read dominates a subsequent write. Such a dependence constraint admittedly does not impose a BEFORE (alpha architecture term for globally visible ordering), but it does guarantee that there can be no "causal loop". I don't see how you could avoid such a loop if another cpu could see the stored value and then impact the value of the first read. Put another way: the read and the write could not be seen as being out of order wrt other cpus. So I do not see how these "x_ctrl()" functions can currently be necessary. I may have to eat my words at some point, but in the absense of clear proof that alpha actually needs this, or indeed even an explanation of how alpha could _possibly_ need it, I do not believe these functions are called for. And if it turns out that alpha really _does_ need a barrier for this case, that barrier still should not be "smp_read_barrier_depends()". We'd have to make up some new speciality barrier just for alpha, along with the documentation for why it really is necessary. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul E McKenney <paulmck@us.ibm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daniel Borkmann 提交于
The verbose() printer dumps the verifier state to user space, so let gcc take care to check calls to verbose() for (future) errors. make with W=1 correctly suggests: function might be possible candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 11月, 2015 5 次提交
-
-
由 Daniel Borkmann 提交于
This work adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users. Just to name one example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits. So a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path. The current workaround is that Unix domain sockets (UDS) need to be instrumented in order to pass the created eBPF map/program file descriptors to a third party management daemon through UDS' socket passing facility. This makes it a bit complicated to deploy shared eBPF maps or programs (programs f.e. for tail calls) among various processes. We've been brainstorming on how we could tackle this issue and various approches have been tried out so far, which can be read up further in the below reference. The architecture we eventually ended up with is a minimal file system that can hold map/prog objects. The file system is a per mount namespace singleton, and the default mount point is /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor along with a pathname is being passed to bpf(2) that in turn creates (we call it eBPF object pinning) the file system nodes. Only the pathname is being passed to bpf(2) for getting a new BPF file descriptor to an existing node. The user can use that to access maps and progs later on, through bpf(2). Removal of file system nodes is being managed through normal VFS functions such as unlink(2), etc. The file system code is kept to a very minimum and can be further extended later on. The next step I'm working on is to add dump eBPF map/prog commands to bpf(2), so that a specification from a given file descriptor can be retrieved. This can be used by things like CRIU but also applications can inspect the meta data after calling BPF_OBJ_GET. Big thanks also to Alexei and Hannes who significantly contributed in the design discussion that eventually let us end up with this architecture here. Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
We currently have duplicated cleanup code in bpf_prog_put() and bpf_prog_put_rcu() cleanup paths. Back then we decided that it was not worth it to make it a common helper called by both, but with the recent addition of resource charging, we could have avoided the fix in commit ac00737f ("bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put") if we would have had only a single, common path. We can simplify it further by assigning aux->prog only once during allocation time. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Add a bpf_map_get() function that we're going to use later on and align/clean the remaining helpers a bit so that we have them a bit more consistent: - __bpf_map_get() and __bpf_prog_get() that both work on the fd struct, check whether the descriptor is eBPF and return the pointer to the map/prog stored in the private data. Also, we can return f.file->private_data directly, the function signature is enough of a documentation already. - bpf_map_get() and bpf_prog_get() that both work on u32 user fd, call their respective __bpf_map_get()/__bpf_prog_get() variants, and take a reference. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Since we're going to use anon_inode_getfd() invocations in more than just the current places, make a helper function for both, so that we only need to pass a map/prog pointer to the helper itself in order to get a fd. The new helpers are called bpf_map_new_fd() and bpf_prog_new_fd(). Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yang Shi 提交于
When running bpf samples on rt kernel, it reports the below warning: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping Preemption disabled at:[<ffff80000017db58>] kprobe_perf_func+0x30/0x228 CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Call trace: [<ffff80000008a5b0>] dump_backtrace+0x0/0x128 [<ffff80000008a6f8>] show_stack+0x20/0x30 [<ffff8000007da90c>] dump_stack+0x7c/0xa0 [<ffff8000000e4830>] ___might_sleep+0x188/0x1a0 [<ffff8000007e2200>] rt_spin_lock+0x28/0x40 [<ffff80000018bf9c>] htab_map_update_elem+0x124/0x320 [<ffff80000018c718>] bpf_map_update_elem+0x40/0x58 [<ffff800000187658>] __bpf_prog_run+0xd48/0x1640 [<ffff80000017ca6c>] trace_call_bpf+0x8c/0x100 [<ffff80000017db58>] kprobe_perf_func+0x30/0x228 [<ffff80000017dd84>] kprobe_dispatcher+0x34/0x58 [<ffff8000007e399c>] kprobe_handler+0x114/0x250 [<ffff8000007e3bf4>] kprobe_breakpoint_handler+0x1c/0x30 [<ffff800000085b80>] brk_handler+0x88/0x98 [<ffff8000000822f0>] do_debug_exception+0x50/0xb8 Exception stack(0xffff808349687460 to 0xffff808349687580) 7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000 7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000 74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083 74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000 74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000 7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000 7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000 7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000 7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77 Convert hashtab lock to raw lock to avoid such warning. Signed-off-by: NYang Shi <yang.shi@linaro.org> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 10月, 2015 1 次提交
-
-
由 Davidlohr Bueso 提交于
This is really about simplifying the double xchg patterns into a single cmpxchg, with the same logic. Other than the immediate cleanup, there are some subtleties this change deals with: (i) While the load of the old bt is fully ordered wrt everything, ie: old_bt = xchg(&q->blk_trace, bt); [barrier] if (old_bt) (void) xchg(&q->blk_trace, old_bt); [barrier] blk_trace could still be changed between the xchg and the old_bt load. Note that this description is merely theoretical and afaict very small, but doing everything in a single context with cmpxchg closes this potential race. (ii) Ordering guarantees are obviously kept with cmpxchg. (iii) Gets rid of the hacky-by-nature (void)xchg pattern. Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> eviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 29 10月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
css_task_iter_next() checked @it->cur_task before grabbing css_set_lock and assumed that the result won't change afterwards; however, tasks could leave the cgroup being iterated terminating the iterator before css_task_lock is acquired. If this happens, css_task_iter_next() tries to calculate the current task from NULL cg_list pointer leading to the following oops. BUG: unable to handle kernel paging request at fffffffffffff7d0 IP: [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80 ... CPU: 4 PID: 6391 Comm: JobQDisp2 Not tainted 4.0.9-22_fbk4_rc3_81616_ge8d9cb6 #1 Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B08 03/04/2014 task: ffff880868e46400 ti: ffff88083404c000 task.ti: ffff88083404c000 RIP: 0010:[<ffffffff810d5f22>] [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80 RSP: 0018:ffff88083404fd28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88083404fd68 RCX: ffff8804697fb8b0 RDX: fffffffffffff7c0 RSI: ffff8803b7dff800 RDI: ffffffff822c0278 RBP: ffff88083404fd38 R08: 0000000000017160 R09: ffff88046f4070c0 R10: ffffffff810d61f7 R11: 0000000000000293 R12: ffff880863bf8400 R13: ffff88046b87fd80 R14: 0000000000000000 R15: ffff88083404fe58 FS: 00007fa0567e2700(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffff7d0 CR3: 0000000469568000 CR4: 00000000001406e0 Stack: 0000000000000246 0000000000000000 ffff88083404fde8 ffffffff810d6248 ffff88083404fd68 0000000000000000 ffff8803b7dff800 000001ef000001ee 0000000000000000 0000000000000000 ffff880863bf8568 0000000000000000 Call Trace: [<ffffffff810d6248>] cgroup_pidlist_start+0x258/0x550 [<ffffffff810cf66d>] cgroup_seqfile_start+0x1d/0x20 [<ffffffff8121f8ef>] kernfs_seq_start+0x5f/0xa0 [<ffffffff811cab76>] seq_read+0x166/0x380 [<ffffffff812200fd>] kernfs_fop_read+0x11d/0x180 [<ffffffff811a7398>] __vfs_read+0x18/0x50 [<ffffffff811a745d>] vfs_read+0x8d/0x150 [<ffffffff811a756f>] SyS_read+0x4f/0xb0 [<ffffffff818d4772>] system_call_fastpath+0x12/0x17 Fix it by moving the termination condition check inside css_set_lock. @it->cur_task is now cleared after being put and @it->task_pos is tested for termination instead of @it->cset_pos as they indicate the same condition and @it->task_pos is what's being dereferenced. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NCalvin Owens <calvinowens@fb.com> Fixes: ed27b9f7 ("cgroup: don't hold css_set_rwsem across css task iteration") Acked-by: NZefan Li <lizefan@huawei.com>
-
- 28 10月, 2015 1 次提交
-
-
由 Tycho Andersen 提交于
This patch adds support for dumping a process' (classic BPF) seccomp filters via ptrace. PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF seccomp filters. addr should be an integer which represents the ith seccomp filter (0 is the most recently installed filter). data should be a struct sock_filter * with enough room for the ith filter, or NULL, in which case the filter is not saved. The return value for this command is the number of BPF instructions the program represents, or negative in the case of errors. Command specific errors are ENOENT: which indicates that there is no ith filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith filter was not installed as a classic BPF filter. A caveat with this approach is that there is no way to get explicitly at the heirarchy of seccomp filters, and users need to memcmp() filters to decide which are inherited. This means that a task which installs two of the same filter can potentially confuse users of this interface. v2: * make save_orig const * check that the orig_prog exists (not necessary right now, but when grows eBPF support it will be) * s/n/filter_off and make it an unsigned long to match ptrace * count "down" the tree instead of "up" when passing a filter offset v3: * don't take the current task's lock for inspecting its seccomp mode * use a 0x42** constant for the ptrace command value v4: * don't copy to userspace while holding spinlocks v5: * add another condition to WARN_ON v6: * rebase on net-next Signed-off-by: NTycho Andersen <tycho.andersen@canonical.com> Acked-by: NKees Cook <keescook@chromium.org> CC: Will Drewry <wad@chromium.org> Reviewed-by: NOleg Nesterov <oleg@redhat.com> CC: Andy Lutomirski <luto@amacapital.net> CC: Pavel Emelyanov <xemul@parallels.com> CC: Serge E. Hallyn <serge.hallyn@ubuntu.com> CC: Alexei Starovoitov <ast@kernel.org> CC: Daniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 10月, 2015 3 次提交
-
-
由 Alexei Starovoitov 提交于
exported perf symbols are GPL only, mark eBPF helper functions used in tracing as GPL only as well. Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
Fix safety checks for bpf_perf_event_read(): - only non-inherited events can be added to perf_event_array map (do this check statically at map insertion time) - dynamically check that event is local and !pmu->count Otherwise buggy bpf program can cause kernel splat. Also fix error path after perf_event_attrs() and remove redundant 'extern'. Fixes: 35578d79 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Tested-by: NWang Nan <wangnan0@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dan Williams 提交于
Currently memremap checks if the range is "System RAM" and returns the kernel linear address. This is broken for highmem platforms where a range may be "System RAM", but is not part of the kernel linear mapping. Fallback to ioremap_cache() in these cases, to let the arch code attempt to handle it. Note that ARM ioremap will WARN when attempting to remap ram, and in that case the caller needs to be fixed. For this reason, existing ioremap_cache() usages for ARM are already trained to avoid attempts to remap ram. The impact of this bug is low for now since the pmem driver is the only user of memremap(), but this is important to fix before more conversions to memremap arrive in 4.4. Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 26 10月, 2015 1 次提交
-
-
由 Jason A. Donenfeld 提交于
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com> Cc: hofrat@osadl.org Link: http://lkml.kernel.org/r/1436894685-5868-1-git-send-email-Jason@zx2c4.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 23 10月, 2015 2 次提交
-
-
由 Peter Zijlstra 提交于
Luca and Wanpeng reported two missing annotations that led to false lockdep complaints. Add the missing annotations. Reported-by: NLuca Abeni <luca.abeni@unitn.it> Reported-by: NWanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: cbce1a68 ("sched,lockdep: Employ lock pinning") Link: http://lkml.kernel.org/r/20151023095008.GY17308@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Oleg Nesterov 提交于
call_usermodehelper_exec_sync() does fork() + wait() with "unignored" SIGCHLD. What we have missed is that this worker thread can have other children previously forked by call_usermodehelper_exec_work() without UMH_WAIT_PROC. If such a child exits in between it becomes a zombie because auto-reaping only works if SIGCHLD is ignored, and nobody can reap it (unless/until this worker thread exits too). Change the !UMH_WAIT_PROC case to use CLONE_PARENT. Note: this is only first step. All PF_KTHREAD tasks, even created by kernel_thread() should have ->parent == kthreadd by default. Fixes: bb304a5c ("kmod: handle UMH_WAIT_PROC from system unbound workqueue") Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NFrederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 10月, 2015 3 次提交
-
-
由 Geliang Tang 提交于
pstore doesn't support unregistering yet. It was marked as TODO. This patch adds some code to fix it: 1) Add functions to unregister kmsg/console/ftrace/pmsg. 2) Add a function to free compression buffer. 3) Unmap the memory and free it. 4) Add a function to unregister pstore filesystem. Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NKees Cook <keescook@chromium.org> [Removed __exit annotation from ramoops_remove(). Reported by Arnd Bergmann] Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Alexei Starovoitov 提交于
This helper is used to send raw data from eBPF program into special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event. User space needs to perf_event_open() it (either for one or all cpus) and store FD into perf_event_array (similar to bpf_perf_event_read() helper) before eBPF program can send data into it. Today the programs triggered by kprobe collect the data and either store it into the maps or print it via bpf_trace_printk() where latter is the debug facility and not suitable to stream the data. This new helper replaces such bpf_trace_printk() usage and allows programs to have dedicated channel into user space for post-processing of the raw data collected. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
Instead of WARN_ON in perf_event_output() on unpaded raw samples, pad them automatically. Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-