- 31 3月, 2018 2 次提交
-
-
由 Andrey Ignatov 提交于
== The problem == There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers. E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack. Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type. == The solution == Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice: 1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value. 2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL. The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly. Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and Daniel Borkmann <daniel@iogearbox.net> here: https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Prashant Bhole 提交于
When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized in sg_init_table() and it is verified in sg api while navigating. We hit BUG_ON when magic check is failed. In functions sg_tcp_sendpage and sg_tcp_sendmsg, the struct containing the scatterlist is already zeroed out. So to avoid extra memset, we use sg_init_marker() to initialize sg_magic. Fixed following things: - In bpf_tcp_sendpage: initialize sg using sg_init_marker - In bpf_tcp_sendmsg: Replace sg_init_table with sg_init_marker - In bpf_tcp_push: Replace memset with sg_init_table where consumed sg entry needs to be re-initialized. Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 30 3月, 2018 2 次提交
-
-
由 John Fastabend 提交于
Add support for the BPF_F_INGRESS flag in skb redirect helper. To do this convert skb into a scatterlist and push into ingress queue. This is the same logic that is used in the sk_msg redirect helper so it should feel familiar. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 John Fastabend 提交于
Add support for the BPF_F_INGRESS flag in sk_msg redirect helper. To do this add a scatterlist ring for receiving socks to check before calling into regular recvmsg call path. Additionally, because the poll wakeup logic only checked the skb recv queue we need to add a hook in TCP stack (similar to write side) so that we have a way to wake up polling socks when a scatterlist is redirected to that sock. After this all that is needed is for the redirect helper to push the scatterlist into the psock receive queue. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 29 3月, 2018 1 次提交
-
-
由 Alexei Starovoitov 提交于
Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access kernel internal arguments of the tracepoints in their raw form. >From bpf program point of view the access to the arguments look like: struct bpf_raw_tracepoint_args { __u64 args[0]; }; int bpf_prog(struct bpf_raw_tracepoint_args *ctx) { // program can read args[N] where N depends on tracepoint // and statically verified at program load+attach time } kprobe+bpf infrastructure allows programs access function arguments. This feature allows programs access raw tracepoint arguments. Similar to proposed 'dynamic ftrace events' there are no abi guarantees to what the tracepoints arguments are and what their meaning is. The program needs to type cast args properly and use bpf_probe_read() helper to access struct fields when argument is a pointer. For every tracepoint __bpf_trace_##call function is prepared. In assembler it looks like: (gdb) disassemble __bpf_trace_xdp_exception Dump of assembler code for function __bpf_trace_xdp_exception: 0xffffffff81132080 <+0>: mov %ecx,%ecx 0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3> where TRACE_EVENT(xdp_exception, TP_PROTO(const struct net_device *dev, const struct bpf_prog *xdp, u32 act), The above assembler snippet is casting 32-bit 'act' field into 'u64' to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is. All of ~500 of __bpf_trace_*() functions are only 5-10 byte long and in total this approach adds 7k bytes to .text. This approach gives the lowest possible overhead while calling trace_xdp_exception() from kernel C code and transitioning into bpf land. Since tracepoint+bpf are used at speeds of 1M+ events per second this is valuable optimization. The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced that returns anon_inode FD of 'bpf-raw-tracepoint' object. The user space looks like: // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type prog_fd = bpf_prog_load(...); // receive anon_inode fd for given bpf_raw_tracepoint with prog attached raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd); Ctrl-C of tracing daemon or cmdline tool that uses this feature will automatically detach bpf program, unload it and unregister tracepoint probe. On the kernel side the __bpf_raw_tp_map section of pointers to tracepoint definition and to __bpf_trace_*() probe function is used to find a tracepoint with "xdp_exception" name and corresponding __bpf_trace_xdp_exception() probe function which are passed to tracepoint_probe_register() to connect probe with tracepoint. Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf tracepoint mechanisms. perf_event_open() can be used in parallel on the same tracepoint. Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted. Each with its own bpf program. The kernel will execute all tracepoint probes and all attached bpf programs. In the future bpf_raw_tracepoints can be extended with query/introspection logic. __bpf_raw_tp_map section logic was contributed by Steven Rostedt Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 28 3月, 2018 1 次提交
-
-
由 Shaohua Li 提交于
Generally we do a preload before doing idr allocation. This also help improve the allocation success rate in memory pressure. Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 26 3月, 2018 2 次提交
-
-
由 Martin KaFai Lau 提交于
The BTF (BPF Type Format) verifier needs to reuse the current BPF verifier log. Hence, it requires the following changes: (1) Expose log_write() in verifier.c for other users. Its name is renamed to bpf_verifier_vlog(). (2) The BTF verifier also needs to check 'log->level && log->ubuf && !bpf_verifier_log_full(log);' independently outside of the current log_write(). It is because the BTF verifier will do one-check before making multiple calls to btf_verifier_vlog to log the details of a type. Hence, this check is also re-factored to a new function bpf_verifier_log_needed(). Since it is re-factored, we can check it before va_start() in the current bpf_verifier_log_write() and verbose(). Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> -
由 Martin KaFai Lau 提交于
bpf_verifer_log => bpf_verifier_log Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Acked-by: NAlexei Starovoitov <ast@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 24 3月, 2018 1 次提交
-
-
由 Jiri Olsa 提交于
We use print_bpf_insn in user space (bpftool and soon perf), so it'd be nice to keep it generic and strip it off the kernel struct bpf_verifier_env argument. This argument can be safely removed, because its users can use the struct bpf_insn_cbs::private_data to pass it. By changing the argument type we can no longer have clean 'verbose' alias to 'bpf_verifier_log_write' in verifier.c. Instead we're adding the 'verbose' cb_print callback and removing the alias. This way we have new cb_print callback in place, and all the 'verbose(env, ...) calls in verifier.c will cleanly cast to 'verbose(void *, ...)' so no other change is needed. Signed-off-by: NJiri Olsa <jolsa@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 21 3月, 2018 2 次提交
-
-
由 Chenbo Feng 提交于
The current check statement in BPF syscall will do a capability check for CAP_SYS_ADMIN before checking sysctl_unprivileged_bpf_disabled. This code path will trigger unnecessary security hooks on capability checking and cause false alarms on unprivileged process trying to get CAP_SYS_ADMIN access. This can be resolved by simply switch the order of the statement and CAP_SYS_ADMIN is not required anyway if unprivileged bpf syscall is allowed. Signed-off-by: NChenbo Feng <fengc@google.com> Acked-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Yonghong Song 提交于
Commit 4bebdc7a ("bpf: add helper bpf_perf_prog_read_value") added helper bpf_perf_prog_read_value so that perf_event type program can read event counter and enabled/running time. This commit, however, introduced a bug which allows this helper for tracepoint type programs. This is incorrect as bpf_perf_prog_read_value needs to access perf_event through its bpf_perf_event_data_kern type context, which is not available for tracepoint type program. This patch fixed the issue by separating bpf_func_proto between tracepoint and perf_event type programs and removed bpf_perf_prog_read_value from tracepoint func prototype. Fixes: 4bebdc7a ("bpf: add helper bpf_perf_prog_read_value") Reported-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 20 3月, 2018 2 次提交
-
-
由 John Fastabend 提交于
This implements a BPF ULP layer to allow policy enforcement and monitoring at the socket layer. In order to support this a new program type BPF_PROG_TYPE_SK_MSG is used to run the policy at the sendmsg/sendpage hook. To attach the policy to sockets a sockmap is used with a new program attach type BPF_SK_MSG_VERDICT. Similar to previous sockmap usages when a sock is added to a sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT program type attached then the BPF ULP layer is created on the socket and the attached BPF_PROG_TYPE_SK_MSG program is run for every msg in sendmsg case and page/offset in sendpage case. BPF_PROG_TYPE_SK_MSG Semantics/API: BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and SK_DROP. Returning SK_DROP free's the copied data in the sendmsg case and in the sendpage case leaves the data untouched. Both cases return -EACESS to the user. Returning SK_PASS will allow the msg to be sent. In the sendmsg case data is copied into kernel space buffers before running the BPF program. The kernel space buffers are stored in a scatterlist object where each element is a kernel memory buffer. Some effort is made to coalesce data from the sendmsg call here. For example a sendmsg call with many one byte iov entries will likely be pushed into a single entry. The BPF program is run with data pointers (start/end) pointing to the first sg element. In the sendpage case data is not copied. We opt not to copy the data by default here, because the BPF infrastructure does not know what bytes will be needed nor when they will be needed. So copying all bytes may be wasteful. Because of this the initial start/end data pointers are (0,0). Meaning no data can be read or written. This avoids reading data that may be modified by the user. A new helper is added later in this series if reading and writing the data is needed. The helper call will do a copy by default so that the page is exclusively owned by the BPF call. The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg in the sendmsg() case and the entire page/offset in the sendpage case. This avoids ambiguity on how to handle mixed return codes in the sendmsg case. Again a helper is added later in the series if a verdict needs to apply to multiple system calls and/or only a subpart of the currently being processed message. The helper msg_redirect_map() can be used to select the socket to send the data on. This is used similar to existing redirect use cases. This allows policy to redirect msgs. Pseudo code simple example: The basic logic to attach a program to a socket is as follows, // load the programs bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog); // lookup the sockmap bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map"); // get fd for sockmap map_fd_msg = bpf_map__fd(bpf_map_msg); // attach program to sockmap bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0); Adding sockets to the map is done in the normal way, // Add a socket 'fd' to sockmap at location 'i' bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY); After the above any socket attached to "my_sock_map", in this case 'fd', will run the BPF msg verdict program (msg_prog) on every sendmsg and sendpage system call. For a complete example see BPF selftests or sockmap samples. Implementation notes: It seemed the simplest, to me at least, to use a refcnt to ensure psock is not lost across the sendmsg copy into the sg, the bpf program running on the data in sg_data, and the final pass to the TCP stack. Some performance testing may show a better method to do this and avoid the refcnt cost, but for now use the simpler method. Another item that will come after basic support is in place is supporting MSG_MORE flag. At the moment we call sendpages even if the MSG_MORE flag is set. An enhancement would be to collect the pages into a larger scatterlist and pass down the stack. Notice that bpf_tcp_sendmsg() could support this with some additional state saved across sendmsg calls. I built the code to support this without having to do refactoring work. Other features TBD include ZEROCOPY and the TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series shortly. Future work could improve size limits on the scatterlist rings used here. Currently, we use MAX_SKB_FRAGS simply because this was being used already in the TLS case. Future work could extend the kernel sk APIs to tune this depending on workload. This is a trade-off between memory usage and throughput performance. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 John Fastabend 提交于
The sockmap refcnt up until now has been wrapped in the sk_callback_lock(). So its not actually needed any locking of its own. The counter itself tracks the lifetime of the psock object. Sockets in a sockmap have a lifetime that is independent of the map they are part of. This is possible because a single socket may be in multiple maps. When this happens we can only release the psock data associated with the socket when the refcnt reaches zero. There are three possible delete sock reference decrement paths first through the normal sockmap process, the user deletes the socket from the map. Second the map is removed and all sockets in the map are removed, delete path is similar to case 1. The third case is an asyncronous socket event such as a closing the socket. The last case handles removing sockets that are no longer available. For completeness, although inc does not pose any problems in this patch series, the inc case only happens when a psock is added to a map. Next we plan to add another socket prog type to handle policy and monitoring on the TX path. When we do this however we will need to keep a reference count open across the sendmsg/sendpage call and holding the sk_callback_lock() here (on every send) seems less than ideal, also it may sleep in cases where we hit memory pressure. Instead of dealing with these issues in some clever way simply make the reference counting a refcnt_t type and do proper atomic ops. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 15 3月, 2018 1 次提交
-
-
由 Song Liu 提交于
Currently, bpf stackmap store address for each entry in the call trace. To map these addresses to user space files, it is necessary to maintain the mapping from these virtual address to symbols in the binary. Usually, the user space profiler (such as perf) has to scan /proc/pid/maps at the beginning of profiling, and monitor mmap2() calls afterwards. Given the cost of maintaining the address map, this solution is not practical for system wide profiling that is always on. This patch tries to solve this problem with a variation of stackmap. This variation is enabled by flag BPF_F_STACK_BUILD_ID. Instead of storing addresses, the variation stores ELF file build_id + offset. Build ID is a 20-byte unique identifier for ELF files. The following command shows the Build ID of /bin/bash: [user@]$ readelf -n /bin/bash ... Build ID: XXXXXXXXXX ... With BPF_F_STACK_BUILD_ID, bpf_get_stackid() tries to parse Build ID for each entry in the call trace, and translate it into the following struct: struct bpf_stack_build_id_offset { __s32 status; unsigned char build_id[BPF_BUILD_ID_SIZE]; union { __u64 offset; __u64 ip; }; }; The search of build_id is limited to the first page of the file, and this page should be in page cache. Otherwise, we fallback to store ip for this entry (ip field in struct bpf_stack_build_id_offset). This requires the build_id to be stored in the first page. A quick survey of binary and dynamic library files in a few different systems shows that almost all binary and dynamic library files have build_id in the first page. Build_id is only meaningful for user stack. If a kernel stack is added to a stackmap with BPF_F_STACK_BUILD_ID, it will automatically fallback to only store ip (status == BPF_STACK_BUILD_ID_IP). Similarly, if build_id lookup failed for some reason, it will also fallback to store ip. User space can access struct bpf_stack_build_id_offset with bpf syscall BPF_MAP_LOOKUP_ELEM. It is necessary for user space to maintain mapping from build id to binary files. This mostly static mapping is much easier to maintain than per process address maps. Note: Stackmap with build_id only works in non-nmi context at this time. This is because we need to take mm->mmap_sem for find_vma(). If this changes, we would like to allow build_id lookup in nmi context. Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 14 3月, 2018 3 次提交
-
-
由 Josh Poimboeuf 提交于
The kbuild test robot reported the following warning on sparc64: kernel/jump_label.c: In function '__jump_label_update': kernel/jump_label.c:376:51: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] WARN_ONCE(1, "can't patch jump_label at %pS", (void *)entry->code); On sparc64, the jump_label entry->code field is of type u32, but pointers are 64-bit. Silence the warning by casting entry->code to an unsigned long before casting it to a pointer. This is also what the sparc jump label code does. Fixes: dc1dd184 ("jump_label: Warn on failed jump_label patching attempt") Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Borislav Petkov <bp@suse.de> Cc: "David S . Miller" <davem@davemloft.net> Link: https://lkml.kernel.org/r/c966fed42be6611254a62d46579ec7416548d572.1521041026.git.jpoimboe@redhat.com -
由 Stephen Hemminger 提交于
Found this by accident. There are no usages of bare cancel_work() in current kernel source. Signed-off-by: NStephen Hemminger <stephen@networkplumber.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Arvind Yadav 提交于
Never directly free @dev after calling device_register(), even if it returned an error! Always use put_device() to give up the reference initialized in this function instead. Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 12 3月, 2018 1 次提交
-
-
由 Masami Hiramatsu 提交于
Since the kprobe which was optimized by jump can not change the execution path, the kprobe for error-injection must not be optimized. To prohibit it, set a dummy post-handler as officially stated in Documentation/kprobes.txt. Fixes: 4b1a29a7 ("error-injection: Support fault injection framework") Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 10 3月, 2018 1 次提交
-
-
由 Kees Cook 提交于
The BUG and stack protector reports were still using a raw %p. This changes it to %pB for more meaningful output. Link: http://lkml.kernel.org/r/20180301225704.GA34198@beast Fixes: ad67b74d ("printk: hash addresses printed with %p") Signed-off-by: NKees Cook <keescook@chromium.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Richard Weinberger <richard.weinberger@gmail.com>, Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 3月, 2018 4 次提交
-
-
由 Boqun Feng 提交于
When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a HARDIRQ-safe->HARDIRQ-unsafe deadlock: ================================ WARNING: inconsistent lock state 4.16.0-rc4+ #1 Not tainted -------------------------------- inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. takes: __schedule+0xbe/0xaf0 {IN-HARDIRQ-W} state was registered at: _raw_spin_lock+0x2a/0x40 scheduler_tick+0x47/0xf0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&rq->lock); <Interrupt> lock(&rq->lock); *** DEADLOCK *** 1 lock held by rcu_torture_rea/724: rcu_torture_read_lock+0x0/0x70 stack backtrace: CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 Call Trace: lock_acquire+0x90/0x200 ? __schedule+0xbe/0xaf0 _raw_spin_lock+0x2a/0x40 ? __schedule+0xbe/0xaf0 __schedule+0xbe/0xaf0 preempt_schedule_irq+0x2f/0x60 retint_kernel+0x1b/0x2d RIP: 0010:rcu_read_unlock_special+0x0/0x680 ? rcu_torture_read_unlock+0x60/0x60 __rcu_read_unlock+0x64/0x70 rcu_torture_read_unlock+0x17/0x60 rcu_torture_reader+0x275/0x450 ? rcutorture_booster_init+0x110/0x110 ? rcu_torture_stall+0x230/0x230 ? kthread+0x10e/0x130 kthread+0x10e/0x130 ? kthread_create_worker_on_cpu+0x70/0x70 ? call_usermodehelper_exec_async+0x11a/0x150 ret_from_fork+0x3a/0x50 This happens with the following even sequence: preempt_schedule_irq(); local_irq_enable(); __schedule(): local_irq_disable(); // irq off ... rcu_note_context_switch(): rcu_note_preempt_context_switch(): rcu_read_unlock_special(): local_irq_save(flags); ... raw_spin_unlock_irqrestore(...,flags); // irq remains off rt_mutex_futex_unlock(): raw_spin_lock_irq(); ... raw_spin_unlock_irq(); // accidentally set irq on <return to __schedule()> rq_lock(): raw_spin_lock(); // acquiring rq->lock with irq on which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause deadlocks in scheduler code. This problem was introduced by commit 02a7c234 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints"). That brought the user of rt_mutex_futex_unlock() with irq off. To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with *lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock() with irq off. Fixes: 02a7c234 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints") Signed-off-by: NBoqun Feng <boqun.feng@gmail.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.feng@gmail.com
-
由 Quentin Monnet 提交于
When pinning a file under the BPF virtual file system (traditionally /sys/fs/bpf), using a dot in the name of the location to pin at is not allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be rejected with -EPERM. This check was introduced at the same time as the BPF file system itself, with commit b2197755 ("bpf: add support for persistent maps/progs"). At this time, it was checked in a function called "bpf_dname_reserved()", which made clear that using a dot was reserved for future extensions. This function disappeared and the check was moved elsewhere with commit 0c93b7d8 ("bpf: reject invalid names right in ->lookup()"), and the meaning of the dot ban was lost. The present commit simply adds a comment in the source to explain to the reader that the usage of dots is reserved for future usage. Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Song Liu 提交于
In ctx_resched(), EVENT_FLEXIBLE should be sched_out when EVENT_PINNED is added. However, ctx_resched() calculates ctx_event_type before checking this condition. As a result, pinned events will NOT get higher priority than flexible events. The following shows this issue on an Intel CPU (where ref-cycles can only use one hardware counter). 1. First start: perf stat -C 0 -e ref-cycles -I 1000 2. Then, in the second console, run: perf stat -C 0 -e ref-cycles:D -I 1000 The second perf uses pinned events, which is expected to have higher priority. However, because it failed in ctx_resched(). It is never run. This patch fixes this by calculating ctx_event_type after re-evaluating event_type. Reported-by: NEphraim Park <ephiepark@fb.com> Signed-off-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <jolsa@redhat.com> Cc: <kernel-team@fb.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 487f05e1 ("perf/core: Optimize event rescheduling on active contexts") Link: http://lkml.kernel.org/r/20180306055504.3283731-1-songliubraving@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org> -
由 Leon Yu 提交于
otherwise kernel can oops later in seq_release() due to dereferencing null file->private_data which is only set if seq_open() succeeds. BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: seq_release+0xc/0x30 Call Trace: close_pdeo+0x37/0xd0 proc_reg_release+0x5d/0x60 __fput+0x9d/0x1d0 ____fput+0x9/0x10 task_work_run+0x75/0x90 do_exit+0x252/0xa00 do_group_exit+0x36/0xb0 SyS_exit_group+0xf/0x10 Fixes: 516fb7f2 ("/proc/module: use the same logic as /proc/kallsyms for address exposure") Cc: Jessica Yu <jeyu@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # 4.15+ Signed-off-by: NLeon Yu <chianglungyu@gmail.com> Signed-off-by: NJessica Yu <jeyu@kernel.org>
-
- 08 3月, 2018 1 次提交
-
-
由 Teng Qin 提交于
This commit adds new field "addr" to bpf_perf_event_data which could be read and used by bpf programs attached to perf events. The value of the field is copied from bpf_perf_event_data_kern.addr and contains the address value recorded by specifying sample_type with PERF_SAMPLE_ADDR when calling perf_event_open. Signed-off-by: NTeng Qin <qinteng@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 07 3月, 2018 1 次提交
-
-
由 Oliver O'Halloran 提交于
devm_memremap_pages() was re-worked in e8d51348 "memremap: change devm_memremap_pages interface to use struct dev_pagemap" to take a caller allocated struct dev_pagemap as a function parameter. A call to devres_free() was left in the error cleanup path which results in a kernel panic if the remap fails for some reason. Remove it to fix the panic and let devm_memremap_pages() fail gracefully. Fixes: e8d51348 ("memremap: change devm_memremap_pages interface...") Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
- 03 3月, 2018 2 次提交
-
-
由 Dan Williams 提交于
The cond_resched() currently in the setup path needs to be duplicated in the teardown path. Rather than require each instance of for_each_device_pfn() to open code the same sequence, embed it in the helper. Link: https://github.com/intel/ixpdimm_sw/issues/11 Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> Fixes: 71389703 ("mm, zone_device: Replace {get, put}_zone_device_page()...") Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Matt Redfearn 提交于
Since commit afcc90f8 ("usercopy: WARN() on slab cache usercopy region violations"), MIPS systems booting with a compat root filesystem emit a warning when copying compat siginfo to userspace: WARNING: CPU: 0 PID: 953 at mm/usercopy.c:81 usercopy_warn+0x98/0xe8 Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLAB object 'task_struct' (offset 1432, size 16)! Modules linked in: CPU: 0 PID: 953 Comm: S01logging Not tainted 4.16.0-rc2 #10 Stack : ffffffff808c0000 0000000000000000 0000000000000001 65ac85163f3bdc4a 65ac85163f3bdc4a 0000000000000000 90000000ff667ab8 ffffffff808c0000 00000000000003f8 ffffffff808d0000 00000000000000d1 0000000000000000 000000000000003c 0000000000000000 ffffffff808c8ca8 ffffffff808d0000 ffffffff808d0000 ffffffff80810000 fffffc0000000000 ffffffff80785c30 0000000000000009 0000000000000051 90000000ff667eb0 90000000ff667db0 000000007fe0d938 0000000000000018 ffffffff80449958 0000000020052798 ffffffff808c0000 90000000ff664000 90000000ff667ab0 00000000100c0000 ffffffff80698810 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffffffff8010d02c 65ac85163f3bdc4a ... Call Trace: [<ffffffff8010d02c>] show_stack+0x9c/0x130 [<ffffffff80698810>] dump_stack+0x90/0xd0 [<ffffffff80137b78>] __warn+0x100/0x118 [<ffffffff80137bdc>] warn_slowpath_fmt+0x4c/0x70 [<ffffffff8021e4a8>] usercopy_warn+0x98/0xe8 [<ffffffff8021e68c>] __check_object_size+0xfc/0x250 [<ffffffff801bbfb8>] put_compat_sigset+0x30/0x88 [<ffffffff8011af24>] setup_rt_frame_n32+0xc4/0x160 [<ffffffff8010b8b4>] do_signal+0x19c/0x230 [<ffffffff8010c408>] do_notify_resume+0x60/0x78 [<ffffffff80106f50>] work_notifysig+0x10/0x18 ---[ end trace 88fffbf69147f48a ]--- Commit 5905429a ("fork: Provide usercopy whitelisting for task_struct") noted that: "While the blocked and saved_sigmask fields of task_struct are copied to userspace (via sigmask_to_save() and setup_rt_frame()), it is always copied with a static length (i.e. sizeof(sigset_t))." However, this is not true in the case of compat signals, whose sigset is copied by put_compat_sigset and receives size as an argument. At most call sites, put_compat_sigset is copying a sigset from the current task_struct. This triggers a warning when CONFIG_HARDENED_USERCOPY is active. However, by marking this function as static inline, the warning can be avoided because in all of these cases the size is constant at compile time, which is allowed. The only site where this is not the case is handling the rt_sigpending syscall, but there the copy is being made from a stack local variable so does not trigger the warning. Move put_compat_sigset to compat.h, and mark it static inline. This fixes the WARN on MIPS. Fixes: afcc90f8 ("usercopy: WARN() on slab cache usercopy region violations") Signed-off-by: NMatt Redfearn <matt.redfearn@mips.com> Acked-by: NKees Cook <keescook@chromium.org> Cc: "Dmitry V . Levin" <ldv@altlinux.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: kernel-hardening@lists.openwall.com Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/18639/Signed-off-by: NJames Hogan <jhogan@kernel.org>
-
- 01 3月, 2018 1 次提交
-
-
由 Lingutla Chandrasekhar 提交于
On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a live CPU. This happens from the control thread which initiated the unplug. If the CPU on which the control thread runs came out from a longer idle period then the base clock of that CPU might be stale because the control thread runs prior to any event which forwards the clock. In such a case the timers from the unplugged CPU are queued on the live CPU based on the stale clock which can cause large delays due to increased granularity of the outer timer wheels which are far away from base:;clock. But there is a worse problem than that. The following sequence of events illustrates it: - CPU0 timer1 is queued expires = 59969 and base->clk = 59131. The timer is queued at wheel level 2, with resulting expiry time = 60032 (due to level granularity). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU0 is hotplugged at @60009 - CPU1 exits idle and runs the control thread which migrates the timers from CPU0 timer1 is now queued in level 0 for immediate handling in the next softirq because the requested expiry time 59969 is before CPU1 base->clk 60007 - CPU1 runs code which forwards the base clock which succeeds because the next expiring timer. which was collected at idle entry time is still set to 60020. So it forwards beyond 60007 and therefore misses to expire the migrated timer1. That timer gets expired when the wheel wraps around again, which takes between 63 and 630ms depending on the HZ setting. Address both problems by invoking forward_timer_base() for the control CPUs timer base. All other places, which might run into a similar problem (mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid that. [ tglx: Massaged comment and changelog ] Fixes: a683f390 ("timers: Forward the wheel clock whenever possible") Co-developed-by: NNeeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: NNeeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: NLingutla Chandrasekhar <clingutla@codeaurora.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Anna-Maria Gleixner <anna-maria@linutronix.de> Cc: linux-arm-msm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
-
- 27 2月, 2018 1 次提交
-
-
由 Petr Mladek 提交于
wake_klogd is a local variable in console_unlock(). The information is lost when the console_lock owner using the busy wait added by the commit dbdda842 ("printk: Add console owner and waiter logic to load balance console writes"). The following race is possible: CPU0 CPU1 console_unlock() for (;;) /* calling console for last message */ printk() log_store() log_next_seq++; /* see new message */ if (seen_seq != log_next_seq) { wake_klogd = true; seen_seq = log_next_seq; } console_lock_spinning_enable(); if (console_trylock_spinning()) /* spinning */ if (console_lock_spinning_disable_and_check()) { printk_safe_exit_irqrestore(flags); return; console_unlock() if (seen_seq != log_next_seq) { /* already seen */ /* nothing to do */ Result: Nobody would wakeup klogd. One solution would be to make a global variable from wake_klogd. But then we would need to manipulate it under a lock or so. This patch wakes klogd also when console_lock is passed to the spinning waiter. It looks like the right way to go. Also userspace should have a chance to see and store any "flood" of messages. Note that the very late klogd wake up was a historic solution. It made sense on single CPU systems or when sys_syslog() operations were synchronized using the big kernel lock like in v2.1.113. But it is questionable these days. Fixes: dbdda842 ("printk: Add console owner and waiter logic to load balance console writes") Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org Cc: Tejun Heo <tj@kernel.org> Suggested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NPetr Mladek <pmladek@suse.com>
-
- 24 2月, 2018 1 次提交
-
-
由 Daniel Borkmann 提交于
The requirements around atomic_add() / atomic64_add() resp. their JIT implementations differ across architectures. E.g. while x86_64 seems just fine with BPF's xadd on unaligned memory, on arm64 it triggers via interpreter but also JIT the following crash: [ 830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703 [...] [ 830.916161] Internal error: Oops: 96000021 [#1] SMP [ 830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8 [ 830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017 [ 830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO) [ 831.003793] pc : __ll_sc_atomic_add+0x4/0x18 [ 831.008055] lr : ___bpf_prog_run+0x1198/0x1588 [ 831.012485] sp : ffff00001ccabc20 [ 831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00 [ 831.021087] x27: 0000000000000001 x26: 0000000000000000 [ 831.026387] x25: 000000c168d9db98 x24: 0000000000000000 [ 831.031686] x23: ffff000008203878 x22: ffff000009488000 [ 831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0 [ 831.042286] x19: ffff0000097b5080 x18: 0000000000000a03 [ 831.047585] x17: 0000000000000000 x16: 0000000000000000 [ 831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000 [ 831.058184] x13: 0000000000000000 x12: 0000000000000000 [ 831.063484] x11: 0000000000000001 x10: 0000000000000000 [ 831.068783] x9 : 0000000000000000 x8 : 0000000000000000 [ 831.074083] x7 : 0000000000000000 x6 : 000580d428000000 [ 831.079383] x5 : 0000000000000018 x4 : 0000000000000000 [ 831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001 [ 831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001 [ 831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044) [ 831.102577] Call trace: [ 831.105012] __ll_sc_atomic_add+0x4/0x18 [ 831.108923] __bpf_prog_run32+0x4c/0x70 [ 831.112748] bpf_test_run+0x78/0xf8 [ 831.116224] bpf_prog_test_run_xdp+0xb4/0x120 [ 831.120567] SyS_bpf+0x77c/0x1110 [ 831.123873] el0_svc_naked+0x30/0x34 [ 831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31) Reason for this is because memory is required to be aligned. In case of BPF, we always enforce alignment in terms of stack access, but not when accessing map values or packet data when the underlying arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set. xadd on packet data that is local to us anyway is just wrong, so forbid this case entirely. The only place where xadd makes sense in fact are map values; xadd on stack is wrong as well, but it's been around for much longer. Specifically enforce strict alignment in case of xadd, so that we handle this case generically and avoid such crashes in the first place. Fixes: 17a52670 ("bpf: verifier (add verifier core)") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 23 2月, 2018 4 次提交
-
-
由 Thomas Gleixner 提交于
At CPU hotunplug the corresponding per cpu matrix allocator is shut down and the allocated interrupt bits are discarded under the assumption that all allocated bits have been either migrated away or shut down through the managed interrupts mechanism. This is not true because interrupts which are not started up might have a vector allocated on the outgoing CPU. When the interrupt is started up later or completely shutdown and freed then the allocated vector is handed back, triggering warnings or causing accounting issues which result in suspend failures and other issues. Change the CPU hotplug mechanism of the matrix allocator so that the remaining allocations at unplug time are preserved and global accounting at hotplug is correctly readjusted to take the dormant vectors into account. Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator") Reported-by: NYuriy Vostrikov <delamonpansie@gmail.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NYuriy Vostrikov <delamonpansie@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180222112316.849980972@linutronix.de
-
由 Yonghong Song 提交于
Commit 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function") fixed a memory leak and removed unnecessary locks in map_free callback function. Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on, running tools/testing/selftests/bpf/test_lpm_map will have: [ 98.294321] ============================= [ 98.294807] WARNING: suspicious RCU usage [ 98.295359] 4.16.0-rc2+ #193 Not tainted [ 98.295907] ----------------------------- [ 98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage! [ 98.297657] [ 98.297657] other info that might help us debug this: [ 98.297657] [ 98.298663] [ 98.298663] rcu_scheduler_active = 2, debug_locks = 1 [ 98.299536] 2 locks held by kworker/2:1/54: [ 98.300152] #0: ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0 [ 98.301381] #1: ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0 Since actual trie tree removal happens only after no other accesses to the tree are possible, replacing rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock)) with rcu_dereference_protected(*slot, 1) fixed the issue. Fixes: 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function") Reported-by: NEric Dumazet <edumazet@google.com> Suggested-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NYonghong Song <yhs@fb.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Eric Dumazet 提交于
syszbot managed to trigger RCU detected stalls in bpf_array_free_percpu() It takes time to allocate a huge percpu map, but even more time to free it. Since we run in process context, use cond_resched() to yield cpu if needed. Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Luck, Tony 提交于
Each read from a file in efivarfs results in two calls to EFI (one to get the file size, another to get the actual data). On X86 these EFI calls result in broadcast system management interrupts (SMI) which affect performance of the whole system. A malicious user can loop performing reads from efivarfs bringing the system to its knees. Linus suggested per-user rate limit to solve this. So we add a ratelimit structure to "user_struct" and initialize it for the root user for no limit. When allocating user_struct for other users we set the limit to 100 per second. This could be used for other places that want to limit the rate of some detrimental user action. In efivarfs if the limit is exceeded when reading, we take an interruptible nap for 50ms and check the rate limit again. Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 2月, 2018 4 次提交
-
-
由 Tycho Andersen 提交于
Previously if users passed a small size for the input structure size, they would get get odd behavior. It doesn't make sense to pass a structure smaller than at least filter_off size, so let's just give -EINVAL in this case. This changes userspace visible behavior, but was only introduced in commit 26500475 ("ptrace, seccomp: add support for retrieving seccomp metadata") in 4.16-rc2, so should be safe to change if merged before then. Reported-by: NEugene Syromiatnikov <esyr@redhat.com> Signed-off-by: NTycho Andersen <tycho@tycho.ws> CC: Kees Cook <keescook@chromium.org> CC: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NKees Cook <keescook@chromium.org>
-
由 David Rientjes 提交于
chan->n_subbufs is set by the user and relay_create_buf() does a kmalloc() of chan->n_subbufs * sizeof(size_t *). kmalloc_slab() will generate a warning when this fails if chan->subbufs * sizeof(size_t *) > KMALLOC_MAX_SIZE. Limit chan->n_subbufs to the maximum allowed kmalloc() size. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802061216100.122576@chino.kir.corp.google.com Fixes: f6302f1b ("relay: prevent integer overflow in relay_open()") Signed-off-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
As Peter points out, Doing a CALL+RET for just the decrement is a bit silly. Fixes: d70f2a14 ("include/linux/sched/mm.h: uninline mmdrop_async(), etc") Acked-by: NPeter Zijlstra (Intel) <peterz@infraded.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tejun Heo 提交于
A domain cgroup isn't allowed to be turned threaded if its subtree is populated or domain controllers are enabled. cgroup_enable_threaded() depended on cgroup_can_be_thread_root() test to enforce this rule. A parent which has populated domain descendants or have domain controllers enabled can't become a thread root, so the above rules are enforced automatically. However, for the root cgroup which can host mixed domain and threaded children, cgroup_can_be_thread_root() doesn't check any of those conditions and thus first level cgroups ends up escaping those rules. This patch fixes the bug by adding explicit checks for those rules in cgroup_enable_threaded(). Reported-by: NMichael Kerrisk (man-pages) <mtk.manpages@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org> Fixes: 8cfd8147 ("cgroup: implement cgroup v2 thread support") Cc: stable@vger.kernel.org # v4.14+
-
- 21 2月, 2018 2 次提交
-
-
由 Josh Poimboeuf 提交于
Convert init_kernel_text() to a global function and use it in a few places instead of manually comparing _sinittext and _einittext. Note that kallsyms.h has a very similar function called is_kernel_inittext(), but its end check is inclusive. I'm not sure whether that's intentional behavior, so I didn't touch it. Suggested-by: NJason Baron <jbaron@akamai.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4335d02be8d45ca7d265d2f174251d0b7ee6c5fd.1519051220.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Josh Poimboeuf 提交于
Currently when the jump label code encounters an address which isn't recognized by kernel_text_address(), it just silently fails. This can be dangerous because jump labels are used in a variety of places, and are generally expected to work. Convert the silent failure to a warning. This won't warn about attempted writes to tracepoints in __init code after initmem has been freed, as those are already guarded by the entry->code check. Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/de3a271c93807adb7ed48f4e946b4f9156617680.1519051220.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-