- 10 6月, 2018 1 次提交
-
-
由 Anna-Maria Gleixner 提交于
Commit a841796f ("signal: align __lock_task_sighand() irq disabling and RCU") introduced a rcu read side critical section with interrupts disabled. The changelog suggested that a better long-term fix would be "to make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's ->wait_lock". This long-term fix has been made in commit b4abf910 ("rtmutex: Make wait_lock irq safe") for a different reason. Therefore revert commit a841796f ("signal: align > __lock_task_sighand() irq disabling and RCU") as the interrupt disable dance is not longer required. The change was tested on the base of b4abf910 ("rtmutex: Make wait_lock irq safe") with a four hour run of rcutorture scenario TREE03 with lockdep enabled as suggested by Paul McKenney. Signed-off-by: NAnna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Cc: bigeasy@linutronix.de Link: https://lkml.kernel.org/r/20180525090507.22248-3-anna-maria@linutronix.de
-
- 08 6月, 2018 3 次提交
-
-
由 Tetsuo Handa 提交于
When we get a hung task it can often be valuable to see _all_ the hung tasks on the system before calling panic(). Quoting from https://syzkaller.appspot.com/text?tag=CrashReport&id=5316056503549952 ---------------------------------------- INFO: task syz-executor0:6540 blocked for more than 120 seconds. Not tainted 4.16.0+ #13 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor0 D23560 6540 4521 0x80000004 Call Trace: context_switch kernel/sched/core.c:2848 [inline] __schedule+0x8fb/0x1ef0 kernel/sched/core.c:3490 schedule+0xf5/0x430 kernel/sched/core.c:3549 schedule_preempt_disabled+0x10/0x20 kernel/sched/core.c:3607 __mutex_lock_common kernel/locking/mutex.c:833 [inline] __mutex_lock+0xb7f/0x1810 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 __blkdev_driver_ioctl block/ioctl.c:303 [inline] blkdev_ioctl+0x1759/0x1e00 block/ioctl.c:601 ioctl_by_bdev+0xa5/0x110 fs/block_dev.c:2060 isofs_get_last_session fs/isofs/inode.c:567 [inline] isofs_fill_super+0x2ba9/0x3bc0 fs/isofs/inode.c:660 mount_bdev+0x2b7/0x370 fs/super.c:1119 isofs_mount+0x34/0x40 fs/isofs/inode.c:1560 mount_fs+0x66/0x2d0 fs/super.c:1222 vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037 vfs_kern_mount fs/namespace.c:2514 [inline] do_new_mount fs/namespace.c:2517 [inline] do_mount+0xea4/0x2b90 fs/namespace.c:2847 ksys_mount+0xab/0x120 fs/namespace.c:3063 SYSC_mount fs/namespace.c:3077 [inline] SyS_mount+0x39/0x50 fs/namespace.c:3074 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 (...snipped...) Showing all locks held in the system: (...snipped...) 2 locks held by syz-executor0/6540: #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: alloc_super fs/super.c:211 [inline] #0: 00000000566d4c39 (&type->s_umount_key#49/1){+.+.}, at: sget_userns+0x3b2/0xe60 fs/super.c:502 /* down_write_nested(&s->s_umount, SINGLE_DEPTH_NESTING); */ #1: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */ (...snipped...) 3 locks held by syz-executor7/6541: #0: 0000000043ca8836 (&lo->lo_ctl_mutex/1){+.+.}, at: lo_ioctl+0x8b/0x1b70 drivers/block/loop.c:1355 /* mutex_lock_nested(&lo->lo_ctl_mutex, 1); */ #1: 000000007bf3d3f9 (&bdev->bd_mutex){+.+.}, at: blkdev_reread_part+0x1e/0x40 block/ioctl.c:192 #2: 00000000566d4c39 (&type->s_umount_key#50){.+.+}, at: __get_super.part.10+0x1d3/0x280 fs/super.c:663 /* down_read(&sb->s_umount); */ ---------------------------------------- When reporting an AB-BA deadlock like shown above, it would be nice if trace of PID=6541 is printed as well as trace of PID=6540 before calling panic(). Showing hung tasks up to /proc/sys/kernel/hung_task_warnings could delay calling panic() but normally there should not be so many hung tasks. Link: http://lkml.kernel.org/r/201804050705.BHE57833.HVFOFtSOMQJFOL@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: NDmitry Vyukov <dvyukov@google.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox 提交于
We're already using a union of many fields here, so stop abusing the _mapcount and make page_type its own field. That implies renaming some of the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring back the PG_buddy, PG_balloon and PG_kmemcg names. As suggested by Kirill, make page_type a bitmask. Because it starts out life as -1 (thanks to sharing the storage with _mapcount), setting a page flag means clearing the appropriate bit. This gives us space for probably twenty or so extra bits (depending how paranoid we want to be about _mapcount underflow). Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 6月, 2018 1 次提交
-
-
由 Kees Cook 提交于
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This patch makes the changes for kmalloc()-family (and kvmalloc()-family) uses. It was done via automatic conversion with manual review for the "CHECKME" non-standard cases noted below, using the following Coccinelle script: // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * // sizeof *pkey_cache->table, GFP_KERNEL); @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; identifier VAR, ELEMENT; expression COUNT; @@ - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP) + alloc(struct_size(VAR, ELEMENT, COUNT), GFP) // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL); @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; identifier VAR, ELEMENT; expression COUNT; @@ - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP) + alloc(struct_size(VAR, ELEMENT, COUNT), GFP) // Same pattern, but can't trivially locate the trailing element name, // or variable name. @@ identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc"; expression GFP; expression SOMETHING, COUNT, ELEMENT; @@ - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP) + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP) Signed-off-by: NKees Cook <keescook@chromium.org>
-
- 06 6月, 2018 2 次提交
-
-
The interrupts are enabled/disabled so the interrupt handler can run with enabled interrupts while serving the interrupt and not lose other interrupts especially the timer tick. If the system runs with force-threaded interrupts then there is no need to enable the interrupts. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yisheng Xie 提交于
match_string() returns the index of an array for a matching string, which can be used to simplify the code. Link: http://lkml.kernel.org/r/1526546163-4609-1-git-send-email-xieyisheng1@huawei.comReviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: NYisheng Xie <xieyisheng1@huawei.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 05 6月, 2018 3 次提交
-
-
由 Sergey Senozhatsky 提交于
Drop the in_nmi() check from printk_safe_flush_on_panic() and attempt to re-init (IOW unlock) locked logbuf spinlock from panic CPU regardless of its context. Otherwise, theoretically, we can deadlock on logbuf trying to flush per-CPU buffers: a) Panic CPU is running in non-NMI context b) Panic CPU sends out shutdown IPI via reboot vector c) Panic CPU fails to stop all remote CPUs d) Panic CPU sends out shutdown IPI via NMI vector One of the CPUs that we bring down via NMI vector can hold logbuf spin lock (theoretically). Link: http://lkml.kernel.org/r/20180530070350.10131-1-sergey.senozhatsky@gmail.com To: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NPetr Mladek <pmladek@suse.com>
-
由 Steven Rostedt (VMware) 提交于
An anonymous source sent me a bunch of typo fixes in the comments of ring_buffer.c file. That source did not want to be associated to this patch because they don't want to be known as "one of those" commiters (you know who you are!). They gave me permission to sign this off in my own name. Suggested-by: One-of-those-commiters@YouKnowWhoYouAre.org Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Yonghong Song 提交于
Commit bf6fa2c8 ("bpf: implement bpf_get_current_cgroup_id() helper") introduced a new helper bpf_get_current_cgroup_id(). The helper has a dependency on CONFIG_CGROUPS. When CONFIG_CGROUPS is not defined, using the helper will result the following verifier error: kernel subsystem misconfigured func bpf_get_current_cgroup_id#80 which is hard for users to interpret. Guarding the reference to bpf_get_current_cgroup_id_proto with CONFIG_CGROUPS will result in below better message: unknown func bpf_get_current_cgroup_id#80 Fixes: bf6fa2c8 ("bpf: implement bpf_get_current_cgroup_id() helper") Suggested-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 04 6月, 2018 1 次提交
-
-
由 Yonghong Song 提交于
bpf has been used extensively for tracing. For example, bcc contains an almost full set of bpf-based tools to trace kernel and user functions/events. Most tracing tools are currently either filtered based on pid or system-wide. Containers have been used quite extensively in industry and cgroup is often used together to provide resource isolation and protection. Several processes may run inside the same container. It is often desirable to get container-level tracing results as well, e.g. syscall count, function count, I/O activity, etc. This patch implements a new helper, bpf_get_current_cgroup_id(), which will return cgroup id based on the cgroup within which the current task is running. The later patch will provide an example to show that userspace can get the same cgroup id so it could configure a filter or policy in the bpf program based on task cgroup id. The helper is currently implemented for tracing. It can be added to other program types as well when needed. Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 03 6月, 2018 9 次提交
-
-
由 Jesper Dangaard Brouer 提交于
The XDP_REDIRECT map devmap can avoid using ndo_xdp_flush, by instead instructing ndo_xdp_xmit to flush via XDP_XMIT_FLUSH flag in appropriate places. Notice after this patch it is possible to remove ndo_xdp_flush completely, as this is the last user of ndo_xdp_flush. This is left for later patches, to keep driver changes separate. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Jesper Dangaard Brouer 提交于
This patch only change the API and reject any use of flags. This is an intermediate step that allows us to implement the flush flag operation later, for each individual driver in a separate patch. The plan is to implement flush operation via XDP_XMIT_FLUSH flag and then remove XDP_XMIT_FLAGS_NONE when done. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT program type in test_verifier report the following errors on x86_32: 172/p unpriv: spill/fill of different pointers ldx FAIL Unexpected error message! 0: (bf) r6 = r10 1: (07) r6 += -8 2: (15) if r1 == 0x0 goto pc+3 R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1 3: (bf) r2 = r10 4: (07) r2 += -76 5: (7b) *(u64 *)(r6 +0) = r2 6: (55) if r1 != 0x0 goto pc+1 R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp 7: (7b) *(u64 *)(r6 +0) = r1 8: (79) r1 = *(u64 *)(r6 +0) 9: (79) r1 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (71) r0 = *(u8 *)(r1 +68) invalid bpf_context access off=68 size=1 379/p check bpf_perf_event_data->sample_period half load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (69) r0 = *(u16 *)(r1 +68) invalid bpf_context access off=68 size=2 380/p check bpf_perf_event_data->sample_period word load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (61) r0 = *(u32 *)(r1 +68) invalid bpf_context access off=68 size=4 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (79) r0 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok() will then bail out saying that off & (size_default - 1) which is 68 & 7 doesn't cleanly align in the case of sample_period access from struct bpf_perf_event_data, hence verifier wrongly thinks we might be doing an unaligned access here though underlying arch can handle it just fine. Therefore adjust this down to machine size and check and rewrite the offset for narrow access on that basis. We also need to fix corresponding pe_prog_is_valid_access(), since we hit the check for off % size != 0 (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs for tracing work on x86_32. Reported-by: NWang YanQing <udknight@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Tested-by: NWang YanQing <udknight@gmail.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
While some of the BPF map lookup helpers provide a ->map_gen_lookup() callback for inlining the map lookup altogether it is not available for every map, so the remaining ones have to call bpf_map_lookup_elem() helper which does a dispatch to map->ops->map_lookup_elem(). In times of retpolines, this will control and trap speculative execution rather than letting it do its work for the indirect call and will therefore cause a slowdown. Likewise, bpf_map_update_elem() and bpf_map_delete_elem() do not have an inlined version and need to call into their map->ops->map_update_elem() resp. map->ops->map_delete_elem() handlers. Before: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#232656 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call bpf_map_delete_elem#215008 <-- indirect call via 16: (95) exit helper After: # bpftool prog dump xlated id 1 0: (bf) r2 = r10 1: (07) r2 += -8 2: (7a) *(u64 *)(r2 +0) = 0 3: (18) r1 = map[id:1] 5: (85) call __htab_map_lookup_elem#233328 6: (15) if r0 == 0x0 goto pc+4 7: (71) r1 = *(u8 *)(r0 +35) 8: (55) if r1 != 0x0 goto pc+1 9: (72) *(u8 *)(r0 +35) = 1 10: (07) r0 += 56 11: (15) if r0 == 0x0 goto pc+4 12: (bf) r2 = r0 13: (18) r1 = map[id:1] 15: (85) call htab_lru_map_delete_elem#238240 <-- direct call 16: (95) exit In all three lookup/update/delete cases however we can use the actual address of the map callback directly if we find that there's only a single path with a map pointer leading to the helper call, meaning when the map pointer has not been poisoned from verifier side. Example code can be seen above for the delete case. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
Its trivial and straight forward to expose it for scripts that can then use it along with bpftool in order to inspect an individual application's used maps and progs. Right now we dump some basic information in the fdinfo file but with the help of the map/prog id full introspection becomes possible now. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NSong Liu <songliubraving@fb.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
Stating 'proprietary program' in the error is just silly since it can also be a different open source license than that which is just not compatible. Reference: https://twitter.com/majek04/status/998531268039102465Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Dan Williams 提交于
There is currently a mismatch between the resources that will trigger the e820_pmem driver to register/load and the resources that will actually be surfaced as pmem ranges. register_e820_pmem() uses walk_iomem_res_desc() which includes children and siblings. In contrast, e820_pmem_probe() only considers top level resources. For example the following resource tree results in the driver being loaded, but no resources being registered: 398000000000-39bfffffffff : PCI Bus 0000:ae 39be00000000-39bf07ffffff : PCI Bus 0000:af 39be00000000-39beffffffff : 0000:af:00.0 39be10000000-39beffffffff : Persistent Memory (legacy) Fix this up to allow definitions of "legacy" pmem ranges anywhere in system-physical address space. Not that it is a recommended or safe to define a pmem range in PCI space, but it is useful for debug / experimentation, and the restriction on being a top-level resource was arbitrary. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Martin KaFai Lau 提交于
The t->type in BTF_KIND_FWD is not used. It must be 0. This patch ensures that and also adds a test case in test_btf.c Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Martin KaFai Lau 提交于
This patch ensures array's t->size is 0. The array size is decided by its individual elem's size and the number of elements. Hence, t->size is not used and it must be 0. A test case is added to test_btf.c Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 31 5月, 2018 5 次提交
-
-
由 Davidlohr Bueso 提交于
I cannot spell 'throttling'. Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180530224940.17839-1-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Juri Lelli 提交于
A missing clock update is causing the following warning: rq->clock_update_flags < RQCF_ACT_SKIP WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720 Call Trace: <IRQ> __hrtimer_run_queues+0x10f/0x530 hrtimer_interrupt+0xe5/0x240 smp_apic_timer_interrupt+0x79/0x2b0 apic_timer_interrupt+0xf/0x20 </IRQ> do_idle+0x203/0x280 cpu_startup_entry+0x6f/0x80 start_secondary+0x1b0/0x200 secondary_startup_64+0xa5/0xb0 hardirqs last enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360 hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0 softirqs last enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70 softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70 This happens because inactive_task_timer() calls sub_running_bw() (if TASK_DEAD and non_contending) that might trigger a schedutil update, which might access the clock. Clock is however currently updated only later in inactive_task_timer() function. Fix the problem by updating the clock right after task_rq_lock(). Reported-by: Nkernel test robot <xiaolong.ye@intel.com> Signed-off-by: NJuri Lelli <juri.lelli@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Paul Burton 提交于
select_task_rq() is used in a few paths to select the CPU upon which a thread should be run - for example it is used by try_to_wake_up() & by fork or exec balancing. As-is it allows use of any online CPU that is present in the task's cpus_allowed mask. This presents a problem because there is a period whilst CPUs are brought online where a CPU is marked online, but is not yet fully initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state < CPUHP_ONLINE. Usually we don't run any user tasks during this window, but there are corner cases where this can happen. An example observed is: - Some user task A, running on CPU X, forks to create task B. - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's task_struct::cpu field to X. - CPU X is offlined. - Task A, currently somewhere between the __set_task_cpu() in copy_process() and the call to wake_up_new_task(), is migrated to CPU Y by migrate_tasks() when CPU X is offlined. - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The scheduler is now active on CPU X, but there are no user tasks on the runqueue. - Task A runs on CPU Y & reaches wake_up_new_task(). This calls select_task_rq() with cpu=X, taken from task B's task_struct, and select_task_rq() allows CPU X to be returned. - Task A enqueues task B on CPU X's runqueue, via activate_task() & enqueue_task(). - CPU X now has a user task on its runqueue before it has reached the CPUHP_ONLINE state. In most cases, the user tasks that schedule on the newly onlined CPU have no idea that anything went wrong, but one case observed to be problematic is if the task goes on to invoke the sched_setaffinity syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state before the CPU that brought it online calls stop_machine_unpark(). This means that for a portion of the window of time between CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct cpu_stopper has its enabled field set to false. If a user thread is executed on the CPU during this window and it invokes sched_setaffinity with a CPU mask that does not include the CPU it's running on, then when __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke migration_cpu_stop() and perform the actual migration away from the CPU it will simply return -ENOENT rather than calling migration_cpu_stop(). We then return from the sched_setaffinity syscall back to the user task that is now running on a CPU which it just asked not to run on, and which is not present in its cpus_allowed mask. This patch resolves the problem by having select_task_rq() enforce that user tasks run on CPUs that are active - the same requirement that select_fallback_rq() already enforces. This should ensure that newly onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to schedule user tasks, and also implies that bringup_wait_for_ap() will have called stop_machine_unpark() which resolves the sched_setaffinity issue above. I haven't yet investigated them, but it may be of interest to review whether any of the actions performed by hotplug states between CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended effects on user tasks that might schedule before they are reached, which might widen the scope of the problem from just affecting the behaviour of sched_setaffinity. Signed-off-by: NPaul Burton <paul.burton@mips.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules for running on an online && !active CPU are stricter than just being a kthread, you need to be a per-cpu kthread. If you're not strictly per-CPU, you have better CPUs to run on and don't need the partially booted one to get your work done. The exception is to allow smpboot threads to bootstrap the CPU itself and get kernel 'services' initialized before we allow userspace on it. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 955dbdf4 ("sched: Allow migrating kthreads into online but inactive CPUs") Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Colin Ian King 提交于
The assignment dev = dev is redundant and should be removed. Detected by CoverityScan, CID#1469486 ("Evaluation order violation") Signed-off-by: NColin Ian King <colin.king@canonical.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 30 5月, 2018 2 次提交
-
-
由 Sean Young 提交于
Add support for BPF_PROG_LIRC_MODE2. This type of BPF program can call rc_keydown() to reported decoded IR scancodes, or rc_repeat() to report that the last key should be repeated. The bpf program can be attached to using the bpf(BPF_PROG_ATTACH) syscall; the target_fd must be the /dev/lircN device. Acked-by: NYonghong Song <yhs@fb.com> Signed-off-by: NSean Young <sean@mess.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Sean Young 提交于
This makes is it possible for bpf prog detach to return -ENOENT. Acked-by: NYonghong Song <yhs@fb.com> Signed-off-by: NSean Young <sean@mess.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 29 5月, 2018 11 次提交
-
-
由 Steven Rostedt (VMware) 提交于
Now that trace_marker can have triggers, including a histogram triggers, the onmatch() and onmax() access the trace event. To do so, the search routine to find the event file needs to use the raw __find_event_file() that does not filter out ftrace events. Reviewed-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
As strings in trace events may not have a nul terminating character, the filter string compares use the defined string length for the field for the compares. The trace_marker records data slightly different than do normal events. It's size is zero, meaning that the string is the rest of the array, and that the string also ends with '\0'. If the size is zero, assume that the string is nul terminated and read the string in question as is. Reviewed-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
Allow writing to the trace_markers file initiate triggers defined in tracefs/ftrace/print/trigger file. This will allow of user space to trigger the same type of triggers (including histograms) that the trace events use. Had to create a ftrace_event_register() function that will become the trace_marker print event's reg() function. This is required because of how triggers are enabled: event_trigger_write() { event_trigger_regex_write() { trigger_process_regex() { for p in trigger_commands { p->func(); /* trigger_snapshot_cmd->func */ event_trigger_callback() { cmd_ops->reg() /* register_trigger() */ { trace_event_trigger_enable_disable() { trace_event_enable_disable() { call->class->reg(); Without the reg() function, the trigger code will call a NULL pointer and crash the system. Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Clark Williams <williams@redhat.com> Cc: Karim Yaghmour <karim.yaghmour@opersys.com> Cc: Brendan Gregg <bgregg@netflix.com> Suggested-by: NJoel Fernandes <joelaf@google.com> Reviewed-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
The filter file in the ftrace internal events, like in /sys/kernel/tracing/events/ftrace/function/filter is not attached to any functionality. Do not create them as they are meaningless. In the future, if an ftrace internal event gets filter functionality, then it will need to create it directly. Reviewed-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
The dynamic arrays defined for ftrace internal events, such as the buf field for trace_marker (ftrace/print) did not have brackets which makes the filter code not accept it as a string. This is not currently an issues because the filter code doesn't do anything for these events, but they will in the future, and this needs to be fixed for when it does. Reviewed-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
Instead of having both trace_init_tracefs() and event_trace_init() be called by fs_initcall() routines, have event_trace_init() called directly by trace_init_tracefs(). This will guarantee order of how the events are created with respect to the rest of the ftrace infrastructure. This is needed to be able to assoctiate event files with ftrace internal events, such as the trace_marker. Reviewed-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
By adding the function __find_event_file() that can search for files without restrictions, such as if the event associated with the file has a reg function, or if it has the "ignore" flag set, the files that are associated to ftrace internal events (like trace_marker and function events) can be found and used. find_event_file() still returns a "filtered" file, as most callers need a valid trace event file. One created by the trace_events.h macros and not one created for parsing ftrace specific events. Reviewed-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
Trace event triggers can be called before or after the event has been committed. If it has been called after the commit, there's a possibility that the event no longer exists. Currently, the two post callers is the trigger to disable tracing (traceoff) and the one that will record a stack dump (stacktrace). Neither of them reference the trace event entry record, as that would lead to a race condition that could pass in corrupted data. To prevent any other users of the post data triggers from using the trace event record, pass in NULL to the post call trigger functions for the event record, as they should never need to use them in the first place. This does not fix any bug, but prevents bugs from happening by new post call trigger users. Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org> Reviewed-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Rafael J. Wysocki 提交于
The extra forward declaration of pm_qos_get_value() is redundant, so drop it. Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
由 Lee, Chun-Yi 提交于
The description of tracepoint_probe_register duplicates with tracepoint_probe_register_prio. This patch cleans up the descriptions. Link: http://lkml.kernel.org/r/20170616082643.7311-1-jlee@suse.comSigned-off-by: N"Lee, Chun-Yi" <jlee@suse.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
The snapshot trigger currently only affects the main ring buffer, even when it is used by the instances. This can be confusing as the snapshot trigger is listed in the instance. > # cd /sys/kernel/tracing > # mkdir instances/foo > # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger > # echo top buffer > trace_marker > # echo foo buffer > instances/foo/trace_marker > # touch /tmp/bar > # chown rostedt /tmp/bar > # cat instances/foo/snapshot # tracer: nop # # # * Snapshot is freed * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') > # cat snapshot # tracer: nop # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | bash-1189 [000] .... 111.488323: tracing_mark_write: top buffer Not only did the snapshot occur in the top level buffer, but the instance snapshot buffer should have been allocated, and it is still free. Cc: stable@vger.kernel.org Fixes: 85f2b082 ("tracing: Add basic event trigger framework") Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 28 5月, 2018 2 次提交
-
-
由 Andrey Ignatov 提交于
In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg. It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr *` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket. The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly. UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound. The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP. For TCP it doesn't make sense to change user-provided `struct sockaddr *` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway. Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path. Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg. To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6. Signed-off-by: NAndrey Ignatov <rdna@fb.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Arnd Bergmann 提交于
The stack_map_get_build_id_offset() function is too long for gcc to track whether 'work' may or may not be initialized at the end of it, leading to a false-positive warning: kernel/bpf/stackmap.c: In function 'stack_map_get_build_id_offset': kernel/bpf/stackmap.c:334:13: error: 'work' may be used uninitialized in this function [-Werror=maybe-uninitialized] This removes the 'in_nmi_ctx' flag and uses the state of that variable itself to see if it got initialized. Fixes: bae77c5e ("bpf: enable stackmap with build_id in nmi context") Signed-off-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-