- 08 6月, 2018 3 次提交
-
-
由 Yang Shi 提交于
mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Baoquan He 提交于
In commit 3b0efdfa ("mm, sl[aou]b: Extract common fields from struct kmem_cache") the variable 'obj_size' was moved above, however the related code comment is not updated accordingly. Do it here. Link: http://lkml.kernel.org/r/20180603032402.27526-1-bhe@redhat.comSigned-off-by: NBaoquan He <bhe@redhat.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Souptick Joarder 提交于
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f4220 ("mm: change return type to vm_fault_t") There was an existing bug inside dax_load_hole() if vm_insert_mixed had failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of VM_FAULT_OOM. With new vmf_insert_mixed() this issue is addressed. vm_insert_mixed_mkwrite has inefficiency when it returns an error value, driver has to convert it to vm_fault_t type. With new vmf_insert_mixed_mkwrite() this limitation will be addressed. Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 6月, 2018 1 次提交
-
-
由 Doron Roberts-Kedes 提交于
strp_unpause queues strp_work in order to parse any messages that arrived while the strparser was paused. However, the process invoking strp_unpause could eagerly parse a buffered message itself if it held the sock lock. __strp_unpause is an alternative to strp_pause that avoids the scheduling overhead that results when a receiving thread unpauses the strparser and waits for the next message to be delivered by the workqueue thread. This patch more than doubled the IOPS achieved in a benchmark of NBD traffic encrypted using ktls. Signed-off-by: NDoron Roberts-Kedes <doronrk@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 6月, 2018 6 次提交
-
-
由 Kees Cook 提交于
Use the overflow helpers both in existing multiplication-using inlines as well as the addition-overflow case in the core allocation routine. Signed-off-by: NKees Cook <keescook@chromium.org>
-
由 Kees Cook 提交于
Instead of open-coded multiplication and bounds checking, use the new overflow helper. Additionally prepare for vmalloc() users to add array_size()-family helpers in the future. Signed-off-by: NKees Cook <keescook@chromium.org>
-
由 Kees Cook 提交于
Instead of open-coded multiplication and bounds checking, use the new overflow helper. Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: NKees Cook <keescook@chromium.org>
-
由 Kees Cook 提交于
In preparation for replacing unchecked overflows for memory allocations, this creates helpers for the 3 most common calculations: array_size(a, b): 2-dimensional array array3_size(a, b, c): 3-dimensional array struct_size(ptr, member, n): struct followed by n-many trailing members Each of these return SIZE_MAX on overflow instead of wrapping around. (Additionally renames a variable named "array_size" to avoid future collision.) Co-developed-by: NMatthew Wilcox <mawilcox@microsoft.com> Signed-off-by: NKees Cook <keescook@chromium.org>
-
由 David Ahern 提交于
Add extack argument to reload, port_split and port_unsplit operations. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Sabrina Dubroca 提交于
commit 0bbbf0e7 ("ipmr, ip6mr: Unite creation of new mr_table") refactored ipmr_new_table, so that it now returns NULL when mr_table_alloc fails. Unfortunately, all callers of ipmr_new_table expect an ERR_PTR. This can result in NULL deref, for example when ipmr_rules_exit calls ipmr_free_table with NULL net->ipv4.mrt in the !CONFIG_IP_MROUTE_MULTIPLE_TABLES version. This patch makes mr_table_alloc return errors, and changes ip6mr_new_table and its callers to return/expect error pointers as well. It also removes the version of mr_table_alloc defined under !CONFIG_IP_MROUTE_COMMON, since it is never used. Fixes: 0bbbf0e7 ("ipmr, ip6mr: Unite creation of new mr_table") Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 6月, 2018 21 次提交
-
-
由 Michal Kalderon 提交于
This FW contains several fixes and features. RDMA - Several modifications and fixes for Memory Windows - drop vlan and tcp timestamp from mss calculation in driver for this FW - Fix SQ completion flow when local ack timeout is infinite - Modifications in t10dif support ETH - Fix aRFS for tunneled traffic without inner IP. - Fix chip configuration which may fail under heavy traffic conditions. - Support receiving any-VNI in VXLAN and GENEVE RX classification. iSCSI / FcoE - Fix iSCSI recovery flow - Drop vlan and tcp timestamp from mss calc for fw 8.37.2.0 Misc - Several registers (split registers) won't read correctly with ethtool -d Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com> Signed-off-by: NManish Rangankar <manish.rangankar@cavium.com> Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maciej Żenczykowski 提交于
Tested: 'git grep tw_timeout' comes up empty and it builds :-) Signed-off-by: NMaciej Żenczykowski <maze@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Magnus Karlsson 提交于
Here we add the functionality required to support zero-copy Tx, and also exposes various zero-copy related functions for the netdevs. Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Magnus Karlsson 提交于
Added ndo_xsk_async_xmit. This ndo "kicks" the netdev to start to pull userland AF_XDP Tx frames from a NAPI context. Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Björn Töpel 提交于
Extend the xsk_rcv to support the new MEM_TYPE_ZERO_COPY memory, and wireup ndo_bpf call in bind. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Björn Töpel 提交于
Here, a new type of allocator support is added to the XDP return API. A zero-copy allocated xdp_buff cannot be converted to an xdp_frame. Instead is the buff has to be copied. This is not supported at all in this commit. Also, an opaque "handle" is added to xdp_buff. This can be used as a context for the zero-copy allocator implementation. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Björn Töpel 提交于
Extend ndo_bpf with two new commands used for query zero-copy support and register an UMEM to a queue_id of a netdev. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Björn Töpel 提交于
The xdp_umem_page holds the address for a page. Trade memory for faster lookup. Later, we'll add DMA address here as well. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Björn Töpel 提交于
Moved struct xdp_umem to xdp_sock.h, in order to prepare for zero-copy support. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Kun Yi 提交于
BCM54612E have 4 multi-functional LED pins that can be configured through register setting; the LED4 pin can be configured to a 125MHz reference clock output by setting the spare register. Since the dedicated CLK125 reference clock pin is not brought out on the 48-Pin MLP, the LED4 pin is the only pin to provide such function in this package, and therefore it is beneficial to just enable the reference clock by default. Signed-off-by: NKun Yi <kunyi@google.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Heiner Kallweit 提交于
Current implementation of MDIO bus PM ops doesn't actually implement bus-specific PM ops but just calls PM ops defined on a device level what doesn't seem to be fully in line with the core PM model. When looking e.g. at __device_suspend() the PM core looks for PM ops of a device in a specific order: 1. device PM domain 2. device type 3. device class 4. device bus I think it has good reason that there's no PM ops on device level. Now that a device type representation of PHY's as special type of MDIO devices was added (only user of MDIO bus PM ops), the MDIO bus PM ops can be removed including member pm of struct mdio_device. If for some other type of MDIO device PM ops are needed, it should be modeled as struct device_type as well. Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
All drivers are cleaned up and no references to ndo_xdp_flush are left in drivers, it is time to remove the net_device_ops operation ndo_xdp_flush. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Mikulas Patocka 提交于
The function __builtin_expect returns long type (see the gcc documentation), and so do macros likely and unlikely. Unfortunatelly, when CONFIG_PROFILE_ANNOTATED_BRANCHES is selected, the macros likely and unlikely expand to __branch_check__ and __branch_check__ truncates the long type to int. This unintended truncation may cause bugs in various kernel code (we found a bug in dm-writecache because of it), so it's better to fix __branch_check__ to return long. Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1805300818140.24812@file01.intranet.prod.int.rdu2.redhat.com Cc: Ingo Molnar <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 1f0d69a9 ("tracing: profile likely and unlikely annotations") Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Vasyl Gomonovych 提交于
Fix typo of the word 'been' Link: http://lkml.kernel.org/r/20180518203130.2011-1-gomonovych@gmail.comSigned-off-by: NVasyl Gomonovych <gomonovych@gmail.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Yuval Bason 提交于
This patch adds support for configuring SRQ and provides the necessary APIs for rdma upper layer driver (qedr) to enable the SRQ feature. Signed-off-by: NMichal Kalderon <michal.kalderon@cavium.com> Signed-off-by: NAriel Elior <ariel.elior@cavium.com> Signed-off-by: NYuval Bason <yuval.bason@cavium.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Randy Dunlap 提交于
<linux/skbuff.h> does not use nor need <linux/slab.h>, so drop this header file from skbuff.h. <linux/skbuff.h> is currently #included in around 1200 C source and header files, making it the 31st most-used header file. Build tested [allmodconfig] on 20 arch-es. Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Howells 提交于
Sometimes an in-progress call will stop responding on the fileserver when the fileserver quietly cancels the call with an internally marked abort (RX_CALL_DEAD), without sending an ABORT to the client. This causes the client's call to eventually expire from lack of incoming packets directed its way, which currently leads to it being cancelled locally with ETIME. Note that it's not currently clear as to why this happens as it's really hard to reproduce. The rotation policy implement by kAFS, however, doesn't differentiate between ETIME meaning we didn't get any response from the server and ETIME meaning the call got cancelled mid-flow. The latter leads to an oops when fetching data as the rotation partially resets the afs_read descriptor, which can result in a cleared page pointer being dereferenced because that page has already been filled. Handle this by the following means: (1) Set a flag on a call when we receive a packet for it. (2) Store the highest packet serial number so far received for a call (bearing in mind this may wrap). (3) If, when the "not received anything recently" timeout expires on a call, we've received at least one packet for a call and the connection as a whole has received packets more recently than that call, then cancel the call locally with ECONNRESET rather than ETIME. This indicates that the call was definitely in progress on the server. (4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME, don't try the next server, but rather abort the call. This avoids the oops as we don't try to reuse the afs_read struct. Rather, as-yet ungotten pages will be reread at a later data. Also: (5) Add an rxrpc tracepoint to log detection of the call being reset. Without this, I occasionally see an oops like the following: general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:_copy_to_iter+0x204/0x310 RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206 RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560 RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000 RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400 R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958 R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560 FS: 00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0 Call Trace: skb_copy_datagram_iter+0x14e/0x289 rxrpc_recvmsg_data.isra.0+0x6f3/0xf68 ? trace_buffer_unlock_commit_regs+0x4f/0x89 rxrpc_kernel_recv_data+0x149/0x421 afs_extract_data+0x1e0/0x798 ? afs_wait_for_call_to_complete+0xc9/0x52e afs_deliver_fs_fetch_data+0x33a/0x5ab afs_deliver_to_call+0x1ee/0x5e0 ? afs_wait_for_call_to_complete+0xc9/0x52e afs_wait_for_call_to_complete+0x12b/0x52e ? wake_up_q+0x54/0x54 afs_make_call+0x287/0x462 ? afs_fs_fetch_data+0x3e6/0x3ed ? rcu_read_lock_sched_held+0x5d/0x63 afs_fs_fetch_data+0x3e6/0x3ed afs_fetch_data+0xbb/0x14a afs_readpages+0x317/0x40d __do_page_cache_readahead+0x203/0x2ba ? ondemand_readahead+0x3a7/0x3c1 ondemand_readahead+0x3a7/0x3c1 generic_file_buffered_read+0x18b/0x62f __vfs_read+0xdb/0xfe vfs_read+0xb2/0x137 ksys_read+0x50/0x8c do_syscall_64+0x7d/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Note the weird value in RDI which is a result of trying to kmap() a NULL page pointer. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Linus Torvalds 提交于
We already earlier discouraged people from using this interface in commit 88796e7e ("sched/swait: Document it clearly that the swait facilities are special and shouldn't be used"), but I just got a pull request with a new broken user. So make the comment *really* clear. The swait interfaces are bad, and should not be used unless you have some *very* strong reasons that include tons of hard performance numbers on just why you want to use them, and you show that you actually understand that they aren't at all like the normal wait/wakeup interfaces. So far, every single user has been suspect. The main user is KVM, which is completely pointless (there is only ever one waiter, which avoids the interface subtleties, but also means that having a queue instead of a pointer is counter-productive and certainly not an "optimization"). So make the comments much stronger. Not that anybody likely reads them anyway, but there's always some slight hope that it will cause somebody to think twice. I'd like to remove this interface entirely, but there is the theoretical possibility that it's actually the right thing to use in some situation, most likely some deep RT use. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Kubecek 提交于
Some of the code paths calculating flow hash for IPv6 use flowlabel member of struct flowi6 which, despite its name, encodes both flow label and traffic class. If traffic class changes within a TCP connection (as e.g. ssh does), ECMP route can switch between path. It's also inconsistent with other code paths where ip6_flowlabel() (returning only flow label) is used to feed the key. Use only flow label everywhere, including one place where hash key is set using ip6_flowinfo(). Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)") Fixes: f70ea018 ("net: Add functions to get skb->hash based on flow structures") Signed-off-by: NMichal Kubecek <mkubecek@suse.cz> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
This reverts commit 87ae68c8. Applied the wrong version of this fix, correct version coming up. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Michal Kubecek 提交于
Some of the code paths calculating flow hash for IPv6 use flowlabel member of struct flowi6 which, despite its name, encodes both flow label and traffic class. If traffic class changes within a TCP connection (as e.g. ssh does), ECMP route can switch between path. It's also incosistent with other code paths where ip6_flowlabel() (returning only flow label) is used to feed the key. Use only flow label everywhere, including one place where hash key is set using ip6_flowinfo(). Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)") Fixes: f70ea018 ("net: Add functions to get skb->hash based on flow structures") Signed-off-by: NMichal Kubecek <mkubecek@suse.cz> Reviewed-by: NIdo Schimmel <idosch@mellanox.com> Tested-by: NIdo Schimmel <idosch@mellanox.com> Acked-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 6月, 2018 3 次提交
-
-
由 Björn Töpel 提交于
Currently, AF_XDP only supports a fixed frame-size memory scheme where each frame is referenced via an index (idx). A user passes the frame index to the kernel, and the kernel acts upon the data. Some NICs, however, do not have a fixed frame-size model, instead they have a model where a memory window is passed to the hardware and multiple frames are filled into that window (referred to as the "type-writer" model). By changing the descriptor format from the current frame index addressing scheme, AF_XDP can in the future be extended to support these kinds of NICs. In the index-based model, an idx refers to a frame of size frame_size. Addressing a frame in the UMEM is done by offseting the UMEM starting address by a global offset, idx * frame_size + offset. Communicating via the fill- and completion-rings are done by means of idx. In this commit, the idx is removed in favor of an address (addr), which is a relative address ranging over the UMEM. To convert an idx-based address to the new addr is simply: addr = idx * frame_size + offset. We also stop referring to the UMEM "frame" as a frame. Instead it is simply called a chunk. To transfer ownership of a chunk to the kernel, the addr of the chunk is passed in the fill-ring. Note, that the kernel will mask addr to make it chunk aligned, so there is no need for userspace to do that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or 3000 to the fill-ring will refer to the same chunk. On the completion-ring, the addr will match that of the Tx descriptor, passed to the kernel. Changing the descriptor format to use chunks/addr will allow for future changes to move to a type-writer based model, where multiple frames can reside in one chunk. In this model passing one single chunk into the fill-ring, would potentially result in multiple Rx descriptors. This commit changes the uapi of AF_XDP sockets, and updates the documentation. Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 David Ahern 提交于
As Michal noted the flow struct takes both the flow label and priority. Update the bpf_fib_lookup API to note that it is flowinfo and not just the flow label. Cc: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Yonghong Song 提交于
bpf has been used extensively for tracing. For example, bcc contains an almost full set of bpf-based tools to trace kernel and user functions/events. Most tracing tools are currently either filtered based on pid or system-wide. Containers have been used quite extensively in industry and cgroup is often used together to provide resource isolation and protection. Several processes may run inside the same container. It is often desirable to get container-level tracing results as well, e.g. syscall count, function count, I/O activity, etc. This patch implements a new helper, bpf_get_current_cgroup_id(), which will return cgroup id based on the cgroup within which the current task is running. The later patch will provide an example to show that userspace can get the same cgroup id so it could configure a filter or policy in the bpf program based on task cgroup id. The helper is currently implemented for tracing. It can be added to other program types as well when needed. Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 03 6月, 2018 6 次提交
-
-
由 Jesper Dangaard Brouer 提交于
Removing XDP_XMIT_FLAGS_NONE as all driver now implement a flush operation in their ndo_xdp_xmit call. The compiler will catch if any users of XDP_XMIT_FLAGS_NONE remains. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Jesper Dangaard Brouer 提交于
This patch only change the API and reject any use of flags. This is an intermediate step that allows us to implement the flush flag operation later, for each individual driver in a separate patch. The plan is to implement flush operation via XDP_XMIT_FLUSH flag and then remove XDP_XMIT_FLAGS_NONE when done. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT program type in test_verifier report the following errors on x86_32: 172/p unpriv: spill/fill of different pointers ldx FAIL Unexpected error message! 0: (bf) r6 = r10 1: (07) r6 += -8 2: (15) if r1 == 0x0 goto pc+3 R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1 3: (bf) r2 = r10 4: (07) r2 += -76 5: (7b) *(u64 *)(r6 +0) = r2 6: (55) if r1 != 0x0 goto pc+1 R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp 7: (7b) *(u64 *)(r6 +0) = r1 8: (79) r1 = *(u64 *)(r6 +0) 9: (79) r1 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 378/p check bpf_perf_event_data->sample_period byte load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (71) r0 = *(u8 *)(r1 +68) invalid bpf_context access off=68 size=1 379/p check bpf_perf_event_data->sample_period half load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (69) r0 = *(u16 *)(r1 +68) invalid bpf_context access off=68 size=2 380/p check bpf_perf_event_data->sample_period word load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (61) r0 = *(u32 *)(r1 +68) invalid bpf_context access off=68 size=4 381/p check bpf_perf_event_data->sample_period dword load permitted FAIL Failed to load prog 'Permission denied'! 0: (b7) r0 = 0 1: (79) r0 = *(u64 *)(r1 +68) invalid bpf_context access off=68 size=8 Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok() will then bail out saying that off & (size_default - 1) which is 68 & 7 doesn't cleanly align in the case of sample_period access from struct bpf_perf_event_data, hence verifier wrongly thinks we might be doing an unaligned access here though underlying arch can handle it just fine. Therefore adjust this down to machine size and check and rewrite the offset for narrow access on that basis. We also need to fix corresponding pe_prog_is_valid_access(), since we hit the check for off % size != 0 (e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs for tracing work on x86_32. Reported-by: NWang YanQing <udknight@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Tested-by: NWang YanQing <udknight@gmail.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
Since the remaining bits are not filled in struct bpf_tunnel_key resp. struct bpf_xfrm_state and originate from uninitialized stack space, we should make sure to clear them before handing control back to the program. Also add a padding element to struct bpf_xfrm_state for future use similar as we have in struct bpf_tunnel_key and clear it as well. struct bpf_xfrm_state { __u32 reqid; /* 0 4 */ __u32 spi; /* 4 4 */ __u16 family; /* 8 2 */ /* XXX 2 bytes hole, try to pack */ union { __u32 remote_ipv4; /* 4 */ __u32 remote_ipv6[4]; /* 16 */ }; /* 12 16 */ /* size: 28, cachelines: 1, members: 4 */ /* sum members: 26, holes: 1, sum holes: 2 */ /* last cacheline: 28 bytes */ }; Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NSong Liu <songliubraving@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Daniel Borkmann 提交于
Add a new bpf_skb_cgroup_id() helper that allows to retrieve the cgroup id from the skb's socket. This is useful in particular to enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in cgroup v2 by allowing ID based matching on egress. This can in particular be used in combination with applying policy e.g. from map lookups, and also complements the older bpf_skb_under_cgroup() interface. In user space the cgroup id for a given path can be retrieved through the f_handle as demonstrated in [0] recently. [0] https://lkml.org/lkml/2018/5/22/1190Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-