- 27 3月, 2018 9 次提交
-
-
由 Inbar Karmy 提交于
Implement set/get functions to configure PFC stall prevention timeout by tunables api through ethtool. By default the stall prevention timeout is configured to 8 sec. Timeout range is: 80-8000 msec. Enabling stall prevention with the auto timeout will set the timeout to 100 msec. Signed-off-by: NInbar Karmy <inbark@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Inbar Karmy 提交于
In the event where the device unexpectedly becomes unresponsive for a long period of time, flow control mechanism may propagate pause frames which will cause congestion spreading to the entire network. To prevent this scenario, when the device is stalled for a period longer than a pre-configured timeout, flow control mechanisms are automatically disabled. This patch adds support for the ETHTOOL_PFC_STALL_PREVENTION as a tunable. This API provides support for configuring flow control storm prevention timeout (msec). Signed-off-by: NInbar Karmy <inbark@mellanox.com> Cc: Michal Kubecek <mkubecek@suse.cz> Cc: Andrew Lunn <andrew@lunn.ch> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Inbar Karmy 提交于
Add the needed capability bit and counters to device spec description. Expose the following two counters in ethtool: tx_pause_storm_warning_events: when the device is stalled for a period longer than a pre-configured watermark, the counter increase, allowing the debug utility an insight into current device status. tx_pause_storm_error_events: when the device is stalled for a period longer than a pre-configured timeout, the pause transmission is disabled, and the counter increase. Signed-off-by: NInbar Karmy <inbark@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Yuval Mintz 提交于
Since ipmr and ip6mr are using the same mr_mfc struct at their core, we can now refactor the ipmr_cache_{hold,put} logic and apply refcounting to both ipmr and ip6mr. Signed-off-by: NYuval Mintz <yuvalm@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuval Mintz 提交于
Add the ability to discern whether a given FIB rule notification relates to the default rule inserted when registering ip6mr or a different one. Would later be used by drivers wishing to offload ipv6 multicast routes but unable to offload rules other than the default one. Signed-off-by: NYuval Mintz <yuvalm@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuval Mintz 提交于
In similar fashion to ipmr, support fib notifications for ip6mr mfc and vif related events. This would later allow drivers to react to said notifications and offload the IPv6 mroutes. Signed-off-by: NYuval Mintz <yuvalm@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuval Mintz 提交于
Since all the primitive elements used for the notification done by ipmr are now common [mr_table, mr_mfc, vif_device] we can refactor the logic for dumping them to a common file. Signed-off-by: NYuval Mintz <yuvalm@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuval Mintz 提交于
Like vif notifications, move the notifier struct for MFC as well as its helpers into a common file; Currently they're only used by ipmr. Signed-off-by: NYuval Mintz <yuvalm@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuval Mintz 提交于
The fib-notifiers are tightly coupled with the vif_device which is already common. Move the notifier struct definition and helpers to the common file; Currently they're only used by ipmr. Signed-off-by: NYuval Mintz <yuvalm@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 3月, 2018 2 次提交
-
-
由 Kirill Tkhai 提交于
Last user is gone after bdf5bd7f "rds: tcp: remove register_netdevice_notifier infrastructure.", so we can remove this netdevice command. This allows to delete rtnl_lock() in netdev_run_todo(), which is hot path for net namespace unregistration. dev_change_net_namespace() and netdev_wait_allrefs() have rcu_barrier() before NETDEV_UNREGISTER_FINAL call, and the source commits say they were introduced to delemit the call with NETDEV_UNREGISTER, but this patch leaves them on the places, since they require additional analysis, whether we need in them for something else. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
This patch is preparation to drop NETDEV_UNREGISTER_FINAL. Since the cmd is used in usnic_ib_netdev_event_to_string() to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL from everywhere, we'd have holes in event2str[] in this function. Instead of that, let's make NETDEV_XXX commands names available for everyone, and to define netdev_cmd_to_name() in the way we won't have to shaffle names after their numbers are changed. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 3月, 2018 7 次提交
-
-
由 Davide Caratti 提交于
tcf_idr_cleanup() is no more used, so remove it. Suggested-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavide Caratti <dcaratti@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jon Maloy 提交于
We add a 128-bit node identity, as an alternative to the currently used 32-bit node address. For the sake of compatibility and to minimize message header changes we retain the existing 32-bit address field. When not set explicitly by the user, this field will be filled with a hash value generated from the much longer node identity, and be used as a shorthand value for the latter. We permit either the address or the identity to be set by configuration, but not both, so when the address value is set by a legacy user the corresponding 128-bit node identity is generated based on the that value. Acked-by: NYing Xue <ying.xue@windriver.com> Signed-off-by: NJon Maloy <jon.maloy@ericsson.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dave Watson 提交于
Add rx path for tls software implementation. recvmsg, splice_read, and poll implemented. An additional sockopt TLS_RX is added, with the same interface as TLS_TX. Either TLX_RX or TLX_TX may be provided separately, or together (with two different setsockopt calls with appropriate keys). Control messages are passed via CMSG in a similar way to transmit. If no cmsg buffer is passed, then only application data records will be passed to userspace, and EIO is returned for other types of alerts. EBADMSG is passed for decryption errors, and EMSGSIZE is passed for framing too big, and EBADMSG for framing too small (matching openssl semantics). EINVAL is returned for TLS versions that do not match the original setsockopt call. All are unrecoverable. strparser is used to parse TLS framing. Decryption is done directly in to userspace buffers if they are large enough to support it, otherwise sk_cow_data is called (similar to ipsec), and buffers are decrypted in place and copied. splice_read always decrypts in place, since no buffers are provided to decrypt in to. sk_poll is overridden, and only returns POLLIN if a full TLS message is received. Otherwise we wait for strparser to finish reading a full frame. Actual decryption is only done during recvmsg or splice_read calls. Signed-off-by: NDave Watson <davejwatson@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dave Watson 提交于
Several config variables are prefixed with tx, drop the prefix since these will be used for both tx and rx. Signed-off-by: NDave Watson <davejwatson@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dave Watson 提交于
Pass EBADMSG explicitly to tls_err_abort. Receive path will pass additional codes - EMSGSIZE if framing is larger than max TLS record size, EINVAL if TLS version mismatch. Signed-off-by: NDave Watson <davejwatson@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dave Watson 提交于
Separate tx crypto parameters to a separate cipher_context struct. The same parameters will be used for rx using the same struct. tls_advance_record_sn is modified to only take the cipher info. Signed-off-by: NDave Watson <davejwatson@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Earlier change missed the path where CONFIG_NET_DEVLINK is disabled. Thanks to Jiri for spotting. Fixes: 14530746 ("devlink: Remove top_hierarchy arg to devlink_resource_register") Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 3月, 2018 8 次提交
-
-
由 Daniel Vacek 提交于
This reverts commit b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible"). The commit is meant to be a boot init speed up skipping the loop in memmap_init_zone() for invalid pfns. But given some specific memory mapping on x86_64 (or more generally theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the implementation also skips valid pfns which is plain wrong and causes 'kernel BUG at mm/page_alloc.c:1389!' crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1 kernel BUG at mm/page_alloc.c:1389! invalid opcode: 0000 [#1] SMP -- RIP: 0010: move_freepages+0x15e/0x160 -- Call Trace: move_freepages_block+0x73/0x80 __rmqueue+0x263/0x460 get_page_from_freelist+0x7e1/0x9e0 __alloc_pages_nodemask+0x176/0x420 -- crash> page_init_bug -v | grep RAM <struct resource 0xffff88067fffd2f8> 1000 - 9bfff System RAM (620.00 KiB) <struct resource 0xffff88067fffd3a0> 100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB) <struct resource 0xffff88067fffd410> 4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB) <struct resource 0xffff88067fffd480> 4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB) <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct resource 0xffff88067fffd640> 100000000 - 67fffffff System RAM ( 22.00 GiB) crash> page_init_bug | head -6 <struct resource 0xffff88067fffd560> 7b788000 - 7b7fffff System RAM (480.00 KiB) <struct page 0xffffea0001ede200> 1fffff00000000 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 <struct page 0xffffea0001ede200> 505736 505344 <struct page 0xffffea0001ed8000> 505855 <struct page 0xffffea0001edffc0> <struct page 0xffffea0001ed8000> 0 0 <struct pglist_data 0xffff88047ffd9000> 0 <struct zone 0xffff88047ffd9000> DMA 1 4095 <struct page 0xffffea0001edffc0> 1fffff00000400 0 <struct pglist_data 0xffff88047ffd9000> 1 <struct zone 0xffff88047ffd9800> DMA32 4096 1048575 BUG, zones differ! crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000 PAGE PHYSICAL MAPPING INDEX CNT FLAGS ffffea0001e00000 78000000 0 0 0 0 ffffea0001ed7fc0 7b5ff000 0 0 0 0 ffffea0001ed8000 7b600000 0 0 0 0 <<<< ffffea0001ede1c0 7b787000 0 0 0 0 ffffea0001ede200 7b788000 0 0 1 1fffff00000000 Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com Fixes: b92df1de ("mm: page_alloc: skip over regions of invalid pfns where possible") Signed-off-by: NDaniel Vacek <neelx@redhat.com> Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Toshi Kani 提交于
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may create pud/pmd mappings. A kernel panic was observed on arm64 systems with Cortex-A75 in the following steps as described by Hanjun Guo. 1. ioremap a 4K size, valid page table will build, 2. iounmap it, pte0 will set to 0; 3. ioremap the same address with 2M size, pgd/pmd is unchanged, then set the a new value for pmd; 4. pte0 is leaked; 5. CPU may meet exception because the old pmd is still in TLB, which will lead to kernel panic. This panic is not reproducible on x86. INVLPG, called from iounmap, purges all levels of entries associated with purged address on x86. x86 still has memory leak. The patch changes the ioremap path to free unmapped page table(s) since doing so in the unmap path has the following issues: - The iounmap() path is shared with vunmap(). Since vmap() only supports pte mappings, making vunmap() to free a pte page is an overhead for regular vmap users as they do not need a pte page freed up. - Checking if all entries in a pte page are cleared in the unmap path is racy, and serializing this check is expensive. - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges. Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB purge. Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which clear a given pud/pmd entry and free up a page for the lower level entries. This patch implements their stub functions on x86 and arm64, which work as workaround. [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub] Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings") Reported-by: NLei Li <lious.lilei@hisilicon.com> Signed-off-by: NToshi Kani <toshi.kani@hpe.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Wang Xuefeng <wxf.wang@hisilicon.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Chintan Pandya <cpandya@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill Tkhai 提交于
Since ra_chain is per-net, we may use per-net mutexes to protect them in ip_ra_control(). This improves scalability. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
This is optimization, which makes ip_call_ra_chain() iterate less sockets to find the sockets it's looking for. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
Define new netlink attributes for rmnet mux_id and flags. These flags / mux_id were earlier using vlan flags / id respectively. The flag bits are also moved to uapi and are renamed with prefix RMNET_FLAG_*. Also add the rmnet policy to handle the new netlink attributes. Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
Currently when tipc is unable to queue a received message on a socket, the message is rejected back to the sender with error TIPC_ERR_OVERLOAD. However, the application on this socket has no knowledge about these discards. In this commit, we try to step the sk_drops counter when tipc is unable to queue a received message. Export sk_drops using tipc socket diagnostics. Acked-by: NJon Maloy <jon.maloy@ericsson.com> Acked-by: NYing Xue <ying.xue@windriver.com> Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
This commit adds socket diagnostics capability for AF_TIPC in netlink family NETLINK_SOCK_DIAG in a new kernel module (diag.ko). The following are key design considerations: - config TIPC_DIAG has default y, like INET_DIAG. - only requests with flag NLM_F_DUMP is supported (dump all). - tipc_sock_diag_req message is introduced to send filter parameters. - the response attributes are of TLV, some nested. To avoid exposing data structures between diag and tipc modules and avoid code duplication, the following additions are required: - export tipc_nl_sk_walk function to reuse socket iterator. - export tipc_sk_fill_sock_diag to fill the tipc diag attributes. - create a sock_diag response message in __tipc_add_sock_diag defined in diag.c and use the above exported tipc_sk_fill_sock_diag to fill response. Acked-by: NJon Maloy <jon.maloy@ericsson.com> Acked-by: NYing Xue <ying.xue@windriver.com> Signed-off-by: NGhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
top_hierarchy arg can be determined by comparing parent_resource_id to DEVLINK_RESOURCE_ID_PARENT_TOP so it does not need to be a separate argument. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 3月, 2018 2 次提交
-
-
由 Kevin Hao 提交于
For some phy devices, even though they don't support the MMD extended register access, it does have some side effect if we are trying to read/write the MMD registers via indirect method. So introduce general dummy stubs for MMD register access which these devices can use to avoid such side effect. Fixes: b6b5e8a6 ("gianfar: Disable EEE autoneg by default") Signed-off-by: NKevin Hao <haokexin@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christian Brauner 提交于
This commit adds struct uevent_sock to struct net. Since struct uevent_sock records the position of the uevent socket in the uevent socket list we can trivially remove it from the uevent socket list during cleanup. This speeds up the old removal codepath. Note, list_del() will hit __list_del_entry_valid() in its call chain which will validate that the element is a member of the list. If it isn't it will take care that the list is not modified. Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 3月, 2018 1 次提交
-
-
由 Ben Caradoc-Davies 提交于
Commit 7b6ddeaf ("mac80211: use QoS NDP for AP probing") added an argument qos_ok to ieee80211_nullfunc_get to support QoS NDP. Despite the claim in the commit log "Change all the drivers to *not* allow QoS NDP for now, even though it looks like most of them should be OK with that", this commit enables QoS NDP in response to beacons (see change to mlme.c:ieee80211_send_nullfunc), causing ath9k_htc to lose IP connectivity. See: https://patchwork.kernel.org/patch/10241109/ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=891060 Introduce a hardware flag to allow such buggy drivers to override the correct default behaviour of mac80211 of sending QoS NDP packets. Signed-off-by: NBen Caradoc-Davies <ben@transient.nz> Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
- 20 3月, 2018 9 次提交
-
-
由 John Fastabend 提交于
Currently, if a bpf sk msg program is run the program can only parse data that the (start,end) pointers already consumed. For sendmsg hooks this is likely the first scatterlist element. For sendpage this will be the range (0,0) because the data is shared with userspace and by default we want to avoid allowing userspace to modify data while (or after) BPF verdict is being decided. To support pulling in additional bytes for parsing use a new helper bpf_sk_msg_pull(start, end, flags) which works similar to cls tc logic. This helper will attempt to point the data start pointer at 'start' bytes offest into msg and data end pointer at 'end' bytes offset into message. After basic sanity checks to ensure 'start' <= 'end' and 'end' <= msg_length there are a few cases we need to handle. First the sendmsg hook has already copied the data from userspace and has exclusive access to it. Therefor, it is not necessesary to copy the data. However, it may be required. After finding the scatterlist element with 'start' offset byte in it there are two cases. One the range (start,end) is entirely contained in the sg element and is already linear. All that is needed is to update the data pointers, no allocate/copy is needed. The other case is (start, end) crosses sg element boundaries. In this case we allocate a block of size 'end - start' and copy the data to linearize it. Next sendpage hook has not copied any data in initial state so that data pointers are (0,0). In this case we handle it similar to the above sendmsg case except the allocation/copy must always happen. Then when sending the data we have possibly three memory regions that need to be sent, (0, start - 1), (start, end), and (end + 1, msg_length). This is required to ensure any writes by the BPF program are correctly transmitted. Lastly this operation will invalidate any previous data checks so BPF programs will have to revalidate pointers after making this BPF call. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 John Fastabend 提交于
In the case where we need a specific number of bytes before a verdict can be assigned, even if the data spans multiple sendmsg or sendfile calls. The BPF program may use msg_cork_bytes(). The extreme case is a user can call sendmsg repeatedly with 1-byte msg segments. Obviously, this is bad for performance but is still valid. If the BPF program needs N bytes to validate a header it can use msg_cork_bytes to specify N bytes and the BPF program will not be called again until N bytes have been accumulated. The infrastructure will attempt to coalesce data if possible so in many cases (most my use cases at least) the data will be in a single scatterlist element with data pointers pointing to start/end of the element. However, this is dependent on available memory so is not guaranteed. So BPF programs must validate data pointer ranges, but this is the case anyways to convince the verifier the accesses are valid. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 John Fastabend 提交于
A single sendmsg or sendfile system call can contain multiple logical messages that a BPF program may want to read and apply a verdict. But, without an apply_bytes helper any verdict on the data applies to all bytes in the sendmsg/sendfile. Alternatively, a BPF program may only care to read the first N bytes of a msg. If the payload is large say MB or even GB setting up and calling the BPF program repeatedly for all bytes, even though the verdict is already known, creates unnecessary overhead. To allow BPF programs to control how many bytes a given verdict applies to we implement a bpf_msg_apply_bytes() helper. When called from within a BPF program this sets a counter, internal to the BPF infrastructure, that applies the last verdict to the next N bytes. If the N is smaller than the current data being processed from a sendmsg/sendfile call, the first N bytes will be sent and the BPF program will be re-run with start_data pointing to the N+1 byte. If N is larger than the current data being processed the BPF verdict will be applied to multiple sendmsg/sendfile calls until N bytes are consumed. Note1 if a socket closes with apply_bytes counter non-zero this is not a problem because data is not being buffered for N bytes and is sent as its received. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 John Fastabend 提交于
This implements a BPF ULP layer to allow policy enforcement and monitoring at the socket layer. In order to support this a new program type BPF_PROG_TYPE_SK_MSG is used to run the policy at the sendmsg/sendpage hook. To attach the policy to sockets a sockmap is used with a new program attach type BPF_SK_MSG_VERDICT. Similar to previous sockmap usages when a sock is added to a sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT program type attached then the BPF ULP layer is created on the socket and the attached BPF_PROG_TYPE_SK_MSG program is run for every msg in sendmsg case and page/offset in sendpage case. BPF_PROG_TYPE_SK_MSG Semantics/API: BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and SK_DROP. Returning SK_DROP free's the copied data in the sendmsg case and in the sendpage case leaves the data untouched. Both cases return -EACESS to the user. Returning SK_PASS will allow the msg to be sent. In the sendmsg case data is copied into kernel space buffers before running the BPF program. The kernel space buffers are stored in a scatterlist object where each element is a kernel memory buffer. Some effort is made to coalesce data from the sendmsg call here. For example a sendmsg call with many one byte iov entries will likely be pushed into a single entry. The BPF program is run with data pointers (start/end) pointing to the first sg element. In the sendpage case data is not copied. We opt not to copy the data by default here, because the BPF infrastructure does not know what bytes will be needed nor when they will be needed. So copying all bytes may be wasteful. Because of this the initial start/end data pointers are (0,0). Meaning no data can be read or written. This avoids reading data that may be modified by the user. A new helper is added later in this series if reading and writing the data is needed. The helper call will do a copy by default so that the page is exclusively owned by the BPF call. The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg in the sendmsg() case and the entire page/offset in the sendpage case. This avoids ambiguity on how to handle mixed return codes in the sendmsg case. Again a helper is added later in the series if a verdict needs to apply to multiple system calls and/or only a subpart of the currently being processed message. The helper msg_redirect_map() can be used to select the socket to send the data on. This is used similar to existing redirect use cases. This allows policy to redirect msgs. Pseudo code simple example: The basic logic to attach a program to a socket is as follows, // load the programs bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG, &obj, &msg_prog); // lookup the sockmap bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map"); // get fd for sockmap map_fd_msg = bpf_map__fd(bpf_map_msg); // attach program to sockmap bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0); Adding sockets to the map is done in the normal way, // Add a socket 'fd' to sockmap at location 'i' bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY); After the above any socket attached to "my_sock_map", in this case 'fd', will run the BPF msg verdict program (msg_prog) on every sendmsg and sendpage system call. For a complete example see BPF selftests or sockmap samples. Implementation notes: It seemed the simplest, to me at least, to use a refcnt to ensure psock is not lost across the sendmsg copy into the sg, the bpf program running on the data in sg_data, and the final pass to the TCP stack. Some performance testing may show a better method to do this and avoid the refcnt cost, but for now use the simpler method. Another item that will come after basic support is in place is supporting MSG_MORE flag. At the moment we call sendpages even if the MSG_MORE flag is set. An enhancement would be to collect the pages into a larger scatterlist and pass down the stack. Notice that bpf_tcp_sendmsg() could support this with some additional state saved across sendmsg calls. I built the code to support this without having to do refactoring work. Other features TBD include ZEROCOPY and the TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series shortly. Future work could improve size limits on the scatterlist rings used here. Currently, we use MAX_SKB_FRAGS simply because this was being used already in the TLS case. Future work could extend the kernel sk APIs to tune this depending on workload. This is a trade-off between memory usage and throughput performance. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 John Fastabend 提交于
The current implementation of sk_alloc_sg expects scatterlist to always start at entry 0 and complete at entry MAX_SKB_FRAGS. Future patches will want to support starting at arbitrary offset into scatterlist so add an additional sg_start parameters and then default to the current values in TLS code paths. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 John Fastabend 提交于
When calling do_tcp_sendpages() from in kernel and we know the data has no references from user side we can omit SKBTX_SHARED_FRAG flag. This patch adds an internal flag, NO_SKBTX_SHARED_FRAG that can be used to omit setting SKBTX_SHARED_FRAG. The flag is not exposed to userspace because the sendpage call from the splice logic masks out all bits except MSG_MORE. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 John Fastabend 提交于
The TLS ULP module builds scatterlists from a sock using page_frag_refill(). This is going to be useful for other ULPs so move it into sock file for more general use. In the process remove useless goto at end of while loop. Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Leon Romanovsky 提交于
XRCD object is not implemented in the restrack, so lets remove it. Fixes: 02d8883f ("RDMA/restrack: Add general infrastructure to track RDMA resources") Signed-off-by: NLeon Romanovsky <leonro@mellanox.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
由 Tejun Heo 提交于
percpu_ref internally uses sched-RCU to implement the percpu -> atomic mode switching and the documentation suggested that this could be depended upon. This doesn't seem like a good idea. * percpu_ref uses sched-RCU which has different grace periods regular RCU. Users may combine percpu_ref with regular RCU usage and incorrectly believe that regular RCU grace periods are performed by percpu_ref. This can lead to, for example, use-after-free due to premature freeing. * percpu_ref has a grace period when switching from percpu to atomic mode. It doesn't have one between the last put and release. This distinction is subtle and can lead to surprising bugs. * percpu_ref allows starting in and switching to atomic mode manually for debugging and other purposes. This means that there may not be any grace periods from kill to release. This patch makes it clear that the grace periods are percpu_ref's internal implementation detail and can't be depended upon by the users. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 18 3月, 2018 2 次提交
-
-
由 Al Viro 提交于
use proc_remove_subtree() for subtree removal, both on setup failure halfway through and on teardown. No need to make simple things complex... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jon Maloy 提交于
Publications for TIPC_CLUSTER_SCOPE and TIPC_ZONE_SCOPE are in all aspects handled the same way, both on the publishing node and on the receiving nodes. Despite previous ambitions to the contrary, this is never going to change, so we take the conseqeunce of this and obsolete TIPC_ZONE_SCOPE and related macros/functions. Whenever a user is doing a bind() or a sendmsg() attempt using ZONE_SCOPE we translate this internally to CLUSTER_SCOPE, while we remain compatible with users and remote nodes still using ZONE_SCOPE. Furthermore, the non-formalized scope value 0 has always been permitted for use during lookup, with the same meaning as ZONE_SCOPE/CLUSTER_SCOPE. We now permit it even as binding scope, but for compatibility reasons we choose to not change the value of TIPC_CLUSTER_SCOPE. Acked-by: NYing Xue <ying.xue@windriver.com> Signed-off-by: NJon Maloy <jon.maloy@ericsson.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-