- 07 12月, 2019 2 次提交
-
-
由 Guillaume Nault 提交于
If no synflood happens for a long enough period of time, then the synflood timestamp isn't refreshed and jiffies can advance so much that time_after32() can't accurately compare them any more. Therefore, we can end up in a situation where time_after32(now, last_overflow + HZ) returns false, just because these two values are too far apart. In that case, the synflood timestamp isn't updated as it should be, which can trick tcp_synq_no_recent_overflow() into rejecting valid syncookies. For example, let's consider the following scenario on a system with HZ=1000: * The synflood timestamp is 0, either because that's the timestamp of the last synflood or, more commonly, because we're working with a freshly created socket. * We receive a new SYN, which triggers synflood protection. Let's say that this happens when jiffies == 2147484649 (that is, 'synflood timestamp' + HZ + 2^31 + 1). * Then tcp_synq_overflow() doesn't update the synflood timestamp, because time_after32(2147484649, 1000) returns false. With: - 2147484649: the value of jiffies, aka. 'now'. - 1000: the value of 'last_overflow' + HZ. * A bit later, we receive the ACK completing the 3WHS. But cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow() says that we're not under synflood. That's because time_after32(2147484649, 120000) returns false. With: - 2147484649: the value of jiffies, aka. 'now'. - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID. Of course, in reality jiffies would have increased a bit, but this condition will last for the next 119 seconds, which is far enough to accommodate for jiffie's growth. Fix this by updating the overflow timestamp whenever jiffies isn't within the [last_overflow, last_overflow + HZ] range. That shouldn't have any performance impact since the update still happens at most once per second. Now we're guaranteed to have fresh timestamps while under synflood, so tcp_synq_no_recent_overflow() can safely use it with time_after32() in such situations. Stale timestamps can still make tcp_synq_no_recent_overflow() return the wrong verdict when not under synflood. This will be handled in the next patch. For 64 bits architectures, the problem was introduced with the conversion of ->tw_ts_recent_stamp to 32 bits integer by commit cca9bab1 ("tcp: use monotonic timestamps for PAWS"). The problem has always been there on 32 bits architectures. Fixes: cca9bab1 ("tcp: use monotonic timestamps for PAWS") Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NGuillaume Nault <gnault@redhat.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 John Hurley 提交于
With indirect blocks, a driver can register for callbacks from a device that is does not 'own', for example, a tunnel device. When registering to or unregistering from a new device, a callback is triggered to generate a bind/unbind event. This, in turn, allows the driver to receive any existing rules or to properly clean up installed rules. When first added, it was assumed that all indirect block registrations would be for ingress offloads. However, the NFP driver can, in some instances, support clsact qdisc binds for egress offload. Change the name of the indirect block callback command in flow_offload to remove the 'ingress' identifier from it. While this does not change functionality, a follow up patch will implement a more more generic callback than just those currently just supporting ingress offload. Fixes: 4d12ba42 ("nfp: flower: allow offloading of matches on 'internal' ports") Signed-off-by: NJohn Hurley <john.hurley@netronome.com> Acked-by: NJakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 12月, 2019 2 次提交
-
-
由 Sabrina Dubroca 提交于
ipv6_stub uses the ip6_dst_lookup function to allow other modules to perform IPv6 lookups. However, this function skips the XFRM layer entirely. All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the ip_route_output_key and ip_route_output helpers) for their IPv4 lookups, which calls xfrm_lookup_route(). This patch fixes this inconsistent behavior by switching the stub to ip6_dst_lookup_flow, which also calls xfrm_lookup_route(). This requires some changes in all the callers, as these two functions take different arguments and have different return types. Fixes: 5f81bd2e ("ipv6: export a stub for IPv6 symbols used by vxlan") Reported-by: NXiumei Mu <xmu@redhat.com> Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Sabrina Dubroca 提交于
This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow, as some modules currently pass a net argument without a socket to ip6_dst_lookup. This is equivalent to commit 343d60aa ("ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument"). Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 12月, 2019 1 次提交
-
-
由 Yoshiki Komachi 提交于
The recent commit 5c72299f ("net: sched: cls_flower: Classify packets using port ranges") had added filtering based on port ranges to tc flower. However the commit missed necessary changes in hw-offload code, so the feature gave rise to generating incorrect offloaded flow keys in NIC. One more detailed example is below: $ tc qdisc add dev eth0 ingress $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \ dst_port 100-200 action drop With the setup above, an exact match filter with dst_port == 0 will be installed in NIC by hw-offload. IOW, the NIC will have a rule which is equivalent to the following one. $ tc qdisc add dev eth0 ingress $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \ dst_port 0 action drop The behavior was caused by the flow dissector which extracts packet data into the flow key in the tc flower. More specifically, regardless of exact match or specified port ranges, fl_init_dissector() set the FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port numbers from skb in skb_flow_dissect() called by fl_classify(). Note that device drivers received the same struct flow_dissector object as used in skb_flow_dissect(). Thus, offloaded drivers could not identify which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was set to struct flow_dissector in either case. This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new tp_range field in struct fl_flow_key to recognize which filters are applied to offloaded drivers. At this point, when filters based on port ranges passed to drivers, drivers return the EOPNOTSUPP error because they do not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE flag). Fixes: 5c72299f ("net: sched: cls_flower: Classify packets using port ranges") Signed-off-by: NYoshiki Komachi <komachi.yoshiki@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 11月, 2019 2 次提交
-
-
由 Jakub Kicinski 提交于
Partially sent record cleanup path increments an SG entry directly instead of using sg_next(). This should not be a problem today, as encrypted messages should be always allocated as arrays. But given this is a cleanup path it's easy to miss was this ever to change. Use sg_next(), and simplify the code. Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: NSimon Horman <simon.horman@netronome.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
Looks like when BPF support was added by commit d3b18ad3 ("tls: add bpf support to sk_msg handling") and commit d829e9c4 ("tls: convert to generic sk_msg interface") it broke/removed the support for in-place crypto as added by commit 4e6d4720 ("tls: Add support for inplace records encryption"). The inplace_crypto member of struct tls_rec is dead, inited to zero, and sometimes set to zero again. It used to be set to 1 when record was allocated, but the skmsg code doesn't seem to have been written with the idea of in-place crypto in mind. Since non trivial effort is required to bring the feature back and we don't really have the HW to measure the benefit just remove the left over support for now to avoid confusing readers. Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: NSimon Horman <simon.horman@netronome.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 11月, 2019 2 次提交
-
-
由 Maciej Żenczykowski 提交于
Any argument outside of that range would result in an out of bound memory access, since the accessed array is 65536 bits long. Signed-off-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maciej Żenczykowski 提交于
Note that the sysctl write accessor functions guarantee that: net->ipv4.sysctl_ip_prot_sock <= net->ipv4.ip_local_ports.range[0] invariant is maintained, and as such the max() in selinux hooks is actually spurious. ie. even though if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) { per logic is the same as if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) { it is actually functionally equivalent to: if (snum < low || snum > high) { which is equivalent to: if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) { even though the first clause is spurious. But we want to hold on to it in case we ever want to change what what inet_port_requires_bind_service() means (for example by changing it from a, by default, [0..1024) range to some sort of set). Test: builds, git 'grep inet_prot_sock' finds no other references Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 11月, 2019 1 次提交
-
-
由 Xin Long 提交于
This patch is to fix a data-race reported by syzbot: BUG: KCSAN: data-race in sctp_assoc_migrate / sctp_hash_obj write to 0xffff8880b67c0020 of 8 bytes by task 18908 on cpu 1: sctp_assoc_migrate+0x1a6/0x290 net/sctp/associola.c:1091 sctp_sock_migrate+0x8aa/0x9b0 net/sctp/socket.c:9465 sctp_accept+0x3c8/0x470 net/sctp/socket.c:4916 inet_accept+0x7f/0x360 net/ipv4/af_inet.c:734 __sys_accept4+0x224/0x430 net/socket.c:1754 __do_sys_accept net/socket.c:1795 [inline] __se_sys_accept net/socket.c:1792 [inline] __x64_sys_accept+0x4e/0x60 net/socket.c:1792 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9 read to 0xffff8880b67c0020 of 8 bytes by task 12003 on cpu 0: sctp_hash_obj+0x4f/0x2d0 net/sctp/input.c:894 rht_key_get_hash include/linux/rhashtable.h:133 [inline] rht_key_hashfn include/linux/rhashtable.h:159 [inline] rht_head_hashfn include/linux/rhashtable.h:174 [inline] head_hashfn lib/rhashtable.c:41 [inline] rhashtable_rehash_one lib/rhashtable.c:245 [inline] rhashtable_rehash_chain lib/rhashtable.c:276 [inline] rhashtable_rehash_table lib/rhashtable.c:316 [inline] rht_deferred_worker+0x468/0xab0 lib/rhashtable.c:420 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269 worker_thread+0xa0/0x800 kernel/workqueue.c:2415 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352 It was caused by rhashtable access asoc->base.sk when sctp_assoc_migrate is changing its value. However, what rhashtable wants is netns from asoc base.sk, and for an asoc, its netns won't change once set. So we can simply fix it by caching netns since created. Fixes: d6c0256a ("sctp: add the rhashtable apis for sctp global transport hashtable") Reported-by: syzbot+e3b35fe7918ff0ee474e@syzkaller.appspotmail.com Signed-off-by: NXin Long <lucien.xin@gmail.com> Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
-
- 23 11月, 2019 2 次提交
-
-
由 Maciej Żenczykowski 提交于
Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
-
由 Petr Machata 提交于
In commit a82055af ("netfilter: nft_payload: add VLAN offload support"), VLAN fields in struct flow_dissector_key_vlan were unionized with the intention of introducing another field that covered the whole TCI header. However without a wrapping struct the subfields end up sharing the same bits. As a result, "tc filter add ... flower vlan_id 14" specifies not only vlan_id, but also vlan_priority. Fix by wrapping the individual VLAN fields in a struct. Fixes: a82055af ("netfilter: nft_payload: add VLAN offload support") Signed-off-by: NPetr Machata <petrm@mellanox.com> Reviewed-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 11月, 2019 7 次提交
-
-
由 Toke Høiland-Jørgensen 提交于
The previous commit added the ability to throttle stations when they queue too much airtime in the hardware. This commit enables the functionality by calculating the expected airtime usage of each packet that is dequeued from the TXQs in mac80211, and accounting that as pending airtime. The estimated airtime for each skb is stored in the tx_info, so we can subtract the same amount from the running total when the skb is freed or recycled. The throttling mechanism relies on this accounting to be accurate (i.e., that we are not freeing skbs without subtracting any airtime they were accounted for), so we put the subtraction into ieee80211_report_used_skb(). As an optimisation, we also subtract the airtime on regular TX completion, zeroing out the value stored in the packet afterwards, to avoid having to do an expensive lookup of the station from the packet data on every packet. This patch does *not* include any mechanism to wake a throttled TXQ again, on the assumption that this will happen anyway as a side effect of whatever freed the skb (most commonly a TX completion). Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20191119060610.76681-5-kyan@google.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Kan Yan 提交于
In order for the Fq_CoDel algorithm integrated in mac80211 layer to operate effectively to control excessive queueing latency, the CoDel algorithm requires an accurate measure of how long packets stays in the queue, AKA sojourn time. The sojourn time measured at the mac80211 layer doesn't include queueing latency in the lower layer (firmware/hardware) and CoDel expects lower layer to have a short queue. However, most 802.11ac chipsets offload tasks such TX aggregation to firmware or hardware, thus have a deep lower layer queue. Without a mechanism to control the lower layer queue size, packets only stay in mac80211 layer transiently before being sent to firmware queue. As a result, the sojourn time measured by CoDel in the mac80211 layer is almost always lower than the CoDel latency target, hence CoDel does little to control the latency, even when the lower layer queue causes excessive latency. The Byte Queue Limits (BQL) mechanism is commonly used to address the similar issue with wired network interface. However, this method cannot be applied directly to the wireless network interface. "Bytes" is not a suitable measure of queue depth in the wireless network, as the data rate can vary dramatically from station to station in the same network, from a few Mbps to over Gbps. This patch implements an Airtime-based Queue Limit (AQL) to make CoDel work effectively with wireless drivers that utilized firmware/hardware offloading. AQL allows each txq to release just enough packets to the lower layer to form 1-2 large aggregations to keep hardware fully utilized and retains the rest of the frames in mac80211 layer to be controlled by the CoDel algorithm. Signed-off-by: NKan Yan <kyan@google.com> [ Toke: Keep API to set pending airtime internal, fix nits in commit msg ] Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20191119060610.76681-4-kyan@google.comSigned-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Toke Høiland-Jørgensen 提交于
Felix recently added code to calculate airtime of packets to the mt76 driver. Import this into mac80211 so we can use it for airtime queue limit calculations. The airtime.c file is copied verbatim from the mt76 driver, and adjusted to be usable in mac80211. This involves: - Switching to mac80211 data structures. - Adding support for 160 MHz channels and HE mode. - Moving the symbol and duration calculations around a bit to avoid rounding with the higher rates and longer symbol times used for HE rates. The per-rate TX rate calculation is also split out to its own function so it can be used directly for the AQL calculations later. Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20191119060610.76681-3-kyan@google.com [fix HE_GROUP_IDX() to use 3 * bw, since there are 3 _gi values] Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
-
由 Paolo Abeni 提交于
This is alike the previous change, with some additional ipv4 specific quirk. Even when using the route hint we still have to do perform additional per packet checks about source address validity: a new helper is added to wrap them. Hints are explicitly disabled if the destination is a local broadcast, that keeps the code simple and local broadcast are a slower path anyway. UDP flood performances vs recvmmsg() receiver: vanilla patched delta Kpps Kpps % 1683 1871 +11 In the worst case scenario - each packet has a different destination address - the performance delta is within noise range. v3 -> v4: - re-enable hints for forward v2 -> v3: - really fix build (sic) and hint usage check - use fib4_has_custom_rules() helpers (David A.) - add ip_extract_route_hint() helper (Edward C.) - use prev skb as hint instead of copying data (Willem) v1 -> v2: - fix build issue with !CONFIG_IP_MULTIPLE_TABLES Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
So that we can use it in the next patch. Additionally constify the helper argument. Suggested-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
Use a per namespace counter, increment it on successful creation of any route using the source address, decrement it on deletion of such routes. This allows us to check easily if the routing decision in the current namespace depends on the packet source. Will be used by the next patch. Suggested-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
It wraps the namespace field with the same name, to easily access it regardless of build options. Suggested-by: NDavid Ahern <dsahern@gmail.com> Suggested-by: NEric Dumazet <eric.dumazet@gmail.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 11月, 2019 5 次提交
-
-
由 Lorenzo Bianconi 提交于
Introduce the following parameters in order to add the possibility to sync DMA memory for device before putting allocated pages in the page_pool caches: - PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that the driver gets from page_pool will be DMA-synced-for-device according to the length provided by the device driver. Please note DMA-sync-for-CPU is still device driver responsibility - offset: DMA address offset where the DMA engine starts copying rx data - max_len: maximum DMA memory size page_pool is allowed to flush. This is currently used in __page_pool_alloc_pages_slow routine when pages are allocated from page allocator These parameters are supposed to be set by device drivers. This optimization reduces the length of the DMA-sync-for-device. The optimization is valid because pages are initially DMA-synced-for-device as defined via max_len. At RX time, the driver will perform a DMA-sync-for-CPU on the memory for the packet length. What is important is the memory occupied by packet payload, because this is the area CPU is allowed to read and modify. As we don't track cache-lines written into by the CPU, simply use the packet payload length as dma_sync_size at page_pool recycle time. This also take into account any tail-extend. Tested-by: NMatteo Croce <mcroce@redhat.com> Signed-off-by: NLorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Saeed Mahameed 提交于
Add page_pool_update_nid() to be called by page pool consumers when they detect numa node changes. It will update the page pool nid value to start allocating from the new effective numa node. This is to mitigate page pool allocating pages from a wrong numa node, where the pool was originally allocated, and holding on to pages that belong to a different numa node, which causes performance degradation. For pages that are already being consumed and could be returned to the pool by the consumer, in next patch we will add a check per page to avoid recycling them back to the pool and return them to the page allocator. Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com> Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: NIlias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pablo Neira Ayuso 提交于
Match on ethertype and set up protocol dependency. Check for protocol dependency before accessing the tci field. Allow to match on the encapsulated ethertype too. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pablo Neira Ayuso 提交于
Hardware offload support at this stage assumes an ethernet device in place. The flow dissector provides the intermediate representation to express this selector, so extend it to allow to store the interface type. Flower does not uses this, so skb_flow_dissect_meta() is not extended to match on this new field. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pablo Neira Ayuso 提交于
This patch constifies the pointer to source register data that is passed as an input parameter. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 11月, 2019 1 次提交
-
-
由 Willem de Bruijn 提交于
Bring back tls_sw_sendpage_locked. sk_msg redirection into a socket with TLS_TX takes the following path: tcp_bpf_sendmsg_redir tcp_bpf_push_locked tcp_bpf_push kernel_sendpage_locked sock->ops->sendpage_locked Also update the flags test in tls_sw_sendpage_locked to allow flag MSG_NO_SHARED_FRAGS. bpf_tcp_sendmsg sets this. Link: https://lore.kernel.org/netdev/CA+FuTSdaAawmZ2N8nfDDKu3XLpXBbMtcCT0q4FntDD2gn8ASUw@mail.gmail.com/T/#t Link: https://github.com/wdebruij/kerneltools/commits/icept.2 Fixes: 0608c69c ("bpf: sk_msg, sock{map|hash} redirect through ULP") Fixes: f3de19af ("Revert \"net/tls: remove unused function tls_sw_sendpage_locked\"") Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 11月, 2019 1 次提交
-
-
由 Jesper Dangaard Brouer 提交于
When Jonathan change the page_pool to become responsible to its own shutdown via deferred work queue, then the disconnect_cnt counter was removed from xdp memory model tracepoint. This patch change the page_pool_inflight tracepoint name to page_pool_release, because it reflects the new responsability better. And it reintroduces a counter that reflect the number of times page_pool_release have been tried. The counter is also used by the code, to only empty the alloc cache once. With a stuck work queue running every second and counter being 64-bit, it will overrun in approx 584 billion years. For comparison, Earth lifetime expectancy is 7.5 billion years, before the Sun will engulf, and destroy, the Earth. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NToke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 11月, 2019 1 次提交
-
-
由 Jonathan Lemon 提交于
The page pool keeps track of the number of pages in flight, and it isn't safe to remove the pool until all pages are returned. Disallow removing the pool until all pages are back, so the pool is always available for page producers. Make the page pool responsible for its own delayed destruction instead of relying on XDP, so the page pool can be used without the xdp memory model. When all pages are returned, free the pool and notify xdp if the pool is registered with the xdp memory system. Have the callback perform a table walk since some drivers (cpsw) may share the pool among multiple xdp_rxq_info. Note that the increment of pages_state_release_cnt may result in inflight == 0, resulting in the pool being released. Fixes: d956a048 ("xdp: force mem allocator removal and periodic warning") Signed-off-by: NJonathan Lemon <jonathan.lemon@gmail.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 11月, 2019 4 次提交
-
-
由 Pablo Neira Ayuso 提交于
Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet flowtable type definitions. Rename the nf_flow_rule_route() function to nf_flow_rule_route_ipv4(). Adjust maximum number of actions, which now becomes 16 to leave sufficient room for the IPv6 address mangling for NAT. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Vladimir Oltean 提交于
While it is entirely possible that this tagger format is in fact more generic than just these 2 switch families, I don't have that knowledge. The Seville switch in NXP T1040 has a similar frame format, but there are enough differences (e.g. DEST field starts at bit 57 instead of 56) that calling this file tag_vitesse.c is a bit of a stretch at the moment. The frame format has been listed in a comment so that people who add support for further Vitesse switches can rework this tagger while keeping compatibility with Felix. The "ocelot" name was chosen instead of "felix" because even the Ocelot switch can act as a DSA device when it is used in NPI mode, and the Felix tagger format is almost identical. Currently it is only used for the Felix switch embedded in the NXP LS1028A chip. The ABI for this tagger should be considered "not stable" at the moment. The DSA tag is always placed before the Ethernet header and therefore, we are using the long prefix for RX tags to avoid putting the DSA master port in promiscuous mode. Once there will be an API in DSA for drivers to request DSA masters to be in promiscuous mode unconditionally, we will switch to the "no prefix" extraction frame header, which will save 16 padding bytes for each RX frame. Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ursula Braun 提交于
If the ism module is unloaded return control from exit routine only, if all link groups are freed. If an IB device is thrown away return control from device removal only, if all link groups belonging to this device are freed. A counters for the total number of SMCD link groups per ISM device is introduced. ism module unloading continues only if the total number of SMCD link groups for all ISM devices is zero. ISM device removal continues only it the total number of SMCD link groups per ISM device has decreased to zero. Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com> Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ursula Braun 提交于
SMCD link group termination is called when peer signals its shutdown of its corresponding link group. For regular shutdowns no connections exist anymore. For abnormal shutdowns connections must be killed and their DMBs must be unregistered immediately. That means the SMCR method to delay the link group freeing several seconds does not fit. This patch adds immediate termination of a link group and its SMCD connections and makes sure all SMCD link group related cleanup steps are finished. Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com> Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 11月, 2019 7 次提交
-
-
由 Stefano Garzarella 提交于
This patch adds 'module' member in the 'struct vsock_transport' in order to get/put the transport module. This prevents the module unloading while sockets are assigned to it. We increase the module refcnt when a socket is assigned to a transport, and we decrease the module refcnt when the socket is destructed. Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com> Reviewed-by: NJorgen Hansen <jhansen@vmware.com> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stefano Garzarella 提交于
This patch adds the support of multiple transports in the VSOCK core. With the multi-transports support, we can use vsock with nested VMs (using also different hypervisors) loading both guest->host and host->guest transports at the same time. Major changes: - vsock core module can be loaded regardless of the transports - vsock_core_init() and vsock_core_exit() are renamed to vsock_core_register() and vsock_core_unregister() - vsock_core_register() has a feature parameter (H2G, G2H, DGRAM) to identify which directions the transport can handle and if it's support DGRAM (only vmci) - each stream socket is assigned to a transport when the remote CID is set (during the connect() or when we receive a connection request on a listener socket). The remote CID is used to decide which transport to use: - remote CID <= VMADDR_CID_HOST will use guest->host transport; - remote CID == local_cid (guest->host transport) will use guest->host transport for loopback (host->guest transports don't support loopback); - remote CID > VMADDR_CID_HOST will use host->guest transport; - listener sockets are not bound to any transports since no transport operations are done on it. In this way we can create a listener socket, also if the transports are not loaded or with VMADDR_CID_ANY to listen on all transports. - DGRAM sockets are handled as before, since only the vmci_transport provides this feature. Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stefano Garzarella 提交于
All transports call __vsock_create() with the same parameters, most of them depending on the parent socket. In order to simplify the VSOCK core APIs exposed to the transports, this patch adds the vsock_create_connected() callable from transports to create a new socket when a connection request is received. We also unexported the __vsock_create(). Suggested-by: NStefan Hajnoczi <stefanha@redhat.com> Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com> Reviewed-by: NJorgen Hansen <jhansen@vmware.com> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stefano Garzarella 提交于
virtio_transport and vmci_transport handle the buffer_size sockopts in a very similar way. In order to support multiple transports, this patch moves this handling in the core to allow the user to change the options also if the socket is not yet assigned to any transport. This patch also adds the '.notify_buffer_size' callback in the 'struct virtio_transport' in order to inform the transport, when the buffer_size is changed by the user. It is also useful to limit the 'buffer_size' requested (e.g. virtio transports). Acked-by: NDexuan Cui <decui@microsoft.com> Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com> Reviewed-by: NJorgen Hansen <jhansen@vmware.com> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stefano Garzarella 提交于
Since now the 'struct vsock_sock' object contains a pointer to the transport, this patch adds a parameter to the vsock_core_get_transport() to return the right transport assigned to the socket. This patch modifies also the virtio_transport_get_ops(), that uses the vsock_core_get_transport(), adding the 'struct vsock_sock *' parameter. Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com> Reviewed-by: NJorgen Hansen <jhansen@vmware.com> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stefano Garzarella 提交于
As a preparation to support multiple transports, this patch adds the 'transport' member at the 'struct vsock_sock'. This new field is initialized during the creation in the __vsock_create() function. This patch also renames the global 'transport' pointer to 'transport_single', since for now we're only supporting a single transport registered at run-time. Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com> Reviewed-by: NJorgen Hansen <jhansen@vmware.com> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Stefano Garzarella 提交于
This header file now only includes the "uapi/linux/vm_sockets.h". We can include directly it when needed. Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com> Reviewed-by: NJorgen Hansen <jhansen@vmware.com> Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 11月, 2019 2 次提交
-
-
由 Pablo Neira Ayuso 提交于
This patch adds support for offloading the NFT_META_IIF selector. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Pablo Neira Ayuso 提交于
This patch adds the dataplane hardware offload to the flowtable infrastructure. Three new flags represent the hardware state of this flow: * FLOW_OFFLOAD_HW: This flow entry resides in the hardware. * FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove from hardware. This might be triggered by either packet path (via TCP RST/FIN packet) or via aging. * FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from the hardware, the software garbage collector can remove it from the software flowtable. This patch supports for: * IPv4 only. * Aging via FLOW_CLS_STATS, no packet and byte counter synchronization at this stage. This patch also adds the action callback that specifies how to convert the flow entry into the flow_rule object that is passed to the driver. Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-