- 28 8月, 2015 6 次提交
-
-
由 Pravin B Shelar 提交于
geneve_core module handles send and receive functionality. This way OVS could use the Geneve API. Now with use of tunnel meatadata mode OVS can directly use Geneve netdevice. So there is no need for separate module for Geneve. Following patch consolidates Geneve protocol processing in single module. Signed-off-by: NPravin B Shelar <pshelar@nicira.com> Reviewed-by: NJesse Gross <jesse@nicira.com> Acked-by: NJohn W. Linville <linville@tuxdriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pravin B Shelar 提交于
Following patch create new tunnel flag which enable tunnel metadata collection on given device. These devices can be used by tunnel metadata based routing or by OVS. Geneve Consolidation patch get rid of collect_md_tun to simplify tunnel lookup further. Signed-off-by: NPravin B Shelar <pshelar@nicira.com> Reviewed-by: NJesse Gross <jesse@nicira.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pravin B Shelar 提交于
Introduce function udp_tun_rx_dst() to initialize tunnel dst on receive path. Signed-off-by: NPravin B Shelar <pshelar@nicira.com> Reviewed-by: NJesse Gross <jesse@nicira.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
For classifiers getting invoked via tc_classify(), we always need an extra function call into tc_classify_compat(), as both are being exported as symbols and tc_classify() itself doesn't do much except handling of reclassifications when tp->classify() returned with TC_ACT_RECLASSIFY. CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(), all others use tc_classify(). When tc actions are being configured out in the kernel, tc_classify() effectively does nothing besides delegating. We could spare this layer and consolidate both functions. pktgen on single CPU constantly pushing skbs directly into the netif_receive_skb() path with a dummy classifier on ingress qdisc attached, improves slightly from 22.3Mpps to 23.1Mpps. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Stringer 提交于
Add functions to change connlabel length into nf_conntrack_labels.c so they may be reused by other modules like OVS and nftables without needing to jump through xt_match_check() hoops. Suggested-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NJoe Stringer <joestringer@nicira.com> Acked-by: NFlorian Westphal <fw@strlen.de> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Stringer 提交于
This variation on skb_dst_copy() doesn't require two skbs. Signed-off-by: NJoe Stringer <joestringer@nicira.com> Acked-by: NPravin B Shelar <pshelar@nicira.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 8月, 2015 2 次提交
-
-
由 Alexei Starovoitov 提交于
Similar to act_gact/act_mirred, act_bpf can be lockless in packet processing with extra care taken to free bpf programs after rcu grace period. Replacement of existing act_bpf (very rare) is done with synchronize_rcu() and final destruction is done from tc_action_ops->cleanup() callback that is called from tcf_exts_destroy()->tcf_action_destroy()->__tcf_hash_release() when bind and refcnt reach zero which is only possible when classifier is destroyed. Previous two patches fixed the last two classifiers (tcindex and rsvp) to call tcf_exts_destroy() from rcu callback. Similar to gact/mirred there is a race between prog->filter and prog->tcf_action. Meaning that the program being replaced may use previous default action if it happened to return TC_ACT_UNSPEC. act_mirred race betwen tcf_action and tcfm_dev is similar. In all cases the race is harmless. Long term we may want to improve the situation by replacing the whole tc_action->priv as single pointer instead of updating inner fields one by one. Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexei Starovoitov 提交于
tcf_hash_destroy() used once. Make it static. Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 8月, 2015 4 次提交
-
-
由 Jiri Benc 提交于
The vxlan_get_sk_family inline function was added after the last #endif, making multiple inclusion of net/vxlan.h fail. Move it to the proper place. Reported-by: NMark Rustad <mark.d.rustad@intel.com> Fixes: 705cc62f ("vxlan: provide access function for vxlan socket address family") Signed-off-by: NJiri Benc <jbenc@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Remove various inlined functions not referenced in the kernel. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When TCP pacing was added back in linux-3.12, we chose to apply a fixed ratio of 200 % against current rate, to allow probing for optimal throughput even during slow start phase, where cwnd can be doubled every other gRTT. At Google, we found it was better applying a different ratio while in Congestion Avoidance phase. This ratio was set to 120 %. We've used the normal tcp_in_slow_start() helper for a while, then tuned the condition to select the conservative ratio as soon as cwnd >= ssthresh/2 : - After cwnd reduction, it is safer to ramp up more slowly, as we approach optimal cwnd. - Initial ramp up (ssthresh == INFINITY) still allows doubling cwnd every other RTT. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
slow start after idle might reduce cwnd, but we perform this after first packet was cooked and sent. With TSO/GSO, it means that we might send a full TSO packet even if cwnd should have been reduced to IW10. Moving the SSAI check in skb_entail() makes sense, because we slightly reduce number of times this check is done, especially for large send() and TCP Small queue callbacks from softirq context. As Neal pointed out, we also need to perform the check if/when receive window opens. Tested: Following packetdrill test demonstrates the problem // Test of slow start after idle `sysctl -q net.ipv4.tcp_slow_start_after_idle=1` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.100 < . 1:1(0) ack 1 win 511 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0 +0 write(4, ..., 26000) = 26000 +0 > . 1:5001(5000) ack 1 +0 > . 5001:10001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10 }% +.100 < . 1:1(0) ack 10001 win 511 +0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }% +0 > . 10001:20001(10000) ack 1 +0 > P. 20001:26001(6000) ack 1 +.100 < . 1:1(0) ack 26001 win 511 +0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }% +4 write(4, ..., 20000) = 20000 // If slow start after idle works properly, we should send 5 MSS here (cwnd/2) +0 > . 26001:31001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }% +0 > . 31001:36001(5000) ack 1 Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 8月, 2015 1 次提交
-
-
由 Tom Herbert 提交于
Add cfg and family arguments to lwt build state functions. cfg is a void pointer and will either be a pointer to a fib_config or fib6_config structure. The family parameter indicates which one (either AF_INET or AF_INET6). LWT encpasulation implementation may use the fib configuration to build the LWT state. Signed-off-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 8月, 2015 2 次提交
-
-
由 Jiri Benc 提交于
__recnt and related fields need to be in its own cacheline for performance reasons. Commit 61adedf3 ("route: move lwtunnel state to dst_entry") broke that on 32bit archs, causing BUILD_BUG_ON in dst_hold to be triggered. This patch fixes the breakage by moving the lwtunnel state to the end of dst_entry on 32bit archs. Unfortunately, this makes it share the cacheline with __refcnt and may affect performance, thus further patches may be needed. Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Fixes: 61adedf3 ("route: move lwtunnel state to dst_entry") Signed-off-by: NJiri Benc <jbenc@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Tom Herbert 提交于
Add calls to gro_cells infrastructure to do GRO when receiving on a tunnel. Testing: Ran 200 netperf TCP_STREAM instance - With fix (GRO enabled on VXLAN interface) Verify GRO is happening. 9084 MBps tput 3.44% CPU utilization - Without fix (GRO disabled on VXLAN interface) Verified no GRO is happening. 9084 MBps tput 5.54% CPU utilization Signed-off-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 8月, 2015 12 次提交
-
-
由 Jiri Benc 提交于
Use flowi_tunnel in flowi6 similarly to what is done with IPv4. This complements commit 1b7179d3 ("route: Extend flow representation with tunnel key"). Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
If output device wants to see the dst, inherit the dst of the original skb in the ndisc request. This is an IPv6 counterpart of commit 0accfc26 ("arp: Inherit metadata dst when creating ARP requests"). Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
Currently, the lwtunnel state resides in per-protocol data. This is a problem if we encapsulate ipv6 traffic in an ipv4 tunnel (or vice versa). The xmit function of the tunnel does not know whether the packet has been routed to it by ipv4 or ipv6, yet it needs the lwtstate data. Moving the lwtstate data to dst_entry makes such inter-protocol tunneling possible. As a bonus, this brings a nice diffstat. Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
Rename the ipv4_tos and ipv4_ttl fields to just 'tos' and 'ttl', as they'll be used with IPv6 tunnels, too. Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
Add the IPv6 addresses as an union with IPv4 ones. When using IPv4, the newly introduced padding after the IPv4 addresses needs to be zeroed out. Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NThomas Graf <tgraf@suug.ch> Acked-by: NAlexei Starovoitov <ast@plumgrid.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
The ip_tunnels.h include file uses mixture of __u16 and u16 (etc.) types. Unify it to the non-underscore variants. Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
The custom alignment of struct ip_tunnel_key is unnecessary. In struct sw_flow_key, it starts at offset 256, in struct ip_tunnel_info it's the first field. The structure is also packed even without the __packed keyword. Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NThomas Graf <tgraf@suug.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christophe Ricard 提交于
A proprietary vendor command may send back useful data to the user application. For example, the field level applied on the NFC router antenna. Still based on net/wireless/nl80211.c implementation, add nfc_vendor_cmd_alloc_reply_skb and nfc_vendor_cmd_reply in order to send back over netlink data generated by a proprietary command. Signed-off-by: NChristophe Ricard <christophe-h.ricard@st.com> Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
-
由 Robert Baldyga 提交于
Some drivers needs to have ability to reinit NCI core, for example after updating firmware in setup() of post_setup() callback. This patch makes nci_core_reset() and nci_core_init() functions public, to make it possible. Signed-off-by: NRobert Baldyga <r.baldyga@samsung.com> Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
-
由 Robert Baldyga 提交于
Some drivers require non-standard configuration after NCI_CORE_INIT request, because they need to know ndev->manufact_specific_info or ndev->manufact_id. This patch adds post_setup handler allowing to do such custom configuration. Signed-off-by: NRobert Baldyga <r.baldyga@samsung.com> Signed-off-by: NSamuel Ortiz <sameo@linux.intel.com>
-
- 20 8月, 2015 2 次提交
-
-
由 Nikolay Aleksandrov 提交于
While running net-next I hit this: [ 634.073119] =============================== [ 634.073150] [ INFO: suspicious RCU usage. ] [ 634.073182] 4.2.0-rc6+ #45 Not tainted [ 634.073213] ------------------------------- [ 634.073244] include/net/vrf.h:38 suspicious rcu_dereference_check() usage! [ 634.073274] other info that might help us debug this: [ 634.073307] rcu_scheduler_active = 1, debug_locks = 1 [ 634.073338] 2 locks held by swapper/0/0: [ 634.073369] #0: (((&n->timer))){+.-...}, at: [<ffffffff8112bc35>] call_timer_fn+0x5/0x480 [ 634.073412] #1: (slock-AF_INET){+.-...}, at: [<ffffffff8174f0f5>] icmp_send+0x155/0x5f0 [ 634.073450] stack backtrace: [ 634.073483] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6+ #45 [ 634.073514] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 634.073545] 0000000000000000 0593ba8242d9ace4 ffff88002fc03b48 ffffffff81803f1b [ 634.073612] 0000000000000000 ffffffff81e12500 ffff88002fc03b78 ffffffff811003c5 [ 634.073642] 0000000000000000 ffff88002ec4e600 ffffffff81f00f80 ffff88002fc03cf0 [ 634.073669] Call Trace: [ 634.073694] <IRQ> [<ffffffff81803f1b>] dump_stack+0x4c/0x65 [ 634.073728] [<ffffffff811003c5>] lockdep_rcu_suspicious+0xc5/0x100 [ 634.073763] [<ffffffff8174eb56>] icmp_route_lookup+0x176/0x5c0 [ 634.073793] [<ffffffff8174f2fb>] ? icmp_send+0x35b/0x5f0 [ 634.073818] [<ffffffff8174f274>] ? icmp_send+0x2d4/0x5f0 [ 634.073844] [<ffffffff8174f3ce>] icmp_send+0x42e/0x5f0 [ 634.073873] [<ffffffff8170b662>] ipv4_link_failure+0x22/0xa0 [ 634.073899] [<ffffffff8174bdda>] arp_error_report+0x3a/0x80 [ 634.073926] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0 [ 634.073952] [<ffffffff816d396e>] neigh_invalidate+0x8e/0x110 [ 634.073984] [<ffffffff816d62ae>] neigh_timer_handler+0x1ae/0x290 [ 634.074013] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0 [ 634.074013] [<ffffffff8112bce3>] call_timer_fn+0xb3/0x480 [ 634.074013] [<ffffffff8112bc35>] ? call_timer_fn+0x5/0x480 [ 634.074013] [<ffffffff816d6100>] ? neigh_lookup+0x2c0/0x2c0 [ 634.074013] [<ffffffff8112c2bc>] run_timer_softirq+0x20c/0x430 [ 634.074013] [<ffffffff810af50e>] __do_softirq+0xde/0x630 [ 634.074013] [<ffffffff810afc97>] irq_exit+0x117/0x120 [ 634.074013] [<ffffffff81810976>] smp_apic_timer_interrupt+0x46/0x60 [ 634.074013] [<ffffffff8180e950>] apic_timer_interrupt+0x70/0x80 [ 634.074013] <EOI> [<ffffffff8106b9d6>] ? native_safe_halt+0x6/0x10 [ 634.074013] [<ffffffff81101d8d>] ? trace_hardirqs_on+0xd/0x10 [ 634.074013] [<ffffffff81027d43>] default_idle+0x23/0x200 [ 634.074013] [<ffffffff8102852f>] arch_cpu_idle+0xf/0x20 [ 634.074013] [<ffffffff810f89ba>] default_idle_call+0x2a/0x40 [ 634.074013] [<ffffffff810f8dcc>] cpu_startup_entry+0x39c/0x4c0 [ 634.074013] [<ffffffff817f9cad>] rest_init+0x13d/0x150 [ 634.074013] [<ffffffff81f69038>] start_kernel+0x4a8/0x4c9 [ 634.074013] [<ffffffff81f68120>] ? early_idt_handler_array+0x120/0x120 [ 634.074013] [<ffffffff81f68339>] x86_64_start_reservations+0x2a/0x2c [ 634.074013] [<ffffffff81f68485>] x86_64_start_kernel+0x14a/0x16d It would seem vrf_master_ifindex_rcu() can be called without RCU held in other contexts as well so introduce a new helper which acquires rcu and returns the ifindex. Also add curly braces around both the "if" and "else" parts as per the style guide. Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ying Xue 提交于
When CONFIG_LWTUNNEL config is not enabled, the lwtstate_free() is not declared in lwtunnel.h at all. However, even in this case, the function is still referenced in fib_semantics.c so that there appears the following sparse warnings: net/ipv4/fib_semantics.c:553:17: error: undefined identifier 'lwtstate_free' CC net/ipv4/fib_semantics.o net/ipv4/fib_semantics.c: In function ‘fib_encap_match’: net/ipv4/fib_semantics.c:553:3: error: implicit declaration of function ‘lwtstate_free’ [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors make[1]: *** [net/ipv4/fib_semantics.o] Error 1 make: *** [net/ipv4/fib_semantics.o] Error 2 To eliminate the error, we define an empty function for lwtstate_free() in lwtunnel.h when CONFIG_LWTUNNEL is disabled. Fixes: df383e62 ("lwtunnel: fix memory leak") Cc: Jiri Benc <jbenc@redhat.com> Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NYing Xue <ying.xue@windriver.com> Acked-by: NJiri Benc <jbenc@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 8月, 2015 3 次提交
-
-
由 Nikolay Aleksandrov 提交于
slave_queue has a num_slaves member which is unused, drop it. Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Benc 提交于
The built lwtunnel_state struct has to be freed after comparison. Fixes: 571e7226 ("ipv4: support for fib route lwtunnel encap attributes") Signed-off-by: NJiri Benc <jbenc@redhat.com> Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
Add an inline helper for determining is a port is a DSA port. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 8月, 2015 6 次提交
-
-
由 Tom Herbert 提交于
This function updates a checksum field value and skb->csum based on a value which is the difference between the old and new checksum. Signed-off-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Tom Herbert 提交于
inet_proto_csum_replace4,2,16 take a pseudohdr argument which indicates the checksum field carries a pseudo header. This argument should be a boolean instead of an int. Signed-off-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Tom Herbert 提交于
This patch adds the capability to redirect dst input in the same way that dst output is redirected by LWT. Also, save the original dst.input and and dst.out when setting up lwtunnel redirection. These can be called by the client as a pass- through. Signed-off-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
This work adds the possibility of deriving the zone id from the skb->mark field in a scalable manner. This allows for having only a single template serving hundreds/thousands of different zones, for example, instead of the need to have one match for each zone as an extra CT jump target. Note that we'd need to have this information attached to the template as at the time when we're trying to lookup a possible ct object, we already need to know zone information for a possible match when going into __nf_conntrack_find_get(). This work provides a minimal implementation for a possible mapping. In order to not add/expose an extra ct->status bit, the zone structure has been extended to carry a flag for deriving the mark. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Daniel Borkmann 提交于
This work adds a direction parameter to netfilter zones, so identity separation can be performed only in original/reply or both directions (default). This basically opens up the possibility of doing NAT with conflicting IP address/port tuples from multiple, isolated tenants on a host (e.g. from a netns) without requiring each tenant to NAT twice resp. to use its own dedicated IP address to SNAT to, meaning overlapping tuples can be made unique with the zone identifier in original direction, where the NAT engine will then allocate a unique tuple in the commonly shared default zone for the reply direction. In some restricted, local DNAT cases, also port redirection could be used for making the reply traffic unique w/o requiring SNAT. The consensus we've reached and discussed at NFWS and since the initial implementation [1] was to directly integrate the direction meta data into the existing zones infrastructure, as opposed to the ct->mark approach we proposed initially. As we pass the nf_conntrack_zone object directly around, we don't have to touch all call-sites, but only those, that contain equality checks of zones. Thus, based on the current direction (original or reply), we either return the actual id, or the default NF_CT_DEFAULT_ZONE_ID. CT expectations are direction-agnostic entities when expectations are being compared among themselves, so we can only use the identifier in this case. Note that zone identifiers can not be included into the hash mix anymore as they don't contain a "stable" value that would be equal for both directions at all times, f.e. if only zone->id would unconditionally be xor'ed into the table slot hash, then replies won't find the corresponding conntracking entry anymore. If no particular direction is specified when configuring zones, the behaviour is exactly as we expect currently (both directions). Support has been added for the CT netlink interface as well as the x_tables raw CT target, which both already offer existing interfaces to user space for the configuration of zones. Below a minimal, simplified collision example (script in [2]) with netperf sessions: +--- tenant-1 ---+ mark := 1 | netperf |--+ +----------------+ | CT zone := mark [ORIGINAL] [ip,sport] := X +--------------+ +--- gateway ---+ | mark routing |--| SNAT |-- ... + +--------------+ +---------------+ | +--- tenant-2 ---+ | ~~~|~~~ | netperf |--+ +-----------+ | +----------------+ mark := 2 | netserver |------ ... + [ip,sport] := X +-----------+ [ip,port] := Y On the gateway netns, example: iptables -t raw -A PREROUTING -j CT --zone mark --zone-dir ORIGINAL iptables -t nat -A POSTROUTING -o <dev> -j SNAT --to-source <ip> --random-fully iptables -t mangle -A PREROUTING -m conntrack --ctdir ORIGINAL -j CONNMARK --save-mark iptables -t mangle -A POSTROUTING -m conntrack --ctdir REPLY -j CONNMARK --restore-mark conntrack dump from gateway netns: netperf -H 10.1.1.2 -t TCP_STREAM -l60 -p12865,5555 from each tenant netns tcp 6 431995 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=1 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=1024 [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 431994 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=5555 dport=12865 zone-orig=2 src=10.1.1.2 dst=10.1.1.1 sport=12865 dport=5555 [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 299 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=39438 dport=33768 zone-orig=1 src=10.1.1.2 dst=10.1.1.1 sport=33768 dport=39438 [ASSURED] mark=1 secctx=system_u:object_r:unlabeled_t:s0 use=1 tcp 6 300 ESTABLISHED src=40.1.1.1 dst=10.1.1.2 sport=32889 dport=40206 zone-orig=2 src=10.1.1.2 dst=10.1.1.1 sport=40206 dport=32889 [ASSURED] mark=2 secctx=system_u:object_r:unlabeled_t:s0 use=2 Taking this further, test script in [2] creates 200 tenants and runs original-tuple colliding netperf sessions each. A conntrack -L dump in the gateway netns also confirms 200 overlapping entries, all in ESTABLISHED state as expected. I also did run various other tests with some permutations of the script, to mention some: SNAT in random/random-fully/persistent mode, no zones (no overlaps), static zones (original, reply, both directions), etc. [1] http://thread.gmane.org/gmane.comp.security.firewalls.netfilter.devel/57412/ [2] https://paste.fedoraproject.org/242835/65657871/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 David Ahern 提交于
Table lookup compiles out when VRF is not enabled. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 8月, 2015 2 次提交
-
-
由 David Ahern 提交于
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. inet_addr_type_dev_table keeps the same semantics as inet_addr_type but if the passed in device is enslaved to a VRF then the table for that VRF is used for the lookup. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Currently inet_addr_type and inet_dev_addr_type expect local addresses to be in the local table. With the VRF device local routes for devices associated with a VRF will be in the table associated with the VRF. Provide an alternate inet_addr lookup to use a specific table rather than defaulting to the local table. Signed-off-by: NShrijeet Mukherjee <shm@cumulusnetworks.com> Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-