- 30 9月, 2015 21 次提交
-
-
由 David Ahern 提交于
Replace calls to vrf_master_ifindex_rcu and vrf_master_ifindex with either l3mdev_master_ifindex_rcu or l3mdev_master_ifindex. The pattern: oif = vrf_master_ifindex(dev) ? : dev->ifindex; is replaced with oif = l3mdev_fib_oif(dev); And remove the now unused vrf macros. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
L3 master devices allow users of the abstraction to influence FIB lookups for enslaved devices. Current API provides a means for the master device to return a specific FIB table for an enslaved device, to return an rtable/custom dst and influence the OIF used for fib lookups. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Rename IFF_VRF_MASTER to IFF_L3MDEV_MASTER and update the name of the netif_is_vrf and netif_index_is_vrf macros. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
While auditing TCP stack for upcoming 'lockless' listener changes, I found I had to change fastopen_init_queue() to properly init the object before publishing it. Otherwise an other cpu could try to lock the spinlock before it gets properly initialized. Instead of adding appropriate barriers, just remove dynamic memory allocations : - Structure is 28 bytes on 64bit arches. Using additional 8 bytes for holding a pointer seems overkill. - Two listeners can share same cache line and performance would suffer. If we really want to save few bytes, we would instead dynamically allocate whole struct request_sock_queue in the future. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
tcp_syn_flood_action() will soon be called with unlocked socket. In order to avoid SYN flood warning being emitted multiple times, use xchg(). Extend max_qlen_log and synflood_warned fields in struct listen_sock to u32 Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
These functions do not change the listener socket. Goal is to make sure tcp_conn_request() is not messing with listener in a racy way. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Some common IPv4/IPv6 code can be factorized. Also constify cookie_init_sequence() socket argument. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We'll soon no longer hold listener socket lock, these functions do not modify the socket in any way. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This method does not touch the listener socket. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
socket no longer needs to be read/write Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
socket is not touched, make it const. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
The socket points to the (shared) listener. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Before changing dccp_v6_request_recv_sock() sock argument to const, we need to get rid of security_sk_classify_flow(), and it seems doable by reusing inet6_csk_route_req() helper. We need to add a proto parameter to inet6_csk_route_req(), not assume it is TCP. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Factorize code to get tcp header from skb. It makes no sense to duplicate code in callers. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Once we realize tcp_rcv_synsent_state_process() does not use its 'len' argument and we get rid of it, then it becomes clear this argument is no longer used in tcp_rcv_state_process() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
None of these functions need to change the socket, make it const. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
err is initialized to -EINVAL when it is declared. It is not reset until fib_lookup which is well after the 3 users of the martian_source jump. So resetting err to -EINVAL at martian_source label is not needed. Removing that line obviates the need for the martian_source_keep_err label so delete it. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NAlexander Duyck <aduyck@mirantis.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
This patch just swaps the ordering of one of the conditional tests in ip_route_input_mc. Specifically it swaps the testing for the source address to see if it is loopback, and the test to see if we allow a loopback source address. The reason for swapping these two tests is because it is much faster to test if an address is loopback than it is to dereference several pointers to get at the net structure to see if the use of loopback is allowed. Signed-off-by: NAlexander Duyck <aduyck@mirantis.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
This patch updates ip_check_mc_rcu so that protocol is passed as a u8 instead of a u16. The motivation is just to avoid any unneeded type transitions since some systems will require an instruction to zero extend a u8 field to a u16. Also it makes it a bit more readable as to the fact that protocol is a u8 so there are no byte ordering changes needed to pass it. Signed-off-by: NAlexander Duyck <aduyck@mirantis.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
For some reason we were carrying the budget value around between the various calls to napi->poll. If for example one of the drivers called had a bug in which it returned a non-zero value for work this could result in the budget value becoming negative. Rather than carry around a value of budget that is 0 or less we can instead just loop through and pass 0 to each napi->poll call. If any driver returns a value for work done that is non-zero then we can report that driver and continue rather than allowing a bad actor to make the budget value negative and pass that negative value to napi->poll. Note, the only actual change here is that instead of letting budget become negative we are keeping it at 0 regardless of the value returned for work since it should not be possible for the polling routine to do any actual work with a budget of 0. So if the polling routine returns a non-0 value we are just reporting it and continuing with a budget of 0 rather than letting that work value be subtracted from the budget of 0. Signed-off-by: NAlexander Duyck <aduyck@mirantis.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
This patch changes the bridge vlan implementation to use rhashtables instead of bitmaps. The main motivation behind this change is that we need extensible per-vlan structures (both per-port and global) so more advanced features can be introduced and the vlan support can be extended. I've tried to break this up but the moment net_port_vlans is changed and the whole API goes away, thus this is a larger patch. A few short goals of this patch are: - Extensible per-vlan structs stored in rhashtables and a sorted list - Keep user-visible behaviour (compressed vlans etc) - Keep fastpath ingress/egress logic the same (optimizations to come later) Here's a brief list of some of the new features we'd like to introduce: - per-vlan counters - vlan ingress/egress mapping - per-vlan igmp configuration - vlan priorities - avoid fdb entries replication (e.g. local fdb scaling issues) The structure is kept single for both global and per-port entries so to avoid code duplication where possible and also because we'll soon introduce "port0 / aka bridge as port" which should simplify things further (thanks to Vlad for the suggestion!). Now we have per-vlan global rhashtable (bridge-wide) and per-vlan port rhashtable, if an entry is added to a port it'll get a pointer to its global context so it can be quickly accessed later. There's also a sorted vlan list which is used for stable walks and some user-visible behaviour such as the vlan ranges, also for error paths. VLANs are stored in a "vlan group" which currently contains the rhashtable, sorted vlan list and the number of "real" vlan entries. A good side-effect of this change is that it resembles how hw keeps per-vlan data. One important note after this change is that if a VLAN is being looked up in the bridge's rhashtable for filtering purposes (or to check if it's an existing usable entry, not just a global context) then the new helper br_vlan_should_use() needs to be used if the vlan is found. In case the lookup is done only with a port's vlan group, then this check can be skipped. Things tested so far: - basic vlan ingress/egress - pvids - untagged vlans - undef CONFIG_BRIDGE_VLAN_FILTERING - adding/deleting vlans in different scenarios (with/without global ctx, while transmitting traffic, in ranges etc) - loading/removing the module while having/adding/deleting vlans - extracting bridge vlan information (user ABI), compressed requests - adding/deleting fdbs on vlans - bridge mac change, promisc mode - default pvid change - kmemleak ON during the whole time Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 9月, 2015 4 次提交
-
-
由 Jesper Dangaard Brouer 提交于
Noticed that the compiler (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC)) generated suboptimal assembler code in eth_get_headlen(). This early return coding style is usually not an issue, on super scalar CPUs, but the compiler choose to put the return statement after this very unlikely branch, thus creating larger jump down to the likely code path. Performance wise, I could measure slightly less L1-icache-load-misses and less branch-misses, and an improvement of 1 nanosec with an IP-forwarding use-case with 257 bytes packets with ixgbe (CPU i7-4790K @ 4.00GHz). Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Bendik Rønning Opstad 提交于
Application limited streams such as thin streams, that transmit small amounts of payload in relatively few packets per RTT, can be prevented from growing the CWND when in congestion avoidance. This leads to increased sojourn times for data segments in streams that often transmit time-dependent data. Currently, a connection is considered CWND limited only after having successfully transmitted at least one packet with new data, while at the same time failing to transmit some unsent data from the output queue because the CWND is full. Applications that produce small amounts of data may be left in a state where it is never considered to be CWND limited, because all unsent data is successfully transmitted each time an incoming ACK opens up for more data to be transmitted in the send window. Fix by always testing whether the CWND is fully used after successful packet transmissions, such that a connection is considered CWND limited whenever the CWND has been filled. This is the correct behavior as specified in RFC2861 (section 3.1). Cc: Andreas Petlund <apetlund@simula.no> Cc: Carsten Griwodz <griff@simula.no> Cc: Jonas Markussen <jonassm@ifi.uio.no> Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Cc: Mads Johannessen <madsjoh@ifi.uio.no> Signed-off-by: NBendik Rønning Opstad <bro.devel+kernel@gmail.com> Acked-by: NEric Dumazet <edumazet@google.com> Tested-by: NEric Dumazet <edumazet@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Tested-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
The oif has already been checked that it is non-zero; the 2 additional checks on oif within that if (oif) {...} block are redundant. CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We found that a TCP Fast Open passive connection was vulnerable to reorders, as the exchange might look like [1] C -> S S <FO ...> <request> [2] S -> C S. ack request <options> [3] S -> C . <answer> packets [2] and [3] can be generated at almost the same time. If C receives the 3rd packet before the 2nd, it will drop it as the socket is in SYN_SENT state and expects a SYNACK. S will have to retransmit the answer. Current OOO avoidance in linux is defeated because SYNACK packets are attached to the LISTEN socket, while DATA packets are attached to the children. They might be sent by different cpus, and different TX queues might be selected. It turns out that for TFO, we created a child, which is a full blown socket in TCP_SYN_RECV state, and we simply can attach the SYNACK packet to this socket. This means that at the time tcp_sendmsg() pushes DATA packet, skb->ooo_okay will be set iff the SYNACK packet had been sent and TX completed. This removes the reorder source at the host level. We also removed the export of tcp_try_fastopen(), as it is no longer called from IPv6. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 9月, 2015 1 次提交
-
-
由 Ian Wilson 提交于
Allow bridge forward delay to be configured when Spanning Tree is enabled. Signed-off-by: NIan Wilson <iwilson@brocade.com> Signed-off-by: NStephen Hemminger <stephen@networkplumber.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 9月, 2015 1 次提交
-
-
由 Jiri Benc 提交于
For metadata based vxlan interface, open both IPv4 and IPv6 socket. This is much more user friendly: it's not necessary to create two vxlan interfaces and pay attention to using the right one in routing rules. Signed-off-by: NJiri Benc <jbenc@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 9月, 2015 13 次提交
-
-
由 David Ahern 提交于
Andrey reported a panic: [ 7249.865507] BUG: unable to handle kernel pointer dereference at 000000b4 [ 7249.865559] IP: [<c16afeca>] icmp_route_lookup+0xaa/0x320 [ 7249.865598] *pdpt = 0000000030f7f001 *pde = 0000000000000000 [ 7249.865637] Oops: 0000 [#1] ... [ 7249.866811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.3.0-999-generic #201509220155 [ 7249.866876] Hardware name: MSI MS-7250/MS-7250, BIOS 080014 08/02/2006 [ 7249.866916] task: c1a5ab00 ti: c1a52000 task.ti: c1a52000 [ 7249.866949] EIP: 0060:[<c16afeca>] EFLAGS: 00210246 CPU: 0 [ 7249.866981] EIP is at icmp_route_lookup+0xaa/0x320 [ 7249.867012] EAX: 00000000 EBX: f483ba48 ECX: 00000000 EDX: f2e18a00 [ 7249.867045] ESI: 000000c0 EDI: f483ba70 EBP: f483b9ec ESP: f483b974 [ 7249.867077] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 7249.867108] CR0: 8005003b CR2: 000000b4 CR3: 36ee07c0 CR4: 000006f0 [ 7249.867141] Stack: [ 7249.867165] 320310ee 00000000 00000042 320310ee 00000000 c1aeca00 f3920240 f0c69180 [ 7249.867268] f483ba04 f855058b a89b66cd f483ba44 f8962f4b 00000000 e659266c f483ba54 [ 7249.867361] 8004753c f483ba5c f8962f4b f2031140 000003c1 ffbd8fa0 c16b0e00 00000064 [ 7249.867448] Call Trace: [ 7249.867494] [<f855058b>] ? e1000_xmit_frame+0x87b/0xdc0 [e1000e] [ 7249.867534] [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack] [ 7249.867576] [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack] [ 7249.867615] [<c16b0e00>] ? icmp_send+0xa0/0x380 [ 7249.867648] [<c16b102f>] icmp_send+0x2cf/0x380 [ 7249.867681] [<f89c8126>] nf_send_unreach+0xa6/0xc0 [nf_reject_ipv4] [ 7249.867714] [<f89cd0da>] reject_tg+0x7a/0x9f [ipt_REJECT] [ 7249.867746] [<f88c29a7>] ipt_do_table+0x317/0x70c [ip_tables] [ 7249.867780] [<f895e0a6>] ? __nf_conntrack_find_get+0x166/0x3b0 [nf_conntrack] [ 7249.867838] [<f895eea8>] ? nf_conntrack_in+0x398/0x600 [nf_conntrack] [ 7249.867889] [<f84c0035>] iptable_filter_hook+0x35/0x80 [iptable_filter] [ 7249.867933] [<c16776a1>] nf_iterate+0x71/0x80 [ 7249.867970] [<c1677715>] nf_hook_slow+0x65/0xc0 [ 7249.868002] [<c1681811>] __ip_local_out_sk+0xc1/0xd0 [ 7249.868034] [<c1680f30>] ? ip_forward_options+0x1a0/0x1a0 [ 7249.868066] [<c1681836>] ip_local_out_sk+0x16/0x30 [ 7249.868097] [<c1684054>] ip_send_skb+0x14/0x80 [ 7249.868129] [<c16840f4>] ip_push_pending_frames+0x34/0x40 [ 7249.868163] [<c16844a2>] ip_send_unicast_reply+0x282/0x310 [ 7249.868196] [<c16a0863>] tcp_v4_send_reset+0x1b3/0x380 [ 7249.868227] [<c16a1b63>] tcp_v4_rcv+0x323/0x990 [ 7249.868257] [<c16776a1>] ? nf_iterate+0x71/0x80 [ 7249.868289] [<c167dc2b>] ip_local_deliver_finish+0x8b/0x230 [ 7249.868322] [<c167df4c>] ip_local_deliver+0x4c/0xa0 [ 7249.868353] [<c167dba0>] ? ip_rcv_finish+0x390/0x390 [ 7249.868384] [<c167d88c>] ip_rcv_finish+0x7c/0x390 [ 7249.868415] [<c167e280>] ip_rcv+0x2e0/0x420 ... Prior to the VRF change the oif was not set in the flow struct, so the VRF support should really have only added the vrf_master_ifindex lookup. Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX") Cc: Andrey Melnikov <temnota.am@gmail.com> Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
SYNACK packets are sent on behalf on unlocked listeners or fastopen sockets. Mark socket as const to catch future changes that might break the assumption. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This is done to make sure we do not change listener socket while sending SYNACK packets while socket lock is not held. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Like tcp_make_synack() the only time we might change the socket is when calling sock_wmalloc(), which is using atomic operation to update sk->sk_wmem_alloc Also use MAX_DCCP_HEADER as both IPv4/IPv6 use this value for max_header. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This documents fact that listener lock might not be held at the time SYNACK are sent. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This is to document that socket lock might not be held at this point. skb_set_owner_w() and ipv6_local_error() are using proper atomic ops or spinlocks, so we promote the socket to non const when calling them. netfilter hooks should never assume socket lock is held, we also promote the socket to non const. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
listener socket is not locked when tcp_make_synack() is called. We better make sure no field is written. There is one exception : Since SYNACK packets are attached to the listener at this moment (or SYN_RECV child in case of Fast Open), sock_wmalloc() needs to update sk->sk_wmem_alloc, but this is done using atomic operations so this is safe. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
SYNACK packets might be sent without holding socket lock. For DCTCP/ECN sake, we should call INET_ECN_xmit() while socket lock is owned, and only when we init/change congestion control. This also fixies a bug if congestion module is changed from dctcp to another one on a listener : we now clear ECN bits properly. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We do not use the socket in this function. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This function is used to build and send SYNACK packets, possibly on behalf of unlocked listener socket. Make sure we did not miss a write by making this socket const. We no longer can use ip_select_ident() and have to either set iph->id to 0 or directly call __ip_select_ident() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When TCP new listener is done, these functions will be called without socket lock being held. Make sure they don't change anything. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
socket is not modified, make it const so that callers can do the same if they need. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
ip6_dst_lookup_flow() and ip6_dst_lookup_tail() do not touch socket, lets add a const qualifier. This will permit the same change in inet6_csk_route_req() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-