提交 · 963a88b31ddbbe99f38502239b1a46601773d217 · openeuler / raspberrypi-kernel

04 9月, 2013 4 次提交

tunnels: harmonize cleanup done on skb on xmit path · 963a88b3

由 Nicolas Dichtel 提交于 9月 02, 2013

The goal of this patch is to harmonize cleanup done on a skbuff on xmit path.
Before this patch, behaviors were different depending of the tunnel type.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

963a88b3

skb: allow skb_scrub_packet() to be used by tunnels · 8b27f277

由 Nicolas Dichtel 提交于 9月 02, 2013

This function was only used when a packet was sent to another netns. Now, it can
also be used after tunnel encapsulation or decapsulation.

Only skb_orphan() should not be done when a packet is not crossing netns.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b27f277

iptunnels: remove net arg from iptunnel_xmit() · 8b7ed2d9

由 Nicolas Dichtel 提交于 9月 02, 2013

This argument is not used, let's remove it.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b7ed2d9

net: neighbour: Remove CONFIG_ARPD · 3e25c65e

由 Tim Gardner 提交于 8月 29, 2013

This config option is superfluous in that it only guards a call
to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
change in behavior. There will now be call to __neigh_notify()
for each ARP resolution, which has no impact unless there is a
user space daemon waiting to receive the notification, i.e.,
the case for which CONFIG_ARPD was designed anyways.
Suggested-by: NEric W. Biederman <ebiederm@xmission.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Joe Perches <joe@perches.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e25c65e

03 9月, 2013 1 次提交

net: make snmp_mib_free static inline · 5a17a390

由 Cong Wang 提交于 9月 02, 2013

Fengguang reported:

   net/built-in.o: In function `in6_dev_finish_destroy':
   (.text+0x4ca7d): undefined reference to `snmp_mib_free'

this is due to snmp_mib_free() is defined when CONFIG_INET is enabled,
but in6_dev_finish_destroy() is now moved to core kernel.

I think snmp_mib_free() is small enough to be inlined, so just make it
static inline.
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NCong Wang <amwang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a17a390

01 9月, 2013 1 次提交

net: unify skb_udp_tunnel_segment() and skb_udp6_tunnel_segment() · eb3c0d83

由 Cong Wang 提交于 8月 31, 2013

As suggested by Pravin, we can unify the code in case of duplicated
code.

Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: NCong Wang <amwang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb3c0d83

31 8月, 2013 1 次提交

tcp: do not use cached RTT for RTT estimation · 1b7fdd2a

由 Yuchung Cheng 提交于 8月 30, 2013

RTT cached in the TCP metrics are valuable for the initial timeout
because SYN RTT usually does not account for serialization delays
on low BW path.

However using it to seed the RTT estimator maybe disruptive because
other components (e.g., pacing) require the smooth RTT to be obtained
from actual connection.

The solution is to use the higher cached RTT to set the first RTO
conservatively like tcp_rtt_estimator(), but avoid seeding the other
RTT estimator variables such as srtt.  It is also a good idea to
keep RTO conservative to obtain the first RTT sample, and the
performance is insured by TCP loss probe if SYN RTT is available.

To keep the seeding formula consistent across SYN RTT and cached RTT,
the rttvar is twice the cached RTT instead of cached RTTVAR value. The
reason is because cached variation may be too small (near min RTO)
which defeats the purpose of being conservative on first RTO. However
the metrics still keep the RTT variations as they might be useful for
user applications (through ip).
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b7fdd2a

30 8月, 2013 1 次提交

tcp: TSO packets automatic sizing · 95bd09eb

由 Eric Dumazet 提交于 8月 27, 2013

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95bd09eb

28 8月, 2013 5 次提交

netfilter: add SYNPROXY core/target · 48b1de4c

由 Patrick McHardy 提交于 8月 27, 2013

Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.

The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.

It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.

Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Tested-by: NMartin Topholm <mph@one.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

48b1de4c

net: syncookies: export cookie_v4_init_sequence/cookie_v4_check · 0198230b

由 Patrick McHardy 提交于 8月 27, 2013

Extract the local TCP stack independant parts of tcp_v4_init_sequence()
and cookie_v4_check() and export them for use by the upcoming SYNPROXY
target.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Tested-by: NMartin Topholm <mph@one.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

0198230b

netfilter: nf_conntrack: make sequence number adjustments usuable without NAT · 41d73ec0

由 Patrick McHardy 提交于 8月 27, 2013

Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.

As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Tested-by: NMartin Topholm <mph@one.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

41d73ec0

netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridged · affe759d

由 Phil Oester 提交于 6月 26, 2013

As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
with the tcp-reset option sends out reset packets with the src MAC address
of the local bridge interface, instead of the MAC address of the intended
destination. This causes some routers/firewalls to drop the reset packet
as it appears to be spoofed. Fix this by bypassing ip[6]_local_out and
setting the MAC of the sender in the tcp reset packet.

This closes netfilter bugzilla #531.
Signed-off-by: NPhil Oester <kernel@linuxace.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

affe759d

net: tcp_probe: allow more advanced ingress filtering by mark · b1dcdc68

由 Daniel Borkmann 提交于 8月 23, 2013

Currently, the tcp_probe snooper can either filter packets by a given
port (handed to the module via module parameter e.g. port=80) or lets
all TCP traffic pass (port=0, default). When a port is specified, the
port number is tested against the sk's source/destination port. Thus,
if one of them matches, the information will be further processed for
the log.

As this is quite limited, allow for more advanced filtering possibilities
which can facilitate debugging/analysis with the help of the tcp_probe
snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
("pkt_sched: ingress socket filter by mark"), add the possibility to
use skb->mark as a filter.

If the mark is not being used otherwise, this allows ingress filtering
by flow (e.g. in order to track updates from only a single flow, or a
subset of all flows for a given port) and other things such as dynamic
logging and reconfiguration without removing/re-inserting the tcp_probe
module, etc. Simple example:

  insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
  ...
  iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
           --sport 60952 -j MARK --set-mark 8888
  [... sampling interval ...]
  iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
           --sport 60952 -j MARK --set-mark 8888

The current option to filter by a given port is still being preserved. A
similar approach could be done for the sctp_probe module as a follow-up.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1dcdc68

26 8月, 2013 1 次提交

ipip: potential race in ip_tunnel_init_net() · b4de77ad

由 Dan Carpenter 提交于 8月 23, 2013

Eric Dumazet says that my previous fix for an ERR_PTR dereference
(ea857f28 'ipip: dereferencing an ERR_PTR in ip_tunnel_init_net()')
could be racy and suggests the following fix instead.
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4de77ad

23 8月, 2013 4 次提交

net: tcp_probe: add IPv6 support · f925d0a6

由 Daniel Borkmann 提交于 8月 21, 2013

The tcp_probe currently only supports analysis of IPv4 connections.
Therefore, it would be nice to have IPv6 supported as well. Since we
have the recently added %pISpc specifier that is IPv4/IPv6 generic,
build related sockaddress structures from the flow information and
pass this to our format string. Tested with SSH and HTTP sessions
on IPv4 and IPv6.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f925d0a6

net: tcp_probe: kprobes: adapt jtcp_rcv_established signature · d8cdeda6

由 Daniel Borkmann 提交于 8月 21, 2013

This patches fixes a rather unproblematic function signature mismatch
as the const specifier was missing for the th variable; and next to
that it adds a build-time assertion so that future function signature
mismatches for kprobes will not end badly, similarly as commit 22222997
("net: sctp: add build check for sctp_sf_eat_sack_6_2/jsctp_sf_eat_sack")
did it for SCTP.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8cdeda6

net: tcp_probe: also include rcv_wnd next to snd_wnd · b4c1c1d0

由 Daniel Borkmann 提交于 8月 21, 2013

It is helpful to sometimes know the TCP window sizes of an established
socket e.g. to confirm that window scaling is working or to tweak the
window size to improve high-latency connections, etc etc. Currently the
TCP snooper only exports the send window size, but not the receive window
size. Therefore, also add the receive window size to the end of the
output line.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4c1c1d0

tcp: increase throughput when reordering is high · 0f7cc9a3

由 Yuchung Cheng 提交于 8月 21, 2013

The stack currently detects reordering and avoid spurious
retransmission very well. However the throughput is sub-optimal under
high reordering because cwnd is increased only if the data is deliverd
in order. I.e., FLAG_DATA_ACKED check in tcp_ack().  The more packet
are reordered the worse the throughput is.

Therefore when reordering is proven high, cwnd should advance whenever
the data is delivered regardless of its ordering. If reordering is low,
conservatively advance cwnd only on ordered deliveries in Open state,
and retain cwnd in Disordered state (RFC5681).

Using netperf on a qdisc setup of 20Mbps BW and random RTT from 45ms
to 55ms (for reordering effect). This change increases TCP throughput
by 20 - 25% to near bottleneck BW.

A special case is the stretched ACK with new SACK and/or ECE mark.
For example, a receiver may receive an out of order or ECN packet with
unacked data buffered because of LRO or delayed ACK. The principle on
such an ACK is to advance cwnd on the cummulative acked part first,
then reduce cwnd in tcp_fastretrans_alert().
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0f7cc9a3

21 8月, 2013 4 次提交

ipip: dereferencing an ERR_PTR in ip_tunnel_init_net() · ea857f28

由 Dan Carpenter 提交于 8月 19, 2013

We need to move the derefernce after the IS_ERR() check.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea857f28

ipv4: raise IP_MAX_MTU to theoretical limit · 734d2725

由 Eric Dumazet 提交于 8月 18, 2013

As discussed last year [1], there is no compelling reason
to limit IPv4 MTU to 0xFFF0, while real limit is 0xFFFF

[1] : http://marc.info/?l=linux-netdev&m=135607247609434&w=2

Willem raised this issue again because some of our internal
regression tests broke after lo mtu being set to 65536.

IP_MTU reports 0xFFF0, and the test attempts to send a RAW datagram of
mtu + 1 bytes, expecting the send() to fail, but it does not.

Alexey raised interesting points about TCP MSS, that should be addressed
in follow-up patches in TCP stack if needed, as someone could also set
an odd mtu anyway.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

734d2725

tcp: trivial: Remove nocache argument from tcp_v4_send_synack · 397b4174

由 Christoph Paasch 提交于 8月 18, 2013

The nocache-argument was used in tcp_v4_send_synack as an argument to
inet_csk_route_req. However, since ba3f7f04 (ipv4: Kill
FLOWI_FLAG_RT_NOCACHE and associated code.) this is no more used.

This patch removes the unsued argument from tcp_v4_send_synack.
Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

397b4174

tcp: set timestamps for restored skb-s · 7ed5c5ae

由 Andrey Vagin 提交于 8月 16, 2013

When the repair mode is turned off, the write queue seqs are
updated so that the whole queue is considered to be 'already sent.

The "when" field must be set for such skb. It's used in tcp_rearm_rto
for example. If the "when" field isn't set, the retransmit timeout can
be calculated incorrectly and a tcp connected can stop for two minutes
(TCP_RTO_MAX).
Acked-by: NPavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ed5c5ae

16 8月, 2013 1 次提交

net: proc_fs: trivial: print UIDs as unsigned int · d14c5ab6

由 Francesco Fusco 提交于 8月 15, 2013

UIDs are printed in the proc_fs as signed int, whereas
they are unsigned int.
Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d14c5ab6

15 8月, 2013 3 次提交

ipip: add x-netns support · 6c742e71

由 Nicolas Dichtel 提交于 8月 13, 2013

This patch allows to switch the netns when packet is encapsulated or
decapsulated. In other word, the encapsulated packet is received in a netns,
where the lookup is done to find the tunnel. Once the tunnel is found, the
packet is decapsulated and injecting into the corresponding interface which
stands to another netns.

When one of the two netns is removed, the tunnel is destroyed.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c742e71

ipv4 tunnels: use net_eq() helper to check netns · fc8f999d

由 Nicolas Dichtel 提交于 8月 13, 2013

It's better to use available helpers for these tests.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc8f999d

dev: move skb_scrub_packet() after eth_type_trans() · 64261f23

由 Nicolas Dichtel 提交于 8月 13, 2013

skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans()
set pkt_type.

In fact, we should force pkt_type to PACKET_HOST, so move the call after
eth_type_trans().
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

64261f23

14 8月, 2013 2 次提交

ip_tunnel: Do not use inner ip-header-id for tunnel ip-header-id. · 4221f405

由 Pravin B Shelar 提交于 8月 13, 2013

Using inner-id for tunnel id is not safe in some rare cases.
E.g. packets coming from multiple sources entering same tunnel
can have same id. Therefore on tunnel packet receive we
could have packets from two different stream but with same
source and dst IP with same ip-id which could confuse ip packet
reassembly.

Following patch reverts optimization from commit
490ab081 (IP_GRE: Fix IP-Identification.)

CC: Jarno Rajahalme <jrajahalme@nicira.com>
CC: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4221f405

tcp: reset reordering est. selectively on timeout · 74c181d5

由 Yuchung Cheng 提交于 8月 12, 2013

On timeout the TCP sender unconditionally resets the estimated degree
of network reordering (tp->reordering). The idea behind this is that
the estimate is too large to trigger fast recovery (e.g., due to a IP
path change).

But for example if the sender only had 2 packets outstanding, then a
timeout doesn't tell much about reordering. A sender that learns about
reordering on big writes and loses packets on small writes will end up
falsely retransmitting again and again, especially when reordering is
more likely on big writes.

Therefore the sender should only suspect that tp->reordering is too
high if it could have gone into fast recovery with the (lower) default
estimate.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74c181d5

10 8月, 2013 6 次提交

tcp: add server ip to encrypt cookie in fast open · 149479d0

由 Yuchung Cheng 提交于 8月 08, 2013

Encrypt the cookie with both server and client IPv4 addresses,
such that multi-homed server will grant different cookies
based on both the source and destination IPs. No client change
is needed since cookie is opaque to the client.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

149479d0

net: rename busy poll MIB counter · 288a9376

由 Eliezer Tamir 提交于 8月 07, 2013

Rename mib counter from "low latency" to "busy poll"

v1 also moved the counter to the ip MIB (suggested by Shawn Bohrer)
Eric Dumazet suggested that the current location is better.

So v2 just renames the counter to fit the new naming convention.
Signed-off-by: NEliezer Tamir <eliezer.tamir@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

288a9376

net: igmp: Allow user-space configuration of igmp unsolicited report interval · 2690048c

由 William Manley 提交于 8月 06, 2013

Adds the new procfs knobs:

/proc/sys/net/ipv4/conf/*/igmpv2_unsolicited_report_interval
/proc/sys/net/ipv4/conf/*/igmpv3_unsolicited_report_interval

Which will allow userspace configuration of the IGMP unsolicited report
interval (see below) in milliseconds. The defaults are 10000ms for IGMPv2
and 1000ms for IGMPv3 in accordance with RFC2236 and RFC3376.

Background:

If an IGMP join packet is lost you will not receive data sent to the
multicast group so if no data arrives from that multicast group in a
period of time after the IGMP join a second IGMP join will be sent. The
delay between joins is the "IGMP Unsolicited Report Interval".

Prior to this patch this value was hard coded in the kernel to 10s for
IGMPv2 and 1s for IGMPv3. 10s is unsuitable for some use-cases, such as
IPTV as it can cause channel change to be slow in the presence of packet
loss.

This patch allows the value to be overridden from userspace for both
IGMPv2 and IGMPv3 such that it can be tuned accoding to the network.

Tested with Wireshark and a simple program to join a (non-existent)
multicast group. The distribution of timings for the second join differ
based upon setting the procfs knobs.

igmpvX_unsolicited_report_interval is intended to follow the pattern
established by force_igmp_version, and while a procfs entry has been added
a corresponding sysctl knob has not as it is my understanding that sysctl
is deprecated[1].

[1]: http://lwn.net/Articles/247243/Signed-off-by: NWilliam Manley <william.manley@youview.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2690048c

net: igmp: Don't flush routing cache when force_igmp_version is modified · 5c6fe01c

由 William Manley 提交于 8月 06, 2013

The procfs knob /proc/sys/net/ipv4/conf/*/force_igmp_version allows the
IGMP protocol version to use to be explicitly set.  As a side effect this
caused the routing cache to be flushed as it was declared as a
DEVINET_SYSCTL_FLUSHING_ENTRY.  Flushing is unnecessary and this patch
makes it so flushing does not occur.

Requested by Hannes Frederic Sowa as he was reviewing other patches
adding procfs entries.
Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NWilliam Manley <william.manley@youview.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c6fe01c

net: igmp: Reduce Unsolicited report interval to 1s when using IGMPv3 · cab70040

由 William Manley 提交于 8月 06, 2013

Previously this value was hard coded to be chosen randomly between 0-10s.
This can be too long for some use-cases, such as IPTV as it can cause
channel change to be slow in the presence of packet loss.

The value 10s has come from IGMPv2 RFC2236, which was reduced to 1s in
IGMPv3 RFC3376. This patch makes the kernel use the 1s value from the
later RFC if we are operating in IGMPv3 mode. IGMPv2 behaviour is
unaffected.

Tested with Wireshark and a simple program to join a (non-existent)
multicast group. The distribution of timings for the second join differ
based upon setting /proc/sys/net/ipv4/conf/eth0/force_igmp_version.
Signed-off-by: NWilliam Manley <william.manley@youview.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NBenjamin LaHaise <bcrl@kvack.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cab70040

ip_gre: fix ipgre_header to return correct offset · 77a482bd

由 Timo Teräs 提交于 8月 06, 2013

Fix ipgre_header() (header_ops->create) to return the correct
amount of bytes pushed. Most callers of dev_hard_header() seem
to care only if it was success, but af_packet.c uses it as
offset to the skb to copy from userspace only once. In practice
this fixes packet socket sendto()/sendmsg() to gre tunnels.

Regression introduced in c5441932
("GRE: Refactor GRE tunneling code.")

Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: NTimo Teräs <timo.teras@iki.fi>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77a482bd

09 8月, 2013 2 次提交

netfilter: nf_conntrack: don't send destroy events from iterator · c655bc68

由 Florian Westphal 提交于 7月 29, 2013

Let nf_ct_delete handle delivery of the DESTROY event.

Based on earlier patch from Pablo Neira.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

c655bc68

net: add SNMP counters tracking incoming ECN bits · 1f07d03e

由 Eric Dumazet 提交于 8月 06, 2013

With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
counters do not count the number of frames, but number of aggregated
segments.

Its probably too late to change this now.

This patch adds four new counters, tracking number of frames, regardless
of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.

Ip[6]NoECTPkts : Number of packets received with NOECT
Ip[6]ECT1Pkts  : Number of packets received with ECT(1)
Ip[6]ECT0Pkts  : Number of packets received with ECT(0)
Ip[6]CEPkts    : Number of packets received with Congestion Experienced

lph37:~# nstat | egrep "Pkts|InReceive"
IpInReceives                    1634137            0.0
Ip6InReceives                   3714107            0.0
Ip6InNoECTPkts                  19205              0.0
Ip6InECT0Pkts                   52651828           0.0
IpExtInNoECTPkts                33630              0.0
IpExtInECT0Pkts                 15581379           0.0
IpExtInCEPkts                   6                  0.0
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f07d03e

08 8月, 2013 3 次提交

ip_tunnel: embed hash list head · 6261d983

由 stephen hemminger 提交于 8月 05, 2013

The IP tunnel hash heads can be embedded in the per-net structure
since it is a fixed size. Reduce the size so that the total structure
fits in a page size. The original size was overly large, even NETDEV_HASHBITS
is only 8 bits!

Also, add some white space for readability.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Acked-by: Pravin B Shelar <pshelar@nicira.com>.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6261d983

tcp: cubic: fix bug in bictcp_acked() · cd6b423a

由 Eric Dumazet 提交于 8月 05, 2013

While investigating about strange increase of retransmit rates
on hosts ~24 days after boot, Van found hystart was disabled
if ca->epoch_start was 0, as following condition is true
when tcp_time_stamp high order bit is set.

(s32)(tcp_time_stamp - ca->epoch_start) < HZ

Quoting Van :

 At initialization & after every loss ca->epoch_start is set to zero so
 I believe that the above line will turn off hystart as soon as the 2^31
 bit is set in tcp_time_stamp & hystart will stay off for 24 days.
 I think we've observed that cubic's restart is too aggressive without
 hystart so this might account for the higher drop rate we observe.
Diagnosed-by: NVan Jacobson <vanj@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd6b423a

tcp: cubic: fix overflow error in bictcp_update() · 2ed0edf9

由 Eric Dumazet 提交于 8月 05, 2013

commit 17a6e9f1 ("tcp_cubic: fix clock dependency") added an
overflow error in bictcp_update() in following code :

/* change the unit from HZ to bictcp_HZ */
t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) -
      ca->epoch_start) << BICTCP_HZ) / HZ;

Because msecs_to_jiffies() being unsigned long, compiler does
implicit type promotion.

We really want to constrain (tcp_time_stamp - ca->epoch_start)
to a signed 32bit value, or else 't' has unexpected high values.

This bugs triggers an increase of retransmit rates ~24 days after
boot [1], as the high order bit of tcp_time_stamp flips.

[1] for hosts with HZ=1000

Big thanks to Van Jacobson for spotting this problem.
Diagnosed-by: NVan Jacobson <vanj@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ed0edf9

06 8月, 2013 1 次提交

fib_trie: remove potential out of bound access · aab515d7

由 Eric Dumazet 提交于 8月 05, 2013

AddressSanitizer [1] dynamic checker pointed a potential
out of bound access in leaf_walk_rcu()

We could allocate one more slot in tnode_new() to leave the prefetch()
in-place but it looks not worth the pain.

Bug added in commit 82cfbb00 ("[IPV4] fib_trie: iterator recode")

[1] :
https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernelReported-by: NAndrey Konovalov <andreyknvl@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aab515d7