提交 · 62748f32d501f5d3712a7c372bbb92abc7c62bc7 · openeuler / raspberrypi-kernel

29 9月, 2013 3 次提交

net: introduce SO_MAX_PACING_RATE · 62748f32

由 Eric Dumazet 提交于 11年前

As mentioned in commit afe4fd06 ("pkt_sched: fq: Fair Queue packet
scheduler"), this patch adds a new socket option.

SO_MAX_PACING_RATE offers the application the ability to cap the
rate computed by transport layer. Value is in bytes per second.

u32 val = 1000000;
setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));

To be effectively paced, a flow must use FQ packet scheduler.

Note that a packet scheduler takes into account the headers for its
computations. The effective payload rate depends on MSS and retransmits
if any.

I chose to make this pacing rate a SOL_SOCKET option instead of a
TCP one because this can be used by other protocols.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Steinar H. Gunderson <sesse@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62748f32

ipv4: processing ancillary IP_TOS or IP_TTL · aa661581

由 Francesco Fusco 提交于 11年前

If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
packets with the specified TTL or TOS overriding the socket values specified
with the traditional setsockopt().

The struct inet_cork stores the values of TOS, TTL and priority that are
passed through the struct ipcm_cookie. If there are user-specified TOS
(tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
used to override the per-socket values. In case of TOS also the priority
is changed accordingly.

Two helper functions get_rttos and get_rtconn_flags are defined to take
into account the presence of a user specified TOS value when computing
RT_TOS and RT_CONN_FLAGS.
Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa661581

ipv4: IP_TOS and IP_TTL can be specified as ancillary data · f02db315

由 Francesco Fusco 提交于 11年前

This patch enables the IP_TTL and IP_TOS values passed from userspace to
be stored in the ipcm_cookie struct. Three fields are added to the struct:

- the TTL, expressed as __u8.
  The allowed values are in the [1-255].
  A value of 0 means that the TTL is not specified.

- the TOS, expressed as __s16.
  The allowed values are in the range [0,255].
  A value of -1 means that the TOS is not specified.

- the priority, expressed as a char and computed when
  handling the ancillary data.
Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f02db315

24 9月, 2013 3 次提交

tcp: fix dynamic right sizing · b0983d3c

由 Eric Dumazet 提交于 11年前

Dynamic Right Sizing (DRS) is supposed to open TCP receive window
automatically, but suffers from two bugs, presented by order
of importance.

1) tcp_rcv_space_adjust() fix :

Using twice the last received amount is very pessimistic,
because it doesn't allow fast recovery or proper slow start
ramp up, if sender wants to increase cwin by 100% every RTT.

copied = bytes received in previous RTT

2*copied = bytes we expect to receive in next RTT

4*copied = bytes we need to advertise in rwin at end of next RTT

DRS is one RTT late, it needs a 4x factor.

If sender is not using ABC, and increases cwin by 50% every rtt,
then we needed 1.5*1.5 = 2.25 factor.
This is probably why this bug was not really noticed.

2) There is no window adjustment after first RTT. DRS triggers only
  after the second RTT.
  DRS needs two RTT to initialize, so tcp_fixup_rcvbuf() should setup
  sk_rcvbuf to allow proper window grow for first two RTT.

This patch increases TCP efficiency particularly for large RTT flows
when autotuning is used at the receiver, and more particularly
in presence of packet losses.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0983d3c

tcp: syncookies: reduce mss table to four values · 08629354

由 Florian Westphal 提交于 11年前

Halve mss table size to make blind cookie guessing more difficult.
This is sad since the tables were already small, but there
is little alternative except perhaps adding more precise mss information
in the tcp timestamp.  Timestamps are unfortunately not ubiquitous.

Guessing all possible cookie values still has 8-in 2**32 chance.
Reported-by: NJakob Lell <jakob@jakoblell.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

08629354

tcp: syncookies: reduce cookie lifetime to 128 seconds · 8c27bd75

由 Florian Westphal 提交于 11年前

We currently accept cookies that were created less than 4 minutes ago
(ie, cookies with counter delta 0-3).  Combined with the 8 mss table
values, this yields 32 possible values (out of 2**32) that will be valid.

Reducing the lifetime to < 2 minutes halves the guessing chance while
still providing a large enough period.

While at it, get rid of jiffies value -- they overflow too quickly on
32 bit platforms.

getnstimeofday is used to create a counter that increments every 64s.
perf shows getnstimeofday cost is negible compared to sha_transform;
normal tcp initial sequence number generation uses getnstimeofday, too.
Reported-by: NJakob Lell <jakob@jakoblell.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8c27bd75

20 9月, 2013 2 次提交

ip: generate unique IP identificator if local fragmentation is allowed · 703133de

由 Ansis Atteka 提交于 11年前

If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.

For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.
Signed-off-by: NAnsis Atteka <aatteka@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

703133de

ip: use ip_hdr() in __ip_make_skb() to retrieve IP header · 749154aa

由 Ansis Atteka 提交于 11年前

skb->data already points to IP header, but for the sake of
consistency we can also use ip_hdr() to retrieve it.
Signed-off-by: NAnsis Atteka <aatteka@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

749154aa

18 9月, 2013 1 次提交

tcp: fix RTO calculated from cached RTT · 269aa759

由 Neal Cardwell 提交于 11年前

Commit 1b7fdd2a ("tcp: do not use cached RTT for RTT estimation")
did not correctly account for the fact that crtt is the RTT shifted
left 3 bits. Fix the calculation to consistently reflect this fact.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-By: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

269aa759

13 9月, 2013 1 次提交

memcg: rename RESOURCE_MAX to RES_COUNTER_MAX · 6de5a8bf

由 Sha Zhengju 提交于 11年前

RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.
Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Jeff Liu <jeff.liu@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6de5a8bf

07 9月, 2013 2 次提交

tcp: properly increase rcv_ssthresh for ofo packets · 4e4f1fc2

由 Eric Dumazet 提交于 11年前

TCP receive window handling is multi staged.

A socket has a memory budget, static or dynamic, in sk_rcvbuf.

Because we do not really know how this memory budget translates to
a TCP window (payload), TCP announces a small initial window
(about 20 MSS).

When a packet is received, we increase TCP rcv_win depending
on the payload/truesize ratio of this packet. Good citizen
packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2

This heuristic takes place in tcp_grow_window()

Problem is : We currently call tcp_grow_window() only for in-order
packets.

This means that reorders or packet losses stop proper grow of
rcv_win, and senders are unable to benefit from fast recovery,
or proper reordering level detection.

Really, a packet being stored in OFO queue is not a bad citizen.
It should be part of the game as in-order packets.

In our traces, we very often see sender is limited by linux small
receive windows, even if linux hosts use autotuning (DRS) and should
allow rcv_win to grow to ~3MB.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e4f1fc2

tcp: fix no cwnd growth after timeout · 16edfe7e

由 Yuchung Cheng 提交于 11年前

In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
it only allows cwnd to increase in Open state. This mistakenly disables
slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
state moves from Disorder to Open later in tcp_fastretrans_alert().

Therefore the correct logic should be to allow cwnd to grow as long
as the data is received in order in Open, Loss, or even Disorder state.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

16edfe7e

06 9月, 2013 1 次提交

tcp: Add missing braces to do_tcp_setsockopt · e2e5c4c0

由 Dave Jones 提交于 11年前

Signed-off-by: NDave Jones <davej@fedoraproject.org>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e2e5c4c0

05 9月, 2013 1 次提交

tcp: better comments for RTO initiallization · 52f20e65

由 Yuchung Cheng 提交于 11年前

Commit 1b7fdd2a("tcp: do not use cached RTT for RTT estimation")
removes important comments on how RTO is initialized and updated.
Hopefully this patch puts those information back.
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52f20e65

04 9月, 2013 9 次提交

netfilter: SYNPROXY: let unrelated packets continue · 7cc9eb6e

由 Jesper Dangaard Brouer 提交于 11年前

Packets reaching SYNPROXY were default dropped, as they were most
likely invalid (given the recommended state matching).  This
patch, changes SYNPROXY target to let packets, not consumed,
continue being processed by the stack.

This will be more in line other target modules. As it will allow
more flexible configurations of handling, logging or matching on
packets in INVALID states.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

7cc9eb6e

netfilter: more strict TCP flag matching in SYNPROXY · 775ada6d

由 Jesper Dangaard Brouer 提交于 11年前

Its seems Patrick missed to incoorporate some of my requested changes
during review v2 of SYNPROXY netfilter module.

Which were, to avoid SYN+ACK packets to enter the path, meant for the
ACK packet from the client (from the 3WHS).

Further there were a bug in ip6t_SYNPROXY.c, for matching SYN packets
that didn't exclude the ACK flag.

Go a step further with SYN packet/flag matching by excluding flags
ACK+FIN+RST, in both IPv4 and IPv6 modules.

The intented usage of SYNPROXY is as follows:
(gracefully describing usage in commit)

 iptables -t raw -A PREROUTING -i eth0 -p tcp --dport 80 --syn -j NOTRACK
 iptables -A INPUT -i eth0 -p tcp --dport 80 -m state UNTRACKED,INVALID \
         -j SYNPROXY --sack-perm --timestamp --mss 1480 --wscale 7 --ecn

 echo 0 > /proc/sys/net/netfilter/nf_conntrack_tcp_loose

This does filter SYN flags early, for packets in the UNTRACKED state,
but packets in the INVALID state with other TCP flags could still
reach the module, thus this stricter flag matching is still needed.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

775ada6d

tcp: Change return value of tcp_rcv_established() · c995ae22

由 Vijay Subramanian 提交于 11年前

tcp_rcv_established() returns only one value namely 0. We change the return
value to void (as suggested by David Miller).

After commit 0c24604b (tcp: implement RFC 5961 4.2), we no longer send RSTs in
response to SYNs. We can remove the check and processing on the return value of
tcp_rcv_established().

We also fix jtcp_rcv_established() in tcp_probe.c to match that of
tcp_rcv_established().
Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c995ae22

net: tcp_probe: adapt tbuf size for recent changes · cc8c6c1b

由 Daniel Borkmann 提交于 11年前

With recent changes in tcp_probe module (e.g. f925d0a6 ("net: tcp_probe:
add IPv6 support")) we also need to take into account that tbuf needs to
be updated as format string will be further expanded. tbuf sits on the stack
in tcpprobe_read() function that is invoked when user space reads procfs
file /proc/net/tcpprobe, hence not fast path as in jtcp_rcv_established().
Having a size similarly as in sctp_probe module of 256 bytes is fully
sufficient for that, we need theoretical maximum of 252 bytes otherwise we
could get truncated.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc8c6c1b

tunnels: harmonize cleanup done on skb on rx path · ea23192e

由 Nicolas Dichtel 提交于 11年前

The goal of this patch is to harmonize cleanup done on a skbuff on rx path.
Before this patch, behaviors were different depending of the tunnel type.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ea23192e

tunnels: harmonize cleanup done on skb on xmit path · 963a88b3

由 Nicolas Dichtel 提交于 11年前

The goal of this patch is to harmonize cleanup done on a skbuff on xmit path.
Before this patch, behaviors were different depending of the tunnel type.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

963a88b3

skb: allow skb_scrub_packet() to be used by tunnels · 8b27f277

由 Nicolas Dichtel 提交于 11年前

This function was only used when a packet was sent to another netns. Now, it can
also be used after tunnel encapsulation or decapsulation.

Only skb_orphan() should not be done when a packet is not crossing netns.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b27f277

iptunnels: remove net arg from iptunnel_xmit() · 8b7ed2d9

由 Nicolas Dichtel 提交于 11年前

This argument is not used, let's remove it.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b7ed2d9

net: neighbour: Remove CONFIG_ARPD · 3e25c65e

由 Tim Gardner 提交于 11年前

This config option is superfluous in that it only guards a call
to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
change in behavior. There will now be call to __neigh_notify()
for each ARP resolution, which has no impact unless there is a
user space daemon waiting to receive the notification, i.e.,
the case for which CONFIG_ARPD was designed anyways.
Suggested-by: NEric W. Biederman <ebiederm@xmission.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Joe Perches <joe@perches.com>
Cc: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: NTim Gardner <tim.gardner@canonical.com>
Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e25c65e

03 9月, 2013 1 次提交

net: make snmp_mib_free static inline · 5a17a390

由 Cong Wang 提交于 11年前

Fengguang reported:

   net/built-in.o: In function `in6_dev_finish_destroy':
   (.text+0x4ca7d): undefined reference to `snmp_mib_free'

this is due to snmp_mib_free() is defined when CONFIG_INET is enabled,
but in6_dev_finish_destroy() is now moved to core kernel.

I think snmp_mib_free() is small enough to be inlined, so just make it
static inline.
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NCong Wang <amwang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a17a390

01 9月, 2013 1 次提交

net: unify skb_udp_tunnel_segment() and skb_udp6_tunnel_segment() · eb3c0d83

由 Cong Wang 提交于 11年前

As suggested by Pravin, we can unify the code in case of duplicated
code.

Cc: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: NCong Wang <amwang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb3c0d83

31 8月, 2013 3 次提交

ipv4 tunnels: fix an oops when using ipip/sit with IPsec · 737e828b

由 Li Hongjun 提交于 11年前

Since commit 3d7b46cd (ip_tunnel: push generic protocol handling to
ip_tunnel module.), an Oops is triggered when an xfrm policy is configured on
an IPv4 over IPv4 tunnel.

xfrm4_policy_check() calls __xfrm_policy_check2(), which uses skb_dst(skb). But
this field is NULL because iptunnel_pull_header() calls skb_dst_drop(skb).
Signed-off-by: NLi Hongjun <hongjun.li@6wind.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

737e828b

tcp: tcp_make_synack() should use sock_wmalloc · eb8895de

由 Phil Oester 提交于 11年前

In commit 90ba9b19 (tcp: tcp_make_synack() can use alloc_skb()), Eric changed
the call to sock_wmalloc in tcp_make_synack to alloc_skb.  In doing so,
the netfilter owner match lost its ability to block the SYNACK packet on
outbound listening sockets.  Revert the change, restoring the owner match
functionality.

This closes netfilter bugzilla #847.
Signed-off-by: NPhil Oester <kernel@linuxace.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb8895de

tcp: do not use cached RTT for RTT estimation · 1b7fdd2a

由 Yuchung Cheng 提交于 11年前

RTT cached in the TCP metrics are valuable for the initial timeout
because SYN RTT usually does not account for serialization delays
on low BW path.

However using it to seed the RTT estimator maybe disruptive because
other components (e.g., pacing) require the smooth RTT to be obtained
from actual connection.

The solution is to use the higher cached RTT to set the first RTO
conservatively like tcp_rtt_estimator(), but avoid seeding the other
RTT estimator variables such as srtt.  It is also a good idea to
keep RTO conservative to obtain the first RTT sample, and the
performance is insured by TCP loss probe if SYN RTT is available.

To keep the seeding formula consistent across SYN RTT and cached RTT,
the rttvar is twice the cached RTT instead of cached RTTVAR value. The
reason is because cached variation may be too small (near min RTO)
which defeats the purpose of being conservative on first RTO. However
the metrics still keep the RTT variations as they might be useful for
user applications (through ip).
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Tested-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b7fdd2a

30 8月, 2013 4 次提交

ipv4: sendto/hdrincl: don't use destination address found in header · c27c9322

由 Chris Clark 提交于 11年前

ipv4: raw_sendmsg: don't use header's destination address

A sendto() regression was bisected and found to start with commit
f8126f1d (ipv4: Adjust semantics of rt->rt_gateway.)

The problem is that it tries to ARP-lookup the constructed packet's
destination address rather than the explicitly provided address.

Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.

cf. commit 2ad5b9e4Reported-by: NChris Clark <chris.clark@alcatel-lucent.com>
Bisected-by: NChris Clark <chris.clark@alcatel-lucent.com>
Tested-by: NChris Clark <chris.clark@alcatel-lucent.com>
Suggested-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NChris Clark <chris.clark@alcatel-lucent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c27c9322

tcp: TSO packets automatic sizing · 95bd09eb

由 Eric Dumazet 提交于 11年前

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95bd09eb

tcp: don't apply tsoffset if rcv_tsecr is zero · e3e12028

由 Andrew Vagin 提交于 11年前

The zero value means that tsecr is not valid, so it's a special case.

tsoffset is used to customize tcp_time_stamp for one socket.
tsoffset is usually zero, it's used when a socket was moved from one
host to another host.

Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
TCP_RTO_MAX.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e3e12028

tcp: initialize rcv_tstamp for restored sockets · c7781a6e

由 Andrew Vagin 提交于 11年前

u32 rcv_tstamp;     /* timestamp of last received ACK */

Its value used in tcp_retransmit_timer, which closes socket
if the last ack was received more then TCP_RTO_MAX ago.

Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
is called before receiving a first ack, the connection is closed.

This patch initializes rcv_tstamp to a timestamp, when a socket was
restored.

Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7781a6e

28 8月, 2013 5 次提交

netfilter: add SYNPROXY core/target · 48b1de4c

由 Patrick McHardy 提交于 11年前

Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
core with common functions and an address family specific target.

The SYNPROXY receives the connection request from the client, responds with
a SYN/ACK containing a SYN cookie and announcing a zero window and checks
whether the final ACK from the client contains a valid cookie.

It then establishes a connection to the original destination and, if
successful, sends a window update to the client with the window size
announced by the server.

Support for timestamps, SACK, window scaling and MSS options can be
statically configured as target parameters if the features of the server
are known. If timestamps are used, the timestamp value sent back to
the client in the SYN/ACK will be different from the real timestamp of
the server. In order to now break PAWS, the timestamps are translated in
the direction server->client.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Tested-by: NMartin Topholm <mph@one.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

48b1de4c

net: syncookies: export cookie_v4_init_sequence/cookie_v4_check · 0198230b

由 Patrick McHardy 提交于 11年前

Extract the local TCP stack independant parts of tcp_v4_init_sequence()
and cookie_v4_check() and export them for use by the upcoming SYNPROXY
target.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Tested-by: NMartin Topholm <mph@one.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

0198230b

netfilter: nf_conntrack: make sequence number adjustments usuable without NAT · 41d73ec0

由 Patrick McHardy 提交于 11年前

Split out sequence number adjustments from NAT and move them to the conntrack
core to make them usable for SYN proxying. The sequence number adjustment
information is moved to a seperate extend. The extend is added to new
conntracks when a NAT mapping is set up for a connection using a helper.

As a side effect, this saves 24 bytes per connection with NAT in the common
case that a connection does not have a helper assigned.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Tested-by: NMartin Topholm <mph@one.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

41d73ec0

netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridged · affe759d

由 Phil Oester 提交于 11年前

As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
with the tcp-reset option sends out reset packets with the src MAC address
of the local bridge interface, instead of the MAC address of the intended
destination. This causes some routers/firewalls to drop the reset packet
as it appears to be spoofed. Fix this by bypassing ip[6]_local_out and
setting the MAC of the sender in the tcp reset packet.

This closes netfilter bugzilla #531.
Signed-off-by: NPhil Oester <kernel@linuxace.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

affe759d

net: tcp_probe: allow more advanced ingress filtering by mark · b1dcdc68

由 Daniel Borkmann 提交于 11年前

Currently, the tcp_probe snooper can either filter packets by a given
port (handed to the module via module parameter e.g. port=80) or lets
all TCP traffic pass (port=0, default). When a port is specified, the
port number is tested against the sk's source/destination port. Thus,
if one of them matches, the information will be further processed for
the log.

As this is quite limited, allow for more advanced filtering possibilities
which can facilitate debugging/analysis with the help of the tcp_probe
snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
("pkt_sched: ingress socket filter by mark"), add the possibility to
use skb->mark as a filter.

If the mark is not being used otherwise, this allows ingress filtering
by flow (e.g. in order to track updates from only a single flow, or a
subset of all flows for a given port) and other things such as dynamic
logging and reconfiguration without removing/re-inserting the tcp_probe
module, etc. Simple example:

  insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
  ...
  iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
           --sport 60952 -j MARK --set-mark 8888
  [... sampling interval ...]
  iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
           --sport 60952 -j MARK --set-mark 8888

The current option to filter by a given port is still being preserved. A
similar approach could be done for the sctp_probe module as a follow-up.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1dcdc68

26 8月, 2013 2 次提交

xfrm: revert ipv4 mtu determination to dst_mtu · 5a25cf1e

由 Hannes Frederic Sowa 提交于 11年前

In commit 0ea9d5e3 ("xfrm: introduce
helper for safe determination of mtu") I switched the determination of
ipv4 mtus from dst_mtu to ip_skb_dst_mtu. This was an error because in
case of IP_PMTUDISC_PROBE we fall back to the interface mtu, which is
never correct for ipv4 ipsec.

This patch partly reverts 0ea9d5e3
("xfrm: introduce helper for safe determination of mtu").

Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

5a25cf1e

ipip: potential race in ip_tunnel_init_net() · b4de77ad

由 Dan Carpenter 提交于 11年前

Eric Dumazet says that my previous fix for an ERR_PTR dereference
(ea857f28 'ipip: dereferencing an ERR_PTR in ip_tunnel_init_net()')
could be racy and suggests the following fix instead.
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4de77ad

23 8月, 2013 1 次提交

net: tcp_probe: add IPv6 support · f925d0a6

由 Daniel Borkmann 提交于 11年前

The tcp_probe currently only supports analysis of IPv4 connections.
Therefore, it would be nice to have IPv6 supported as well. Since we
have the recently added %pISpc specifier that is IPv4/IPv6 generic,
build related sockaddress structures from the flow information and
pass this to our format string. Tested with SSH and HTTP sessions
on IPv4 and IPv6.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f925d0a6