提交 · 80d0a69fc57715dc9080c0567df1ed911b78abea · openanolis / cloud-kernel

16 7月, 2012 1 次提交

ipv4: Add helper inet_csk_update_pmtu(). · 80d0a69f

由 David S. Miller 提交于 7月 16, 2012

This abstracts away the call to dst_ops->update_pmtu() so that we can
transparently handle the fact that, in the future, the dst itself can
be invalidated by the PMTU update (when we have non-host routes cached
in sockets).

So we try to rebuild the socket cached route after the method
invocation if necessary.

This isn't used by SCTP because it needs to cache dsts per-transport,
and thus will need it's own local version of this helper.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80d0a69f

12 7月, 2012 3 次提交

D
net: Remove checks for dst_ops->redirect being NULL. · 1ed5c48f
由 David S. Miller 提交于 7月 12, 2012
```
No longer necessary.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
1ed5c48f
D
ipv4: Add redirect support to all protocol icmp error handlers. · 55be7a9c
由 David S. Miller 提交于 7月 11, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
55be7a9c

tcp: TCP Small Queues · 46d3ceab

由 Eric Dumazet 提交于 7月 11, 2012

This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46d3ceab

11 7月, 2012 3 次提交
- D
  inet: Remove ->get_peer() method. · 16d18399
  由 David S. Miller 提交于 7月 10, 2012
```
No longer used.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  16d18399
- D
  tcp: Move timestamps from inetpeer to metrics cache. · 81166dd6
  由 David S. Miller 提交于 7月 10, 2012
```
With help from Lin Ming.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  81166dd6
- D
  tcp: Abstract back handling peer aliveness test into helper function. · ab92bb2f
  由 David S. Miller 提交于 7月 09, 2012
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  ab92bb2f
28 6月, 2012 4 次提交

ipv4: Show that ip_send_reply() is purely unicast routine. · 70e73416

由 David S. Miller 提交于 6月 28, 2012

Rename it to ip_send_unicast_reply() and add explicit 'saddr'
argument.

This removed one of the few users of rt->rt_spec_dst.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70e73416

D
ipv4: Kill early demux method return value. · 160eb5a6
由 David S. Miller 提交于 6月 27, 2012
```
It's completely unnecessary.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
160eb5a6

Revert "ipv4: tcp: dont cache unconfirmed intput dst" · c10237e0

由 David S. Miller 提交于 6月 27, 2012

This reverts commit c074da28.

This change has several unwanted side effects:

1) Sockets will cache the DST_NOCACHE route in sk->sk_rx_dst and we'll
   thus never create a real cached route.

2) All TCP traffic will use DST_NOCACHE and never use the routing
   cache at all.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c10237e0

ipv4: tcp: dont cache unconfirmed intput dst · c074da28

由 Eric Dumazet 提交于 6月 26, 2012

DDOS synflood attacks hit badly IP route cache.

On typical machines, this cache is allowed to hold up to 8 Millions dst
entries, 256 bytes for each, for a total of 2GB of memory.

rt_garbage_collect() triggers and tries to cleanup things.

Eventually route cache is disabled but machine is under fire and might
OOM and crash.

This patch exploits the new TCP early demux, to set a nocache
boolean in case incoming TCP frame is for a not yet ESTABLISHED or
TIMEWAIT socket.

This 'nocache' boolean is then used in case dst entry is not found in
route cache, to create an unhashed dst entry (DST_NOCACHE)

SYN-cookie-ACK sent use a similar mechanism (ipv4: tcp: dont cache
output dst for syncookies), so after this patch, a machine is able to
absorb a DDOS synflood attack without polluting its IP route cache.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c074da28

24 6月, 2012 1 次提交

tcp: Fix bug in tcp socket early demux · 7011d085

由 Vijay Subramanian 提交于 6月 23, 2012

The dest port for the call to __inet_lookup_established() in TCP early demux
code is passed with the wrong endian-ness. This causes the lookup to fail
leading to early demux not being used.
Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7011d085

23 6月, 2012 1 次提交

ipv4: tcp: dont cache output dst for syncookies · 7586eceb

由 Eric Dumazet 提交于 6月 20, 2012

Don't cache output dst for syncookies, as this adds pressure on IP route
cache and rcu subsystem for no gain.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7586eceb

22 6月, 2012 1 次提交

tcp: Validate route interface in early demux. · fd62e09b

由 David S. Miller 提交于 6月 21, 2012

Otherwise we might violate reverse path filtering.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd62e09b

20 6月, 2012 1 次提交

ipv4: Early TCP socket demux. · 41063e9d

由 David S. Miller 提交于 6月 19, 2012

Input packet processing for local sockets involves two major demuxes.
One for the route and one for the socket.

But we can optimize this down to one demux for certain kinds of local
sockets.

Currently we only do this for established TCP sockets, but it could
at least in theory be expanded to other kinds of connections.

If a TCP socket is established then it's identity is fully specified.

This means that whatever input route was used during the three-way
handshake must work equally well for the rest of the connection since
the keys will not change.

Once we move to established state, we cache the receive packet's input
route to use later.

Like the existing cached route in sk->sk_dst_cache used for output
packets, we have to check for route invalidations using dst->obsolete
and dst->ops->check().

Early demux occurs outside of a socket locked section, so when a route
invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
actually inside of established state packet processing and thus have
the socket locked.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

41063e9d

10 6月, 2012 1 次提交

[PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. · 2397849b

由 David S. Miller 提交于 6月 09, 2012

Since it's guarenteed that we will access the inetpeer if we're trying
to do timewait recycling and TCP options were enabled on the
connection, just cache the peer in the timewait socket.

In the future, inetpeer lookups will be context dependent (per routing
realm), and this helps facilitate that as well.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2397849b

09 6月, 2012 3 次提交

tcp: Get rid of inetpeer special cases. · 4670fd81

由 David S. Miller 提交于 6月 09, 2012

The get_peer method TCP uses is full of special cases that make no
sense accommodating, and it also gets in the way of doing more
reasonable things here.

First of all, if the socket doesn't have a usable cached route, there
is no sense in trying to optimize timewait recycling.

Likewise for the case where we have IP options, such as SRR enabled,
that make the IP header destination address (and thus the destination
address of the route key) differ from that of the connection's
destination address.

Just return a NULL peer in these cases, and thus we're also able to
get rid of the clumsy inetpeer release logic.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4670fd81

inet: Create and use rt{,6}_get_peer_create(). · fbfe95a4

由 David S. Miller 提交于 6月 08, 2012

There's a lot of places that open-code rt{,6}_get_peer() only because
they want to set 'create' to one.  So add an rt{,6}_get_peer_create()
for their sake.

There were also a few spots open-coding plain rt{,6}_get_peer() and
those are transformed here as well.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fbfe95a4

inetpeer: add parameter net for inet_getpeer_v4,v6 · 54db0cc2

由 Gao feng 提交于 6月 08, 2012

add struct net as a parameter of inet_getpeer_v[4,6],
use net to replace &init_net.

and modify some places to provide net for inet_getpeer_v[4,6]
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

54db0cc2

04 6月, 2012 1 次提交

tcp: tcp_make_synack() consumes dst parameter · 4aea39c1

由 Eric Dumazet 提交于 6月 03, 2012

tcp_make_synack() clones the dst, and callers release it.

We can avoid two atomic operations per SYNACK if tcp_make_synack()
consumes dst instead of cloning it.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4aea39c1

02 6月, 2012 1 次提交

tcp: reflect SYN queue_mapping into SYNACK packets · fff32699

由 Eric Dumazet 提交于 6月 01, 2012

While testing how linux behaves on SYNFLOOD attack on multiqueue device
(ixgbe), I found that SYNACK messages were dropped at Qdisc level
because we send them all on a single queue.

Obvious choice is to reflect incoming SYN packet @queue_mapping to
SYNACK packet.

Under stress, my machine could only send 25.000 SYNACK per second (for
200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.

After patch, not a single SYNACK is dropped.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fff32699

18 5月, 2012 1 次提交

tcp: bool conversions · a2a385d6

由 Eric Dumazet 提交于 5月 16, 2012

bool conversions where possible.

__inline__ -> inline

space cleanups
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2a385d6

16 5月, 2012 1 次提交

net: Convert net_ratelimit uses to net_<level>_ratelimited · e87cc472

由 Joe Perches 提交于 5月 13, 2012

Standardize the net core ratelimited logging functions.

Coalesce formats, align arguments.
Change a printk then vprintk sequence to use printf extension %pV.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e87cc472

05 5月, 2012 1 次提交

tcp: be more strict before accepting ECN negociation · bd14b1b2

由 Eric Dumazet 提交于 5月 04, 2012

It appears some networks play bad games with the two bits reserved for
ECN. This can trigger false congestion notifications and very slow
transferts.

Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
disable TCP ECN negociation if it happens we receive mangled CT bits in
the SYN packet.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Perry Lorier <perryl@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Wilmer van der Gaast <wilmer@google.com>
Cc: Ankur Jain <jankur@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Dave Täht <dave.taht@bufferbloat.net>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd14b1b2

24 4月, 2012 2 次提交

tcp: sk_add_backlog() is too agressive for TCP · da882c1f

由 Eric Dumazet 提交于 4月 22, 2012

While investigating TCP performance problems on 10Gb+ links, we found a
tcp sender was dropping lot of incoming ACKS because of sk_rcvbuf limit
in sk_add_backlog(), especially if receiver doesnt use GRO/LRO and sends
one ACK every two MSS segments.

A sender usually tweaks sk_sndbuf, but sk_rcvbuf stays at its default
value (87380), allowing a too small backlog.

A TCP ACK, even being small, can consume nearly same truesize space than
outgoing packets. Using sk_rcvbuf + sk_sndbuf as a limit makes sense and
is fast to compute.

Performance results on netperf, single flow, receiver with disabled
GRO/LRO : 7500 Mbits instead of 6050 Mbits, no more TCPBacklogDrop
increments at sender.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da882c1f

net: add a limit parameter to sk_add_backlog() · f545a38f

由 Eric Dumazet 提交于 4月 22, 2012

sk_add_backlog() & sk_rcvqueues_full() hard coded sk_rcvbuf as the
memory limit. We need to make this limit a parameter for TCP use.

No functional change expected in this patch, all callers still using the
old sk_rcvbuf limit.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Rick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f545a38f

23 4月, 2012 1 次提交

tcp: Fix build warning after tcp_{v4,v6}_init_sock consolidation. · ac807fa8

由 David S. Miller 提交于 4月 23, 2012

net/ipv4/tcp_ipv4.c: In function 'tcp_v4_init_sock':
net/ipv4/tcp_ipv4.c:1891:19: warning: unused variable 'tp' [-Wunused-variable]
net/ipv6/tcp_ipv6.c: In function 'tcp_v6_init_sock':
net/ipv6/tcp_ipv6.c:1836:19: warning: unused variable 'tp' [-Wunused-variable]
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac807fa8

22 4月, 2012 2 次提交

tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() · 900f65d3

由 Neal Cardwell 提交于 4月 19, 2012

This commit moves the (substantial) common code shared between
tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
independent function, tcp_init_sock().

Centralizing this functionality should help avoid drift issues,
e.g. where the IPv4 side is updated without a corresponding update to
IPv6. There was already some drift: IPv4 initialized snd_cwnd to
TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
2 (in this case it should not matter, since snd_cwnd is also
initialized in tcp_init_metrics(), but the general risks and
maintenance overhead remain).

When diffing the old and new code, note that new tcp_init_sock()
function uses the order of steps from the tcp_v4_init_sock()
implementation (the order is slightly different in
tcp_v6_init_sock()).
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

900f65d3

tcp: Initial repair mode · ee995283

由 Pavel Emelyanov 提交于 4月 19, 2012

This includes (according the the previous description):

* TCP_REPAIR sockoption

This one just puts the socket in/out of the repair mode.
Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
When repair mode is turned off and the socket happens to be in
the established state the window probe is sent to the peer to
'unlock' the connection.

* TCP_REPAIR_QUEUE sockoption

This one sets the queue which we're about to repair. The
'no-queue' is set by default.

* TCP_QUEUE_SEQ socoption

Sets the write_seq/rcv_nxt of a selected repaired queue.
Allowed for TCP_CLOSE-d sockets only. When the socket changes
its state the other seq-s are changed by the kernel according
to the protocol rules (most of the existing code is actually
reused).

* Ability to forcibly bind a socket to a port

The sk->sk_reuse is set to SK_FORCE_REUSE.

* Immediate connect modification

The connect syscall initializes the connection, then directly jumps
to the code which finalizes it.

* Silent close modification

The close just aborts the connection (similar to SO_LINGER with 0
time) but without sending any FIN/RST-s to peer.
Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ee995283

06 4月, 2012 1 次提交

netdma: adding alignment check for NETDMA ops · a2bd1140

由 Dave Jiang 提交于 4月 04, 2012

This is the fallout from adding memcpy alignment workaround for certain
IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align
ops.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a2bd1140

13 3月, 2012 1 次提交

net: ipv4: Standardize prefixes for message logging · afd46503

由 Joe Perches 提交于 3月 12, 2012

Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afd46503

12 3月, 2012 2 次提交

net: Convert printks to pr_<level> · 058bd4d2

由 Joe Perches 提交于 3月 11, 2012

Use a more current kernel messaging style.

Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.

Some messages that were prefixed with <foo>_close are
now prefixed with <foo>_fini.  Some ah4 and esp messages
are now not prefixed with "ip ".

The intent of this patch is to later add something like
  #define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.

Text size is trivially reduced. (x86-32 allyesconfig)

$ size net/ipv4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
 887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

058bd4d2

tcp: fix syncookie regression · dfd25fff

由 Eric Dumazet 提交于 3月 10, 2012

commit ea4fc0d6 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit())
added a serious regression on synflood handling.

Simon Kirby discovered a successful connection was delayed by 20 seconds
before being responsive.

In my tests, I discovered that xmit frames were lost, and needed ~4
retransmits and a socket dst rebuild before being really sent.

In case of syncookie initiated connection, we use a different path to
initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared.

As ip_queue_xmit() now depends on inet flow being setup, fix this by
copying the temp flowi4 we use in cookie_v4_check().
Reported-by: NSimon Kirby <sim@netnation.com>
Bisected-by: NSimon Kirby <sim@netnation.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dfd25fff

08 3月, 2012 1 次提交

tcp: md5: correct a RCU lockdep splat · b4fb05ea

由 Eric Dumazet 提交于 3月 07, 2012

commit a8afca03 (tcp: md5: protects md5sig_info with RCU) added a
lockdep splat in tcp_md5_do_lookup() in case a timer fires a tcp
retransmit.

At this point, socket lock is owned by the sofirq handler, not the user,
so we should adjust a bit the lockdep condition, as we dont hold
rcu_read_lock().
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Reported-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4fb05ea

13 2月, 2012 1 次提交

net: implement IP_RECVTOS for IP_PKTOPTIONS · 4c507d28

由 Jiri Benc 提交于 2月 09, 2012

Currently, it is not easily possible to get TOS/DSCP value of packets from
an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
not actually implemented for TOS, though.

This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
(IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
the TOS field is stored from the other party's ACK.

This is needed for proxies which require DSCP transparency. One such example
is at http://zph.bratcheda.org/.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c507d28

05 2月, 2012 1 次提交

tcp_v4_send_reset: binding oif to iif in no sock case · e2446eaa

由 Shawn Lu 提交于 2月 04, 2012

Binding RST packet outgoing interface to incoming interface
for tcp v4 when there is no socket associate with it.
when sk is not NULL, using sk->sk_bound_dev_if instead.
(suggested by Eric Dumazet).

This has few benefits:
1. tcp_v6_send_reset already did that.
2. This helps tcp connect with SO_BINDTODEVICE set. When
connection is lost, we still able to sending out RST using
same interface.
3. we are sending reply, it is most likely to be succeed
if iif is used
Signed-off-by: NShawn Lu <shawn.lu@ericsson.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e2446eaa

02 2月, 2012 1 次提交

tcp: md5: RST: getting md5 key from listener · 658ddaaf

由 Shawn Lu 提交于 1月 31, 2012

TCP RST mechanism is broken in TCP md5(RFC2385). When
connection is gone, md5 key is lost, sending RST
without md5 hash is deem to ignored by peer. This can
be a problem since RST help protocal like bgp to fast
recove from peer crash.

In most case, users of tcp md5, such as bgp and ldp,
have listener on both sides to accept connection from peer.
md5 keys for peers are saved in listening socket.

There are two cases in finding md5 key when connection is
lost:
1.Passive receive RST: The message is send to well known port,
tcp will associate it with listner. md5 key is gotten from
listener.

2.Active receive RST (no sock): The message is send to ative
side, there is no socket associated with the message. In this
case, finding listener from source port, then find md5 key from
listener.

we are not loosing sercuriy here:
packet is checked with md5 hash. No RST is generated
if md5 hash doesn't match or no md5 key can be found.
Signed-off-by: NShawn Lu <shawn.lu@ericsson.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

658ddaaf

01 2月, 2012 3 次提交

tcp: md5: protects md5sig_info with RCU · a8afca03

由 Eric Dumazet 提交于 1月 31, 2012

This patch makes sure we use appropriate memory barriers before
publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
tcp_v4_send_reset() without holding socket lock (upcoming patch from
Shawn Lu)

Note we also need to respect rcu grace period before its freeing, since
we can free socket without this grace period thanks to
SLAB_DESTROY_BY_RCU
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8afca03

tcp: md5: use sock_kmalloc() to limit md5 keys · 5f3d9cb2

由 Eric Dumazet 提交于 1月 31, 2012

There is no limit on number of MD5 keys an application can attach to a
tcp socket.

This patch adds a per tcp socket limit based
on /proc/sys/net/core/optmem_max

With current default optmem_max values, this allows about 150 keys on
64bit arches, and 88 keys on 32bit arches.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f3d9cb2

tcp: md5: rcu conversion · a915da9b

由 Eric Dumazet 提交于 1月 31, 2012

In order to be able to support proper RST messages for TCP MD5 flows, we
need to allow access to MD5 keys without locking listener socket.

This conversion is a nice cleanup, and shrinks size of timewait sockets
by 80 bytes.

IPv6 code reuses generic code found in IPv4 instead of duplicating it.

Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a915da9b

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功