提交 · 900f65d361d333c949ef76a828343075f4fdf523 · _Walt / cloud-kernel

22 4月, 2012 2 次提交

tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() · 900f65d3

由 Neal Cardwell 提交于 4月 19, 2012

This commit moves the (substantial) common code shared between
tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
independent function, tcp_init_sock().

Centralizing this functionality should help avoid drift issues,
e.g. where the IPv4 side is updated without a corresponding update to
IPv6. There was already some drift: IPv4 initialized snd_cwnd to
TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
2 (in this case it should not matter, since snd_cwnd is also
initialized in tcp_init_metrics(), but the general risks and
maintenance overhead remain).

When diffing the old and new code, note that new tcp_init_sock()
function uses the order of steps from the tcp_v4_init_sock()
implementation (the order is slightly different in
tcp_v6_init_sock()).
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

900f65d3

tcp: Initial repair mode · ee995283

由 Pavel Emelyanov 提交于 4月 19, 2012

This includes (according the the previous description):

* TCP_REPAIR sockoption

This one just puts the socket in/out of the repair mode.
Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
When repair mode is turned off and the socket happens to be in
the established state the window probe is sent to the peer to
'unlock' the connection.

* TCP_REPAIR_QUEUE sockoption

This one sets the queue which we're about to repair. The
'no-queue' is set by default.

* TCP_QUEUE_SEQ socoption

Sets the write_seq/rcv_nxt of a selected repaired queue.
Allowed for TCP_CLOSE-d sockets only. When the socket changes
its state the other seq-s are changed by the kernel according
to the protocol rules (most of the existing code is actually
reused).

* Ability to forcibly bind a socket to a port

The sk->sk_reuse is set to SK_FORCE_REUSE.

* Immediate connect modification

The connect syscall initializes the connection, then directly jumps
to the code which finalizes it.

* Silent close modification

The close just aborts the connection (similar to SO_LINGER with 0
time) but without sending any FIN/RST-s to peer.
Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ee995283

06 4月, 2012 1 次提交

netdma: adding alignment check for NETDMA ops · a2bd1140

由 Dave Jiang 提交于 4月 04, 2012

This is the fallout from adding memcpy alignment workaround for certain
IOATDMA hardware. NetDMA will only use DMA engine that can handle byte align
ops.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a2bd1140

13 3月, 2012 1 次提交

net: ipv4: Standardize prefixes for message logging · afd46503

由 Joe Perches 提交于 3月 12, 2012

Add #define pr_fmt(fmt) as appropriate.

Add "IPv4: ", "TCP: ", and "IPsec: " to appropriate files.
Standardize on "UDPLite: " for appropriate uses.
Some prefixes were previously "UDPLITE: " and "UDP-Lite: ".

Add KBUILD_MODNAME ": " to icmp and gre.
Remove embedded prefixes as appropriate.

Add missing "\n" to pr_info in gre.c.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afd46503

12 3月, 2012 2 次提交

net: Convert printks to pr_<level> · 058bd4d2

由 Joe Perches 提交于 3月 11, 2012

Use a more current kernel messaging style.

Convert a printk block to print_hex_dump.
Coalesce formats, align arguments.
Use %s, __func__ instead of embedding function names.

Some messages that were prefixed with <foo>_close are
now prefixed with <foo>_fini.  Some ah4 and esp messages
are now not prefixed with "ip ".

The intent of this patch is to later add something like
  #define pr_fmt(fmt) "IPv4: " fmt.
to standardize the output messages.

Text size is trivially reduced. (x86-32 allyesconfig)

$ size net/ipv4/built-in.o*
   text	   data	    bss	    dec	    hex	filename
 887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
 887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

058bd4d2

tcp: fix syncookie regression · dfd25fff

由 Eric Dumazet 提交于 3月 10, 2012

commit ea4fc0d6 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit())
added a serious regression on synflood handling.

Simon Kirby discovered a successful connection was delayed by 20 seconds
before being responsive.

In my tests, I discovered that xmit frames were lost, and needed ~4
retransmits and a socket dst rebuild before being really sent.

In case of syncookie initiated connection, we use a different path to
initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared.

As ip_queue_xmit() now depends on inet flow being setup, fix this by
copying the temp flowi4 we use in cookie_v4_check().
Reported-by: NSimon Kirby <sim@netnation.com>
Bisected-by: NSimon Kirby <sim@netnation.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dfd25fff

08 3月, 2012 1 次提交

tcp: md5: correct a RCU lockdep splat · b4fb05ea

由 Eric Dumazet 提交于 3月 07, 2012

commit a8afca03 (tcp: md5: protects md5sig_info with RCU) added a
lockdep splat in tcp_md5_do_lookup() in case a timer fires a tcp
retransmit.

At this point, socket lock is owned by the sofirq handler, not the user,
so we should adjust a bit the lockdep condition, as we dont hold
rcu_read_lock().
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Reported-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4fb05ea

13 2月, 2012 1 次提交

net: implement IP_RECVTOS for IP_PKTOPTIONS · 4c507d28

由 Jiri Benc 提交于 2月 09, 2012

Currently, it is not easily possible to get TOS/DSCP value of packets from
an incoming TCP stream. The mechanism is there, IP_PKTOPTIONS getsockopt
with IP_RECVTOS set, the same way as incoming TTL can be queried. This is
not actually implemented for TOS, though.

This patch adds this functionality, both for IPv4 (IP_PKTOPTIONS) and IPv6
(IPV6_2292PKTOPTIONS). For IPv4, like in the IP_RECVTTL case, the value of
the TOS field is stored from the other party's ACK.

This is needed for proxies which require DSCP transparency. One such example
is at http://zph.bratcheda.org/.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c507d28

05 2月, 2012 1 次提交

tcp_v4_send_reset: binding oif to iif in no sock case · e2446eaa

由 Shawn Lu 提交于 2月 04, 2012

Binding RST packet outgoing interface to incoming interface
for tcp v4 when there is no socket associate with it.
when sk is not NULL, using sk->sk_bound_dev_if instead.
(suggested by Eric Dumazet).

This has few benefits:
1. tcp_v6_send_reset already did that.
2. This helps tcp connect with SO_BINDTODEVICE set. When
connection is lost, we still able to sending out RST using
same interface.
3. we are sending reply, it is most likely to be succeed
if iif is used
Signed-off-by: NShawn Lu <shawn.lu@ericsson.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e2446eaa

02 2月, 2012 1 次提交

tcp: md5: RST: getting md5 key from listener · 658ddaaf

由 Shawn Lu 提交于 1月 31, 2012

TCP RST mechanism is broken in TCP md5(RFC2385). When
connection is gone, md5 key is lost, sending RST
without md5 hash is deem to ignored by peer. This can
be a problem since RST help protocal like bgp to fast
recove from peer crash.

In most case, users of tcp md5, such as bgp and ldp,
have listener on both sides to accept connection from peer.
md5 keys for peers are saved in listening socket.

There are two cases in finding md5 key when connection is
lost:
1.Passive receive RST: The message is send to well known port,
tcp will associate it with listner. md5 key is gotten from
listener.

2.Active receive RST (no sock): The message is send to ative
side, there is no socket associated with the message. In this
case, finding listener from source port, then find md5 key from
listener.

we are not loosing sercuriy here:
packet is checked with md5 hash. No RST is generated
if md5 hash doesn't match or no md5 key can be found.
Signed-off-by: NShawn Lu <shawn.lu@ericsson.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

658ddaaf

01 2月, 2012 4 次提交

tcp: md5: protects md5sig_info with RCU · a8afca03

由 Eric Dumazet 提交于 1月 31, 2012

This patch makes sure we use appropriate memory barriers before
publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
tcp_v4_send_reset() without holding socket lock (upcoming patch from
Shawn Lu)

Note we also need to respect rcu grace period before its freeing, since
we can free socket without this grace period thanks to
SLAB_DESTROY_BY_RCU
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8afca03

tcp: md5: use sock_kmalloc() to limit md5 keys · 5f3d9cb2

由 Eric Dumazet 提交于 1月 31, 2012

There is no limit on number of MD5 keys an application can attach to a
tcp socket.

This patch adds a per tcp socket limit based
on /proc/sys/net/core/optmem_max

With current default optmem_max values, this allows about 150 keys on
64bit arches, and 88 keys on 32bit arches.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f3d9cb2

tcp: md5: rcu conversion · a915da9b

由 Eric Dumazet 提交于 1月 31, 2012

In order to be able to support proper RST messages for TCP MD5 flows, we
need to allow access to MD5 keys without locking listener socket.

This conversion is a nice cleanup, and shrinks size of timewait sockets
by 80 bytes.

IPv6 code reuses generic code found in IPv4 instead of duplicating it.

Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Shawn Lu <shawn.lu@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a915da9b

tcp: md5: remove obsolete md5_add() method · a2d91241

由 Eric Dumazet 提交于 1月 31, 2012

We no longer use md5_add() method from struct tcp_sock_af_ops
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2d91241

23 1月, 2012 1 次提交

tcp: md5: using remote adress for md5 lookup in rst packet · 8a622e71

由 shawnlu 提交于 1月 20, 2012

md5 key is added in socket through remote address.
remote address should be used in finding md5 key when
sending out reset packet.
Signed-off-by: Nshawnlu <shawn.lu@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a622e71

13 12月, 2011 3 次提交

per-netns ipv4 sysctl_tcp_mem · 3dc43e3e

由 Glauber Costa 提交于 12月 11, 2011

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3dc43e3e

tcp memory pressure controls · d1a4c0b3

由 Glauber Costa 提交于 12月 11, 2011

This patch introduces memory pressure controls for the tcp
protocol. It uses the generic socket memory pressure code
introduced in earlier patches, and fills in the
necessary data in cg_proto struct.
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d1a4c0b3

foundations of per-cgroup memory pressure controlling. · 180d8cd9

由 Glauber Costa 提交于 12月 11, 2011

This patch replaces all uses of struct sock fields' memory_pressure,
memory_allocated, sockets_allocated, and sysctl_mem to acessor
macros. Those macros can either receive a socket argument, or a mem_cgroup
argument, depending on the context they live in.

Since we're only doing a macro wrapping here, no performance impact at all is
expected in the case where we don't have cgroups disabled.
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NHiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

180d8cd9

01 12月, 2011 1 次提交

tcp: inherit listener congestion control for passive cnx · d8a6e65f

由 Eric Dumazet 提交于 11月 30, 2011

Rick Jones reported that TCP_CONGESTION sockopt performed on a listener
was ignored for its children sockets : right after accept() the
congestion control for new socket is the system default one.

This seems an oversight of the initial design (quoted from Stephen)

Based on prior investigation and patch from Rick.
Reported-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Stephen Hemminger <shemminger@vyatta.com>
CC: Yuchung Cheng <ycheng@google.com>
Tested-by: NRick Jones <rick.jones2@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8a6e65f

17 11月, 2011 1 次提交

tcp: clear xmit timers in tcp_v4_syn_recv_sock() · 709e8697

由 Eric Dumazet 提交于 11月 14, 2011

Simon Kirby reported divides by zero errors in __tcp_select_window()

This happens when inet_csk_route_child_sock() returns a NULL pointer :

We free new socket while we eventually armed keepalive timer in
tcp_create_openreq_child()

Fix this by a call to tcp_clear_xmit_timers()

[ This is a followup to commit 918eb399 (net: add missing
bh_unlock_sock() calls) ]
Reported-by: NSimon Kirby <sim@hostway.ca>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Tested-by: NSimon Kirby <sim@hostway.ca>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

709e8697

04 11月, 2011 1 次提交

net: add missing bh_unlock_sock() calls · 918eb399

由 Eric Dumazet 提交于 11月 02, 2011

Simon Kirby reported lockdep warnings and following messages :

[104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?

[104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740
preempt_count 00000101, exited with 00000102?

Problem comes from commit 0e734419
(ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.)

If inet_csk_route_child_sock() returns NULL, we should release socket
lock before freeing it.

Another lock imbalance exists if __inet_inherit_port() returns an error
since commit 093d2823 ( tproxy: fix hash locking issue when using
port redirection in __inet_inherit_port()) a backport is also needed for
>= 2.6.37 kernels.
Reported-by: NSimon Kirby <sim@hostway.ca>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Balazs Scheidler <bazsi@balabit.hu>
CC: KOVACS Krisztian <hidden@balabit.hu>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NSimon Kirby <sim@hostway.ca>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

918eb399

02 11月, 2011 1 次提交

net: make the tcp and udp file_operations for the /proc stuff const · 73cb88ec

由 Arjan van de Ven 提交于 10月 30, 2011

the tcp and udp code creates a set of struct file_operations at runtime
while it can also be done at compile time, with the added benefit of then
having these file operations be const.

the trickiest part was to get the "THIS_MODULE" reference right; the naive
method of declaring a struct in the place of registration would not work
for this reason.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73cb88ec

24 10月, 2011 2 次提交

ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAIT · 66b13d99

由 Eric Dumazet 提交于 10月 24, 2011

There is a long standing bug in linux tcp stack, about ACK messages sent
on behalf of TIME_WAIT sockets.

In the IP header of the ACK message, we choose to reflect TOS field of
incoming message, and this might break some setups.

Example of things that were broken :
  - Routing using TOS as a selector
  - Firewalls
  - Trafic classification / shaping

We now remember in timewait structure the inet tos field and use it in
ACK generation, and route lookup.

Notes :
 - We still reflect incoming TOS in RST messages.
 - We could extend MuraliRaja Muniraju patch to report TOS value in
netlink messages for TIME_WAIT sockets.
 - A patch is needed for IPv6
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

66b13d99

tcp: md5: add more const attributes · 318cf7aa

由 Eric Dumazet 提交于 10月 24, 2011

Now tcp_md5_hash_header() has a const tcphdr argument, we can add more
const attributes to callers.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

318cf7aa

21 10月, 2011 1 次提交

tcp: add const qualifiers where possible · cf533ea5

由 Eric Dumazet 提交于 10月 21, 2011

Adding const qualifiers to pointers can ease code review, and spot some
bugs. It might allow compiler to optimize code further.

For example, is it legal to temporary write a null cksum into tcphdr
in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
temporary null value...
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf533ea5

05 10月, 2011 1 次提交

tcp: properly handle md5sig_pool references · 260fcbeb

由 Yan, Zheng 提交于 9月 29, 2011

tcp_v4_clear_md5_list() assumes that multiple tcp md5sig peers
only hold one reference to md5sig_pool. but tcp_v4_md5_do_add()
increases use count of md5sig_pool for each peer. This patch
makes tcp_v4_md5_do_add() only increases use count for the first
tcp md5sig peer.
Signed-off-by: NZheng Yan <zheng.z.yan@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

260fcbeb

27 9月, 2011 1 次提交

tcp: unalias tcp_skb_cb flags and ip_dsfield · b82d1bb4

由 Eric Dumazet 提交于 9月 27, 2011

struct tcp_skb_cb contains a "flags" field containing either tcp flags
or IP dsfield depending on context (input or output path)

Introduce ip_dsfield to make the difference clear and ease maintenance.
If later we want to save space, we can union flags/ip_dsfield
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b82d1bb4

16 9月, 2011 1 次提交

tcp: Change possible SYN flooding messages · 946cedcc

由 Eric Dumazet 提交于 8月 30, 2011

"Possible SYN flooding on port xxxx " messages can fill logs on servers.

Change logic to log the message only once per listener, and add two new
SNMP counters to track :

TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client

TCPReqQFullDrop : number of times a SYN request was dropped because
syncookies were not enabled.

Based on a prior patch from Tom Herbert, and suggestions from David.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

946cedcc

18 8月, 2011 1 次提交

rps: Add flag to skb to indicate rxhash is based on L4 tuple · bdeab991

由 Tom Herbert 提交于 8月 14, 2011

The l4_rxhash flag was added to the skb structure to indicate
that the rxhash value was computed over the 4 tuple for the
packet which includes the port information in the encapsulated
transport packet.  This is used by the stack to preserve the
rxhash value in __skb_rx_tunnel.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bdeab991

07 8月, 2011 1 次提交

net: Compute protocol sequence numbers and fragment IDs using MD5. · 6e5714ea

由 David S. Miller 提交于 8月 03, 2011

Computers have become a lot faster since we compromised on the
partial MD4 hash which we use currently for performance reasons.

MD5 is a much safer choice, and is inline with both RFC1948 and
other ISS generators (OpenBSD, Solaris, etc.)

Furthermore, only having 24-bits of the sequence number be truly
unpredictable is a very serious limitation.  So the periodic
regeneration and 8-bit counter have been removed.  We compute and
use a full 32-bit sequence number.

For ipv6, DCCP was found to use a 32-bit truncated initial sequence
number (it needs 43-bits) and that is fixed here as well.
Reported-by: NDan Kaminsky <dan@doxpara.com>
Tested-by: NWilly Tarreau <w@1wt.eu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e5714ea

18 6月, 2011 1 次提交

net: rfs: enable RFS before first data packet is received · 1eddcead

由 Eric Dumazet 提交于 6月 17, 2011

Le jeudi 16 juin 2011 à 23:38 -0400, David Miller a écrit :
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Fri, 17 Jun 2011 00:50:46 +0100
>
> > On Wed, 2011-06-15 at 04:15 +0200, Eric Dumazet wrote:
> >> @@ -1594,6 +1594,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
> >>  			goto discard;
> >>
> >>  		if (nsk != sk) {
> >> +			sock_rps_save_rxhash(nsk, skb->rxhash);
> >>  			if (tcp_child_process(sk, nsk, skb)) {
> >>  				rsk = nsk;
> >>  				goto reset;
> >>
> >
> > I haven't tried this, but it looks reasonable to me.
> >
> > What about IPv6?  The logic in tcp_v6_do_rcv() looks very similar.
>
> Indeed ipv6 side needs the same fix.
>
> Eric please add that part and resubmit.  And in fact I might stick
> this into net-2.6 instead of net-next-2.6
>

OK, here is the net-2.6 based one then, thanks !

[PATCH v2] net: rfs: enable RFS before first data packet is received

First packet received on a passive tcp flow is not correctly RFS
steered.

One sock_rps_record_flow() call is missing in inet_accept()

But before that, we also must record rxhash when child socket is setup.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Tom Herbert <therbert@google.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: NDavid S. Miller <davem@conan.davemloft.net>

1eddcead

09 6月, 2011 1 次提交

tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side · 9ad7c049

由 Jerry Chu 提交于 6月 08, 2011

This patch lowers the default initRTO from 3secs to 1sec per
RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
has been retransmitted, AND the TCP timestamp option is not on.

It also adds support to take RTT sample during 3WHS on the passive
open side, just like its active open counterpart, and uses it, if
valid, to seed the initRTO for the data transmission phase.

The patch also resets ssthresh to its initial default at the
beginning of the data transmission phase, and reduces cwnd to 1 if
there has been MORE THAN ONE retransmission during 3WHS per RFC5681.
Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9ad7c049

24 5月, 2011 1 次提交

net: convert %p usage to %pK · 71338aa7

由 Dan Rosenberg 提交于 5月 23, 2011

The %pK format specifier is designed to hide exposed kernel pointers,
specifically via /proc interfaces.  Exposing these pointers provides an
easy target for kernel write vulnerabilities, since they reveal the
locations of writable structures containing easily triggerable function
pointers.  The behavior of %pK depends on the kptr_restrict sysctl.

If kptr_restrict is set to 0, no deviation from the standard %p behavior
occurs.  If kptr_restrict is set to 1, the default, if the current user
(intended to be a reader via seq_printf(), etc.) does not have CAP_SYSLOG
(currently in the LSM tree), kernel pointers using %pK are printed as 0's.
 If kptr_restrict is set to 2, kernel pointers using %pK are printed as
0's regardless of privileges.  Replacing with 0's was chosen over the
default "(null)", which cannot be parsed by userland %p, which expects
"(nil)".

The supporting code for kptr_restrict and %pK are currently in the -mm
tree.  This patch converts users of %p in net/ to %pK.  Cases of printing
pointers to the syslog are not covered, since this would eliminate useful
information for postmortem debugging and the reading of the syslog is
already optionally protected by the dmesg_restrict sysctl.
Signed-off-by: NDan Rosenberg <drosenberg@vsecurity.com>
Cc: James Morris <jmorris@namei.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Thomas Graf <tgraf@infradead.org>
Cc: Eugene Teo <eugeneteo@kernel.org>
Cc: Kees Cook <kees.cook@canonical.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

71338aa7

19 5月, 2011 3 次提交
- D
  ipv4: Pass explicit destination address to rt_bind_peer(). · a48eff12
  由 David S. Miller 提交于 5月 18, 2011
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  a48eff12
- D
  ipv4: Pass explicit destination address to rt_get_peer(). · ed2361e6
  由 David S. Miller 提交于 5月 18, 2011
```
This will next trickle down to rt_bind_peer().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  ed2361e6
- D
  ipv4: Make caller provide flowi4 key to inet_csk_route_req(). · 6bd023f3
  由 David S. Miller 提交于 5月 18, 2011
```
This way the caller can get at the fully resolved fl4->{daddr,saddr}
etc.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  6bd023f3
11 5月, 2011 1 次提交
- D
  ipv4: Pass explicit daddr arg to ip_send_reply(). · 0a5ebb80
  由 David S. Miller 提交于 5月 09, 2011
```
This eliminates an access to rt->rt_src.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  0a5ebb80
09 5月, 2011 3 次提交

D
tcp: Use cork flow info instead of rt->rt_dst in tcp_v4_get_peer() · c5216cc7
由 David S. Miller 提交于 5月 06, 2011
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
c5216cc7

ipv4: Use inet_csk_route_child_sock() in DCCP and TCP. · 0e734419

由 David S. Miller 提交于 5月 08, 2011

Operation order is now transposed, we first create the child
socket then we try to hook up the route.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e734419

tcp: Use cork flow in tcp_v4_connect() · da905bd1

由 David S. Miller 提交于 5月 06, 2011

Since this is invoked from inet_stream_connect() the socket is locked
and therefore this usage is safe.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da905bd1

_Walt / cloud-kernel 与 Fork 源项目一致

_Walt / cloud-kernel
与 Fork 源项目一致