提交 · a5ad24be728d4352b71a81fba471aa41eb71f83a · bug2833 / cloud-kernel

30 1月, 2009 1 次提交

gro: Avoid copying headers of unmerged packets · 86911732

由 Herbert Xu 提交于 1月 29, 2009

Unfortunately simplicity isn't always the best.  The fraginfo
interface turned out to be suboptimal.  The problem was quite
obvious.  For every packet, we have to copy the headers from
the frags structure into skb->head, even though for 99% of the
packets this part is immediately thrown away after the merge.

LRO didn't have this problem because it directly read the headers
from the frags structure.

This patch attempts to address this by creating an interface
that allows GRO to access the headers in the first frag without
having to copy it.  Because all drivers that use frags place the
headers in the first frag this optimisation should be enough.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86911732

27 1月, 2009 1 次提交

tcp: Fix length tcp_splice_data_recv passes to skb_splice_bits. · 9fa5fdf2

由 Dimitris Michailidis 提交于 1月 26, 2009

tcp_splice_data_recv has two lengths to consider: the len parameter it
gets from tcp_read_sock, which specifies the amount of data in the skb,
and rd_desc->count, which is the amount of data the splice caller still
wants. Currently it passes just the latter to skb_splice_bits, which then
splices min(rd_desc->count, skb->len - offset) bytes.

Most of the time this is fine, except when the skb contains urgent data.
In that case len goes only up to the urgent byte and is less than
skb->len - offset. By ignoring len tcp_splice_data_recv may a) splice
data tcp_read_sock told it not to, b) return to tcp_read_sock a value > len.

Now, tcp_read_sock doesn't handle used > len and leaves the socket in a
bad state (both sk_receive_queue and copied_seq are bad at that point)
resulting in duplicated data and corruption.

Fix by passing min(rd_desc->count, len) to skb_splice_bits.
Signed-off-by: NDimitris Michailidis <dm@chelsio.com>
Acked-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9fa5fdf2

15 1月, 2009 1 次提交

gso: Ensure that the packet is long enough · 4e704ee3

由 Herbert Xu 提交于 1月 14, 2009

When we get a GSO packet from an untrusted source, we need to
ensure that it is sufficiently long so that we don't end up
crashing.

Based on discovery and patch by Ian Campbell.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Tested-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e704ee3

14 1月, 2009 1 次提交

tcp: splice as many packets as possible at once · 33966dd0

由 Willy Tarreau 提交于 1月 13, 2009

As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
optimal. It processes at most one segment per call.
This results in low performance and very high overhead due to syscall rate
when splicing from interfaces which do not support LRO.

Willy provided a patch inside tcp_splice_read(), but a better fix
is to let tcp_read_sock() process as many segments as possible, so
that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
often.

With this change, splice() behaves like tcp_recvmsg(), being able
to consume many skbs in one system call. With typical 1460 bytes
of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
16*1460 = 23360 bytes.
Signed-off-by: NWilly Tarreau <w@1wt.eu>
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33966dd0

09 1月, 2009 1 次提交

tcp6: Add GRO support · 684f2176

由 Herbert Xu 提交于 1月 08, 2009

This patch adds GRO support for TCP over IPv6.  The code is exactly
the same as the IPv4 version except for the pseudo-header checksum
computation.

Note that I've removed the unused tcphdr argument from tcp_v6_check
rather than invent a bogus value for GRO.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

684f2176

07 1月, 2009 2 次提交

net_dma: convert to dma_find_channel · f67b4599

由 Dan Williams 提交于 1月 06, 2009

Use the general-purpose channel allocation provided by dmaengine.
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f67b4599

dmaengine: up-level reference counting to the module level · 6f49a57a

由 Dan Williams 提交于 1月 06, 2009

Simply, if a client wants any dmaengine channel then prevent all dmaengine
modules from being removed.  Once the clients are done re-enable module
removal.

Why?, beyond reducing complication:
1/ Tracking reference counts per-transaction in an efficient manner, as
   is currently done, requires a complicated scheme to avoid cache-line
   bouncing effects.
2/ Per-transaction ref-counting gives the false impression that a
   dma-driver can be gracefully removed ahead of its user (net, md, or
   dma-slave)
3/ None of the in-tree dma-drivers talk to hot pluggable hardware, but
   if such an engine were built one day we still would not need to notify
   clients of remove events.  The driver can simply return NULL to a
   ->prep() request, something that is much easier for a client to handle.
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6f49a57a

05 1月, 2009 3 次提交

tcp: Kill extraneous SPLICE_F_NONBLOCK checks. · 7945cc64

由 David S. Miller 提交于 1月 05, 2009

In splice TCP receive, the SPLICE_F_NONBLOCK flag is used
to compute the "timeo" value.  So checking it again inside
of the main receive loop to trigger -EAGAIN processing is
entirely unnecessary.

Noticed by Jarek P. and Lennert Buytenhek.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7945cc64

tcp: don't mask EOF and socket errors on nonblocking splice receive · 4f7d54f5

由 Lennert Buytenhek 提交于 1月 05, 2009

Currently, setting SPLICE_F_NONBLOCK on splice from a TCP socket
results in masking of EOF (RDHUP) and error conditions on the socket
by an -EAGAIN return.  Move the NONBLOCK check in tcp_splice_read()
to be after the EOF and error checks to fix this.
Signed-off-by: NLennert Buytenhek <buytenh@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f7d54f5

gro: Use gso_size to store MSS · b530256d

由 Herbert Xu 提交于 1月 04, 2009

In order to allow GRO packets without frag_list at all, we need to
store the MSS in the packet itself.  The obvious place is gso_size.
The only thing to watch out for is if the packet ends up not being
GRO then we need to clear gso_size before pushing the packet into
the stack.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b530256d

30 12月, 2008 1 次提交

net: Fix percpu counters deadlock · eb4dea58

由 Herbert Xu 提交于 12月 29, 2008

When we converted the protocol atomic counters such as the orphan
count and the total socket count deadlocks were introduced due to
the mismatch in BH status of the spots that used the percpu counter
operations.

Based on the diagnosis and patch by Peter Zijlstra, this patch
fixes these issues by disabling BH where we may be in process
context.
Reported-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb4dea58

16 12月, 2008 1 次提交

tcp: Add GRO support · bf296b12

由 Herbert Xu 提交于 12月 15, 2008

This patch adds the TCP-specific portion of GRO.  The criterion for
merging is extremely strict (the TCP header must match exactly apart
from the checksum) so as to allow refragmentation.  Otherwise this
is pretty much identical to LRO, except that we support the merging
of ECN packets.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf296b12

26 11月, 2008 2 次提交

net: Use a percpu_counter for orphan_count · dd24c001

由 Eric Dumazet 提交于 11月 25, 2008

Instead of using one atomic_t per protocol, use a percpu_counter
for "orphan_count", to reduce cache line contention on
heavy duty network servers. 
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dd24c001

net: Use a percpu_counter for sockets_allocated · 1748376b

由 Eric Dumazet 提交于 11月 25, 2008

Instead of using one atomic_t per protocol, use a percpu_counter
for "sockets_allocated", to reduce cache line contention on
heavy duty network servers. 

Note : We revert commit (248969ae
net: af_unix can make unix_nr_socks visbile in /proc),
since it is not anymore used after sock_prot_inuse_add() addition
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1748376b

17 11月, 2008 1 次提交

net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7

由 Eric Dumazet 提交于 11月 16, 2008

RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
  price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ab5aee7

05 11月, 2008 1 次提交

tcp: Fix recvmsg MSG_PEEK influence of blocking behavior. · 518a09ef

由 David S. Miller 提交于 11月 05, 2008

Vito Caputo noticed that tcp_recvmsg() returns immediately from
partial reads when MSG_PEEK is used.  In particular, this means that
SO_RCVLOWAT is not respected.

Simply remove the test.  And this matches the behavior of several
other systems, including BSD.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

518a09ef

03 11月, 2008 1 次提交
- J
  net: clean up net/ipv4/ipip.c raw.c tcp.c tcp_minisocks.c tcp_yeah.c xfrm4_policy.c · 5a5f3a8d
  由 Jianjun Kong 提交于 11月 03, 2008
```
Signed-off-by: NJianjun Kong <jianjun@zeuux.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  5a5f3a8d
08 10月, 2008 2 次提交

tcp: kill pointless urg_mode · 33f5f57e

由 Ilpo Järvinen 提交于 10月 07, 2008

It all started from me noticing that this urgent check in
tcp_clean_rtx_queue is unnecessarily inside the loop. Then
I took a longer look to it and found out that the users of
urg_mode can trivially do without, well almost, there was
one gotcha.

Bonus: those funny people who use urg with >= 2^31 write_seq -
snd_una could now rejoice too (that's the only purpose for the
between being there, otherwise a simple compare would have done
the thing). Not that I assume that the rest of the tcp code
happily lives with such mind-boggling numbers :-). Alas, it
turned out to be impossible to set wmem to such numbers anyway,
yes I really tried a big sendfile after setting some wmem but
nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
So I hacked a bit variable to long and found out that it seems
to work... :-)
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33f5f57e

net: wrap sk->sk_backlog_rcv() · c57943a1

由 Peter Zijlstra 提交于 10月 07, 2008

Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c57943a1

07 10月, 2008 1 次提交
- D
  tcp: Respect SO_RCVLOWAT in tcp_poll(). · c7004482
  由 David S. Miller 提交于 10月 06, 2008
```
Based upon a report by Vito Caputo.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
  c7004482
26 7月, 2008 1 次提交

net: convert BUG_TRAP to generic WARN_ON · 547b792c

由 Ilpo Järvinen 提交于 7月 25, 2008

Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.

I could make at least one BUILD_BUG_ON conversion.
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

547b792c

19 7月, 2008 1 次提交

tcp: Fix MD5 signatures for non-linear skbs · 49a72dfb

由 Adam Langley 提交于 7月 19, 2008

Currently, the MD5 code assumes that the SKBs are linear and, in the case
that they aren't, happily goes off and hashes off the end of the SKB and
into random memory.

Reported by Stephen Hemminger in [1]. Advice thanks to Stephen and Evgeniy
Polyakov. Also includes a couple of missed route_caps from Stephen's patch
in [2].

[1] http://marc.info/?l=linux-netdev&m=121445989106145&w=2
[2] http://marc.info/?l=linux-netdev&m=121459157816964&w=2Signed-off-by: NAdam Langley <agl@imperialviolet.org>
Acked-by: NStephen Hemminger <shemminger@vyatta.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49a72dfb

18 7月, 2008 1 次提交

mib: put tcp statistics on struct net · 57ef42d5

由 Pavel Emelyanov 提交于 7月 18, 2008

Proc temporary uses stats from init_net.

BTW, TCP_XXX_STATS are beautiful (w/o do { } while (0) facing) again :)
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

57ef42d5

17 7月, 2008 9 次提交

mib: add net to NET_ADD_STATS_USER · ed88098e

由 Pavel Emelyanov 提交于 7月 16, 2008

Done with NET_XXX_STATS macros :)

To be continued...
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed88098e

mib: add net to NET_INC_STATS_USER · 6f67c817

由 Pavel Emelyanov 提交于 7月 16, 2008

Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f67c817

mib: add net to NET_INC_STATS_BH · de0744af

由 Pavel Emelyanov 提交于 7月 16, 2008

Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de0744af

mib: add net to NET_INC_STATS · 4e673444

由 Pavel Emelyanov 提交于 7月 16, 2008

Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e673444

sock: add net to prot->enter_memory_pressure callback · 5c52ba17

由 Pavel Emelyanov 提交于 7月 16, 2008

The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't
have where to get the net from.

I decided to add a sk argument, not the net itself, only to factor
all the required sock_net(sk) calls inside the enter_memory_pressure 
callback itself.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c52ba17

mib: add net to TCP_DEC_STATS · 74688e48

由 Pavel Emelyanov 提交于 7月 16, 2008

Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74688e48

mib: add net to TCP_INC_STATS_BH · 63231bdd

由 Pavel Emelyanov 提交于 7月 16, 2008

Same as before - the sock is always there to get the net from,
but there are also some places with the net already saved on 
the stack.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

63231bdd

mib: add net to TCP_INC_STATS · 81cc8a75

由 Pavel Emelyanov 提交于 7月 16, 2008

Fortunately (almost) all the TCP code has a sock to get the net from :)
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

81cc8a75

net/ipv4/tcp.c: Fix use of PULLHUP instead of POLLHUP in comments. · 70efce27

由 Will Newton 提交于 7月 16, 2008

Change PULLHUP to POLLHUP in tcp_poll comments and clean up another
comment for grammar and coding style.
Signed-off-by: NWill Newton <will.newton@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70efce27

03 7月, 2008 2 次提交

tcp: fix a size_t < 0 comparison in tcp_read_sock · 374e7b59

由 Octavian Purdila 提交于 7月 03, 2008

<used> should be of type int (not size_t) since recv_actor can return
negative values and it is also used in a < 0 comparison.
Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

374e7b59

tcp: net/ipv4/tcp.c needs linux/scatterlist.h · 81b23b4a

由 Andrew Morton 提交于 7月 03, 2008

alpha:

net/ipv4/tcp.c: In function 'tcp_calc_md5_hash':
net/ipv4/tcp.c:2479: error: implicit declaration of function 'sg_init_table' net/ipv4/tcp.c:2482: error: implicit declaration of function 'sg_set_buf'
net/ipv4/tcp.c:2507: error: implicit declaration of function 'sg_mark_end'
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

81b23b4a

28 6月, 2008 1 次提交

tcp: calculate tcp_mem based on low memory instead of all memory · 57413ebc

由 Miquel van Smoorenburg 提交于 6月 27, 2008

The tcp_mem array which contains limits on the total amount of memory
used by TCP sockets is calculated based on nr_all_pages. On a 32 bits
x86 system, we should base this on the number of lowmem pages.
Signed-off-by: NMiquel van Smoorenburg <miquels@cistron.nl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

57413ebc

13 6月, 2008 1 次提交

tcp: Revert 'process defer accept as established' changes. · ec0a1966

由 David S. Miller 提交于 6月 12, 2008

This reverts two changesets, ec3c0982
("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
the follow-on bug fix 9ae27e0a
("tcp: Fix slab corruption with ipv6 and tcp6fuzz").

This change causes several problems, first reported by Ingo Molnar
as a distcc-over-loopback regression where connections were getting
stuck.

Ilpo Järvinen first spotted the locking problems.  The new function
added by this code, tcp_defer_accept_check(), only has the
child socket locked, yet it is modifying state of the parent
listening socket.

Fixing that is non-trivial at best, because we can't simply just grab
the parent listening socket lock at this point, because it would
create an ABBA deadlock.  The normal ordering is parent listening
socket --> child socket, but this code path would require the
reverse lock ordering.

Next is a problem noticed by Vitaliy Gusev, he noted:

----------------------------------------
>--- a/net/ipv4/tcp_timer.c
>+++ b/net/ipv4/tcp_timer.c
>@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
> 		goto death;
> 	}
>
>+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
>+		tcp_send_active_reset(sk, GFP_ATOMIC);
>+		goto death;

Here socket sk is not attached to listening socket's request queue. tcp_done()
will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
release this sk) as socket is not DEAD. Therefore socket sk will be lost for
freeing.
----------------------------------------

Finally, Alexey Kuznetsov argues that there might not even be any
real value or advantage to these new semantics even if we fix all
of the bugs:

----------------------------------------
Hiding from accept() sockets with only out-of-order data only
is the only thing which is impossible with old approach. Is this really
so valuable? My opinion: no, this is nothing but a new loophole
to consume memory without control.
----------------------------------------

So revert this thing for now.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec0a1966

12 6月, 2008 2 次提交

net: remove CVS keywords · 0b040829

由 Adrian Bunk 提交于 6月 10, 2008

This patch removes CVS keywords that weren't updated for a long time
from comments.
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0b040829

tcp md5sig: Share most of hash calcucaltion bits between IPv4 and IPv6. · 8d26d76d

由 YOSHIFUJI Hideaki 提交于 4月 17, 2008

We can share most part of the hash calculation code because
the only difference between IPv4 and IPv6 is their pseudo headers.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>

8d26d76d

05 6月, 2008 1 次提交

tcp: Fix for race due to temporary drop of the socket lock in skb_splice_bits. · 293ad604

由 Octavian Purdila 提交于 6月 04, 2008

skb_splice_bits temporary drops the socket lock while iterating over
the socket queue in order to break a reverse locking condition which
happens with sendfile. This, however, opens a window of opportunity
for tcp_collapse() to aggregate skbs and thus potentially free the
current skb used in skb_splice_bits and tcp_read_sock.

This patch fixes the problem by (re-)getting the same "logical skb"
after the lock has been temporary dropped.

Based on idea and initial patch from Evgeniy Polyakov.
Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
Acked-by: NEvgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

293ad604

21 4月, 2008 1 次提交

tcp: Trivial fix to correct function name in a comment in net/ipv4/tcp.c · 1f29b058

由 Satoru SATOH 提交于 4月 21, 2008

This is a trivial fix to correct function name in a comment in
net/ipv4/tcp.c.
Signed-off-by: NSatoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f29b058

bug2833 / cloud-kernel 与 Fork 源项目一致

bug2833 / cloud-kernel
与 Fork 源项目一致