提交 · adf30907d63893e4208dfe3f5c88ae12bc2f25d5 · openanolis / cloud-kernel

03 6月, 2009 2 次提交

由 Eric Dumazet 提交于 6月 02, 2009

Define three accessors to get/set dst attached to a skb

struct dst_entry *skb_dst(const struct sk_buff *skb)

void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;

Delete skb->dst field
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

adf30907

net: skb->rtable accessor · 511c3f92

由 Eric Dumazet 提交于 6月 02, 2009

Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb

Delete skb->rtable field

Setting rtable is not allowed, just set dst instead as rtable is an alias.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

511c3f92

02 6月, 2009 2 次提交

ipv4: New multicast-all socket option · f771bef9

由 Nivedita Singhvi 提交于 5月 28, 2009

After some discussion offline with Christoph Lameter and David Stevens
regarding multicast behaviour in Linux, I'm submitting a slightly
modified patch from the one Christoph submitted earlier.

This patch provides a new socket option IP_MULTICAST_ALL.

In this case, default behaviour is _unchanged_ from the current
Linux standard. The socket option is set by default to provide
original behaviour. Sockets wishing to receive data only from
multicast groups they join explicitly will need to clear this
socket option.
Signed-off-by: NNivedita Singhvi <niv@us.ibm.com>
Signed-off-by: Christoph Lameter<cl@linux.com>
Acked-by: NDavid Stevens <dlstevens@us.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f771bef9

net: ipv4/ip_sockglue.c cleanups · 4d52cfbe

由 Eric Dumazet 提交于 6月 02, 2009

Pure cleanups
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d52cfbe

30 5月, 2009 1 次提交

tcp: fix loop in ofo handling code and reduce its complexity · 2df9001e

由 Ilpo Järvinen 提交于 5月 29, 2009

Somewhat luckily, I was looking into these parts with very fine
comb because I've made somewhat similar changes on the same
area (conflicts that arose weren't that lucky though). The loop
was very much overengineered recently in commit 91521944
(tcp: Use SKB queue and list helpers instead of doing it
by-hand), while it basically just wants to know if there are
skbs after 'skb'.

Also it got broken because skb1 = skb->next got translated into
skb1 = skb1->next (though abstracted) improperly. Note that
'skb1' is pointing to previous sk_buff than skb or NULL if at
head. Two things went wrong:
- We'll kfree 'skb' on the first iteration instead of the
  skbuff following 'skb' (it would require required SACK reneging
  to recover I think).
- The list head case where 'skb1' is NULL is checked too early
  and the loop won't execute whereas it previously did.

Conclusion, mostly revert the recent changes which makes the
cset very messy looking but using proper accessor in the
previous-like version.

The effective changes against the original can be viewed with:
  git-diff 91521944^ \
		net/ipv4/tcp_input.c | sed -n -e '57,70 p'
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2df9001e

29 5月, 2009 3 次提交

net: unset IFF_XMIT_DST_RELEASE in ipgre_tunnel_setup() · 108bfa89

由 Eric Dumazet 提交于 5月 28, 2009

ipgre_tunnel_xmit() might need skb->dst, so tell dev_hard_start_xmit()
to no release it.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

108bfa89

net: unset IFF_XMIT_DST_RELEASE in ipip_tunnel_setup() · 28e72216

由 Eric Dumazet 提交于 5月 28, 2009

ipip_tunnel_xmit() might need skb->dst, so tell dev_hard_start_xmit()
to no release it.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

28e72216

D
tcp: Use SKB queue and list helpers instead of doing it by-hand. · 91521944
由 David S. Miller 提交于 5月 28, 2009
```
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
91521944

27 5月, 2009 6 次提交

tcp: Do not check flush when comparing options for GRO · a2a804cd

由 Herbert Xu 提交于 5月 26, 2009

There is no need to repeatedly check flush when comparing TCP
options for GRO as it will be false 99% of the time where it
matters.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2a804cd

ipv4: Use 32-bit loads for ID and length in GRO · 1075f3f6

由 Herbert Xu 提交于 5月 26, 2009

This patch optimises the IPv4 GRO code by using 32-bit loads
(instead of 16-bit ones) on the ID and length checks in the receive
function.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1075f3f6

gro: Avoid unnecessary comparison after skb_gro_header · a5b1cf28

由 Herbert Xu 提交于 5月 26, 2009

For the overwhelming majority of cases, skb_gro_header's return
value cannot be NULL.  Yet we must check it because of its current
form.  This patch splits it up into multiple functions in order
to avoid this.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a5b1cf28

tcp: Optimise len/mss comparison · 30a3ae30

由 Herbert Xu 提交于 5月 26, 2009

Instead of checking len > mss || len == 0, we can accomplish
both by checking (len - 1) > mss using the unsigned wraparound.
At nearly a million times a second, this might just help.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30a3ae30

tcp: Remove unnecessary window comparisons for GRO · 4a9a2968

由 Herbert Xu 提交于 5月 26, 2009

The window has already been checked as part of the flag word
so there is no need to check it explicitly.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a9a2968

tcp: Optimise GRO port comparisons · 745898ea

由 Herbert Xu 提交于 5月 26, 2009

Instead of doing two 16-bit operations for the source/destination
ports, we can do one 32-bit operation to take care both.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

745898ea

26 5月, 2009 1 次提交

tcp: tcp_vegas ssthresh bugfix · c80a5cdf

由 Doug Leith 提交于 5月 25, 2009

This patch fixes ssthresh accounting issues in tcp_vegas when cwnd decreases
Signed-off-by: NDoug Leith <doug.leith@nuim.ie>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c80a5cdf

22 5月, 2009 1 次提交

ipv4: Fix oops with FIB_TRIE · 3ed18d76

由 Robert Olsson 提交于 5月 21, 2009

It seems we can fix this by disabling preemption while we re-balance the 
trie. This is with the CONFIG_CLASSIC_RCU. It's been stress-tested at high 
loads continuesly taking a full BGP table up/down via iproute -batch.

Note. fib_trie is not updated for CONFIG_PREEMPT_RCU

Reported-by: Andrei Popa
Signed-off-by: NRobert Olsson <robert.olsson@its.uu.se>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ed18d76

21 5月, 2009 3 次提交

net: Remove unused parameter from fill method in fib_rules_ops. · 04af8cf6

由 Rami Rosen 提交于 5月 20, 2009

The netlink message header (struct nlmsghdr) is an unused parameter in
fill method of fib_rules_ops struct.  This patch removes this
parameter from this method and fixes the places where this method is
called.

(include/net/fib_rules.h)
Signed-off-by: NRami Rosen <ramirose@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04af8cf6

net: fix rtable leak in net/ipv4/route.c · 1ddbcb00

由 Eric Dumazet 提交于 5月 19, 2009

Alexander V. Lukyanov found a regression in 2.6.29 and made a complete
analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339
Quoted here because its a perfect one :

begin_of_quotation
 2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
 patch has at least one critical flaw, and another problem.

 rt_intern_hash calculates rthi pointer, which is later used for new entry
 insertion. The same loop calculates cand pointer which is used to clean the
 list. If the pointers are the same, rtable leak occurs, as first the cand is
 removed then the new entry is appended to it.

 This leak leads to unregister_netdevice problem (usage count > 0).

 Another problem of the patch is that it tries to insert the entries in certain
 order, to facilitate counting of entries distinct by all but QoS parameters.
 Unfortunately, referencing an existing rtable entry moves it to list beginning,
 to speed up further lookups, so the carefully built order is destroyed.

 For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
 it will also destroy the ordering.
end_of_quotation

Problematic commit is 1080d709
(net: implement emergency route cache rebulds when gc_elasticity is exceeded)

Trying to keep dst_entries ordered is too complex and breaks the fact that
order should depend on the frequency of use for garbage collection.

A possible fix is to make rt_intern_hash() simpler, and only makes
rt_check_expire() a litle bit smarter, being able to cope with an arbitrary
entries order. The added loop is running on cache hot data, while cpu
is prefetching next object, so should be unnoticied.
Reported-and-analyzed-by: NAlexander V. Lukyanov <lav@yar.ru>
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ddbcb00

net: fix length computation in rt_check_expire() · cf8da764

由 Eric Dumazet 提交于 5月 19, 2009

rt_check_expire() computes average and standard deviation of chain lengths,
but not correclty reset length to 0 at beginning of each chain.
This probably gives overflows for sum2 (and sum) on loaded machines instead
of meaningful results.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf8da764

20 5月, 2009 1 次提交

ipv4: teach ipconfig about the MTU option in DHCP · 9643f455

由 Chris Friesen 提交于 5月 19, 2009

The DHCP spec allows the server to specify the MTU. This can be useful
for netbooting with UDP-based NFS-root on a network using jumbo frames.
This patch allows the kernel IP autoconfiguration to handle this option
correctly.

It would be possible to use initramfs and add a script to set the MTU,
but that seems like a complicated solution if no initramfs is otherwise
necessary, and would bloat the kernel image more than this code would.

This patch was originally submitted to LKML in 2003 by Hans-Peter Jansen.
Signed-off-by: NChris Friesen <cfriesen@nortel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9643f455

19 5月, 2009 5 次提交

net: Fix devinet_sysctl_forward · 9b8adb5e

由 Eric W. Biederman 提交于 5月 13, 2009

sysctls are unregistered with the rntl_lock held making
it unsafe to unconditionally grab the the rtnl_lock.  Instead
we need to call rtnl_trylock and restart the system call
if we can not grab it.  Otherwise we could deadlock at unregistration
time.
Signed-off-by: NEric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b8adb5e

ipv4: make default for INET_LRO consistent with help text · bc8a5397

由 Frans Pop 提交于 5月 18, 2009

Commit e81963b1 ("ipv4: Make INET_LRO a bool instead of tristate.")
changed this config from tristate to bool.  Add default so that it is
consistent with the help text.
Signed-off-by: NFrans Pop <elendil@planet.nl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc8a5397

ipv4: cleanup: remove unnecessary include. · d23a9b5b

由 Rami Rosen 提交于 5月 18, 2009

There is no need for net/icmp.h header in net/ipv4/fib_frontend.c.
This patch removes the #include net/icmp.h from it.
Signed-off-by: NRami Rosen <ramirose@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d23a9b5b

R
ipv4: cleanup - remove two unused parameters from fib_semantic_match(). · e204a345
由 Rami Rosen 提交于 5月 18, 2009
```
Signed-off-by: NRami Rosen <ramirose@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
e204a345

tcp: fix MSG_PEEK race check · 77527313

由 Ilpo Järvinen 提交于 5月 10, 2009

Commit 518a09ef (tcp: Fix recvmsg MSG_PEEK influence of
blocking behavior) lets the loop run longer than the race check
did previously expect, so we need to be more careful with this
check and consider the work we have been doing.

I tried my best to deal with urg hole madness too which happens
here:
	if (!sock_flag(sk, SOCK_URGINLINE)) {
		++*seq;
		...
by using additional offset by one but I certainly have very
little interest in testing that part.
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Tested-by: NFrans Pop <elendil@planet.nl>
Tested-by: NIan Zimmermann <itz@buug.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77527313

18 5月, 2009 2 次提交

ipconfig: handle case of delayed DHCP server · 2513dfb8

由 Chris Friesen 提交于 5月 17, 2009

If a DHCP server is delayed, it's possible for the client to receive the 
DHCPOFFER after it has already sent out a new DHCPDISCOVER message from 
a second interface.  The client then sends out a DHCPREQUEST from the 
second interface, but the server doesn't recognize the device and 
rejects the request.

This patch simply tracks the current device being configured and throws 
away the OFFER if it is not intended for the current device.  A more 
sophisticated approach would be to put the OFFER information into the 
struct ic_device rather than storing it globally.
Signed-off-by: NChris Friesen <cfriesen@nortel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2513dfb8

R
ipv4: remove an unused parameter from configure method of fib_rules_ops. · 8b3521ee
由 Rami Rosen 提交于 5月 11, 2009
```
Signed-off-by: NRami Rosen <ramirose@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
8b3521ee

09 5月, 2009 1 次提交

ipv4: Make INET_LRO a bool instead of tristate. · e81963b1

由 David S. Miller 提交于 5月 08, 2009

This code is used as a library by several device drivers,
which select INET_LRO.

If some are modules and some are statically built into the
kernel, we get build failures if INET_LRO is modular.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e81963b1

07 5月, 2009 1 次提交

net: Make inet_twsk_put similar to sock_put · 4dbc8ef7

由 Arnaldo Carvalho de Melo 提交于 5月 06, 2009

By separating the freeing code from the refcounting decrementing.
Probably reducing icache pressure when we still have reference counts to
go.
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4dbc8ef7

06 5月, 2009 1 次提交

tcp:fix the code indent · ae8d7f88

由 Shan Wei 提交于 5月 05, 2009

Signed-off-by: Shan Wei<shanwei@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae8d7f88

05 5月, 2009 2 次提交

tcp: Fix tcp_prequeue() to get correct rto_min value · 0c266898

由 Satoru SATOH 提交于 5月 04, 2009

tcp_prequeue() refers to the constant value (TCP_RTO_MIN) regardless of
the actual value might be tuned. The following patches fix this and make
tcp_prequeue get the actual value returns from tcp_rto_min().
Signed-off-by: NSatoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0c266898

tcp: extend ECN sysctl to allow server-side only ECN · 255cac91

由 Ilpo Järvinen 提交于 5月 04, 2009

This should be very safe compared with full enabled, so I see
no reason why it shouldn't be done right away. As ECN can only
be negotiated if the SYN sending party is also supporting it,
somebody in the loop probably knows what he/she is doing. If
SYN does not ask for ECN, the server side SYN-ACK is identical
to what it is without ECN. Thus it's quite safe.

The chosen value is safe w.r.t to existing configs which
choose to currently set manually either 0 or 1 but
silently upgrades those who have not explicitly requested
ECN off.

Whether to just enable both sides comes up time to time but
unless that gets done now we can at least make the servers
aware of ECN already. As there are some known problems to occur
if ECN is enabled, it's currently questionable whether there's
any real gain from enabling clients as servers mostly won't
support it anyway (so we'd hit just the negative sides). After
enabling the servers and getting that deployed, the client end
enable really has some potential gain too.
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

255cac91

29 4月, 2009 1 次提交

netfilter: revised locking for x_tables · 942e4a2b

由 Stephen Hemminger 提交于 4月 28, 2009

The x_tables are organized with a table structure and a per-cpu copies
of the counters and rules. On older kernels there was a reader/writer
lock per table which was a performance bottleneck. In 2.6.30-rc, this
was converted to use RCU and the counters/rules which solved the performance
problems for do_table but made replacing rules much slower because of
the necessary RCU grace period.

This version uses a per-cpu set of spinlocks and counters to allow to
table processing to proceed without the cache thrashing of a global
reader lock and keeps the same performance for table updates.
Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

942e4a2b

28 4月, 2009 1 次提交

inet_diag: Remove dup assignments · ac5978e7

由 Arnaldo Carvalho de Melo 提交于 4月 28, 2009

These are later assigned to other values without being used meanwhile.
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac5978e7

27 4月, 2009 3 次提交

gro: Fix COMPLETE checksum handling · 36e7b1b8

由 Herbert Xu 提交于 4月 27, 2009

On a brand new GRO skb, we cannot call ip_hdr since the header
may lie in the non-linear area.  This patch adds the helper
skb_gro_network_header to handle this.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

36e7b1b8

ipv4: Limit size of route cache hash table · c9503e0f

由 Anton Blanchard 提交于 4月 27, 2009

Right now we have no upper limit on the size of the route cache hash table.
On a 128GB POWER6 box it ends up as 32MB:

IP route cache hash table entries: 4194304 (order: 9, 33554432 bytes)

It would be nice to cap this for memory consumption reasons, but a massive
hashtable also causes a significant spike when measuring OS jitter.

With a 32MB hashtable and 4 million entries, rt_worker_func is taking
5 ms to complete. On another system with more memory it's taking 14 ms.
Even though rt_worker_func does call cond_sched() to limit its impact,
in an HPC environment we want to keep all sources of OS jitter to a minimum.

With the patch applied we limit the number of entries to 512k which
can still be overriden by using the rt_entries boot option:

IP route cache hash table entries: 524288 (order: 6, 4194304 bytes)

With this patch rt_worker_func now takes 0.460 ms on the same system.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Acked-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c9503e0f

snmp: add missing counters for RFC 4293 · edf391ff

由 Neil Horman 提交于 4月 27, 2009

The IP MIB (RFC 4293) defines stats for InOctets, OutOctets, InMcastOctets and
OutMcastOctets:
http://tools.ietf.org/html/rfc4293
But it seems we don't track those in any way that easy to separate from other
protocols.  This patch adds those missing counters to the stats file. Tested
successfully by me

With help from Eric Dumazet.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

edf391ff

20 4月, 2009 2 次提交

syncookies: remove last_synq_overflow from struct tcp_sock · a0f82f64

由 Florian Westphal 提交于 4月 19, 2009

last_synq_overflow eats 4 or 8 bytes in struct tcp_sock, even
though it is only used when a listening sockets syn queue
is full.

We can (ab)use rx_opt.ts_recent_stamp to store the same information;
it is not used otherwise as long as a socket is in listen state.

Move linger2 around to avoid splitting struct mtu_probe
across cacheline boundary on 32 bit arches.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0f82f64

tcp: fix mid-wq adjustment helper · 52cf3cc8

由 Ilpo Järvinen 提交于 4月 18, 2009

Just noticed while doing some new work that the recent
mid-wq adjustment logic will misbehave when FACK is not
in use (happens either due sysctl'ed off or auto-detected
reordering) because I forgot the relevant TCPCB tagbit.
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52cf3cc8

17 4月, 2009 1 次提交

[PATCH] net: remove superfluous call to synchronize_net() · 573636cb

由 Eric Dumazet 提交于 4月 17, 2009

inet_register_protosw() function is responsible for adding a new
inet protocol into a global table (inetsw[]) that is used with RCU rules.

As soon as the store of the pointer is done, other cpus might see
this new protocol in inetsw[], so we have to make sure new protocol
is ready for use. All pending memory updates should thus be committed
to memory before setting the pointer.
This is correctly done using rcu_assign_pointer()

synchronize_net() is typically used at unregister time, after
unsetting the pointer, to make sure no other cpu is still using
the object we want to dismantle. Using it at register time
is only adding an artificial delay that could hide a real bug,
and this bug could popup if/when synchronize_rcu() can proceed
faster than now.

This saves about 13 ms on boot time on a HZ=1000 8 cpus machine  ;) 
(4 calls to inet_register_protosw(), and about 3200 us per call)
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

573636cb

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功