提交 · 9c55e01c0cc835818475a6ce8c4d684df9949ac8 · openeuler / Kernel

29 1月, 2008 1 次提交

[TCP]: Splice receive support. · 9c55e01c

由 Jens Axboe 提交于 11月 06, 2007

Support for network splice receive.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9c55e01c

26 1月, 2008 1 次提交

IPoIB: improve IPv4/IPv6 to IB mcast mapping functions · a9e527e3

由 Rolf Manderscheid 提交于 12月 10, 2007

An IPoIB subnet on an IB fabric that spans multiple IB subnets can't
use link-local scope in multicast GIDs.  The existing routines that
map IP/IPv6 multicast addresses into IB link-level addresses hard-code
the scope to link-local, and they also leave the partition key field
uninitialised.  This patch adds a parameter (the link-level broadcast
address) to the mapping routines, allowing them to initialise both the
scope and the P_Key appropriately, and fixes up the call sites.

The next step will be to add a way to configure the scope for an IPoIB
interface.
Signed-off-by: NRolf Manderscheid <rvm@obsidianresearch.com>
Signed-off-by: NRoland Dreier <rolandd@cisco.com>

a9e527e3

23 1月, 2008 2 次提交

[INET]: Fix truesize setting in ip_append_data · f945fa7a

由 Herbert Xu 提交于 1月 22, 2008

As it is ip_append_data only counts page fragments to the skb that
allocated it.  As such it means that the first skb gets hit with a
4K charge even though it might have only used a fraction of it while
all subsequent skb's that use the same page gets away with no charge
at all.

This bug was exposed by the UDP accounting patch.

[ The wmem_alloc bumping needs to be moved with the truesize,
  noticed by Takahiro Yasui.  -DaveM ]
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f945fa7a

[IPV4]: Add missing skb->truesize increment in ip_append_page(). · 1e34a11d

由 David S. Miller 提交于 1月 22, 2008

And as noted by Takahiro Yasui, we thus need to bump the
sk->sk_wmem_alloc at this spot as well.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e34a11d

21 1月, 2008 4 次提交

[ICMP]: ICMP_MIB_OUTMSGS increment duplicated · 5b4d383a

由 Wang Chen 提交于 1月 21, 2008

Commit "96793b48" (Add ICMPMsgStats
MIB (RFC 4293)) made a mistake.

In that patch, David L added a icmp_out_count() in
ip_push_pending_frames(), remove icmp_out_count() from
icmp_reply(). But he forgot to remove icmp_out_count() from
icmp_send() too.  Since icmp_send and icmp_reply will call
icmp_push_reply, which will call ip_push_pending_frames, a duplicated
increment happened in icmp_send.

This patch remove the icmp_out_count from icmp_send too.
Signed-off-by: NWang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b4d383a

[IPV4] FIB_HASH : Avoid unecessary loop in fn_hash_dump_zone() · 8d3f099a

由 Eric Dumazet 提交于 1月 18, 2008

I noticed "ip route list" was slower than "cat /proc/net/route" on a
machine with a full Internet routing table (214392 entries : Special
thanks to Robert ;) )

This is similar to problem reported in commit
d8c92830 ("[IPV4] ROUTE: ip_rt_dump()
is unecessary slow")

Fix is to avoid scanning the begining of fz_hash table, but directly
seek to the right offset.

Before patch :

time ip route >/tmp/ROUTE

real    0m1.285s
user    0m0.712s
sys     0m0.436s

After patch

# time ip route >/tmp/ROUTE

real    0m0.835s
user    0m0.692s
sys     0m0.124s
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d3f099a

[IPV4] fib_trie: fix duplicated route issue · 6725033f

由 Joonwoo Park 提交于 1月 18, 2008

http://bugzilla.kernel.org/show_bug.cgi?id=9493

The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.
Signed-off-by: NJoonwoo Park <joonwpark81@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6725033f

[IPV4] fib_hash: fix duplicated route issue · bd566e75

由 Joonwoo Park 提交于 1月 18, 2008

http://bugzilla.kernel.org/show_bug.cgi?id=9493

The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.
Signed-off-by: NJoonwoo Park <joonwpark81@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd566e75

10 1月, 2008 1 次提交

[IPV4] ROUTE: fix rcu_dereference() uses in /proc/net/rt_cache · 0bcceadc

由 Eric Dumazet 提交于 1月 10, 2008

In rt_cache_get_next(), no need to guard seq->private by a
rcu_dereference() since seq is private to the thread running this
function. Reading seq.private once (as guaranted bu rcu_dereference())
or several time if compiler really is dumb enough wont change the
result.

But we miss real spots where rcu_dereference() are needed, both in
rt_cache_get_first() and rt_cache_get_next()
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0bcceadc

09 1月, 2008 4 次提交

[LRO] Fix lro_mgr->features checks · 877364e6

由 Brice Goglin 提交于 1月 07, 2008

lro_mgr->features contains a bitmask of LRO_F_* values which are
defined as power of two, not as bit indexes.
They must be checked with x&LRO_F_FOO, not with test_bit(LRO_F_FOO,&x).
Signed-off-by: NBrice Goglin <Brice.Goglin@inria.fr>
Acked-by: NAndrew Gallatin <gallatin@myri.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

877364e6

[IPV4] ROUTE: ip_rt_dump() is unecessary slow · d8c92830

由 Eric Dumazet 提交于 1月 07, 2008

I noticed "ip route list cache x.y.z.t" can be *very* slow.

While strace-ing -T it I also noticed that first part of route cache
is fetched quite fast :

recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.000047>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\
202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.000042>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\
202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3740 <0.000055>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\
202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.000043>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\
202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3732 <0.000053>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3708 <0.000052>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202
GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3680 <0.000041>

while the part at the end of the table is more expensive:

recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3656 <0.003857>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\204\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3772 <0.003891>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3712 <0.003765>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3700 <0.003879>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3676 <0.003797>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"p\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\2\0\2\0"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3724 <0.003856>
recvmsg(3, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\234\0\0\0\30\0\2\0\254i\202GXm\0\0\2 \0\376\0\0\1\0\2"..., 16384}], msg_controllen=0, msg_flags=0}, 0) = 3736 <0.003848>

The following patch corrects this performance/latency problem,
removing quadratic behavior.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8c92830

[IPV4] ipconfig: Fix regression in ip command line processing · 92ffb85d

由 Amos Waterland 提交于 1月 05, 2008

The recent changes for ip command line processing fixed some problems
but unfortunately broke some common usage scenarios.  In current
2.6.24-rc6 the following command line results in no IP address
assignment, which is surely a regression:

 ip=10.0.2.15::10.0.2.2:255.255.255.0::eth0:off

Please find below a patch that works for all cases I can find.
Signed-off-by: NAmos Waterland <apw@us.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92ffb85d

[IPV4] raw: Strengthen check on validity of iph->ihl · f844c74f

由 Herbert Xu 提交于 1月 05, 2008

We currently check that iph->ihl is bounded by the real length and that
the real length is greater than the minimum IP header length.  However,
we did not check the caes where iph->ihl is less than the minimum IP
header length.

This breaks because some ip_fast_csum implementations assume that which
is quite reasonable.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f844c74f

04 1月, 2008 1 次提交

[INET]: Fix netdev renaming and inet address labels · 44344b2a

由 Mark McLoughlin 提交于 1月 04, 2008

When re-naming an interface, the previous secondary address
labels get lost e.g.

  $> brctl addbr foo
  $> ip addr add 192.168.0.1 dev foo
  $> ip addr add 192.168.0.2 dev foo label foo:00
  $> ip addr show dev foo | grep inet
    inet 192.168.0.1/32 scope global foo
    inet 192.168.0.2/32 scope global foo:00
  $> ip link set foo name bar
  $> ip addr show dev bar | grep inet
    inet 192.168.0.1/32 scope global bar
    inet 192.168.0.2/32 scope global bar:2

Turns out to be a simple thinko in inetdev_changename() - clearly we
want to look at the address label, rather than the device name, for
a suffix to retain.
Signed-off-by: NMark McLoughlin <markmc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

44344b2a

30 12月, 2007 1 次提交

[TCP]: use non-delayed ACK for congestion control RTT · 2072c228

由 Gavin McCullagh 提交于 12月 29, 2007

When a delayed ACK representing two packets arrives, there are two RTT
samples available, one for each packet.  The first (in order of seq
number) will be artificially long due to the delay waiting for the
second packet, the second will trigger the ACK and so will not itself
be delayed.

According to rfc1323, the SRTT used for RTO calculation should use the
first rtt, so receivers echo the timestamp from the first packet in
the delayed ack.  For congestion control however, it seems measuring
delayed ack delay is not desirable as it varies independently of
congestion.

The patch below causes seq_rtt and last_ackt to be updated with any
available later packet rtts which should have less (and hopefully
zero) delack delay.  The rtt value then gets passed to
ca_ops->pkts_acked().

Where TCP_CONG_RTT_STAMP was set, effort was made to supress RTTs from
within a TSO chunk (!fully_acked), using only the final ACK (which
includes any TSO delay) to generate RTTs.  This patch removes these
checks so RTTs are passed for each ACK to ca_ops->pkts_acked().

For non-delay based congestion control (cubic, h-tcp), rtt is
sometimes used for rtt-scaling.  In shortening the RTT, this may make
them a little less aggressive.  Delay-based schemes (eg vegas, veno,
illinois) should get a cleaner, more accurate congestion signal,
particularly for small cwnds.  The congestion control module can
potentially also filter out bad RTTs due to the delayed ack alarm by
looking at the associated cnt which (where delayed acking is in use)
should probably be 1 if the alarm went off or greater if the ACK was
triggered by a packet.
Signed-off-by: NGavin McCullagh <gavin.mccullagh@nuim.ie>
Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2072c228

29 12月, 2007 1 次提交

[IPV4] Fix ip=dhcp regression · 9cecd07c

由 Simon Horman 提交于 12月 27, 2007

David Brownell pointed out a regression in my recent "Fix ip command
line processing" patch. It turns out to be a fairly blatant oversight on
my part whereby ic_enable is never set, and thus autoconfiguration is
never enabled. Clearly my testing was broken :-(

The solution that I have is to set ic_enable to 1 if we hit
ip_auto_config_setup(), which basically means that autoconfiguration is
activated unless told otherwise. I then flip ic_enable to 0 if ip=off,
ip=none, ip=::::::off or ip=::::::none using ic_proto_name();

The incremental patch is below, let me know if a non-incremental version
is prepared, as I did as for the original patch to be reverted pending a
fix.
Signed-off-by: NSimon Horman <horms@verge.net.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9cecd07c

27 12月, 2007 2 次提交

[IPV4]: Fix ip command line processing. · a6c05c3d

由 Simon Horman 提交于 12月 25, 2007

Recently the documentation in Documentation/nfsroot.txt was
update to note that in fact ip=off and ip=::::::off as the
latter is ignored and the default (on) is used.

This was certainly a step in the direction of reducing confusion.
But it seems to me that the code ought to be fixed up so that
ip=::::::off actually turns off ip autoconfiguration.

This patch also notes more specifically that ip=on (aka ip=::::::on)
is the default.
Signed-off-by: NSimon Horman <horms@verge.net.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a6c05c3d

[NETFILTER]: nf_conntrack_ipv4: fix module parameter compatibility · fae718dd

由 Patrick McHardy 提交于 12月 24, 2007

Some users do "modprobe ip_conntrack hashsize=...". Since we have the
module aliases this loads nf_conntrack_ipv4 and nf_conntrack, the
hashsize parameter is unknown for nf_conntrack_ipv4 however and makes
it fail.

Allow to specify hashsize= for both nf_conntrack and nf_conntrack_ipv4.

Note: the nf_conntrack message in the ringbuffer will display an
incorrect hashsize since nf_conntrack is first pulled in as a
dependency and calculates the size itself, then it gets changed
through a call to nf_conntrack_set_hashsize().
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fae718dd

21 12月, 2007 2 次提交

[IPV4]: OOPS with NETLINK_FIB_LOOKUP netlink socket · d883a036

由 Denis V. Lunev 提交于 12月 21, 2007

[ Regression added by changeset:
	cd40b7d3
	[NET]: make netlink user -> kernel interface synchronious
  -DaveM ]

nl_fib_input re-reuses incoming skb to send the reply. This means that this
packet will be freed twice, namely in:
- netlink_unicast_kernel
- on receive path
Use clone to send as a cure, the caller is responsible for kfree_skb on error.

Thanks to Alexey Dobryan, who originally found the problem.
Signed-off-by: NDenis V. Lunev <den@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d883a036

[NETFILTER] ipv4: Spelling fixes · e00ccd4a

由 Joe Perches 提交于 12月 20, 2007

Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e00ccd4a

20 12月, 2007 2 次提交

[IPV4] ip_gre: set mac_header correctly in receive path · 1d069167

由 Timo Teras 提交于 12月 20, 2007

mac_header update in ipgre_recv() was incorrectly changed to
skb_reset_mac_header() when it was introduced.
Signed-off-by: NTimo Teras <timo.teras@iki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d069167

[IPV4] ARP: Remove not used code · e0260fed

由 Mark Ryden 提交于 12月 19, 2007

In arp_process() (net/ipv4/arp.c), there is unused code: definition
and assignment of tha (target hw address ).
Signed-off-by: NMark Ryden <markryde@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e0260fed

17 12月, 2007 1 次提交

[IPV4]: Make tcp_input_metrics() get minimum RTO via tcp_rto_min() · 488faa2a

由 Satoru SATOH 提交于 12月 16, 2007

tcp_input_metrics() refers to the built-time constant TCP_RTO_MIN
regardless of configured minimum RTO with iproute2.
Signed-off-by: NSatoru SATOH <satoru.satoh@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

488faa2a

15 12月, 2007 2 次提交

[IPV4]: Updates to nfsroot documentation · f33e1d9f

由 Amos Waterland 提交于 12月 14, 2007

The difference between ip=off and ip=::::::off has been a cause of much
confusion.  Document how each behaves, and do not contradict ourselves by
saying that "off" is the default when in fact "any" is the default and is
descibed as being so lower in the file.
Signed-off-by: NAmos Waterland <apw@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f33e1d9f

[NETFILTER]: ip_tables: fix compat copy race · a18aa31b

由 Patrick McHardy 提交于 12月 12, 2007

When copying entries to user, the kernel makes two passes through the
data, first copying all the entries, then fixing up names and counters.
On the second pass it copies the kernel and match data from userspace
to the kernel again to find the corresponding structures, expecting
that kernel pointers contained in the data are still valid.

This is obviously broken, fix by avoiding the second pass completely
and fixing names and counters while dumping the ruleset, using the
kernel-internal data structures.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a18aa31b

11 12月, 2007 2 次提交

[IPv4] ESP: Discard dummy packets introduced in rfc4303 · 2017a72c

由 Thomas Graf 提交于 12月 10, 2007

RFC4303 introduces dummy packets with a nexthdr value of 59
to implement traffic confidentiality. Such packets need to
be dropped silently and the payload may not be attempted to
be parsed as it consists of random chunk.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2017a72c

[IPV4]: Swap the ifa allocation with the"ipv4_devconf_setall" call · a4e65d36

由 Pavel Emelyanov 提交于 12月 07, 2007

According to Herbert, the ipv4_devconf_setall should be called
only when the ifa is added to the device. However, failed
ifa allocation may bring things into inconsistent state.

Move the call to ipv4_devconf_setall after the ifa allocation.

Fits both net-2.6 (with offsets) and net-2.6.25 (cleanly).
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4e65d36

07 12月, 2007 2 次提交

[IPV4]: Remove prototype of ip_rt_advice · 56c99d04

由 Denis V. Lunev 提交于 12月 06, 2007

ip_rt_advice has been gone, so no need to keep prototype and debug message.
Signed-off-by: NDenis V. Lunev <den@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56c99d04

[IPv4]: Reply net unreachable ICMP message · 7f53878d

由 Mitsuru Chinen 提交于 12月 07, 2007

IPv4 stack doesn't reply any ICMP destination unreachable message
with net unreachable code when IP detagrams are being discarded
because of no route could be found in the forwarding path.
Incidentally, IPv6 stack replies such ICMPv6 message in the similar
situation.
Signed-off-by: NMitsuru Chinen <mitch@linux.vnet.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7f53878d

05 12月, 2007 6 次提交

[LRO]: fix lro_gen_skb() alignment · 621544eb

由 Andrew Gallatin 提交于 12月 05, 2007

Add a field to the lro_mgr struct so that drivers can specify how much
padding is required to align layer 3 headers when a packet is copied
into a freshly allocated skb by inet_lro.c:lro_gen_skb().  Without
padding, skbs generated by LRO will cause alignment warnings on
architectures which require strict alignment (seen on sparc64).

Myri10GE is updated to use this field.
Signed-off-by: NAndrew Gallatin <gallatin@myri.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

621544eb

[TCP]: NAGLE_PUSH seems to be a wrong way around · 4e67d876

由 Ilpo Jrvinen 提交于 12月 05, 2007

The comment in tcp_nagle_test suggests that. This bug is very
very old, even 2.4.0 seems to have it.
Signed-off-by: NIlpo Jrvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e67d876

[TCP]: Move prior_in_flight collect to more robust place · 52d34081

由 Ilpo Jrvinen 提交于 12月 05, 2007

The previous location is after sacktag processing, which affects
counters tcp_packets_in_flight depends on. This may manifest as
wrong behavior if new SACK blocks are present and all is clear
for call to tcp_cong_avoid, which in the case of
tcp_reno_cong_avoid bails out early because it thinks that
TCP is not limited by cwnd.
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52d34081

[TCP] FRTO: Use of existing funcs make code more obvious & robust · 3e6f049e

由 Ilpo Jrvinen 提交于 12月 05, 2007

Though there's little need for everything that tcp_may_send_now
does (actually, even the state had to be adjusted to pass some
checks FRTO does not want to occur), it's more robust to let it
make the decision if sending is allowed. State adjustments
needed:
- Make sure snd_cwnd limit is not hit in there
- Disable nagle (if necessary) through the frto_counter == 2

The result of check for frto_counter in argument to call for
tcp_enter_frto_loss can just be open coded, therefore there
isn't need to store the previous frto_counter past
tcp_may_send_now.

In addition, returns can then be combined.
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e6f049e

[IPVS]: Fix sched registration race when checking for name collision. · 4ac63ad6

由 Pavel Emelyanov 提交于 12月 04, 2007

The register_ip_vs_scheduler() checks for the scheduler with the
same name under the read-locked __ip_vs_sched_lock, then drops,
takes it for writing and puts the scheduler in list.

This is racy, since we can have a race window between the lock
being re-locked for writing.

The fix is to search the scheduler with the given name right under
the write-locked __ip_vs_sched_lock.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Acked-by: NSimon Horman <horms@verge.net.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ac63ad6

[IPVS]: Don't leak sysctl tables if the scheduler registration fails. · a014bc8f

由 Pavel Emelyanov 提交于 12月 04, 2007

In case we load lblc or lblcr module we can leak some sysctl
tables if the call to register_ip_vs_scheduler() fails.

I've looked at the register_ip_vs_scheduler() code and saw, that
the only reason to fail is the name collision, so I think that
with some 3rd party schedulers this becomes a relevant issue. No?
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Acked-by: NSimon Horman <horms@verge.net.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a014bc8f

03 12月, 2007 1 次提交

[INET]: Fix inet_diag dead-lock regression · d523a328

由 Herbert Xu 提交于 12月 03, 2007

The inet_diag register fix broke inet_diag module loading because the
loaded module had to take the same mutex that's already held by the
loader in order to register the new handler.

This patch fixes it by introducing a separate mutex to protect the
handling of handlers.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

d523a328

29 11月, 2007 2 次提交

[TCP] illinois: Incorrect beta usage · a357dde9

由 Stephen Hemminger 提交于 11月 30, 2007

Lachlan Andrew observed that my TCP-Illinois implementation uses the
beta value incorrectly:
  The parameter  beta  in the paper specifies the amount to decrease
  *by*:  that is, on loss,
     W <-  W -  beta*W
  but in   tcp_illinois_ssthresh() uses  beta  as the amount
  to decrease  *to*: W <- beta*W

This bug makes the Linux TCP-Illinois get less-aggressive on uncongested network,
hurting performance. Note: since the base beta value is .5, it has no
impact on a congested network.
Signed-off-by: NStephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

a357dde9

[INET]: Fix inet_diag register vs rcv race · 07693198

由 Pavel Emelyanov 提交于 11月 30, 2007

The following race is possible when one cpu unregisters the handler
while other one is trying to receive a message and call this one:

CPU1:                                                 CPU2:
inet_diag_rcv()                                       inet_diag_unregister()
  mutex_lock(&inet_diag_mutex);
  netlink_rcv_skb(skb, &inet_diag_rcv_msg);
    if (inet_diag_table[nlh->nlmsg_type] == 
                               NULL) /* false handler is still registered */
    ...
    netlink_dump_start(idiagnl, skb, nlh,
                           inet_diag_dump, NULL);
           cb = kzalloc(sizeof(*cb), GFP_KERNEL);
                   /* sleep here freeing memory 
                    * or preempt
                    * or sleep later on nlk->cb_mutex
                    */
                                                         spin_lock(&inet_diag_register_lock);
                                                         inet_diag_table[type] = NULL;
    ...                                                  spin_unlock(&inet_diag_register_lock);
                                                         synchronize_rcu();
                                                         /* CPU1 is sleeping - RCU quiescent
                                                          * state is passed
                                                          */
                                                         return;
    /* inet_diag_dump is finally called: */
    inet_diag_dump()
      handler = inet_diag_table[cb->nlh->nlmsg_type];
      BUG_ON(handler == NULL); 
      /* OOPS! While we slept the unregister has set
       * handler to NULL :(
       */

Grep showed, that the register/unregister functions are called
from init/fini module callbacks for tcp_/dccp_diag, so it's OK
to use the inet_diag_mutex to synchronize manipulations with the
inet_diag_table and the access to it.

Besides, as Herbert pointed out, asynchronous dumps should hold 
this mutex as well, and thus, we provide the mutex as cb_mutex one.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

07693198

26 11月, 2007 1 次提交

[IPV4]: Remove bogus ifdef mess in arp_process · 3660019e

由 Adrian Bunk 提交于 11月 26, 2007

The #ifdef's in arp_process() were not only a mess, they were also wrong 
in the CONFIG_NET_ETHERNET=n and (CONFIG_NETDEV_1000=y or 
CONFIG_NETDEV_10000=y) cases.

Since they are not required this patch removes them.

Also removed are some #ifdef's around #include's that caused compile 
errors after this change.
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

3660019e

23 11月, 2007 1 次提交

[TCP] MTUprobe: Cleanup send queue check (no need to loop) · 7f9c33e5

由 Ilpo Järvinen 提交于 11月 23, 2007

The original code has striking complexity to perform a query
which can be reduced to a very simple compare.

FIN seqno may be included to write_seq but it should not make
any significant difference here compared to skb->len which was
used previously. One won't end up there with SYN still queued.

Use of write_seq check guarantees that there's a valid skb in
send_head so I removed the extra check.
Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-by: NJohn Heffner <jheffner@psc.edu>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

7f9c33e5

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功