提交 · b8dad61cc74b9ec71052e2a0e1c5119c65d166da · openeuler / raspberrypi-kernel

29 1月, 2011 3 次提交

ipv4: If fib metrics are default, no need to grab ref to FIB info. · b8dad61c

由 David S. Miller 提交于 1月 28, 2011

The fib metric memory in this case is static in the kernel image,
so we don't need to reference count it since it's never going
to go away on us.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8dad61c

ipv4: Attach FIB info to dst_default_metrics when possible · 725d1e1b

由 David S. Miller 提交于 1月 28, 2011

If there are no explicit metrics attached to a route, hook
fi->fib_info up to dst_default_metrics.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

725d1e1b

ipv4: Allocate fib metrics dynamically. · 9c150e82

由 David S. Miller 提交于 1月 28, 2011

This is the initial gateway towards super-sharing metrics
if they are all set to zero for a route.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9c150e82

28 1月, 2011 3 次提交

net: Pre-COW metrics for TCP. · a4daad6b

由 David S. Miller 提交于 1月 27, 2011

TCP is going to record metrics for the connection,
so pre-COW the route metrics at route cache entry
creation time.

This avoids several atomic operations that have to
occur if we COW the metrics after the entry reaches
global visibility.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4daad6b

net: Store ipv4/ipv6 COW'd metrics in inetpeer cache. · 06582540

由 David S. Miller 提交于 1月 27, 2011

Please note that the IPSEC dst entry metrics keep using
the generic metrics COW'ing mechanism using kmalloc/kfree.

This gives the IPSEC routes an opportunity to use metrics
which are unique to their encapsulated paths.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06582540

inetpeer: Mark metrics as "new" in fresh inetpeer entries. · 144001bd

由 David S. Miller 提交于 1月 27, 2011

Set the RTAX_LOCKED metric to INETPEER_METRICS_NEW (basically,
all ones) on fresh inetpeer entries.

This way code can determine if default metrics have been loaded
in from a routing table entry already.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

144001bd

27 1月, 2011 1 次提交

net: Implement read-only protection and COW'ing of metrics. · 62fa8a84

由 David S. Miller 提交于 1月 26, 2011

Routing metrics are now copy-on-write.

Initially a route entry points it's metrics at a read-only location.
If a routing table entry exists, it will point there.  Else it will
point at the all zero metric place-holder called 'dst_default_metrics'.

The writeability state of the metrics is stored in the low bits of the
metrics pointer, we have two bits left to spare if we want to store
more states.

For the initial implementation, COW is implemented simply via kmalloc.
However future enhancements will change this to place the writable
metrics somewhere else, in order to increase sharing.  Very likely
this "somewhere else" will be the inetpeer cache.

Note also that this means that metrics updates may transiently fail
if we cannot COW the metrics successfully.

But even by itself, this patch should decrease memory usage and
increase cache locality especially for routing workloads.  In those
cases the read-only metric copies stay in place and never get written
to.

TCP workloads where metrics get updated, and those rare cases where
PMTU triggers occur, will take a very slight performance hit.  But
that hit will be alleviated when the long-term writable metrics
move to a more sharable location.

Since the metrics storage went from a u32 array of RTAX_MAX entries to
what is essentially a pointer, some retooling of the dst_entry layout
was necessary.

Most importantly, we need to preserve the alignment of the reference
count so that it doesn't share cache lines with the read-mostly state,
as per Eric Dumazet's alignment assertion checks.

The only non-trivial bit here is the move of the 'flags' member into
the writeable cacheline.  This is OK since we are always accessing the
flags around the same moment when we made a modification to the
reference count.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62fa8a84

26 1月, 2011 1 次提交

TCP: fix a bug that triggers large number of TCP RST by mistake · 44f5324b

由 Jerry Chu 提交于 1月 25, 2011

This patch fixes a bug that causes TCP RST packets to be generated
on otherwise correctly behaved applications, e.g., no unread data
on close,..., etc. To trigger the bug, at least two conditions must
be met:

1. The FIN flag is set on the last data packet, i.e., it's not on a
separate, FIN only packet.
2. The size of the last data chunk on the receive side matches
exactly with the size of buffer posted by the receiver, and the
receiver closes the socket without any further read attempt.

This bug was first noticed on our netperf based testbed for our IW10
proposal to IETF where a large number of RST packets were observed.
netperf's read side code meets the condition 2 above 100%.

Before the fix, tcp_data_queue() will queue the last skb that meets
condition 1 to sk_receive_queue even though it has fully copied out
(skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
tcp_recvmsg() often returns all the copied out data successfully
without actually consuming the skb, due to a check
"if ((chunk = len - tp->ucopy.len) != 0) {"
and
"len -= chunk;"
after tcp_prequeue_process() that causes "len" to become 0 and an
early exit from the big while loop.

I don't see any reason not to free the skb whose data have been fully
consumed in tcp_data_queue(), regardless of the FIN flag.  We won't
get there if MSG_PEEK is on. Am I missing some arcane cases related
to urgent data?
Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

44f5324b

25 1月, 2011 4 次提交

net: change netdev->features to u32 · 04ed3e74

由 Michał Mirosław 提交于 1月 24, 2011

Quoting Ben Hutchings: we presumably won't be defining features that
can only be enabled on 64-bit architectures.

Occurences found by `grep -r` on net/, drivers/net, include/

[ Move features and vlan_features next to each other in
  struct netdev, as per Eric Dumazet's suggestion -DaveM ]
Signed-off-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04ed3e74

tcp: fix bug in listening_get_next() · fd0273c5

由 Eric Dumazet 提交于 1月 24, 2011

commit a8b690f9 (tcp: Fix slowness in read /proc/net/tcp)
introduced a bug in handling of SYN_RECV sockets.

st->offset represents number of sockets found since beginning of
listening_hash[st->bucket].

We should not reset st->offset when iterating through
syn_table[st->sbucket], or else if more than ~25 sockets (if
PAGE_SIZE=4096) are in SYN_RECV state, we exit from listening_get_next()
with a too small st->offset

Next time we enter tcp_seek_last_pos(), we are not able to seek past
already found sockets.
Reported-by: NPK <runningdoglackey@yahoo.com>
CC: Tom Herbert <therbert@google.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd0273c5

inetpeer: Use correct AVL tree base pointer in inet_getpeer(). · 3408404a

由 David S. Miller 提交于 1月 24, 2011

Family was hard-coded to AF_INET but should be daddr->family.

This fixes crashes when unlinking ipv6 peer entries, since the
unlink code was looking up the base pointer properly.
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3408404a

net: arp_ioctl() must hold RTNL · c506653d

由 Eric Dumazet 提交于 1月 24, 2011

Commit 941666c2 "net: RCU conversion of dev_getbyhwaddr() and
arp_ioctl()" introduced a regression, reported by Jamie Heilman.
"arp -Ds 192.168.2.41 eth0 pub" triggered the ASSERT_RTNL() assert
in pneigh_lookup()

Removing RTNL requirement from arp_ioctl() was a mistake, just revert
that part.
Reported-by: NJamie Heilman <jamie@audible.transient.net>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c506653d

20 1月, 2011 3 次提交

netfilter: nf_nat: place conntrack in source hash after SNAT is done · 41a7cab6

由 Changli Gao 提交于 1月 20, 2011

If SNAT isn't done, the wrong info maybe got by the other cts.

As the filter table is after DNAT table, the packets dropped in filter
table also bother bysource hash table.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

41a7cab6

netfilter: do not omit re-route check on NF_QUEUE verdict · 28a51ba5

由 Florian Westphal 提交于 1月 20, 2011

ret != NF_QUEUE only works in the "--queue-num 0" case; for
queues > 0 the test should be '(ret & NF_VERDICT_MASK) != NF_QUEUE'.

However, NF_QUEUE no longer DROPs the skb unconditionally if queueing
fails (due to NF_VERDICT_FLAG_QUEUE_BYPASS verdict flag), so the
re-route test should also be performed if this flag is set in the
verdict.

The full test would then look something like

&& ((ret & NF_VERDICT_MASK) == NF_QUEUE && (ret & NF_VERDICT_FLAG_QUEUE_BYPASS))

This is rather ugly, so just remove the NF_QUEUE test altogether.

The only effect is that we might perform an unnecessary route lookup
in the NF_QUEUE case.

ip6table_mangle did not have such a check.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

28a51ba5

Revert "netlink: test for all flags of the NLM_F_DUMP composite" · b8f3ab42

由 David S. Miller 提交于 1月 18, 2011

This reverts commit 0ab03c2b.

It breaks several things including the avahi daemon.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8f3ab42

19 1月, 2011 1 次提交

netfilter: nf_conntrack: nf_conntrack snmp helper · 93557f53

由 Jiri Olsa 提交于 1月 18, 2011

Adding support for SNMP broadcast connection tracking. The SNMP
broadcast requests are now paired with the SNMP responses.
Thus allowing using SNMP broadcasts with firewall enabled.

Please refer to the following conversation:
http://marc.info/?l=netfilter-devel&m=125992205006600&w=2

Patrick McHardy wrote:
> > The best solution would be to add generic broadcast tracking, the
> > use of expectations for this is a bit of abuse.
> > The second best choice I guess would be to move the help() function
> > to a shared module and generalize it so it can be used for both.
This patch implements the "second best choice".

Since the netbios-ns conntrack module uses the same helper
functionality as the snmp, only one helper function is added
for both snmp and netbios-ns modules into the new object -
nf_conntrack_broadcast.
Signed-off-by: NJiri Olsa <jolsa@redhat.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

93557f53

18 1月, 2011 2 次提交

netfilter: ipt_CLUSTERIP: remove "no conntrack!" · 94d117a1

由 Eric Dumazet 提交于 1月 18, 2011

When a packet is meant to be handled by another node of the cluster,
silently drop it instead of flooding kernel log.

Note : INVALID packets are also dropped without notice.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

94d117a1

netfilter: nf_nat: fix conversion to non-atomic bit ops · a7c2f4d7

由 Changli Gao 提交于 1月 18, 2011

My previous patch (netfilter: nf_nat: don't use atomic bit operation)
made a mistake when converting atomic_set to a normal bit 'or'.
IPS_*_BIT should be replaced with IPS_*.
Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
Cc: Tim Gardner <tim.gardner@canonical.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

a7c2f4d7

14 1月, 2011 2 次提交

netfilter: nf_conntrack: use is_vmalloc_addr() · d862a662

由 Patrick McHardy 提交于 1月 14, 2011

Use is_vmalloc_addr() in nf_ct_free_hashtable() and get rid of
the vmalloc flags to indicate that a hash table has been allocated
using vmalloc().
Signed-off-by: NPatrick McHardy <kaber@trash.net>

d862a662

netfilter: fix Kconfig dependencies · c7066f70

由 Patrick McHardy 提交于 1月 14, 2011

Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
which itself depends on NET_SCHED; this dependency is missing from netfilter.

Since matching on realms is also useful without having NET_SCHED enabled and
the option really only controls whether the tclassid member is included in
route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.
Reported-by: NVladis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: NPatrick McHardy <kaber@trash.net>

c7066f70

13 1月, 2011 1 次提交

netfilter: x_table: speedup compat operations · 255d0dc3

由 Eric Dumazet 提交于 12月 18, 2010

One iptables invocation with 135000 rules takes 35 seconds of cpu time
on a recent server, using a 32bit distro and a 64bit kernel.

We eventually trigger NMI/RCU watchdog.

INFO: rcu_sched_state detected stall on CPU 3 (t=6000 jiffies)

COMPAT mode has quadratic behavior and consume 16 bytes of memory per
rule.

Switch the xt_compat algos to use an array instead of list, and use a
binary search to locate an offset in the sorted array.

This halves memory need (8 bytes per rule), and removes quadratic
behavior [ O(N*N) -> O(N*log2(N)) ]

Time of iptables goes from 35 s to 150 ms.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

255d0dc3

12 1月, 2011 2 次提交

ah: reload pointers to skb data after calling skb_cow_data() · 4b0ef1f2

由 Dang Hongwu 提交于 1月 11, 2011

skb_cow_data() may allocate a new data buffer, so pointers on
skb should be set after this function.

Bug was introduced by commit dff3bb06 ("ah4: convert to ahash")
and 8631e9bd ("ah6: convert to ahash").
Signed-off-by: NWang Xuefu <xuefu.wang@6wind.com>
Acked-by: NKrzysztof Witek <krzysztof.witek@6wind.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b0ef1f2

tcp: disallow bind() to reuse addr/port · c191a836

由 Eric Dumazet 提交于 1月 11, 2011

inet_csk_bind_conflict() logic currently disallows a bind() if
it finds a friend socket (a socket bound on same address/port)
satisfying a set of conditions :

1) Current (to be bound) socket doesnt have sk_reuse set
OR
2) other socket doesnt have sk_reuse set
OR
3) other socket is in LISTEN state

We should add the CLOSE state in the 3) condition, in order to avoid two
REUSEADDR sockets in CLOSE state with same local address/port, since
this can deny further operations.

Note : a prior patch tried to address the problem in a different (and
buggy) way. (commit fda48a0d tcp: bind() fix when many ports
are bound).
Reported-by: NGaspar Chilingarov <gasparch@gmail.com>
Reported-by: NDaniel Baluta <daniel.baluta@gmail.com>
Tested-by: NDaniel Baluta <daniel.baluta@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c191a836

11 1月, 2011 2 次提交

arp: allow to invalidate specific ARP entries · 545ecdc3

由 Maxim Levitsky 提交于 1月 08, 2011

IPv4 over firewire needs to be able to remove ARP entries
from the ARP cache that belong to nodes that are removed, because
IPv4 over firewire uses ARP packets for private information
about nodes.

This information becomes invalid as soon as node drops
off the bus and when it reconnects, its only possible
to start talking to it after it responded to an ARP packet.
But ARP cache prevents such packets from being sent.
Signed-off-by: NMaxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

545ecdc3

netfilter: x_tables: dont block BH while reading counters · 83723d60

由 Eric Dumazet 提交于 1月 10, 2011

Using "iptables -L" with a lot of rules have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.

Switch to a per_cpu seqlock scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).

This adds two increments on seqlock sequence per ipt_do_table() call,
its a reasonable cost for allowing "iptables -L" not block BH
processing.
Reported-by: NJesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Acked-by: NStephen Hemminger <shemminger@vyatta.com>
Acked-by: NJesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

83723d60

10 1月, 2011 1 次提交

netlink: test for all flags of the NLM_F_DUMP composite · 0ab03c2b

由 Jan Engelhardt 提交于 1月 07, 2011

Due to NLM_F_DUMP is composed of two bits, NLM_F_ROOT | NLM_F_MATCH,
when doing "if (x & NLM_F_DUMP)", it tests for _either_ of the bits
being set. Because NLM_F_MATCH's value overlaps with NLM_F_EXCL,
non-dump requests with NLM_F_EXCL set are mistaken as dump requests.

Substitute the condition to test for _all_ bits being set.
Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
Acked-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ab03c2b

07 1月, 2011 2 次提交

netfilter: fix export secctx error handling · cba85b53

由 Pablo Neira Ayuso 提交于 1月 06, 2011

In 1ae4de0c, the secctx was exported
via the /proc/net/netfilter/nf_conntrack and ctnetlink interfaces
instead of the secmark.

That patch introduced the use of security_secid_to_secctx() which may
return a non-zero value on error.

In one of my setups, I have NF_CONNTRACK_SECMARK enabled but no
security modules. Thus, security_secid_to_secctx() returns a negative
value that results in the breakage of the /proc and `conntrack -L'
outputs. To fix this, we skip the inclusion of secctx if the
aforementioned function fails.

This patch also fixes the dynamic netlink message size calculation
if security_secid_to_secctx() returns an error, since its logic is
also wrong.

This problem exists in Linux kernel >= 2.6.37.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cba85b53

ipv4: IP defragmentation must be ECN aware · 6623e3b2

由 Eric Dumazet 提交于 1月 05, 2011

RFC3168 (The Addition of Explicit Congestion Notification to IP)
states :

5.3.  Fragmentation

   ECN-capable packets MAY have the DF (Don't Fragment) bit set.
   Reassembly of a fragmented packet MUST NOT lose indications of
   congestion.  In other words, if any fragment of an IP packet to be
   reassembled has the CE codepoint set, then one of two actions MUST be
   taken:

      * Set the CE codepoint on the reassembled packet.  However, this
        MUST NOT occur if any of the other fragments contributing to
        this reassembly carries the Not-ECT codepoint.

      * The packet is dropped, instead of being reassembled, for any
        other reason.

This patch implements this requirement for IPv4, choosing the first
action :

If one fragment had NO-ECT codepoint
        reassembled frame has NO-ECT
ElIf one fragment had CE codepoint
        reassembled frame has CE
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6623e3b2

05 1月, 2011 1 次提交

ipv4/route.c: respect prefsrc for local routes · 9fc3bbb4

由 Joel Sing 提交于 1月 03, 2011

The preferred source address is currently ignored for local routes,
which results in all local connections having a src address that is the
same as the local dst address. Fix this by respecting the preferred source
address when it is provided for local routes.

This bug can be demonstrated as follows:

 # ifconfig dummy0 192.168.0.1
 # ip route show table local | grep local.*dummy0
 local 192.168.0.1 dev dummy0  proto kernel  scope host  src 192.168.0.1
 # ip route change table local local 192.168.0.1 dev dummy0 \
     proto kernel scope host src 127.0.0.1
 # ip route show table local | grep local.*dummy0
 local 192.168.0.1 dev dummy0  proto kernel  scope host  src 127.0.0.1

We now establish a local connection and verify the source IP
address selection:

 # nc -l 192.168.0.1 3128 &
 # nc 192.168.0.1 3128 &
 # netstat -ant | grep 192.168.0.1:3128.*EST
 tcp        0      0 192.168.0.1:3128        192.168.0.1:33228 ESTABLISHED
 tcp        0      0 192.168.0.1:33228       192.168.0.1:3128  ESTABLISHED
Signed-off-by: NJoel Sing <jsing@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9fc3bbb4

26 12月, 2010 1 次提交

ipv4: dont create routes on down devices · fc75fc83

由 Eric Dumazet 提交于 12月 22, 2010

In ip_route_output_slow(), instead of allowing a route to be created on
a not UPed device, report -ENETUNREACH immediately.

# ip tunnel add mode ipip remote 10.16.0.164 local
10.16.0.72 dev eth0
# (Note : tunl1 is down)
# ping -I tunl1 10.1.2.3
PING 10.1.2.3 (10.1.2.3) from 192.168.18.5 tunl1: 56(84) bytes of data.
(nothing)
# ./a.out tunl1
# ip tunnel del tunl1
Message from syslogd@shelby at Dec 22 10:12:08 ...
  kernel: unregister_netdevice: waiting for tunl1 to become free.
Usage count = 3

After patch:
# ping -I tunl1 10.1.2.3
connect: Network is unreachable
Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: NOctavian Purdila <opurdila@ixiacom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fc75fc83

24 12月, 2010 3 次提交

Revert "ipv4: Allow configuring subnets as local addresses" · e0584649

由 David S. Miller 提交于 12月 23, 2010

This reverts commit 4465b469.

Conflicts:

	net/ipv4/fib_frontend.c

As reported by Ben Greear, this causes regressions:

> Change 4465b469 caused rules
> to stop matching the input device properly because the
> FLOWI_FLAG_MATCH_ANY_IIF is always defined in ip_dev_find().
>
> This breaks rules such as:
>
> ip rule add pref 512 lookup local
> ip rule del pref 0 lookup local
> ip link set eth2 up
> ip -4 addr add 172.16.0.102/24 broadcast 172.16.0.255 dev eth2
> ip rule add to 172.16.0.102 iif eth2 lookup local pref 10
> ip rule add iif eth2 lookup 10001 pref 20
> ip route add 172.16.0.0/24 dev eth2 table 10001
> ip route add unreachable 0/0 table 10001
>
> If you had a second interface 'eth0' that was on a different
> subnet, pinging a system on that interface would fail:
>
>   [root@ct503-60 ~]# ping 192.168.100.1
>   connect: Invalid argument
Reported-by: NBen Greear <greearb@candelatech.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e0584649

tcp: cleanup of cwnd initialization in tcp_init_metrics() · d9f4fbaf

由 Jiri Kosina 提交于 12月 22, 2010

Commit 86bcebaf ("tcp: fix >2 iw selection") fixed a case
when congestion window initialization has been mistakenly omitted
by introducing cwnd label and putting backwards goto from the
end of the function.

This makes the code unnecessarily tricky to read and understand
on a first sight.

Shuffle the code around a little bit to make it more obvious.
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d9f4fbaf

tcp: fix listening_get_next() · 1bde5ac4

由 Eric Dumazet 提交于 12月 23, 2010

Alexey Vlasov found /proc/net/tcp could sometime loop and display
millions of sockets in LISTEN state.

In 2.6.29, when we converted TCP hash tables to RCU, we left two
sk_next() calls in listening_get_next().

We must instead use sk_nulls_next() to properly detect an end of chain.
Reported-by: NAlexey Vlasov <renton@renton.name>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bde5ac4

21 12月, 2010 2 次提交

TCP: increase default initial receive window. · 356f0398

由 Nandita Dukkipati 提交于 12月 20, 2010

This patch changes the default initial receive window to 10 mss
(defined constant). The default window is limited to the maximum
of 10*1460 and 2*mss (when mss > 1460).

draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends
increasing TCP's initial congestion window to 10 mss or about 15KB.
Leading up to this proposal were several large-scale live Internet
experiments with an initial congestion window of 10 mss (IW10), where
we showed that the average latency of HTTP responses improved by
approximately 10%. This was accompanied by a slight increase in
retransmission rate (0.5%), most of which is coming from applications
opening multiple simultaneous connections. To understand the extreme
worst case scenarios, and fairness issues (IW10 versus IW3), we further
conducted controlled testbed experiments. We came away finding minimal
negative impact even under low link bandwidths (dial-ups) and small
buffers. These results are extremely encouraging to adopting IW10.

However, an initial congestion window of 10 mss is useless unless a TCP
receiver advertises an initial receive window of at least 10 mss.
Fortunately, in the large-scale Internet experiments we found that most
widely used operating systems advertised large initial receive windows
of 64KB, allowing us to experiment with a wide range of initial
congestion windows. Linux systems were among the few exceptions that
advertised a small receive window of 6KB. The purpose of this patch is
to fix this shortcoming.

References:
1. A comprehensive list of all IW10 references to date.
http://code.google.com/speed/protocols/tcpm-IW10.html

2. Paper describing results from large-scale Internet experiments with IW10.
http://ccr.sigcomm.org/drupal/?q=node/621

3. Controlled testbed experiments under worst case scenarios and a
fairness study.
http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf

4. Raw test data from testbed experiments (Linux senders/receivers)
with initial congestion and receive windows of both 10 mss.
http://research.csc.ncsu.edu/netsrv/?q=content/iw10

5. Internet-Draft. Increasing TCP's Initial Window.
https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/Signed-off-by: NNandita Dukkipati <nanditad@google.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

356f0398

ipv4: Flush per-ns routing cache more sanely. · 6561a3b1

由 David S. Miller 提交于 12月 19, 2010

Flush the routing cache only of entries that match the
network namespace in which the purge event occurred.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>

6561a3b1

17 12月, 2010 2 次提交

net: Use skb_checksum_start_offset() · 55508d60

由 Michał Mirosław 提交于 12月 14, 2010

Replace skb->csum_start - skb_headroom(skb) with skb_checksum_start_offset().

Note for usb/smsc95xx: skb->data - skb->head == skb_headroom(skb).
Signed-off-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55508d60

net: fix nulls list corruptions in sk_prot_alloc · fcbdf09d

由 Octavian Purdila 提交于 12月 16, 2010

Special care is taken inside sk_port_alloc to avoid overwriting
skc_node/skc_nulls_node. We should also avoid overwriting
skc_bind_node/skc_portaddr_node.

The patch fixes the following crash:

 BUG: unable to handle kernel paging request at fffffffffffffff0
 IP: [<ffffffff812ec6dd>] udp4_lib_lookup2+0xad/0x370
 [<ffffffff812ecc22>] __udp4_lib_lookup+0x282/0x360
 [<ffffffff812ed63e>] __udp4_lib_rcv+0x31e/0x700
 [<ffffffff812bba45>] ? ip_local_deliver_finish+0x65/0x190
 [<ffffffff812bbbf8>] ? ip_local_deliver+0x88/0xa0
 [<ffffffff812eda35>] udp_rcv+0x15/0x20
 [<ffffffff812bba45>] ip_local_deliver_finish+0x65/0x190
 [<ffffffff812bbbf8>] ip_local_deliver+0x88/0xa0
 [<ffffffff812bb2cd>] ip_rcv_finish+0x32d/0x6f0
 [<ffffffff8128c14c>] ? netif_receive_skb+0x99c/0x11c0
 [<ffffffff812bb94b>] ip_rcv+0x2bb/0x350
 [<ffffffff8128c14c>] netif_receive_skb+0x99c/0x11c0
Signed-off-by: NLeonard Crestez <lcrestez@ixiacom.com>
Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fcbdf09d

15 12月, 2010 1 次提交

net: Abstract default MTU metric calculation behind an accessor. · d33e4553

由 David S. Miller 提交于 12月 14, 2010

Like RTAX_ADVMSS, make the default calculation go through a dst_ops
method rather than caching the computation in the routing cache
entries.

Now dst metrics are pretty much left as-is when new entries are
created, thus optimizing metric sharing becomes a real possibility.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d33e4553

14 12月, 2010 2 次提交

net: Abstract default ADVMSS behind an accessor. · 0dbaee3b

由 David S. Miller 提交于 12月 13, 2010

Make all RTAX_ADVMSS metric accesses go through a new helper function,
dst_metric_advmss().

Leave the actual default metric as "zero" in the real metric slot,
and compute the actual default value dynamically via a new dst_ops
AF specific callback.

For stacked IPSEC routes, we use the advmss of the path which
preserves existing behavior.

Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
advmss on pmtu updates.  This inconsistency in advmss handling
results in more raw metric accesses than I wish we ended up with.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0dbaee3b

net: add limits to ip_default_ttl · 249fab77

由 Eric Dumazet 提交于 12月 13, 2010

ip_default_ttl should be between 1 and 255
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

249fab77