提交 · 3e6b3b2e34ac2c72fa05b5e4b20bd88d64c298dc · openeuler / raspberrypi-kernel

27 2月, 2007 1 次提交

[TCP]: Fix MD5 signature pool locking. · 2c4f6219

由 David S. Miller 提交于 2月 20, 2007

The locking calls assumed that these code paths were only
invoked in software interrupt context, but that isn't true.

Therefore we need to use spin_{lock,unlock}_bh() throughout.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c4f6219

11 2月, 2007 1 次提交

[NET] IPV4: Fix whitespace errors. · e905a9ed

由 YOSHIFUJI Hideaki 提交于 2月 09, 2007

Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e905a9ed

09 2月, 2007 1 次提交

[NET]: change layout of ehash table · dbca9b27

由 Eric Dumazet 提交于 2月 08, 2007

ehash table layout is currently this one :

First half of this table is used by sockets not in TIME_WAIT state
Second half of it is used by sockets in TIME_WAIT state.

This is non optimal because of for a given hash or socket, the two chain heads
are located in separate cache lines.
Moreover the locks of the second half are never used.

If instead of this halving, we use two list heads in inet_ehash_bucket instead
of only one, we probably can avoid one cache miss, and reduce ram usage,
particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK,
CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep
together related chains to speedup lookups and socket state change.

In this patch I did not try to align struct inet_ehash_bucket, but a future
patch could try to make this structure have a convenient size (a power of two
or a multiple of L1_CACHE_SIZE).
I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so
maybe we dont need to scratch our heads to align the bucket...

Note : In case struct inet_ehash_bucket is not a power of two, we could
probably change alloc_large_system_hash() (in case it use __get_free_pages())
to free the unused space. It currently allocates a big zone, but the last
quarter of it could be freed. Again, this should be a temporary 'problem'.

Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dbca9b27

14 12月, 2006 1 次提交

[TCP]: Fix oops caused by __tcp_put_md5sig_pool() · 6931ba7c

由 David S. Miller 提交于 12月 13, 2006

It should call tcp_free_md5sig_pool() not __tcp_free_md5sig_pool()
so that it does proper refcounting.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6931ba7c

03 12月, 2006 4 次提交

[NET]: Possible cleanups. · f5b99bcd

由 Adrian Bunk 提交于 11月 30, 2006

This patch contains the following possible cleanups:
- make the following needlessly global functions statis:
  - ipv4/tcp.c: __tcp_alloc_md5sig_pool()
  - ipv4/tcp_ipv4.c: tcp_v4_reqsk_md5_lookup()
  - ipv4/udplite.c: udplite_rcv()
  - ipv4/udplite.c: udplite_err()
- make the following needlessly global structs static:
  - ipv4/tcp_ipv4.c: tcp_request_sock_ipv4_ops
  - ipv4/tcp_ipv4.c: tcp_sock_ipv4_specific
  - ipv6/tcp_ipv6.c: tcp_request_sock_ipv6_ops
- net/ipv{4,6}/udplite.c: remove inline's from static functions
                          (gcc should know best when to inline them)
Signed-off-by: NAdrian Bunk <bunk@stusta.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5b99bcd

[TCP]: Tidy up skb_entail · 352d4800

由 Arnaldo Carvalho de Melo 提交于 11月 17, 2006

Heck, it even saves us some few bytes:

[acme@newtoy net-2.6.20]$ codiff -f /tmp/tcp.o.before ../OUTPUT/qemu/net-2.6.20/net/ipv4/tcp.o
/pub/scm/linux/kernel/git/acme/net-2.6.20/net/ipv4/tcp.c:
  tcp_sendpage |   -7
  tcp_sendmsg  |   -5
 2 functions changed, 12 bytes removed
[acme@newtoy net-2.6.20]$
Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>

352d4800

[NET]: Annotate callers of csum_fold() in net/* · d3bc23e7

由 Al Viro 提交于 11月 14, 2006

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d3bc23e7

[TCP]: MD5 Signature Option (RFC2385) support. · cfb6eeb4

由 YOSHIFUJI Hideaki 提交于 11月 14, 2006

Based on implementation by Rick Payne.
Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cfb6eeb4

16 11月, 2006 1 次提交

[TCP]: Fix up sysctl_tcp_mem initialization. · 52bf376c

由 John Heffner 提交于 11月 14, 2006

Fix up tcp_mem initial settings to take into account the size of the
hash entries (different on SMP and non-SMP systems).
Signed-off-by: NJohn Heffner <jheffner@psc.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52bf376c

08 11月, 2006 1 次提交

[TCP]: Don't use highmem in tcp hash size calculation. · 9e950efa

由 John Heffner 提交于 11月 06, 2006

This patch removes consideration of high memory when determining TCP
hash table sizes.  Taking into account high memory results in tcp_mem
values that are too large.
Signed-off-by: NJohn Heffner <jheffner@psc.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9e950efa

23 9月, 2006 4 次提交

[TCP]: Send ACKs each 2nd received segment. · 1ef9696c

由 Alexey Kuznetsov 提交于 9月 19, 2006

It does not affect either mss-sized connections (obviously) or
connections controlled by Nagle (because there is only one small
segment in flight).

The idea is to record the fact that a small segment arrives on a
connection, where one small segment has already been received and
still not-ACKed. In this case ACK is forced after tcp_recvmsg() drains
receive buffer.

In other words, it is a "soft" each-2nd-segment ACK, which is enough
to preserve ACK clock even when ABC is enabled.
Signed-off-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1ef9696c

[NET]: Use SLAB_PANIC · e5d679f3

由 Alexey Dobriyan 提交于 8月 26, 2006

Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e5d679f3

[NET/IPV4/IPV6]: Change some sysctl variables to __read_mostly · ab32ea5d

由 Brian Haley 提交于 9月 22, 2006

Change net/core, ipv4 and ipv6 sysctl variables to __read_mostly.

Couldn't actually measure any performance increase while testing (.3%
I consider noise), but seems like the right thing to do.
Signed-off-by: NBrian Haley <brian.haley@hp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab32ea5d

[NET]: Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE · 84fa7933

由 Patrick McHardy 提交于 8月 29, 2006

Replace CHECKSUM_HW by CHECKSUM_PARTIAL (for outgoing packets, whose
checksum still needs to be completed) and CHECKSUM_COMPLETE (for
incoming packets, device supplied full checksum).

Patch originally from Herbert Xu, updated by myself for 2.6.18-rc3.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

84fa7933

03 8月, 2006 2 次提交

[NET]: Fix more per-cpu typos · 29bbd72d

由 Alexey Dobriyan 提交于 8月 02, 2006

Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29bbd72d

[TCP]: Process linger2 timeout consistently. · 52499afe

由 David S. Miller 提交于 7月 31, 2006

Based upon guidance from Alexey Kuznetsov.

When linger2 is active, we check to see if the fin_wait2
timeout is longer than the timewait.  If it is, we schedule
the keepalive timer for the difference between the timewait
timeout and the fin_wait2 timeout.

When this orphan socket is seen by tcp_keepalive_timer()
it will try to transform this fin_wait2 socket into a
fin_wait2 mini-socket, again if linger2 is active.

Not all paths were setting this initial keepalive timer correctly.
The tcp input path was doing it correctly, but tcp_close() wasn't,
potentially making the socket linger longer than it really needs to.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52499afe

04 7月, 2006 1 次提交

[NET]: Verify gso_type too in gso_segment · bbcf467d

由 Herbert Xu 提交于 7月 03, 2006

We don't want nasty Xen guests to pass a TCPv6 packet in with gso_type set
to TCPv4 or even UDP (or a packet that's both TCP and UDP).
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bbcf467d

01 7月, 2006 4 次提交

[NET]: Generalise TSO-specific bits from skb_setup_caps · bcd76111

由 Herbert Xu 提交于 6月 30, 2006

This patch generalises the TSO-specific bits from sk_setup_caps by adding
the sk_gso_type member to struct sock.  This makes sk_setup_caps generic
so that it can be used by TCPv6 or UFO.

The only catch is that whoever uses this must provide a GSO implementation
for their protocol which I think is a fair deal :) For now UFO continues to
live without a GSO implementation which is OK since it doesn't use the sock
caps field at the moment.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bcd76111

[IPV6]: Added GSO support for TCPv6 · adcfc7d0

由 Herbert Xu 提交于 6月 30, 2006

This patch adds GSO support for IPv6 and TCPv6.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

adcfc7d0

[TCP]: Reset gso_segs if packet is dodgy · 3820c3f3

由 Herbert Xu 提交于 6月 29, 2006

I wasn't paranoid enough in verifying GSO information.  A bogus gso_segs
could upset drivers as much as a bogus header would.  Let's reset it in
the per-protocol gso_segment functions.

I didn't verify gso_size because that can be verified by the source of
the dodgy packets.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3820c3f3

Remove obsolete #include <linux/config.h> · 6ab3d562

由 Jörn Engel 提交于 6月 30, 2006

Signed-off-by: NJörn Engel <joern@wohnheim.fh-wedel.de>
Signed-off-by: NAdrian Bunk <bunk@stusta.de>

6ab3d562

30 6月, 2006 1 次提交

[NET]: Added GSO header verification · 576a30eb

由 Herbert Xu 提交于 6月 27, 2006

When GSO packets come from an untrusted source (e.g., a Xen guest domain),
we need to verify the header integrity before passing it to the hardware.

Since the first step in GSO is to verify the header, we can reuse that
code by adding a new bit to gso_type: SKB_GSO_DODGY. Packets with this
bit set can only be fed directly to devices with the corresponding bit
NETIF_F_GSO_ROBUST. If the device doesn't have that bit, then the skb
is fed to the GSO engine which will allow the packet to be sent to the
hardware if it passes the header check.

This patch changes the sg flag to a full features flag. The same method
can be used to implement TSO ECN support. We simply have to mark packets
with CWR set with SKB_GSO_ECN so that only hardware with a corresponding
NETIF_F_TSO_ECN can accept them. The GSO engine can either fully segment
the packet, or segment the first MTU and pass the rest to the hardware for
further segmentation.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

576a30eb

26 6月, 2006 1 次提交

[NET]: Fix CHECKSUM_HW GSO problems. · 0718bcc0

由 Herbert Xu 提交于 6月 25, 2006

Fix checksum problems in the GSO code path for CHECKSUM_HW packets.

The ipv4 TCP pseudo header checksum has to be adjusted for GSO
segmented packets.

The adjustment is needed because the length field in the pseudo-header
changes.  However, because we have the inequality oldlen > newlen, we
know that delta = (u16)~oldlen + newlen is still a 16-bit quantity.
This also means that htonl(delta) + th->check still fits in 32 bits.
Therefore we don't have to use csum_add on this operations.

This is based on a patch by Michael Chan <mchan@broadcom.com>.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NMichael Chan <mchan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0718bcc0

23 6月, 2006 2 次提交

[NET]: Add software TSOv4 · f4c50d99

由 Herbert Xu 提交于 6月 22, 2006

This patch adds the GSO implementation for IPv4 TCP.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f4c50d99

[NET]: Merge TSO/UFO fields in sk_buff · 7967168c

由 Herbert Xu 提交于 6月 22, 2006

Having separate fields in sk_buff for TSO/UFO (tso_size/ufo_size) is not
going to scale if we add any more segmentation methods (e.g., DCCP).  So
let's merge them.

They were used to tell the protocol of a packet.  This function has been
subsumed by the new gso_type field.  This is essentially a set of netdev
feature bits (shifted by 16 bits) that are required to process a specific
skb.  As such it's easy to tell whether a given device can process a GSO
skb: you just have to and the gso_type field and the netdev's features
field.

I've made gso_type a conjunction.  The idea is that you have a base type
(e.g., SKB_GSO_TCPV4) that can be modified further to support new features.
For example, if we add a hardware TSO type that supports ECN, they would
declare NETIF_F_TSO | NETIF_F_TSO_ECN.  All TSO packets with CWR set would
have a gso_type of SKB_GSO_TCPV4 | SKB_GSO_TCPV4_ECN while all other TSO
packets would be SKB_GSO_TCPV4.  This means that only the CWR packets need
to be emulated in software.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7967168c

18 6月, 2006 4 次提交

[NET]: Add NETIF_F_GEN_CSUM and NETIF_F_ALL_CSUM · 8648b305

由 Herbert Xu 提交于 6月 17, 2006

The current stack treats NETIF_F_HW_CSUM and NETIF_F_NO_CSUM
identically so we test for them in quite a few places. For the sake
of brevity, I'm adding the macro NETIF_F_GEN_CSUM for these two. We
also test the disjunct of NETIF_F_IP_CSUM and the other two in various
places, for that purpose I've added NETIF_F_ALL_CSUM.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8648b305

[I/OAT]: TCP recv offload to I/OAT · 1a2449a8

由 Chris Leech 提交于 5月 23, 2006

Locks down user pages and sets up for DMA in tcp_recvmsg, then calls
dma_async_try_early_copy in tcp_v4_do_rcv
Signed-off-by: NChris Leech <christopher.leech@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a2449a8

[I/OAT]: Make sk_eat_skb I/OAT aware. · 624d1164

由 Chris Leech 提交于 5月 23, 2006

Add an extra argument to sk_eat_skb, and make it move early copied
packets to the async_wait_queue instead of freeing them.
Signed-off-by: NChris Leech <christopher.leech@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

624d1164

[I/OAT]: Rename cleanup_rbuf to tcp_cleanup_rbuf and make non-static · 0e4b4992

由 Chris Leech 提交于 5月 23, 2006

Needed to be able to call tcp_cleanup_rbuf in tcp_input.c for I/OAT
Signed-off-by: NChris Leech <christopher.leech@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e4b4992

04 5月, 2006 1 次提交

[TCP]: Fix sock_orphan dead lock · 75c2d907

由 Herbert Xu 提交于 5月 03, 2006

Calling sock_orphan inside bh_lock_sock in tcp_close can lead to dead
locks.  For example, the inet_diag code holds sk_callback_lock without
disabling BH.  If an inbound packet arrives during that admittedly tiny
window, it will cause a dead lock on bh_lock_sock.  Another possible
path would be through sock_wfree if the network device driver frees the
tx skb in process context with BH enabled.

We can fix this by moving sock_orphan out of bh_lock_sock.

The tricky bit is to work out when we need to destroy the socket
ourselves and when it has already been destroyed by someone else.

By moving sock_orphan before the release_sock we can solve this
problem.  This is because as long as we own the socket lock its
state cannot change.

So we simply record the socket state before the release_sock
and then check the state again after we regain the socket lock.
If the socket state has transitioned to TCP_CLOSE in the time being,
we know that the socket has been destroyed.  Otherwise the socket is
still ours to keep.

Note that I've also moved the increment on the orphan count forward.
This may look like a problem as we're increasing it even if the socket
is just about to be destroyed where it'll be decreased again.  However,
this simply enlarges a window that already exists.  This also changes
the orphan count test by one.

Considering what the orphan count is meant to do this is no big deal.

This problem was discoverd by Ingo Molnar using his lock validator.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75c2d907

26 3月, 2006 1 次提交

[PATCH] POLLRDHUP/EPOLLRDHUP handling for half-closed devices notifications · f348d70a

由 Davide Libenzi 提交于 3月 25, 2006

Implement the half-closed devices notifiation, by adding a new POLLRDHUP
(and its alias EPOLLRDHUP) bit to the existing poll/select sets.  Since the
existing POLLHUP handling, that does not report correctly half-closed
devices, was feared to be changed, this implementation leaves the current
POLLHUP reporting unchanged and simply add a new bit that is set in the few
places where it makes sense.  The same thing was discussed and conceptually
agreed quite some time ago:

http://lkml.org/lkml/2003/7/12/116

Since this new event bit is added to the existing Linux poll infrastruture,
even the existing poll/select system calls will be able to use it.  As far
as the existing POLLHUP handling, the patch leaves it as is.  The
pollrdhup-2.6.16.rc5-0.10.diff defines the POLLRDHUP for all the existing
archs and sets the bit in the six relevant files.  The other attached diff
is the simple change required to sys/epoll.h to add the EPOLLRDHUP
definition.

There is "a stupid program" to test POLLRDHUP delivery here:

 http://www.xmailserver.org/pollrdhup-test.c

It tests poll(2), but since the delivery is same epoll(2) will work equally.
Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Signed-off-by: NAndrew Morton <akpm@osdl.org>
Signed-off-by: NLinus Torvalds <torvalds@osdl.org>

f348d70a

25 3月, 2006 2 次提交

D
[TCP]: Mark tcp_*mem[] __read_mostly. · b8059ead
由 David S. Miller 提交于 3月 25, 2006
```
Suggested by Stephen Hemminger.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
b8059ead

[TCP]: Set default max buffers from memory pool size · 7b4f4b5e

由 John Heffner 提交于 3月 25, 2006

This patch sets the maximum TCP buffer sizes (available to automatic
buffer tuning, not to setsockopt) based on the TCP memory pool size.
The maximum sndbuf and rcvbuf each will be up to 4 MB, but no more
than 1/128 of the memory pressure threshold.
Signed-off-by: NJohn Heffner <jheffner@psc.edu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b4f4b5e

21 3月, 2006 3 次提交

[NET]: Identation & other cleanups related to compat_[gs]etsockopt cset · 543d9cfe

由 Arnaldo Carvalho de Melo 提交于 3月 20, 2006

No code changes, just tidying up, in some cases moving EXPORT_SYMBOLs
to just after the function exported, etc.
Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

543d9cfe

A
[ICSK] compat: Introduce inet_csk_compat_[gs]etsockopt · dec73ff0
由 Arnaldo Carvalho de Melo 提交于 3月 20, 2006
```
Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
dec73ff0

[NET]: {get|set}sockopt compatibility layer · 3fdadf7d

由 Dmitry Mishin 提交于 3月 20, 2006

This patch extends {get|set}sockopt compatibility layer in order to
move protocol specific parts to their place and avoid huge universal
net/compat.c file in the future.
Signed-off-by: NDmitry Mishin <dim@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fdadf7d

04 1月, 2006 2 次提交

[IP_SOCKGLUE]: Remove most of the tcp specific calls · d83d8461

由 Arnaldo Carvalho de Melo 提交于 12月 13, 2005

As DCCP needs to be called in the same spots.

Now we have a member in inet_sock (is_icsk), set at sock creation time from
struct inet_protosw->flags (if INET_PROTOSW_ICSK is set, like for TCP and
DCCP) to see if a struct sock instance is a inet_connection_sock for places
like the ones in ip_sockglue.c (v4 and v6) where we previously were looking if
sk_type was SOCK_STREAM, that is insufficient because we now use the same code
for DCCP, that has sk_type SOCK_DCCP.
Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d83d8461

[ICSK]: Rename struct tcp_func to struct inet_connection_sock_af_ops · 8292a17a

由 Arnaldo Carvalho de Melo 提交于 12月 13, 2005

And move it to struct inet_connection_sock. DCCP will use it in the
upcoming changesets.
Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8292a17a

30 11月, 2005 2 次提交

[NET]: Add const markers to various variables. · 9b5b5cff

由 Arjan van de Ven 提交于 11月 29, 2005

the patch below marks various variables const in net/; the goal is to
move them to the .rodata section so that they can't false-share
cachelines with things that get written to, as well as potentially
helping gcc a bit with optimisations.  (these were found using a gcc
patch to warn about such variables)
Signed-off-by: NArjan van de Ven <arjan@infradead.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b5b5cff

[IPV4] tcp/route: Another look at hash table sizes · 18955cfc

由 Mike Stroyan 提交于 11月 29, 2005

  The tcp_ehash hash table gets too big on systems with really big memory.
It is worse on systems with pages larger than 4KB.  It wastes memory that
could be better used.  It also makes the netstat command slow because reading
/proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table.

  The default value should not be larger for larger page sizes.  It seems
that the effect of page size is an unintended error dating back a long
time.  I also wonder if the default value really should be a larger
fraction of memory for systems with more memory.  While systems with
really big ram can afford more space for hash tables, it is not clear to
me that they benefit from increasing the allocation ratio for this table.

  The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and
mm/page_alloc.c:alloc_large_system_hash.

tcp_init calls alloc_large_system_hash passing parameters-
    bucketsize=sizeof(struct tcp_ehash_bucket)
    numentries=thash_entries
    scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT)
    limit=0

On i386, PAGE_SHIFT is 12 for a page size of 4K
On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K

The num_physpages test above makes the allocation take a larger fraction
of the total memory on systems with larger memory.  The threshold size
for a i386 system is 512MB.  For an ia64 system with 16KB pages the
threshold is 2GB.

For smaller memory systems-
On i386, scale = (27 - 12) = 15
On ia64, scale = (27 - 14) = 13
For larger memory systems-
On i386, scale = (25 - 12) = 13
On ia64, scale = (25 - 14) = 11

  For the rest of this discussion, I'll just track the larger memory case.

  The default behavior has numentries=thash_entries=0, so the allocated
size is determined by either scale or by the default limit of 1/16 of
total memory.

In alloc_large_system_hash-
|	numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
|	numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
|	numentries >>= 20 - PAGE_SHIFT;
|	numentries <<= 20 - PAGE_SHIFT;

  At this point, numentries is pages for all of memory, rounded up to the
nearest megabyte boundary.

|	/* limit to 1 bucket per 2^scale bytes of low memory */
|	if (scale > PAGE_SHIFT)
|		numentries >>= (scale - PAGE_SHIFT);
|	else
|		numentries <<= (PAGE_SHIFT - scale);

On i386, numentries >>= (13 - 12), so numentries is 1/8196 of
bytes of total memory.
On ia64, numentries <<= (14 - 11), so numentries is 1/2048 of
bytes of total memory.

|        log2qty = long_log2(numentries);
|
|        do {
|                size = bucketsize << log2qty;

bucketsize is 16, so size is 16 times numentries, rounded
down to a power of two.

On i386, size is 1/512 of bytes of total memory.
On ia64, size is 1/128 of bytes of total memory.

For smaller systems the results are
On i386, size is 1/2048 of bytes of total memory.
On ia64, size is 1/512 of bytes of total memory.

  The large page effect can be removed by just replacing
the use of PAGE_SHIFT with a constant of 12 in the calls to
alloc_large_system_hash.  That makes them more like the other uses of
that function from fs/inode.c and fs/dcache.c
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

18955cfc