提交 · e022f0b4a03f4fff9323b509df023b8af635716e · openanolis / cloud-kernel

21 10月, 2009 1 次提交

net: Introduce sk_tx_queue_mapping · e022f0b4

由 Krishna Kumar 提交于 10月 19, 2009

Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.
Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e022f0b4

15 10月, 2009 1 次提交

net: sk_drops consolidation · 766e9037

由 Eric Dumazet 提交于 10月 14, 2009

sock_queue_rcv_skb() can update sk_drops itself, removing need for
callers to take care of it. This is more consistent since
sock_queue_rcv_skb() also reads sk_drops when queueing a skb.

This adds sk_drops managment to many protocols that not cared yet.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

766e9037

13 10月, 2009 1 次提交

net: Generalize socket rx gap / receive queue overflow cmsg · 3b885787

由 Neil Horman 提交于 10月 12, 2009

Create a new socket level option to report number of queue overflows

Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.

Notes:

1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.

2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.

3) This applies cleanly to net-next assuming that commit
97775007 (my af packet cmsg patch) is reverted
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b885787

01 10月, 2009 2 次提交

net: Fix sock_wfree() race · d99927f4

由 Eric Dumazet 提交于 9月 24, 2009

Commit 2b85a34e
(net: No more expensive sock_hold()/sock_put() on each tx)
opens a window in sock_wfree() where another cpu
might free the socket we are working on.

A fix is to call sk->sk_write_space(sk) while still
holding a reference on sk.
Reported-by: NJike Song <albcamus@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d99927f4

net: Make setsockopt() optlen be unsigned. · b7058842

由 David S. Miller 提交于 9月 30, 2009

This provides safety against negative optlen at the type
level instead of depending upon (sometimes non-trivial)
checks against this sprinkled all over the the place, in
each and every implementation.

Based upon work done by Arjan van de Ven and feedback
from Linus Torvalds.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b7058842

22 9月, 2009 1 次提交

mm: replace various uses of num_physpages by totalram_pages · 4481374c

由 Jan Beulich 提交于 9月 21, 2009

Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages.  The amount of what actually is usable as storage
should instead be used as a basis here.

Some of the calculations (i.e.  those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).
Signed-off-by: NJan Beulich <jbeulich@novell.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Acked-by: NIngo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4481374c

02 9月, 2009 1 次提交

net: sk_free() should be allowed right after sk_alloc() · d66ee058

由 Jarek Poplawski 提交于 8月 30, 2009

After commit 2b85a34e
(net: No more expensive sock_hold()/sock_put() on each tx)
sk_free() frees socks conditionally and depends
on sk_wmem_alloc being set e.g. in sock_init_data(). But in some
cases sk_free() is called earlier, usually after other alloc errors.

Fix is to move sk_wmem_alloc initialization from sock_init_data()
to sk_alloc() itself.
Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d66ee058

06 8月, 2009 3 次提交

net: implement a SO_DOMAIN getsockoption · 0d6038ee

由 Jan Engelhardt 提交于 8月 04, 2009

This sockopt goes in line with SO_TYPE and SO_PROTOCOL. It makes it
possible for userspace programs to pass around file descriptors — I
am referring to arguments-to-functions, but it may even work for the
fd passing over UNIX sockets — without needing to also pass the
auxiliary information (PF_INET6/IPPROTO_TCP).
Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0d6038ee

net: implement a SO_PROTOCOL getsockoption · 49c794e9

由 Jan Engelhardt 提交于 8月 04, 2009

Similar to SO_TYPE returning the socket type, SO_PROTOCOL allows to
retrieve the protocol used with a given socket.

I am not quite sure why we have that-many copies of socket.h, and why
the values are not the same on all arches either, but for where hex
numbers dominate, I use 0x1029 for SO_PROTOCOL as that seems to be
the next free unused number across a bunch of operating systems, or
so Google results make me want to believe. SO_PROTOCOL for others
just uses the next free Linux number, 38.
Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49c794e9

net: mark read-only arrays as const · 36cbd3dc

由 Jan Engelhardt 提交于 8月 05, 2009

String literals are constant, and usually, we can also tag the array
of pointers const too, moving it to the .rodata section.
Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

36cbd3dc

20 7月, 2009 1 次提交

Fix error return for setsockopt(SO_TIMESTAMPING) · f249fb78

由 Rémi Denis-Courmont 提交于 7月 20, 2009

I guess it should be -EINVAL rather than EINVAL. I have not checked
when the bug came in. Perhaps a candidate for -stable?
Signed-off-by: NRémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f249fb78

17 7月, 2009 1 次提交

net: sock_copy() fixes · 4dc6dc71

由 Eric Dumazet 提交于 7月 15, 2009

Commit e912b114
(net: sk_prot_alloc() should not blindly overwrite memory)
took care of not zeroing whole new socket at allocation time.

sock_copy() is another spot where we should be very careful.
We should not set refcnt to a non null value, until
we are sure other fields are correctly setup, or
a lockless reader could catch this socket by mistake,
while not fully (re)initialized.

This patch puts sk_node & sk_refcnt to the very beginning
of struct sock to ease sock_copy() & sk_prot_alloc() job.

We add appropriate smp_wmb() before sk_refcnt initializations
to match our RCU requirements (changes to sock keys should
be committed to memory before sk_refcnt setting)
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4dc6dc71

12 7月, 2009 1 次提交

net: sk_prot_alloc() should not blindly overwrite memory · e912b114

由 Eric Dumazet 提交于 7月 08, 2009

Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness
depends on sk->sk_nulls_node.next being always valid. A NULL
value is not allowed as it might fault a lockless reader.

Current sk_prot_alloc() implementation doesnt respect this hypothesis,
calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around
the forbidden field.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e912b114

10 7月, 2009 1 次提交

net: adding memory barrier to the poll and receive callbacks · a57de0b4

由 Jiri Olsa 提交于 7月 08, 2009

Adding memory barrier after the poll_wait function, paired with
receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
to wrap the memory barrier.

Without the memory barrier, following race can happen.
The race fires, when following code paths meet, and the tp->rcv_nxt
and __add_wait_queue updates stay in CPU caches.

CPU1                         CPU2

sys_select                   receive packet
  ...                        ...
  __add_wait_queue           update tp->rcv_nxt
  ...                        ...
  tp->rcv_nxt check          sock_def_readable
  ...                        {
  schedule                      ...
                                if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
                                        wake_up_interruptible(sk->sk_sleep)
                                ...
                             }

If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.

Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.

Calls to poll_wait in following modules were ommited:
	net/bluetooth/af_bluetooth.c
	net/irda/af_irda.c
	net/irda/irnet/irnet_ppp.c
	net/mac80211/rc80211_pid_debugfs.c
	net/phonet/socket.c
	net/rds/af_rds.c
	net/rfkill/core.c
	net/sunrpc/cache.c
	net/sunrpc/rpc_pipe.c
	net/tipc/socket.c
Signed-off-by: NJiri Olsa <jolsa@redhat.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a57de0b4

15 6月, 2009 1 次提交

net: annotate struct sock bitfield · a98b65a3

由 Vegard Nossum 提交于 2月 26, 2009

2009/2/24 Ingo Molnar <mingo@elte.hu>:
> ok, this is the last warning i have from today's overnight -tip
> testruns - a 32-bit system warning in sock_init_data():
>
> [    2.610389] NET: Registered protocol family 16
> [    2.616138] initcall netlink_proto_init+0x0/0x170 returned 0 after 7812 usecs
> [    2.620010] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f642c184)
> [    2.624002] 010000000200000000000000604990c000000000000000000000000000000000
> [    2.634076]  i i i i i i u u i i i i i i i i i i i i i i i i i i i i i i i i
> [    2.641038]          ^
> [    2.643376]
> [    2.644004] Pid: 1, comm: swapper Not tainted (2.6.29-rc6-tip-01751-g4d1c22c-dirty #885)
> [    2.648003] EIP: 0060:[<c07141a1>] EFLAGS: 00010282 CPU: 0
> [    2.652008] EIP is at sock_init_data+0xa1/0x190
> [    2.656003] EAX: 0001a800 EBX: f6836c00 ECX: 00463000 EDX: c0e46fe0
> [    2.660003] ESI: f642c180 EDI: c0b83088 EBP: f6863ed8 ESP: c0c412ec
> [    2.664003]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [    2.668003] CR0: 8005003b CR2: f682c400 CR3: 00b91000 CR4: 000006f0
> [    2.672003] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [    2.676003] DR6: ffff4ff0 DR7: 00000400
> [    2.680002]  [<c07423e5>] __netlink_create+0x35/0xa0
> [    2.684002]  [<c07443cc>] netlink_kernel_create+0x4c/0x140
> [    2.688002]  [<c072755e>] rtnetlink_net_init+0x1e/0x40
> [    2.696002]  [<c071b601>] register_pernet_operations+0x11/0x30
> [    2.700002]  [<c071b72c>] register_pernet_subsys+0x1c/0x30
> [    2.704002]  [<c0bf3c8c>] rtnetlink_init+0x4c/0x100
> [    2.708002]  [<c0bf4669>] netlink_proto_init+0x159/0x170
> [    2.712002]  [<c0101124>] do_one_initcall+0x24/0x150
> [    2.716002]  [<c0bbf3c7>] do_initcalls+0x27/0x40
> [    2.723201]  [<c0bbf3fc>] do_basic_setup+0x1c/0x20
> [    2.728002]  [<c0bbfb8a>] kernel_init+0x5a/0xa0
> [    2.732002]  [<c0103e47>] kernel_thread_helper+0x7/0x10
> [    2.736002]  [<ffffffff>] 0xffffffff

We fix this false positive by annotating the bitfield in struct
sock.
Reported-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>

a98b65a3

11 6月, 2009 1 次提交

net: No more expensive sock_hold()/sock_put() on each tx · 2b85a34e

由 Eric Dumazet 提交于 6月 11, 2009

One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.

This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.

We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.

We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.

As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.

skb_set_owner_w() doesnt call sock_hold() anymore

sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.

Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.

Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2b85a34e

09 6月, 2009 1 次提交

Add constants for the ieee 802.15.4 stack · fcb94e42

由 Sergey Lapin 提交于 6月 08, 2009

IEEE 802.15.4 stack requires several constants to be defined/adjusted.
Signed-off-by: NDmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Signed-off-by: NSergey Lapin <slapin@ossfans.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fcb94e42

28 5月, 2009 1 次提交

net: net/core/sock.c cleanup · 2a91525c

由 Eric Dumazet 提交于 5月 27, 2009

Pure style cleanup patch.
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a91525c

01 4月, 2009 1 次提交

epoll keyed wakeups: make sockets use keyed wakeups · 37e5540b

由 Davide Libenzi 提交于 3月 31, 2009

Add support for event-aware wakeups to the sockets code.  Events are
delivered to the wakeup target, so that epoll can avoid spurious wakeups
for non-interesting events.
Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
Acked-by: NAlan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: William Lee Irwin III <wli@movementarian.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

37e5540b

27 2月, 2009 1 次提交

RDS: Add RDS to AF key strings · cbd151bf

由 Andy Grover 提交于 2月 26, 2009

Signed-off-by: NAndy Grover <andy.grover@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cbd151bf

24 2月, 2009 1 次提交

net: amend the fix for SO_BSDCOMPAT gsopt infoleak · 50fee1de

由 Eugene Teo 提交于 2月 23, 2009

The fix for CVE-2009-0676 (upstream commit df0bca04) is incomplete. Note
that the same problem of leaking kernel memory will reappear if someone
on some architecture uses struct timeval with some internal padding (for
example tv_sec 64-bit and tv_usec 32-bit) --- then, you are going to
leak the padded bytes to userspace.
Signed-off-by: NEugene Teo <eugeneteo@kernel.sg>
Reported-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

50fee1de

18 2月, 2009 1 次提交

net: Kill skb_truesize_check(), it only catches false-positives. · 92a0acce

由 David S. Miller 提交于 2月 17, 2009

A long time ago we had bugs, primarily in TCP, where we would modify
skb->truesize (for TSO queue collapsing) in ways which would corrupt
the socket memory accounting.

skb_truesize_check() was added in order to try and catch this error
more systematically.

However this debugging check has morphed into a Frankenstein of sorts
and these days it does nothing other than catch false-positives.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92a0acce

16 2月, 2009 1 次提交

net: socket infrastructure for SO_TIMESTAMPING · 20d49473

由 Patrick Ohly 提交于 2月 12, 2009

The overlap with the old SO_TIMESTAMP[NS] options is handled so
that time stamping in software (net_enable_timestamp()) is
enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE
is set.  It's disabled if all of these are off.
Signed-off-by: NPatrick Ohly <patrick.ohly@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20d49473

13 2月, 2009 1 次提交

net: 4 bytes kernel memory disclosure in SO_BSDCOMPAT gsopt try #2 · df0bca04

由 Clément Lecigne 提交于 2月 12, 2009

In function sock_getsockopt() located in net/core/sock.c, optval v.val
is not correctly initialized and directly returned in userland in case
we have SO_BSDCOMPAT option set.

This dummy code should trigger the bug:

int main(void)
{
	unsigned char buf[4] = { 0, 0, 0, 0 };
	int len;
	int sock;
	sock = socket(33, 2, 2);
	getsockopt(sock, 1, SO_BSDCOMPAT, &buf, &len);
	printf("%x%x%x%x\n", buf[0], buf[1], buf[2], buf[3]);
	close(sock);
}

Here is a patch that fix this bug by initalizing v.val just after its
declaration.
Signed-off-by: NClément Lecigne <clement.lecigne@netasq.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df0bca04

05 2月, 2009 1 次提交

net: Reexport sock_alloc_send_pskb · 4cc7f68d

由 Herbert Xu 提交于 2月 04, 2009

The function sock_alloc_send_pskb is completely useless if not
exported since most of the code in it won't be used as is.  In
fact, this code has already been duplicated in the tun driver.

Now that we need accounting in the tun driver, we can in fact
use this function as is.  So this patch marks it for export again.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4cc7f68d

18 12月, 2008 1 次提交

Revert "net: release skb->dst in sock_queue_rcv_skb()" · 49ad9599

由 David S. Miller 提交于 12月 17, 2008

This reverts commit 70355602.

As pointed out by Mark McLoughlin IP_PKTINFO cmsg data is one
post-queueing user, so this optimization is not valid right
now.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49ad9599

26 11月, 2008 2 次提交

net: release skb->dst in sock_queue_rcv_skb() · 70355602

由 Eric Dumazet 提交于 11月 26, 2008

When queuing a skb to sk->sk_receive_queue, we can release its dst,
not anymore needed.  Since current cpu did the dst_hold(), refcount is
probably still hot int this cpu caches.

This avoids readers to access the original dst to decrement its
refcount, possibly a long time after packet reception. This should
speedup UDP and RAW receive path.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

70355602

net: Use a percpu_counter for sockets_allocated · 1748376b

由 Eric Dumazet 提交于 11月 25, 2008

Instead of using one atomic_t per protocol, use a percpu_counter
for "sockets_allocated", to reduce cache line contention on
heavy duty network servers. 

Note : We revert commit (248969ae
net: af_unix can make unix_nr_socks visbile in /proc),
since it is not anymore used after sock_prot_inuse_add() addition
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1748376b

22 11月, 2008 1 次提交

net: Fix memory leak in the proto_register function · 7e56b5d6

由 Catalin Marinas 提交于 11月 21, 2008

If the slub allocator is used, kmem_cache_create() may merge two or more
kmem_cache's into one but the cache name pointer is not updated and
kmem_cache_name() is no longer guaranteed to return the pointer passed
to the former function. This patch stores the kmalloc'ed pointers in the
corresponding request_sock_ops and timewait_sock_ops structures.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7e56b5d6

20 11月, 2008 1 次提交

net: make /proc/net/protocols namespace aware · 14e943db

由 Eric Dumazet 提交于 11月 19, 2008

Converting /proc/net/protocols to be namespace aware is quite easy
and permits us to use sock_prot_inuse_get().

This provides seperate counters for each protocol. For example
we can really count TCPv6 sockets and TCPv4 sockets, while previously,
we had the same value, and this value was not namespace aware.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14e943db

17 11月, 2008 1 次提交

net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7

由 Eric Dumazet 提交于 11月 16, 2008

RCU was added to UDP lookups, using a fast infrastructure :
- sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
  price of call_rcu() at freeing time.
- hlist_nulls permits to use few memory barriers.

This patch uses same infrastructure for TCP/DCCP established
and timewait sockets.

Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
using short lived TCP connections. A followup patch, converting
rwlocks to spinlocks will even speedup this case.

__inet_lookup_established() is pretty fast now we dont have to
dirty a contended cache line (read_lock/read_unlock)

Only established and timewait hashtable are converted to RCU
(bind table and listen table are still using traditional locking)
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3ab5aee7

14 11月, 2008 1 次提交

lockdep: include/linux/lockdep.h - fix warning in net/bluetooth/af_bluetooth.c · e8f6fbf6

由 Ingo Molnar 提交于 11月 12, 2008

fix this warning:

  net/bluetooth/af_bluetooth.c:60: warning: ‘bt_key_strings’ defined but not used
  net/bluetooth/af_bluetooth.c:71: warning: ‘bt_slock_key_strings’ defined but not used

this is a lockdep macro problem in the !LOCKDEP case.

We cannot convert it to an inline because the macro works on multiple types,
but we can mark the parameter used.

[ also clean up a misaligned tab in sock_lock_init_class_and_name() ]

[ also remove #ifdefs from around af_family_clock_key strings - which
  were certainly added to get rid of the ugly build warnings. ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e8f6fbf6

12 11月, 2008 1 次提交

lockdep: include/linux/lockdep.h - fix warning in net/bluetooth/af_bluetooth.c · e25cf3db

由 Ingo Molnar 提交于 10月 17, 2008

fix this warning:

  net/bluetooth/af_bluetooth.c:60: warning: ‘bt_key_strings’ defined but not used
  net/bluetooth/af_bluetooth.c:71: warning: ‘bt_slock_key_strings’ defined but not used

this is a lockdep macro problem in the !LOCKDEP case.

We cannot convert it to an inline because the macro works on multiple types,
but we can mark the parameter used.

[ also clean up a misaligned tab in sock_lock_init_class_and_name() ]

[ also remove #ifdefs from around af_family_clock_key strings - which
  were certainly added to get rid of the ugly build warnings. ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e25cf3db

29 10月, 2008 1 次提交

udp: RCU handling for Unicast packets. · 271b72c7

由 Eric Dumazet 提交于 10月 29, 2008

Goals are :

1) Optimizing handling of incoming Unicast UDP frames, so that no memory
 writes should happen in the fast path.

 Note: Multicasts and broadcasts still will need to take a lock,
 because doing a full lockless lookup in this case is difficult.

2) No expensive operations in the socket bind/unhash phases :
  - No expensive synchronize_rcu() calls.

  - No added rcu_head in socket structure, increasing memory needs,
  but more important, forcing us to use call_rcu() calls,
  that have the bad property of making sockets structure cold.
  (rcu grace period between socket freeing and its potential reuse
   make this socket being cold in CPU cache).
  David did a previous patch using call_rcu() and noticed a 20%
  impact on TCP connection rates.
  Quoting Cristopher Lameter :
   "Right. That results in cacheline cooldown. You'd want to recycle
    the object as they are cache hot on a per cpu basis. That is screwed
    up by the delayed regular rcu processing. We have seen multiple
    regressions due to cacheline cooldown.
    The only choice in cacheline hot sensitive areas is to deal with the
    complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."

  - Because udp sockets are allocated from dedicated kmem_cache,
  use of SLAB_DESTROY_BY_RCU can help here.

Theory of operation :
---------------------

As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
special attention must be taken by readers and writers.

Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
reused, inserted in a different chain or in worst case in the same chain
while readers could do lookups in the same time.

In order to avoid loops, a reader must check each socket found in a chain
really belongs to the chain the reader was traversing. If it finds a
mismatch, lookup must start again at the begining. This *restart* loop
is the reason we had to use rdlock for the multicast case, because
we dont want to send same message several times to the same socket.

We use RCU only for fast path.
Thus, /proc/net/udp still takes spinlocks.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

271b72c7

08 10月, 2008 1 次提交

net: wrap sk->sk_backlog_rcv() · c57943a1

由 Peter Zijlstra 提交于 10月 07, 2008

Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
generic sk_backlog_rcv behaviour.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c57943a1

23 9月, 2008 1 次提交

Phonet: global definitions · bce7b154

由 Remi Denis-Courmont 提交于 9月 22, 2008

Signed-off-by: NRemi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bce7b154

19 9月, 2008 1 次提交

ISDN sockets: add missing lockdep strings · 821c92f2

由 Rémi Denis-Courmont 提交于 9月 18, 2008

Signed-off-by: NRémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

821c92f2

24 7月, 2008 1 次提交

net: Update entry in af_family_clock_key_strings · b4942af6

由 Oliver Hartkopp 提交于 7月 23, 2008

In the merge phase of the CAN subsystem the 
af_family_clock_key_strings[] have been added to sock.c in commit 
443aef0e 
(lockdep: fixup sk_callback_lock annotation). This trivial patch adds 
the missing name for address family 29 (AF_CAN).
Signed-off-by: NOliver Hartkopp <oliver@hartkopp.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4942af6

17 7月, 2008 1 次提交

sock: add net to prot->enter_memory_pressure callback · 5c52ba17

由 Pavel Emelyanov 提交于 7月 16, 2008

The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't
have where to get the net from.

I decided to add a sk argument, not the net itself, only to factor
all the required sock_net(sk) calls inside the enter_memory_pressure 
callback itself.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c52ba17

18 6月, 2008 1 次提交

net: Add sk_set_socket() helper. · 972692e0

由 David S. Miller 提交于 6月 17, 2008

In order to more easily grep for all things that set
sk->sk_socket, add sk_set_socket() helper inline function.

Suggested (although only half-seriously) by Evgeniy Polyakov.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

972692e0

openanolis / cloud-kernel 11 个月 前同步成功

openanolis / cloud-kernel
11 个月前同步成功