提交 · a4e1adb98f7376dccbfaee7c375e6378b6c21985 · openanolis / cloud-kernel

22 6月, 2015 1 次提交

netlink: add API to retrieve all group memberships · b42be38b

由 David Herrmann 提交于 9年前

This patch adds getsockopt(SOL_NETLINK, NETLINK_LIST_MEMBERSHIPS) to
retrieve all groups a socket is a member of. Currently, we have to use
getsockname() and look at the nl.nl_groups bitmask. However, this mask is
limited to 32 groups. Hence, similar to NETLINK_ADD_MEMBERSHIP and
NETLINK_DROP_MEMBERSHIP, this adds a separate sockopt to manager higher
groups IDs than 32.

This new NETLINK_LIST_MEMBERSHIPS option takes a pointer to __u32 and the
size of the array. The array is filled with the full membership-set of the
socket, and the required array size is returned in optlen. Hence,
user-space can retry with a properly sized array in case it was too small.
Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b42be38b

18 5月, 2015 1 次提交

netlink: Use random autobind rover · b9fbe709

由 Herbert Xu 提交于 9年前

Currently we use a global rover to select a port ID that is unique.
This used to work consistently when it was protected with a global
lock.  However as we're now lockless, the global rover can exhibit
pathological behaviour should multiple threads all stomp on it at
the same time.

Granted this will eventually resolve itself but the process is
suboptimal.

This patch replaces the global rover with a pseudorandom starting
point to avoid this issue.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b9fbe709

17 5月, 2015 1 次提交

netlink: Reset portid after netlink_insert failure · c0bb07df

由 Herbert Xu 提交于 9年前

The commit c5adde94 ("netlink:
eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
because it doesn't reset portid after a failed netlink_insert.

This means that should autobind fail the first time around, then
the socket will be stuck in limbo as it can never be bound again
since it already has a non-zero portid.

Fixes: c5adde94 ("netlink: eliminate nl_sk_hash_lock")
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c0bb07df

15 5月, 2015 1 次提交

netlink: move nl_table in read_mostly section · 91dd93f9

由 Eric Dumazet 提交于 9年前

netlink sockets creation and deletion heavily modify nl_table_users
and nl_table_lock.

If nl_table is sharing one cache line with one of them, netlink
performance is really bad on SMP.

ffffffff81ff5f00 B nl_table
ffffffff81ff5f0c b nl_table_users

Putting nl_table in read_mostly section increased performance
of my open/delete netlink sockets test by about 80 %

This came up while diagnosing a getaddrinfo() problem.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91dd93f9

11 5月, 2015 2 次提交

netlink: Create kernel netlink sockets in the proper network namespace · 13d3078e

由 Eric W. Biederman 提交于 9年前

Utilize the new functionality of sk_alloc so that nothing needs to be
done to suprress the reference counting on kernel sockets.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13d3078e

net: Pass kern from net_proto_family.create to sk_alloc · 11aa9c28

由 Eric W. Biederman 提交于 9年前

In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11aa9c28

10 5月, 2015 2 次提交

netlink: allow to listen "all" netns · 59324cf3

由 Nicolas Dichtel 提交于 9年前

More accurately, listen all netns that have a nsid assigned into the netns
where the netlink socket is opened.
For this purpose, a netlink socket option is added:
NETLINK_LISTEN_ALL_NSID. When this option is set on a netlink socket, this
socket will receive netlink notifications from all netns that have a nsid
assigned into the netns where the socket has been opened. The nsid is sent
to userland via an anscillary data.

With this patch, a daemon needs only one socket to listen many netns. This
is useful when the number of netns is high.

Because 0 is a valid value for a nsid, the field nsid_is_set indicates if
the field nsid is valid or not. skb->cb is initialized to 0 on skb
allocation, thus we are sure that we will never send a nsid 0 by error to
the userland.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

59324cf3

netlink: rename private flags and states · cc3a572f

由 Nicolas Dichtel 提交于 9年前

These flags and states have the same prefix (NETLINK_) that netlink socket
options. To avoid confusion and to be able to name a flag like a socket
option, let's use an other prefix: NETLINK_[S|F]_.

Note: a comment has been fixed, it was talking about
NETLINK_RECV_NO_ENOBUFS socket option instead of NETLINK_NO_ENOBUFS.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc3a572f

04 5月, 2015 1 次提交

netlink: Remove max_size setting · edac450d

由 Herbert Xu 提交于 9年前

We currently limit the hash table size to 64K which is very bad
as even 10 years ago it was relatively easy to generate millions
of sockets.

Since the hash table is naturally limited by memory allocation
failure, we don't really need an explicit limit so this patch
removes it.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NThomas Graf <tgraf@noironetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

edac450d

26 4月, 2015 1 次提交

net: fix crash in build_skb() · 2ea2f62c

由 Eric Dumazet 提交于 9年前

When I added pfmemalloc support in build_skb(), I forgot netlink
was using build_skb() with a vmalloc() area.

In this patch I introduce __build_skb() for netlink use,
and build_skb() is a wrapper handling both skb->head_frag and
skb->pfmemalloc

This means netlink no longer has to hack skb->head_frag

[ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 1567.700067] Dumping ftrace buffer:
[ 1567.700067]    (ftrace buffer empty)
[ 1567.700067] Modules linked in:
[ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
[ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
[ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
[ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
[ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
[ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
[ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
[ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
[ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
[ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
[ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
[ 1567.700067] Stack:
[ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
[ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
[ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
[ 1567.700067] Call Trace:
[ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
[ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
[ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
[ 1567.774369] sock_write_iter (net/socket.c:823)
[ 1567.774369] ? sock_sendmsg (net/socket.c:806)
[ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
[ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 1567.774369] ? default_llseek (fs/read_write.c:487)
[ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
[ 1567.774369] vfs_write (fs/read_write.c:539)
[ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
[ 1567.774369] ? SyS_read (fs/read_write.c:577)
[ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)

Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ea2f62c

26 3月, 2015 1 次提交

rhashtable: provide len to obj_hashfn · 49f7b33e

由 Patrick McHardy 提交于 9年前

nftables sets will be converted to use so called setextensions, moving
the key to a non-fixed position. To hash it, the obj_hashfn must be used,
however it so far doesn't receive the length parameter.

Pass the key length to obj_hashfn() and convert existing users.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

49f7b33e

25 3月, 2015 1 次提交

rhashtable: Disable automatic shrinking by default · b5e2c150

由 Thomas Graf 提交于 9年前

Introduce a new bool automatic_shrinking to require the
user to explicitly opt-in to automatic shrinking of tables.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b5e2c150

24 3月, 2015 1 次提交

netlink: Use default rhashtable hashfn · 11b58ba1

由 Herbert Xu 提交于 9年前

This patch removes the explicit jhash value for the hashfn parameter
of rhashtable.  As the key length is a multiple of 4, this means that
we will actually end up using jhash2.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11b58ba1

21 3月, 2015 2 次提交

netlink: Remove netlink_compare_arg.trailer · 8f2ddaac

由 Herbert Xu 提交于 9年前

Instead of computing the offset from trailer, this patch computes
netlink_compare_arg_len from the offset of portid and then adds 4
to it.  This allows trailer to be removed.
Reported-by: NDavid Miller <davem@davemloft.net>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8f2ddaac

netlink: Move namespace into hash key · c428ecd1

由 Herbert Xu 提交于 9年前

Currently the name space is a de facto key because it has to match
before we find an object in the hash table.  However, it isn't in
the hash value so all objects from different name spaces with the
same port ID hash to the same bucket.

This is bad as the number of name spaces is unbounded.

This patch fixes this by using the namespace when doing the hash.

Because the namespace field doesn't lie next to the portid field
in the netlink socket, this patch switches over to the rhashtable
interface without a fixed key.

This patch also uses the new inlined rhashtable interface where
possible.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c428ecd1

19 3月, 2015 1 次提交

netlink: Use rhashtable max_size instead of max_shift · b06eee59

由 Herbert Xu 提交于 9年前

This patch converts netlink to use rhashtable max_size instead
of the obsolete max_shift.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b06eee59

03 3月, 2015 1 次提交

net: Remove iocb argument from sendmsg and recvmsg · 1b784140

由 Ying Xue 提交于 10年前

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b784140

28 2月, 2015 1 次提交

rhashtable: remove indirection for grow/shrink decision functions · 4c4b52d9

由 Daniel Borkmann 提交于 10年前

Currently, all real users of rhashtable default their grow and shrink
decision functions to rht_grow_above_75() and rht_shrink_below_30(),
so that there's currently no need to have this explicitly selectable.

It can/should be generic and private inside rhashtable until a real
use case pops up. Since we can make this private, we'll save us this
additional indirection layer and can improve insertion/deletion time
as well.

Reference: http://patchwork.ozlabs.org/patch/443040/Suggested-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c4b52d9

05 2月, 2015 1 次提交

netlink: Use rhashtable walk iterator · 56d28b1e

由 Herbert Xu 提交于 10年前

This patch gets rid of the manual rhashtable walk in netlink
which touches rhashtable internals that should not be exposed.
It does so by using the rhashtable iterator primitives.

In fact the existing code was very buggy.  Some sockets weren't
shown at all while others were shown more than once.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56d28b1e

04 2月, 2015 1 次提交

netlink: make the check for "send from tx_ring" deterministic · a8866ff6

由 Al Viro 提交于 10年前

As it is, zero msg_iovlen means that the first iovec in the kernel
array of iovecs is left uninitialized, so checking if its ->iov_base
is NULL is random.  Since the real users of that thing are doing
sendto(fd, NULL, 0, ...), they are getting msg_iovlen = 1 and
msg_iov[0] = {NULL, 0}, which is what this test is trying to catch.
As suggested by davem, let's just check that msg_iovlen was 1 and
msg_iov[0].iov_base was NULL - _that_ is well-defined and it catches
what we want to catch.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a8866ff6

31 1月, 2015 1 次提交

netlink: fix wrong subscription bitmask to group mapping in · 8b7c36d8

由 Pablo Neira 提交于 10年前

The subscription bitmask passed via struct sockaddr_nl is converted to
the group number when calling the netlink_bind() and netlink_unbind()
callbacks.

The conversion is however incorrect since bitmask (1 << 0) needs to be
mapped to group number 1. Note that you cannot specify the group number 0
(usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP
since this is rejected through -EINVAL.

This problem became noticeable since 97840cb6 ("netfilter: nfnetlink:
fix insufficient validation in nfnetlink_bind") when binding to bitmask
(1 << 0) in ctnetlink.
Reported-by: NAndre Tomt <andre@tomt.net>
Reported-by: NIvan Delalande <colona@arista.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8b7c36d8

29 1月, 2015 1 次提交

net: remove sock_iocb · 7cc05662

由 Christoph Hellwig 提交于 10年前

The sock_iocb structure is allocate on stack for each read/write-like
operation on sockets, and contains various fields of which only the
embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
Get rid of the sock_iocb and put a msghdr directly on the stack and pass
the scm_cookie explicitly to netlink_mmap_sendmsg.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7cc05662

27 1月, 2015 1 次提交

netlink: Kill redundant net argument in netlink_insert · 8ea65f4a

由 Herbert Xu 提交于 10年前

The socket already carries the net namespace with it so there is
no need to be passing another net around.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ea65f4a

18 1月, 2015 1 次提交

netlink: make nlmsg_end() and genlmsg_end() void · 053c095a

由 Johannes Berg 提交于 10年前

Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

  if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

  return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

  if (my_function(...))
    /* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

-	return nlmsg_end(...);
+	nlmsg_end(...);
+	return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

053c095a

17 1月, 2015 2 次提交

genetlink: synchronize socket closing and family removal · ee1c2442

由 Johannes Berg 提交于 10年前

In addition to the problem Jeff Layton reported, I looked at the code
and reproduced the same warning by subscribing and removing the genl
family with a socket still open. This is a fairly tricky race which
originates in the fact that generic netlink allows the family to go
away while sockets are still open - unlike regular netlink which has
a module refcount for every open socket so in general this cannot be
triggered.

Trying to resolve this issue by the obvious locking isn't possible as
it will result in deadlocks between unregistration and group unbind
notification (which incidentally lockdep doesn't find due to the home
grown locking in the netlink table.)

To really resolve this, introduce a "closing socket" reference counter
(for generic netlink only, as it's the only affected family) in the
core netlink code and use that in generic netlink to wait for all the
sockets that are being closed at the same time as a generic netlink
family is removed.

This fixes the race that when a socket is closed, it will should call
the unbind, but if the family is removed at the same time the unbind
will not find it, leading to the warning. The real problem though is
that in this case the unbind could actually find a new family that is
registered to have a multicast group with the same ID, and call its
mcast_unbind() leading to confusing.

Also remove the warning since it would still trigger, but is now no
longer a problem.

This also moves the code in af_netlink.c to before unreferencing the
module to avoid having the same problem in the normal non-genl case.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ee1c2442

genetlink: disallow subscribing to unknown mcast groups · 5ad63005

由 Johannes Berg 提交于 10年前

Jeff Layton reported that he could trigger the multicast unbind warning
in generic netlink using trinity. I originally thought it was a race
condition between unregistering the generic netlink family and closing
the socket, but there's a far simpler explanation: genetlink currently
allows subscribing to groups that don't (yet) exist, and the warning is
triggered when unsubscribing again while the group still doesn't exist.

Originally, I had a warning in the subscribe case and accepted it out of
userspace API concerns, but the warning was of course wrong and removed
later.

However, I now think that allowing userspace to subscribe to groups that
don't exist is wrong and could possibly become a security problem:
Consider a (new) genetlink family implementing a permission check in
the mcast_bind() function similar to the like the audit code does today;
it would be possible to bypass the permission check by guessing the ID
and subscribing to the group it exists. This is only possible in case a
family like that would be dynamically loaded, but it doesn't seem like a
huge stretch, for example wireless may be loaded when you plug in a USB
device.

To avoid this reject such subscription attempts.

If this ends up causing userspace issues we may need to add a workaround
in af_netlink to deny such requests but not return an error.
Reported-by: NJeff Layton <jeff.layton@primarydata.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5ad63005

16 1月, 2015 1 次提交

netlink: Fix netlink_insert EADDRINUSE error · 919d9db9

由 Herbert Xu 提交于 10年前

The patch c5adde94 ("netlink:
eliminate nl_sk_hash_lock") introduced a bug where the EADDRINUSE
error has been replaced by ENOMEM.  This patch rectifies that
problem.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

919d9db9

14 1月, 2015 1 次提交

netlink: eliminate nl_sk_hash_lock · c5adde94

由 Ying Xue 提交于 10年前

As rhashtable_lookup_compare_insert() can guarantee the process
of search and insertion is atomic, it's safe to eliminate the
nl_sk_hash_lock. After this, object insertion or removal will
be protected with per bucket lock on write side while object
lookup is guarded with rcu read lock on read side.
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c5adde94

04 1月, 2015 4 次提交

netlink: Lockless lookup with RCU grace period in socket release · 21e4902a

由 Thomas Graf 提交于 10年前

Defers the release of the socket reference using call_rcu() to
allow using an RCU read-side protected call to rhashtable_lookup()

This restores behaviour and performance gains as previously
introduced by e341694e ("netlink: Convert netlink_lookup() to use
RCU protected hash table") without the side effect of severely
delayed socket destruction.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

21e4902a

rhashtable: Per bucket locks & deferred expansion/shrinking · 97defe1e

由 Thomas Graf 提交于 10年前

Introduces an array of spinlocks to protect bucket mutations. The number
of spinlocks per CPU is configurable and selected based on the hash of
the bucket. This allows for parallel insertions and removals of entries
which do not share a lock.

The patch also defers expansion and shrinking to a worker queue which
allows insertion and removal from atomic context. Insertions and
deletions may occur in parallel to it and are only held up briefly
while the particular bucket is linked or unzipped.

Mutations of the bucket table pointer is protected by a new mutex, read
access is RCU protected.

In the event of an expansion or shrinking, the new bucket table allocated
is exposed as a so called future table as soon as the resize process
starts.  Lookups, deletions, and insertions will briefly use both tables.
The future table becomes the main table after an RCU grace period and
initial linking of the old to the new table was performed. Optimization
of the chains to make use of the new number of buckets follows only the
new table is in use.

The side effect of this is that during that RCU grace period, a bucket
traversal using any rht_for_each() variant on the main table will not see
any insertions performed during the RCU grace period which would at that
point land in the future table. The lookup will see them as it searches
both tables if needed.

Having multiple insertions and removals occur in parallel requires nelems
to become an atomic counter.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97defe1e

rhashtable: Convert bucket iterators to take table and index · 88d6ed15

由 Thomas Graf 提交于 10年前

This patch is in preparation to introduce per bucket spinlocks. It
extends all iterator macros to take the bucket table and bucket
index. It also introduces a new rht_dereference_bucket() to
handle protected accesses to buckets.

It introduces a barrier() to the RCU iterators to the prevent
the compiler from caching the first element.

The lockdep verifier is introduced as stub which always succeeds
and properly implement in the next patch when the locks are
introduced.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

88d6ed15

rhashtable: Do hashing inside of rhashtable_lookup_compare() · 8d24c0b4

由 Thomas Graf 提交于 10年前

Hash the key inside of rhashtable_lookup_compare() like
rhashtable_lookup() does. This allows to simplify the hashing
functions and keep them private.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d24c0b4

30 12月, 2014 1 次提交

genetlink: A genl_bind() to an out-of-range multicast group should not WARN(). · dc97a1a9

由 David S. Miller 提交于 10年前

Users can request to bind to arbitrary multicast groups, so warning
when the requested group number is out of range is not appropriate.

And with the warning removed, and the 'err' variable properly given
an initial value, we can remove 'found' altogether.
Reported-by: NSedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dc97a1a9

27 12月, 2014 5 次提交

netlink/genetlink: pass network namespace to bind/unbind · 023e2cfa

由 Johannes Berg 提交于 10年前

Netlink families can exist in multiple namespaces, and for the most
part multicast subscriptions are per network namespace. Thus it only
makes sense to have bind/unbind notifications per network namespace.

To achieve this, pass the network namespace of a given client socket
to the bind/unbind functions.

Also do this in generic netlink, and there also make sure that any
bind for multicast groups that only exist in init_net is rejected.
This isn't really a problem if it is accepted since a client in a
different namespace will never receive any notifications from such
a group, but it can confuse the family if not rejected (it's also
possible to silently (without telling the family) accept it, but it
would also have to be ignored on unbind so families that take any
kind of action on bind/unbind won't do unnecessary work for invalid
clients like that.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

023e2cfa

genetlink: pass multicast bind/unbind to families · c380d9a7

由 Johannes Berg 提交于 10年前

In order to make the newly fixed multicast bind/unbind
functionality in generic netlink, pass them down to the
appropriate family.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c380d9a7

netlink: call unbind when releasing socket · 7d68536b

由 Johannes Berg 提交于 10年前

Currently, netlink_unbind() is only called when the socket
explicitly unbinds, which limits its usefulness (luckily
there are no users of it yet anyway.)

Call netlink_unbind() also when a socket is released, so it
becomes possible to track listeners with this callback and
without also implementing a netlink notifier (and checking
netlink_has_listeners() in there.)
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7d68536b

netlink: update listeners directly when removing socket · b10dcb3b

由 Johannes Berg 提交于 10年前

The code is now confusing to read - first in one function down
(netlink_remove) any group subscriptions are implicitly removed
by calling __sk_del_bind_node(), but the subscriber database is
only updated far later by calling netlink_update_listeners().

Move the latter call to just after removal from the list so it
is easier to follow the code.

This also enables moving the locking inside the kernel-socket
conditional, which improves the normal socket destruction path.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b10dcb3b

netlink: rename netlink_unbind() to netlink_undo_bind() · 02c81ab9

由 Johannes Berg 提交于 10年前

The new name is more expressive - this isn't a generic unbind
function but rather only a little undo helper for use only in
netlink_bind().
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02c81ab9

19 12月, 2014 2 次提交

netlink: Don't reorder loads/stores before marking mmap netlink frame as available · a18e6a18

由 Thomas Graf 提交于 10年前

Each mmap Netlink frame contains a status field which indicates
whether the frame is unused, reserved, contains data or needs to
be skipped. Both loads and stores may not be reordeded and must
complete before the status field is changed and another CPU might
pick up the frame for use. Use an smp_mb() to cover needs of both
types of callers to netlink_set_status(), callers which have been
reading data frame from the frame, and callers which have been
filling or releasing and thus writing to the frame.

- Example code path requiring a smp_rmb():
  memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len);
  netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED);

- Example code path requiring a smp_wmb():
  hdr->nm_uid	= from_kuid(sk_user_ns(sk), NETLINK_CB(skb).creds.uid);
  hdr->nm_gid	= from_kgid(sk_user_ns(sk), NETLINK_CB(skb).creds.gid);
  netlink_frame_flush_dcache(hdr);
  netlink_set_status(hdr, NL_MMAP_STATUS_VALID);

Fixes: f9c228 ("netlink: implement memory mapped recvmsg()")
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a18e6a18

netlink: Always copy on mmap TX. · 4682a035

由 David Miller 提交于 10年前

Checking the file f_count and the nlk->mapped count is not completely
sufficient to prevent the mmap'd area contents from changing from
under us during netlink mmap sendmsg() operations.

Be careful to sample the header's length field only once, because this
could change from under us as well.

Fixes: 5fd96123 ("netlink: implement memory mapped sendmsg()")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NDaniel Borkmann <dborkman@redhat.com>
Acked-by: NThomas Graf <tgraf@suug.ch>

4682a035

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功