提交 · e341694e3eb57fcda9f1adc7bfea42fe080d8d7a · openanolis / cloud-kernel

03 8月, 2014 1 次提交

netlink: Convert netlink_lookup() to use RCU protected hash table · e341694e

由 Thomas Graf 提交于 8月 02, 2014

Heavy Netlink users such as Open vSwitch spend a considerable amount of
time in netlink_lookup() due to the read-lock on nl_table_lock. Use of
RCU relieves the lock contention.

Makes use of the new resizable hash table to avoid locking on the
lookup.

The hash table will grow if entries exceeds 75% of table size up to a
total table size of 64K. It will automatically shrink if usage falls
below 30%.

Also splits nl_table_lock into a separate mutex to protect hash table
mutations and allow synchronize_rcu() to sleep while waiting for readers
during expansion and shrinking.

Before:
   9.16%  kpktgend_0  [openvswitch]      [k] masked_flow_lookup
   6.42%  kpktgend_0  [pktgen]           [k] mod_cur_headers
   6.26%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
   6.23%  kpktgend_0  [kernel.kallsyms]  [k] memset
   4.79%  kpktgend_0  [kernel.kallsyms]  [k] netlink_lookup
   4.37%  kpktgend_0  [kernel.kallsyms]  [k] memcpy
   3.60%  kpktgend_0  [openvswitch]      [k] ovs_flow_extract
   2.69%  kpktgend_0  [kernel.kallsyms]  [k] jhash2

After:
  15.26%  kpktgend_0  [openvswitch]      [k] masked_flow_lookup
   8.12%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
   7.92%  kpktgend_0  [pktgen]           [k] mod_cur_headers
   5.11%  kpktgend_0  [kernel.kallsyms]  [k] memset
   4.11%  kpktgend_0  [openvswitch]      [k] ovs_flow_extract
   4.06%  kpktgend_0  [kernel.kallsyms]  [k] _raw_spin_lock
   3.90%  kpktgend_0  [kernel.kallsyms]  [k] jhash2
   [...]
   0.67%  kpktgend_0  [kernel.kallsyms]  [k] netlink_lookup
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e341694e

01 8月, 2014 1 次提交

netlink: Use PAGE_ALIGNED macro · 74e83b23

由 Tobias Klauser 提交于 7月 31, 2014

Use PAGE_ALIGNED(...) instead of IS_ALIGNED(..., PAGE_SIZE).
Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74e83b23

17 7月, 2014 1 次提交

netlink: remove bool varible · 498044bb

由 Varka Bhadram 提交于 7月 16, 2014

This patch removes the bool variable 'pass'.
If the swith case exist return true or return false.
Signed-off-by: NVarka Bhadram <varkab@cdac.in>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

498044bb

10 7月, 2014 1 次提交

netlink: Fix handling of error from netlink_dump(). · ac30ef83

由 Ben Pfaff 提交于 7月 09, 2014

netlink_dump() returns a negative errno value on error.  Until now,
netlink_recvmsg() directly recorded that negative value in sk->sk_err, but
that's wrong since sk_err takes positive errno values.  (This manifests as
userspace receiving a positive return value from the recv() system call,
falsely indicating success.) This bug was introduced in the commit that
started checking the netlink_dump() return value, commit b44d211e (netlink:
handle errors from netlink_dump()).

Multithreaded Netlink dumps are one way to trigger this behavior in
practice, as described in the commit message for the userspace workaround
posted here:
    http://openvswitch.org/pipermail/dev/2014-June/042339.html

This commit also fixes the same bug in netlink_poll(), introduced in commit
cd1df525 (netlink: add flow control for memory mapped I/O).
Signed-off-by: NBen Pfaff <blp@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac30ef83

08 7月, 2014 1 次提交

netlink: Fix do_one_broadcast() prototype. · 46c9521f

由 Rami Rosen 提交于 7月 01, 2014

This patch changes the prototype of the do_one_broadcast() method so that it will return void.
Signed-off-by: NRami Rosen <ramirose@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46c9521f

03 6月, 2014 1 次提交

netlink: Only check file credentials for implicit destinations · 2d7a85f4

由 Eric W. Biederman 提交于 5月 30, 2014

It was possible to get a setuid root or setcap executable to write to
it's stdout or stderr (which has been set made a netlink socket) and
inadvertently reconfigure the networking stack.

To prevent this we check that both the creator of the socket and
the currentl applications has permission to reconfigure the network
stack.

Unfortunately this breaks Zebra which always uses sendto/sendmsg
and creates it's socket without any privileges.

To keep Zebra working don't bother checking if the creator of the
socket has privilege when a destination address is specified.  Instead
rely exclusively on the privileges of the sender of the socket.

Note from Andy: This is exactly Eric's code except for some comment
clarifications and formatting fixes.  Neither I nor, I think, anyone
else is thrilled with this approach, but I'm hesitant to wait on a
better fix since 3.15 is almost here.

Note to stable maintainers: This is a mess.  An earlier series of
patches in 3.15 fix a rather serious security issue (CVE-2014-0181),
but they did so in a way that breaks Zebra.  The offending series
includes:

    commit aa4cf945
    Author: Eric W. Biederman <ebiederm@xmission.com>
    Date:   Wed Apr 23 14:28:03 2014 -0700

        net: Add variants of capable for use on netlink messages

If a given kernel version is missing that series of fixes, it's
probably worth backporting it and this patch.  if that series is
present, then this fix is critical if you care about Zebra.

Cc: stable@vger.kernel.org
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d7a85f4

25 4月, 2014 2 次提交

net: Add variants of capable for use on netlink messages · aa4cf945

由 Eric W. Biederman 提交于 4月 23, 2014

netlink_net_capable - The common case use, for operations that are safe on a network namespace
netlink_capable - For operations that are only known to be safe for the global root
netlink_ns_capable - The general case of capable used to handle special cases

__netlink_ns_capable - Same as netlink_ns_capable except taking a netlink_skb_parms instead of
		       the skbuff of a netlink message.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

aa4cf945

netlink: Rename netlink_capable netlink_allowed · 5187cd05

由 Eric W. Biederman 提交于 4月 23, 2014

netlink_capable is a static internal function in af_netlink.c and we
have better uses for the name netlink_capable.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5187cd05

23 4月, 2014 2 次提交

netlink: implement unbind to netlink_setsockopt NETLINK_DROP_MEMBERSHIP · 7774d5e0

由 Richard Guy Briggs 提交于 4月 22, 2014

Call the per-protocol unbind function rather than bind function on
NETLINK_DROP_MEMBERSHIP in netlink_setsockopt().
Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7774d5e0

netlink: have netlink per-protocol bind function return an error code. · 4f520900

由 Richard Guy Briggs 提交于 4月 22, 2014

Have the netlink per-protocol optional bind function return an int error code
rather than void to signal a failure.

This will enable netlink protocols to perform extra checks including
capabilities and permissions verifications when updating memberships in
multicast groups.

In netlink_bind() and netlink_setsockopt() the call to the per-protocol bind
function was moved above the multicast group update to prevent any access to
the multicast socket groups before checking with the per-protocol bind
function.  This will enable the per-protocol bind function to be used to check
permissions which could be denied before making them available, and to avoid
the messy job of undoing the addition should the per-protocol bind function
fail.

The netfilter subsystem seems to be the only one currently using the
per-protocol bind function.
Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f520900

12 4月, 2014 1 次提交

net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369

由 David S. Miller 提交于 4月 11, 2014

Several spots in the kernel perform a sequence like:

	skb_queue_tail(&sk->s_receive_queue, skb);
	sk->sk_data_ready(sk, skb->len);

But at the moment we place the SKB onto the socket receive queue it
can be consumed and freed up.  So this skb->len access is potentially
to freed up memory.

Furthermore, the skb->len can be modified by the consumer so it is
possible that the value isn't accurate.

And finally, no actual implementation of this callback actually uses
the length argument.  And since nobody actually cared about it's
value, lots of call sites pass arbitrary values in such as '0' and
even '1'.

So just remove the length argument from the callback, that way there
is no confusion whatsoever and all of these use-after-free cases get
fixed as a side effect.

Based upon a patch by Eric Dumazet and his suggestion to audit this
issue tree-wide.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

676d2369

11 3月, 2014 1 次提交

netlink: autosize skb lengthes · 9063e21f

由 Eric Dumazet 提交于 3月 07, 2014

One known problem with netlink is the fact that NLMSG_GOODSIZE is
really small on PAGE_SIZE==4096 architectures, and it is difficult
to know in advance what buffer size is used by the application.

This patch adds an automatic learning of the size.

First netlink message will still be limited to ~4K, but if user used
bigger buffers, then following messages will be able to use up to 16KB.

This speedups dump() operations by a large factor and should be safe
for legacy applications.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9063e21f

26 2月, 2014 1 次提交

net: Fix permission check in netlink_connect() · 46833a86

由 Mike Pecovnik 提交于 2月 24, 2014

netlink_sendmsg() was changed to prevent non-root processes from sending
messages with dst_pid != 0.
netlink_connect() however still only checks if nladdr->nl_groups is set.
This patch modifies netlink_connect() to check for the same condition.
Signed-off-by: NMike Pecovnik <mike.pecovnik@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46833a86

18 2月, 2014 1 次提交

netlink: fix checkpatch errors space and "foo *bar" · 23b45672

由 Wang Yufen 提交于 2月 17, 2014

ERROR: spaces required and "(foo*)" should be "(foo *)"
Signed-off-by: NWang Yufen <wangyufen@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

23b45672

19 1月, 2014 1 次提交

net: add build-time checks for msg->msg_name size · 342dfc30

由 Steffen Hurrle 提交于 1月 17, 2014

This is a follow-up patch to f3d33426 ("net: rework recvmsg
handler msg_name and msg_namelen logic").

DECLARE_SOCKADDR validates that the structure we use for writing the
name information to is not larger than the buffer which is reserved
for msg->msg_name (which is 128 bytes). Also use DECLARE_SOCKADDR
consistently in sendmsg code paths.
Signed-off-by: NSteffen Hurrle <steffen@hurrle.net>
Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

342dfc30

07 1月, 2014 1 次提交

netlink: Avoid netlink mmap alloc if msg size exceeds frame size · aae9f0e2

由 Thomas Graf 提交于 11月 30, 2013

An insufficent ring frame size configuration can lead to an
unnecessary skb allocation for every Netlink message. Check frame
size before taking the queue lock and allocating the skb and
re-check with lock to be safe.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Reviewed-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NJesse Gross <jesse@nicira.com>

aae9f0e2

02 1月, 2014 1 次提交

netlink: cleanup tap related functions · 2173f8d9

由 stephen hemminger 提交于 12月 30, 2013

Cleanups in netlink_tap code
 * remove unused function netlink_clear_multicast_users
 * make local function static
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Reviewed-by: NJohannes Berg <johannes@sipsolutions.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2173f8d9

01 1月, 2014 2 次提交

netlink: specify netlink packet direction for nlmon · 604d13c9

由 Daniel Borkmann 提交于 12月 23, 2013

In order to facilitate development for netlink protocol dissector,
fill the unused field skb->pkt_type of the cloned skb with a hint
of the address space of the new owner (receiver) socket in the
notion of "to kernel" resp. "to user".

At the time we invoke __netlink_deliver_tap_skb(), we already have
set the new skb owner via netlink_skb_set_owner_r(), so we can use
that for netlink_is_kernel() probing.

In normal PF_PACKET network traffic, this field denotes if the
packet is destined for us (PACKET_HOST), if it's broadcast
(PACKET_BROADCAST), etc.

As we only have 3 bit reserved, we can use the value (= 6) of
PACKET_FASTROUTE as it's _not used_ anywhere in the whole kernel
and not supported anywhere, and packets of such type were never
exposed to user space, so there are no overlapping users of such
kind. Thus, as wished, that seems the only way to make both
PACKET_* values non-overlapping and therefore device agnostic.

By using those two flags for netlink skbs on nlmon devices, they
can be made available and picked up via sll_pkttype (previously
unused in netlink context) in struct sockaddr_ll. We now have
these two directions:

 - PACKET_USER (= 6)    ->  to user space
 - PACKET_KERNEL (= 7)  ->  to kernel space

Partial `ip a` example strace for sa_family=AF_NETLINK with
detected nl msg direction:

syscall:                     direction:
sendto(3,  ...) = 40         /* to kernel */
recvmsg(3, ...) = 3404       /* to user */
recvmsg(3, ...) = 1120       /* to user */
recvmsg(3, ...) = 20         /* to user */
sendto(3,  ...) = 40         /* to kernel */
recvmsg(3, ...) = 168        /* to user */
recvmsg(3, ...) = 144        /* to user */
recvmsg(3, ...) = 20         /* to user */
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NJakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

604d13c9

netlink: only do not deliver to tap when both sides are kernel sks · 73bfd370

由 Daniel Borkmann 提交于 12月 23, 2013

We should also deliver packets to nlmon devices when we are in
netlink_unicast_kernel(), and only one of the {src,dst} sockets
is user sk and the other one kernel sk. That's e.g. the case in
netlink diag, netlink route, etc. Still, forbid to deliver messages
from kernel to kernel sks.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NJakub Zawadzki <darkjames-ws@darkjames.pl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73bfd370

21 11月, 2013 1 次提交

net: rework recvmsg handler msg_name and msg_namelen logic · f3d33426

由 Hannes Frederic Sowa 提交于 11月 21, 2013

This patch now always passes msg->msg_namelen as 0. recvmsg handlers must
set msg_namelen to the proper size <= sizeof(struct sockaddr_storage)
to return msg_name to the user.

This prevents numerous uninitialized memory leaks we had in the
recvmsg handlers and makes it harder for new code to accidentally leak
uninitialized memory.

Optimize for the case recvfrom is called with NULL as address. We don't
need to copy the address at all, so set it to NULL before invoking the
recvmsg handler. We can do so, because all the recvmsg handlers must
cope with the case a plain read() is called on them. read() also sets
msg_name to NULL.

Also document these changes in include/linux/net.h as suggested by David
Miller.

Changes since RFC:

Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
affect sendto as it would bail out earlier while trying to copy-in the
address. It also more naturally reflects the logic by the callers of
verify_iovec.

With this change in place I could remove "
if (!uaddr || msg_sys->msg_namelen == 0)
	msg->msg_name = NULL
".

This change does not alter the user visible error logic as we ignore
msg_namelen as long as msg_name is NULL.

Also remove two unnecessary curly brackets in ___sys_recvmsg and change
comments to netdev style.

Cc: David Miller <davem@davemloft.net>
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f3d33426

20 11月, 2013 1 次提交

netlink: fix documentation typo in netlink_set_err() · 840e93f2

由 Johannes Berg 提交于 11月 19, 2013

The parameter is just 'group', not 'groups', fix the documentation typo.
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

840e93f2

07 9月, 2013 1 次提交

net: netlink: filter particular protocols from analyzers · 5ffd5cdd

由 Daniel Borkmann 提交于 9月 05, 2013

Fix finer-grained control and let only a whitelist of allowed netlink
protocols pass, in our case related to networking. If later on, other
subsystems decide they want to add their protocol as well to the list
of allowed protocols they shall simply add it. While at it, we also
need to tell what protocol is in use otherwise BPF_S_ANC_PROTOCOL can
not pick it up (as it's not filled out).
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5ffd5cdd

16 8月, 2013 1 次提交

netlink: Eliminate kmalloc in netlink dump operation. · 16b304f3

由 Pravin B Shelar 提交于 8月 15, 2013

Following patch stores struct netlink_callback in netlink_sock
to avoid allocating and freeing it on every netlink dump msg.
Only one dump operation is allowed for a given socket at a time
therefore we can safely convert cb pointer to cb struct inside
netlink_sock.
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

16b304f3

03 8月, 2013 1 次提交

net: netlink: minor: remove unused pointer in alloc_pg_vec · 8a849bb7

由 Daniel Borkmann 提交于 8月 02, 2013

Variable ptr is being assigned, but never used, so just remove it.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a849bb7

28 6月, 2013 1 次提交

netlink: fix splat in skb_clone with large messages · 3a36515f

由 Pablo Neira 提交于 6月 28, 2013

Since (c05cdb1b netlink: allow large data transfers from user-space),
netlink splats if it invokes skb_clone on large netlink skbs since:

* skb_shared_info was not correctly initialized.
* skb->destructor is not set in the cloned skb.

This was spotted by trinity:

[  894.990671] BUG: unable to handle kernel paging request at ffffc9000047b001
[  894.991034] IP: [<ffffffff81a212c4>] skb_clone+0x24/0xc0
[...]
[  894.991034] Call Trace:
[  894.991034]  [<ffffffff81ad299a>] nl_fib_input+0x6a/0x240
[  894.991034]  [<ffffffff81c3b7e6>] ? _raw_read_unlock+0x26/0x40
[  894.991034]  [<ffffffff81a5f189>] netlink_unicast+0x169/0x1e0
[  894.991034]  [<ffffffff81a601e1>] netlink_sendmsg+0x251/0x3d0

Fix it by:

1) introducing a new netlink_skb_clone function that is used in nl_fib_input,
   that sets our special skb->destructor in the cloned skb. Moreover, handle
   the release of the large cloned skb head area in the destructor path.

2) not allowing large skbuffs in the netlink broadcast path. I cannot find
   any reasonable use of the large data transfer using netlink in that path,
   moreover this helps to skip extra skb_clone handling.

I found two more netlink clients that are cloning the skbs, but they are
not in the sendmsg path. Therefore, the sole client cloning that I found
seems to be the fib frontend.

Thanks to Eric Dumazet for helping to address this issue.
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a36515f

25 6月, 2013 1 次提交

net: netlink: virtual tap device management · bcbde0d4

由 Daniel Borkmann 提交于 6月 21, 2013

Similarly to the networking receive path with ptype_all taps, we add
the possibility to register netdevices that are for ARPHRD_NETLINK to
the netlink subsystem, so that those can be used for netlink analyzers
resp. debuggers. We do not offer a direct callback function as out-of-tree
modules could do crap with it. Instead, a netdevice must be registered
properly and only receives a clone, managed by the netlink layer. Symbols
are exported as GPL-only.
Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bcbde0d4

13 6月, 2013 1 次提交

netlink: make compare exist all the time · ca15febf

由 Gao feng 提交于 6月 13, 2013

Commit da12c90e
"netlink: Add compare function for netlink_table"
only set compare at the time we create kernel netlink,
and reset compare to NULL at the time we finially
release netlink socket, but netlink_lookup wants
the compare exist always.

So we should set compare after we allocate nl_table,
and never reset it. make comapre exist all the time.
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca15febf

11 6月, 2013 2 次提交

netlink: fix error propagation in netlink_mmap() · 7cdbac71

由 Patrick McHardy 提交于 6月 11, 2013

Return the error if something went wrong instead of unconditionally
returning 0.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7cdbac71

netlink: Add compare function for netlink_table · da12c90e

由 Gao feng 提交于 6月 06, 2013

As we know, netlink sockets are private resource of
net namespace, they can communicate with each other
only when they in the same net namespace. this works
well until we try to add namespace support for other
subsystems which use netlink.

Don't like ipv4 and route table.., it is not suited to
make these subsytems belong to net namespace, Such as
audit and crypto subsystems,they are more suitable to
user namespace.

So we must have the ability to make the netlink sockets
in same user namespace can communicate with each other.

This patch adds a new function pointer "compare" for
netlink_table, we can decide if the netlink sockets can
communicate with each other through this netlink_table
self-defined compare function.

The behavior isn't changed if we don't provide the compare
function for netlink_table.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

da12c90e

08 6月, 2013 1 次提交

netlink: allow large data transfers from user-space · c05cdb1b

由 Pablo Neira Ayuso 提交于 6月 03, 2013

I can hit ENOBUFS in the sendmsg() path with a large batch that is
composed of many netlink messages. Here that limit is 8 MBytes of
skbuff data area as kmalloc does not manage to get more than that.

While discussing atomic rule-set for nftables with Patrick McHardy,
we decided to put all rule-set updates that need to be applied
atomically in one single batch to simplify the existing approach.
However, as explained above, the existing netlink code limits us
to a maximum of ~20000 rules that fit in one single batch without
hitting ENOBUFS. iptables does not have such limitation as it is
using vmalloc.

This patch adds netlink_alloc_large_skb() which is only used in
the netlink_sendmsg() path. It uses alloc_skb if the memory
requested is <= one memory page, that should be the common case
for most subsystems, else vmalloc for higher memory allocations.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c05cdb1b

05 6月, 2013 1 次提交

net: fix sk_buff head without data area · 5e71d9d7

由 Pablo Neira 提交于 6月 03, 2013

Eric Dumazet spotted that we have to check skb->head instead
of skb->data as skb->head points to the beginning of the
data area of the skbuff. Similarly, we have to initialize the
skb->head pointer, not skb->data in __alloc_skb_head.

After this fix, netlink crashes in the release path of the
sk_buff, so let's fix that as well.

This bug was introduced in (0ebd0ac5 net: add function to
allocate sk_buff head without data area).
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5e71d9d7

02 5月, 2013 1 次提交

netlink: Fix skb ref counting. · ae6164ad

由 Pravin B Shelar 提交于 4月 29, 2013

Commit f9c22888 (netlink:
implement memory mapped recvmsg) increamented skb->users
ref count twice for a dump op which does not look right.

Following patch fixes that.

CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae6164ad

25 4月, 2013 1 次提交

netlink: fix compilation after memory mapped patches · 1bf9310a

由 Nicolas Dichtel 提交于 4月 24, 2013

Depending of the kernel configuration (CONFIG_UIDGID_STRICT_TYPE_CHECKS), we can
get the following errors:

net/netlink/af_netlink.c: In function ‘netlink_queue_mmaped_skb’:
net/netlink/af_netlink.c:663:14: error: incompatible types when assigning to type ‘__u32’ from type ‘kuid_t’
net/netlink/af_netlink.c:664:14: error: incompatible types when assigning to type ‘__u32’ from type ‘kgid_t’
net/netlink/af_netlink.c: In function ‘netlink_ring_set_copied’:
net/netlink/af_netlink.c:693:14: error: incompatible types when assigning to type ‘__u32’ from type ‘kuid_t’
net/netlink/af_netlink.c:694:14: error: incompatible types when assigning to type ‘__u32’ from type ‘kgid_t’

We must use the helpers to get the uid and gid, and also take care of user_ns.

Fix suggested by Eric W. Biederman <ebiederm@xmission.com>.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bf9310a

24 4月, 2013 1 次提交

netlink: fix typo in net/netlink/af_netlink.c · 1d5085cb

由 Stephen Rothwell 提交于 4月 23, 2013

Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d5085cb

20 4月, 2013 6 次提交

netlink: add flow control for memory mapped I/O · cd1df525

由 Patrick McHardy 提交于 4月 17, 2013

Add flow control for memory mapped RX. Since user-space usually doesn't
invoke recvmsg() when using memory mapped I/O, flow control is performed
in netlink_poll(). Dumps are allowed to continue if at least half of the
ring frames are unused.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cd1df525

netlink: implement memory mapped recvmsg() · f9c22888

由 Patrick McHardy 提交于 4月 17, 2013

Add support for mmap'ed recvmsg(). To allow the kernel to construct messages
into the mapped area, a dataless skb is allocated and the data pointer is
set to point into the ring frame. This means frames will be delivered to
userspace in order of allocation instead of order of transmission. This
usually doesn't matter since the order is either not determinable by
userspace or message creation/transmission is serialized. The only case
where this can have a visible difference is nfnetlink_queue. Userspace
can't assume mmap'ed messages have ordered IDs anymore and needs to check
this if using batched verdicts.

For non-mapped sockets, nothing changes.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9c22888

netlink: implement memory mapped sendmsg() · 5fd96123

由 Patrick McHardy 提交于 4月 17, 2013

Add support for mmap'ed sendmsg() to netlink. Since the kernel validates
received messages before processing them, the code makes sure userspace
can't modify the message contents after invoking sendmsg(). To do that
only a single mapping of the TX ring is allowed to exist and the socket
must not be shared. If either of these two conditions does not hold, it
falls back to copying.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5fd96123

netlink: add mmap'ed netlink helper functions · 9652e931

由 Patrick McHardy 提交于 4月 17, 2013

Add helper functions for looking up mmap'ed frame headers, reading and
writing their status, allocating skbs with mmap'ed data areas and a poll
function.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9652e931

netlink: mmaped netlink: ring setup · ccdfcc39

由 Patrick McHardy 提交于 4月 17, 2013

Add support for mmap'ed RX and TX ring setup and teardown based on the
af_packet.c code. The following patches will use this to add the real
mmap'ed receive and transmit functionality.
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ccdfcc39

netlink: add netlink_skb_set_owner_r() · cf0a018a

由 Patrick McHardy 提交于 4月 17, 2013

For mmap'ed I/O a netlink specific skb destructor needs to be invoked
after the final kfree_skb() to clean up state. This doesn't work currently
since the skb's ownership is transfered to the receiving socket using
skb_set_owner_r(), which orphans the skb, thereby invoking the destructor
prematurely.

Since netlink doesn't account skbs to the originating socket, there's no
need to orphan the skb. Add a netlink specific skb_set_owner_r() variant
that does not orphan the skb and use a netlink specific destructor to
call sock_rfree().
Signed-off-by: NPatrick McHardy <kaber@trash.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf0a018a

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功