提交 · e341694e3eb57fcda9f1adc7bfea42fe080d8d7a · openanolis / cloud-kernel

03 8月, 2014 12 次提交

netlink: Convert netlink_lookup() to use RCU protected hash table · e341694e

由 Thomas Graf 提交于 8月 02, 2014

Heavy Netlink users such as Open vSwitch spend a considerable amount of
time in netlink_lookup() due to the read-lock on nl_table_lock. Use of
RCU relieves the lock contention.

Makes use of the new resizable hash table to avoid locking on the
lookup.

The hash table will grow if entries exceeds 75% of table size up to a
total table size of 64K. It will automatically shrink if usage falls
below 30%.

Also splits nl_table_lock into a separate mutex to protect hash table
mutations and allow synchronize_rcu() to sleep while waiting for readers
during expansion and shrinking.

Before:
   9.16%  kpktgend_0  [openvswitch]      [k] masked_flow_lookup
   6.42%  kpktgend_0  [pktgen]           [k] mod_cur_headers
   6.26%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
   6.23%  kpktgend_0  [kernel.kallsyms]  [k] memset
   4.79%  kpktgend_0  [kernel.kallsyms]  [k] netlink_lookup
   4.37%  kpktgend_0  [kernel.kallsyms]  [k] memcpy
   3.60%  kpktgend_0  [openvswitch]      [k] ovs_flow_extract
   2.69%  kpktgend_0  [kernel.kallsyms]  [k] jhash2

After:
  15.26%  kpktgend_0  [openvswitch]      [k] masked_flow_lookup
   8.12%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
   7.92%  kpktgend_0  [pktgen]           [k] mod_cur_headers
   5.11%  kpktgend_0  [kernel.kallsyms]  [k] memset
   4.11%  kpktgend_0  [openvswitch]      [k] ovs_flow_extract
   4.06%  kpktgend_0  [kernel.kallsyms]  [k] _raw_spin_lock
   3.90%  kpktgend_0  [kernel.kallsyms]  [k] jhash2
   [...]
   0.67%  kpktgend_0  [kernel.kallsyms]  [k] netlink_lookup
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e341694e

inet: frags: use kmem_cache for inet_frag_queue · d4ad4d22

由 Nikolay Aleksandrov 提交于 8月 01, 2014

Use kmem_cache to allocate/free inet_frag_queue objects since they're
all the same size per inet_frags user and are alloced/freed in high volumes
thus making it a perfect case for kmem_cache.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4ad4d22

inet: frags: use INET_FRAG_EVICTED to prevent icmp messages · 2e404f63

由 Nikolay Aleksandrov 提交于 8月 01, 2014

Now that we have INET_FRAG_EVICTED we might as well use it to stop
sending icmp messages in the "frag_expire" functions instead of
stripping INET_FRAG_FIRST_IN from their flags when evicting.
Also fix the comment style in ip6_expire_frag_queue().
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Reviewed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e404f63

inet: frags: fix function declaration alignments in inet_fragment · f926e236

由 Nikolay Aleksandrov 提交于 8月 01, 2014

Fix a couple of functions' declaration alignments.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f926e236

inet: frags: rename last_in to flags · 06aa8b8a

由 Nikolay Aleksandrov 提交于 8月 01, 2014

The last_in field has been used to store various flags different from
first/last frag in so give it a more descriptive name: flags.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06aa8b8a

inet: frags: use INC_STATS_BH in the ipv6 reassembly code · d2373862

由 Nikolay Aleksandrov 提交于 8月 01, 2014

Softirqs are already disabled so no need to do it again, thus let's be
consistent and use the IP6_INC_STATS_BH variant.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d2373862

ipv4: remove nested rcu_read_lock/unlock · 188b1210

由 Duan Jiong 提交于 8月 01, 2014

ip_local_deliver_finish() already have a rcu_read_lock/unlock, so
the rcu_read_lock/unlock is unnecessary.

See the stack below:
ip_local_deliver_finish
	|
	|
	->icmp_rcv
		|
		|
		->icmp_socket_deliver
Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

188b1210

net: filter: split 'struct sk_filter' into socket and bpf parts · 7ae457c1

由 Alexei Starovoitov 提交于 7月 30, 2014

clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix

split 'struct sk_filter' into
struct sk_filter {
	atomic_t        refcnt;
	struct rcu_head rcu;
	struct bpf_prog *prog;
};
and
struct bpf_prog {
        u32                     jited:1,
                                len:31;
        struct sock_fprog_kern  *orig_prog;
        unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                            const struct bpf_insn *filter);
        union {
                struct sock_filter      insns[0];
                struct bpf_insn         insnsi[0];
                struct work_struct      work;
        };
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases

split SK_RUN_FILTER macro into:
    SK_RUN_FILTER to be used with 'struct sk_filter *' and
    BPF_PROG_RUN to be used with 'struct bpf_prog *'

__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function

also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:

sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter

API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet

API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ae457c1

net: filter: rename sk_convert_filter() -> bpf_convert_filter() · 8fb575ca

由 Alexei Starovoitov 提交于 7月 30, 2014

to indicate that this function is converting classic BPF into eBPF
and not related to sockets
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8fb575ca

net: filter: rename sk_chk_filter() -> bpf_check_classic() · 4df95ff4

由 Alexei Starovoitov 提交于 7月 30, 2014

trivial rename to indicate that this functions performs classic BPF checking
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4df95ff4

net: filter: rename sk_filter_proglen -> bpf_classic_proglen · 009937e7

由 Alexei Starovoitov 提交于 7月 30, 2014

trivial rename to better match semantics of macro
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

009937e7

net: filter: simplify socket charging · 278571ba

由 Alexei Starovoitov 提交于 7月 30, 2014

attaching bpf program to a socket involves multiple socket memory arithmetic,
since size of 'sk_filter' is changing when classic BPF is converted to eBPF.
Also common path of program creation has to deal with two ways of freeing
the memory.

Simplify the code by delaying socket charging until program is ready and
its size is known
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

278571ba

01 8月, 2014 7 次提交

net: use inet6_iif instead of IP6CB()->iif · 4330487a

由 Duan Jiong 提交于 8月 01, 2014

Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4330487a

netlink: Use PAGE_ALIGNED macro · 74e83b23

由 Tobias Klauser 提交于 7月 31, 2014

Use PAGE_ALIGNED(...) instead of IS_ALIGNED(..., PAGE_SIZE).
Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74e83b23

net: fix the counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS · 7304fe46

由 Duan Jiong 提交于 7月 31, 2014

When dealing with ICMPv[46] Error Message, function icmp_socket_deliver()
and icmpv6_notify() do some valid checks on packet's length, but then some
protocols check packet's length redaudantly. So remove those duplicated
statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in
function icmp_socket_deliver() and icmpv6_notify() respectively.

In addition, add missed counter in udp6/udplite6 when socket is NULL.
Signed-off-by: NDuan Jiong <duanj.fnst@cn.fujitsu.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7304fe46

sctp: Fixup v4mapped behaviour to comply with Sock API · 299ee123

由 Jason Gunthorpe 提交于 7月 30, 2014

The SCTP socket extensions API document describes the v4mapping option as
follows:

8.1.15.  Set/Clear IPv4 Mapped Addresses (SCTP_I_WANT_MAPPED_V4_ADDR)

   This socket option is a Boolean flag which turns on or off the
   mapping of IPv4 addresses.  If this option is turned on, then IPv4
   addresses will be mapped to V6 representation.  If this option is
   turned off, then no mapping will be done of V4 addresses and a user
   will receive both PF_INET6 and PF_INET type addresses on the socket.
   See [RFC3542] for more details on mapped V6 addresses.

This description isn't really in line with what the code does though.

Introduce addr_to_user (renamed addr_v4map), which should be called
before any sockaddr is passed back to user space. The new function
places the sockaddr into the correct format depending on the
SCTP_I_WANT_MAPPED_V4_ADDR option.

Audit all places that touched v4mapped and either sanely construct
a v4 or v6 address then call addr_to_user, or drop the
unnecessary v4mapped check entirely.

Audit all places that call addr_to_user and verify they are on a sycall
return path.

Add a custom getname that formats the address properly.

Several bugs are addressed:
 - SCTP_I_WANT_MAPPED_V4_ADDR=0 often returned garbage for
   addresses to user space
 - The addr_len returned from recvmsg was not correct when
   returning AF_INET on a v6 socket
 - flowlabel and scope_id were not zerod when promoting
   a v4 to v6
 - Some syscalls like bind and connect behaved differently
   depending on v4mapped

Tested bind, getpeername, getsockname, connect, and recvmsg for proper
behaviour in v4mapped = 1 and 0 cases.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
Tested-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

299ee123

tcp: don't require root to read tcp_metrics · 388070fa

由 Banerjee, Debabrata 提交于 7月 30, 2014

commit d23ff701 (tcp: add generic netlink support for tcp_metrics) introduced
netlink support for the new tcp_metrics, however it restricted getting of
tcp_metrics to root user only. This is a change from how these values could
have been fetched when in the old route cache. Unless there's a legitimate
reason to restrict the reading of these values it would be better if normal
users could fetch them.

Cc: Julian Anastasov <ja@ssi.bg>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NDebabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

388070fa

netfilter: nf_tables: check for unset NFTA_SET_ELEM_LIST_ELEMENTS attribute · 7d5570ca

由 Pablo Neira Ayuso 提交于 7月 25, 2014

Otherwise, the kernel oopses in nla_for_each_nested when iterating over
the unset attribute NFTA_SET_ELEM_LIST_ELEMENTS in the
nf_tables_{new,del}setelem() path.

netlink: 65524 bytes leftover after parsing attributes in process `nft'.
[...]
Oops: 0000 [#1] SMP
[...]
CPU: 2 PID: 6287 Comm: nft Not tainted 3.16.0-rc2+ #169
RIP: 0010:[<ffffffffa0526e61>]  [<ffffffffa0526e61>] nf_tables_newsetelem+0x82/0xec [nf_tables]
[...]
Call Trace:
 [<ffffffffa05178c4>] nfnetlink_rcv+0x2e7/0x3d7 [nfnetlink]
 [<ffffffffa0517939>] ? nfnetlink_rcv+0x35c/0x3d7 [nfnetlink]
 [<ffffffff8137d300>] netlink_unicast+0xf8/0x17a
 [<ffffffff8137d6a5>] netlink_sendmsg+0x323/0x351
[...]

Fix this by returning -EINVAL if this attribute is not set, which
doesn't make sense at all since those commands are there to add and to
delete elements from the set.
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

7d5570ca

netfilter: nfnetlink_acct: avoid using NFACCT_F_OVERQUOTA with bit helper functions · b6d04688

由 Alexey Perevalov 提交于 7月 31, 2014

Bit helper functions were used for manipulation with NFACCT_F_OVERQUOTA,
but they are accepting pit position, but not a bit mask. As a result
not a third bit for NFACCT_F_OVERQUOTA was set, but forth. Such
behaviour was dangarous and could lead to unexpected overquota report
result.
Signed-off-by: NAlexey Perevalov <a.perevalov@samsung.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

b6d04688

31 7月, 2014 4 次提交

net: filter: don't release unattached filter through call_rcu() · 34c5bd66

由 Pablo Neira 提交于 7月 29, 2014

sk_unattached_filter_destroy() does not always need to release the
filter object via rcu. Since this filter is never attached to the
socket, the caller should be responsible for releasing the filter
in a safe way, which may not necessarily imply rcu.

This is a short summary of clients of this function:

1) xt_bpf.c and cls_bpf.c use the bpf matchers from rules, these rules
are removed from the packet path before the filter is released. Thus,
the framework makes sure the filter is safely removed.

2) In the ppp driver, the ppp_lock ensures serialization between the
xmit and filter attachment/detachment path. This doesn't use rcu
so deferred release via rcu makes no sense.

3) In the isdn/ppp driver, it is called from isdn_ppp_release()
the isdn_ppp_ioctl(). This driver uses mutex and spinlocks, no rcu.
Thus, deferred rcu makes no sense to me either, the deferred releases
may be just masking the effects of wrong locking strategy, which
should be fixed in the driver itself.

4) In the team driver, this is the only place where the rcu
synchronization with unattached filter is used. Therefore, this
patch introduces synchronize_rcu() which is called from the
genetlink path to make sure the filter doesn't go away while packets
are still walking over it. I think we can revisit this once struct
bpf_prog (that only wraps specific bpf code bits) is in place, then
add some specific struct rcu_head in the scope of the team driver if
Jiri thinks this is needed.

Deferred rcu release for unattached filters was originally introduced
in 302d6637 ("filter: Allow to create sk-unattached filters").
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34c5bd66

net: Remove unlikely() for WARN_ON() conditions · 80019d31

由 Thomas Graf 提交于 7月 30, 2014

No need for the unlikely(), WARN_ON() and BUG_ON() internally use
unlikely() on the condition.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

80019d31

dcbnl : Fix misleading dcb_app->priority explanation · 16eecd9b

由 Anish Bhatt 提交于 7月 28, 2014

Current explanation of dcb_app->priority is wrong. It says priority is
expected to be a 3-bit unsigned integer which is only true when working with
DCBx-IEEE. Use of dcb_app->priority by DCBx-CEE expects it to be 802.1p user
priority bitmap. Updated accordingly

This affects the cxgb4 driver, but I will post those changes as part of a
larger changeset shortly.

Fixes: 3e29027a ("dcbnl: add support for ieee8021Qaz attributes")
Signed-off-by: NAnish Bhatt <anish@chelsio.com>
Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

16eecd9b

netfilter: nfnetlink_acct: dump unmodified nfacct flags · d24675cb

由 Alexey Perevalov 提交于 7月 30, 2014

NFNL_MSG_ACCT_GET_CTRZERO modifies dumped flags, in this case
client see unmodified (uncleared) counter value and cleared
overquota state - end user doesn't know anything about overquota state,
unless end user subscribed on overquota report.
Signed-off-by: NAlexey Perevalov <a.perevalov@samsung.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

d24675cb

30 7月, 2014 11 次提交

ipv4: clean up cast warning in do_ip_getsockopt · c54a5e02

由 Karoly Kemeny 提交于 7月 27, 2014

Sparse warns because of implicit pointer cast.

v2: subject line correction, space between "void" and "*"
Signed-off-by: NKaroly Kemeny <karoly.kemeny@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c54a5e02

tipc: remove duplicated include from socket.c · ad025a56

由 Wei Yongjun 提交于 7月 28, 2014

Remove duplicated include.
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad025a56

net/udp_offload: Use IS_ERR_OR_NULL · 27446442

由 Himangi Saraogi 提交于 7月 27, 2014

This patch introduces the use of the macro IS_ERR_OR_NULL in place of
tests for NULL and IS_ERR.

The following Coccinelle semantic patch was used for making the change:

@@
expression e;
@@

- e == NULL || IS_ERR(e)
+ IS_ERR_OR_NULL(e)
 || ...
Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

27446442

openvswitch: Use IS_ERR_OR_NULL · d0e992aa

由 Himangi Saraogi 提交于 7月 27, 2014

This patch introduces the use of the macro IS_ERR_OR_NULL in place of
tests for NULL and IS_ERR.

The following Coccinelle semantic patch was used for making the change:

@@
expression e;
@@

- e == NULL || IS_ERR(e)
+ IS_ERR_OR_NULL(e)
 || ...
Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
Acked-by: NPravin B Shelar <pshelar@nicira.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0e992aa

net/ipv4: Use IS_ERR_OR_NULL · 5a8dbf03

由 Himangi Saraogi 提交于 7月 27, 2014

This patch introduces the use of the macro IS_ERR_OR_NULL in place of
tests for NULL and IS_ERR.

The following Coccinelle semantic patch was used for making the change:

@@
expression e;
@@

- e == NULL || IS_ERR(e)
+ IS_ERR_OR_NULL(e)
 || ...
Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
Acked-by: NJulia Lawall <julia.lawall@lip6.fr>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a8dbf03

net: sendmsg: fix NULL pointer dereference · 40eea803

由 Andrey Ryabinin 提交于 7月 26, 2014

Sasha's report:
	> While fuzzing with trinity inside a KVM tools guest running the latest -next
	> kernel with the KASAN patchset, I've stumbled on the following spew:
	>
	> [ 4448.949424] ==================================================================
	> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
	> [ 4448.952988] Read of size 2 by thread T19638:
	> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
	> [ 4448.956823]  ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
	> [ 4448.958233]  ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
	> [ 4448.959552]  0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
	> [ 4448.961266] Call Trace:
	> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
	> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
	> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
	> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
	> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
	> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
	> [ 4448.970103] sock_sendmsg (net/socket.c:654)
	> [ 4448.971584] ? might_fault (mm/memory.c:3741)
	> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
	> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
	> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
	> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
	> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
	> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
	> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
	> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
	> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
	> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
	> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
	> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
	> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
	> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
	> [ 4448.988929] ==================================================================

This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.

After this report there was no usual "Unable to handle kernel NULL pointer dereference"
and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.

This bug was introduced in f3d33426
(net: rework recvmsg handler msg_name and msg_namelen logic).
Commit message states that:
	"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
	 non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
	 affect sendto as it would bail out earlier while trying to copy-in the
	 address."
But in fact this affects sendto when address 0 is mapped and contains
socket address structure in it. In such case copy-in address will succeed,
verify_iovec() function will successfully exit with msg->msg_namelen > 0
and msg->msg_name == NULL.

This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40eea803

vlan: fail early when creating netdev named config · 9c5ff24f

由 WANG Cong 提交于 7月 25, 2014

Similarly, vlan will create  /proc/net/vlan/<dev>, so when we
create dev with name "config", it will confict with
/proc/net/vlan/config.
Reported-by: NStephane Chazelas <stephane.chazelas@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9c5ff24f

ipv6: fail early when creating netdev named all or default · a317a2f1

由 WANG Cong 提交于 7月 25, 2014

We create a proc dir for each network device, this will cause
conflicts when the devices have name "all" or "default".

Rather than emitting an ugly kernel warning, we could just
fail earlier by checking the device name.
Reported-by: NStephane Chazelas <stephane.chazelas@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a317a2f1

ipv4: fail early when creating netdev named all or default · 20e61da7

由 WANG Cong 提交于 7月 25, 2014

We create a proc dir for each network device, this will cause
conflicts when the devices have name "all" or "default".

Rather than emitting an ugly kernel warning, we could just
fail earlier by checking the device name.
Reported-by: NStephane Chazelas <stephane.chazelas@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20e61da7

net: remove deprecated syststamp timestamp · 4d276eb6

由 Willem de Bruijn 提交于 7月 25, 2014

The SO_TIMESTAMPING API defines three types of timestamps: software,
hardware in raw format (hwtstamp) and hardware converted to system
format (syststamp). The last has been deprecated in favor of combining
hwtstamp with a PTP clock driver. There are no active users in the
kernel.

The option was device driver dependent. If set, but without hardware
support, the correct behavior is to return zero in the relevant field
in the SCM_TIMESTAMPING ancillary message. Without device drivers
implementing the option, this field is effectively always zero.

Remove the internal plumbing to dissuage new drivers from implementing
the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
avoid breaking existing applications that request the timestamp.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4d276eb6

packet: remove deprecated syststamp timestamp · 68a360e8

由 Willem de Bruijn 提交于 7月 25, 2014

No device driver will ever return an skb_shared_info structure with
syststamp non-zero, so remove the branch that tests for this and
optionally marks the packet timestamp as TP_STATUS_TS_SYS_HARDWARE.

Do not remove the definition TP_STATUS_TS_SYS_HARDWARE, as processes
may refer to it.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

68a360e8

29 7月, 2014 3 次提交

ip: make IP identifiers less predictable · 04ca6973

由 Eric Dumazet 提交于 7月 26, 2014

In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.

With commit 73f156a6 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.

This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.

Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.

This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.

We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.

For IPv6, adds saddr into the hash as well, but not nexthdr.

If I ping the patched target, we can see ID are now hard to predict.

21:57:11.008086 IP (...)
    A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
    target > A: ICMP echo reply, seq 1, length 64

21:57:12.013133 IP (...)
    A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
    target > A: ICMP echo reply, seq 2, length 64

21:57:13.016580 IP (...)
    A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
    target > A: ICMP echo reply, seq 3, length 64

[1] TCP sessions uses a per flow ID generator not changed by this patch.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NJeffrey Knockel <jeffk@cs.unm.edu>
Reported-by: NJedidiah R. Crandall <crandall@cs.unm.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04ca6973

tipc: make tipc_buf_append() more robust · 13e9b997

由 Jon Paul Maloy 提交于 7月 25, 2014

As per comment from David Miller, we try to make the buffer reassembly
function more resilient to user errors than it is today.

- We check that the "*buf" parameter always is set, since this is
  mandatory input.

- We ensure that *buf->next always is set to NULL before linking in
  the buffer, instead of relying of the caller to have done this.

- We ensure that the "tail" pointer in the head buffer's control
  block is initialized to NULL when the first fragment arrives.
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

13e9b997

neighbour : fix ndm_type type error issue · 545469f7

由 Jun Zhao 提交于 7月 26, 2014

ndm_type means L3 address type, in neighbour proxy and vxlan, it's RTN_UNICAST.
NDA_DST is for netlink TLV type, hence it's not right value in this context.
Signed-off-by: NJun Zhao <mypopydev@gmail.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

545469f7

28 7月, 2014 3 次提交

inet: frag: set limits and make init_net's high_thresh limit global · 1bab4c75

由 Nikolay Aleksandrov 提交于 7月 24, 2014

This patch makes init_net's high_thresh limit to be the maximum for all
namespaces, thus introducing a global memory limit threshold equal to the
sum of the individual high_thresh limits which are capped.
It also introduces some sane minimums for low_thresh as it shouldn't be
able to drop below 0 (or > high_thresh in the unsigned case), and
overall low_thresh should not ever be above high_thresh, so we make the
following relations for a namespace:
init_net:
 high_thresh - max(not capped), min(init_net low_thresh)
 low_thresh - max(init_net high_thresh), min (0)

all other namespaces:
 high_thresh = max(init_net high_thresh), min(namespace's low_thresh)
 low_thresh = max(namespace's high_thresh), min(0)

The major issue with having low_thresh > high_thresh is that we'll
schedule eviction but never evict anything and thus rely only on the
timers.
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1bab4c75

inet: frag: use seqlock for hash rebuild · ab1c724f

由 Florian Westphal 提交于 7月 24, 2014

rehash is rare operation, don't force readers to take
the read-side rwlock.

Instead, we only have to detect the (rare) case where
the secret was altered while we are trying to insert
a new inetfrag queue into the table.

If it was changed, drop the bucket lock and recompute
the hash to get the 'new' chain bucket that we have to
insert into.

Joint work with Nikolay Aleksandrov.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab1c724f

inet: frag: remove periodic secret rebuild timer · e3a57d18

由 Florian Westphal 提交于 7月 24, 2014

merge functionality into the eviction workqueue.

Instead of rebuilding every n seconds, take advantage of the upper
hash chain length limit.

If we hit it, mark table for rebuild and schedule workqueue.
To prevent frequent rebuilds when we're completely overloaded,
don't rebuild more than once every 5 seconds.

ipfrag_secret_interval sysctl is now obsolete and has been marked as
deprecated, it still can be changed so scripts won't be broken but it
won't have any effect. A comment is left above each unused secret_timer
variable to avoid confusion.

Joint work with Nikolay Aleksandrov.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e3a57d18

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功