提交 · a516993f0ac1694673412eb2d16a091eafa77d2a · openeuler / raspberrypi-kernel

14 8月, 2015 1 次提交

net: fix wrong skb_get() usage / crash in IGMP/MLD parsing code · a516993f

由 Linus Lüssing 提交于 8月 13, 2015

The recent refactoring of the IGMP and MLD parsing code into
ipv6_mc_check_mld() / ip_mc_check_igmp() introduced a potential crash /
BUG() invocation for bridges:

I wrongly assumed that skb_get() could be used as a simple reference
counter for an skb which is not the case. skb_get() bears additional
semantics, a user count. This leads to a BUG() invocation in
pskb_expand_head() / kernel panic if pskb_may_pull() is called on an skb
with a user count greater than one - unfortunately the refactoring did
just that.

Fixing this by removing the skb_get() call and changing the API: The
caller of ipv6_mc_check_mld() / ip_mc_check_igmp() now needs to
additionally check whether the returned skb_trimmed is a clone.

Fixes: 9afd85c9 ("net: Export IGMP/MLD message validation code")
Reported-by: NBrenden Blanco <bblanco@plumgrid.com>
Signed-off-by: NLinus Lüssing <linus.luessing@c0d3.blue>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a516993f

11 8月, 2015 1 次提交

inet: fix races with reqsk timers · 2235f2ac

由 Eric Dumazet 提交于 8月 10, 2015

reqsk_queue_destroy() and reqsk_queue_unlink() should use
del_timer_sync() instead of del_timer() before calling reqsk_put(),
otherwise we could free a req still used by another cpu.

But before doing so, reqsk_queue_destroy() must release syn_wait_lock
spinlock or risk a dead lock, as reqsk_timer_handler() might
need to take this same spinlock from reqsk_queue_unlink() (called from
inet_csk_reqsk_queue_drop())

Fixes: fa76ce73 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2235f2ac

07 8月, 2015 2 次提交

net: pktgen: don't abuse current->state in pktgen_thread_worker() · 7ba8bd75

由 Oleg Nesterov 提交于 8月 04, 2015

Commit 1fbe4b46 "net: pktgen: kill the Wait for kthread_stop
code in pktgen_thread_worker()" removed (in particular) the final
__set_current_state(TASK_RUNNING) and I didn't notice the previous
set_current_state(TASK_INTERRUPTIBLE). This triggers the warning
in __might_sleep() after return.

Afaics, we can simply remove both set_current_state()'s, and we
could do this a long ago right after ef87979c "pktgen: better
scheduler friendliness" which changed pktgen_thread_worker() to
use wait_event_interruptible_timeout().
Reported-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7ba8bd75

net: Fix skb_set_peeked use-after-free bug · a0a2a660

由 Herbert Xu 提交于 8月 04, 2015

The commit 738ac1eb ("net: Clone
skb before setting peeked flag") introduced a use-after-free bug
in skb_recv_datagram.  This is because skb_set_peeked may create
a new skb and free the existing one.  As it stands the caller will
continue to use the old freed skb.

This patch fixes it by making skb_set_peeked return the new skb
(or the old one if unchanged).

Fixes: 738ac1eb ("net: Clone skb before setting peeked flag")
Reported-by: NBrenden Blanco <bblanco@plumgrid.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Tested-by: NBrenden Blanco <bblanco@plumgrid.com>
Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0a2a660

31 7月, 2015 1 次提交

net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket · 8a681736

由 Sowmini Varadhan 提交于 7月 30, 2015

The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).

E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.

Fixes: 26abe143 ("net: Modify sk_alloc to not reference count the
      netns of kernel sockets.")
Acked-by: NEric Dumazet <edumazet@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a681736

27 7月, 2015 1 次提交

tcp: fix recv with flags MSG_WAITALL | MSG_PEEK · dfbafc99

由 Sabrina Dubroca 提交于 7月 24, 2015

Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called
with flags = MSG_WAITALL | MSG_PEEK.

sk_wait_data waits for sk_receive_queue not empty, but in this case,
the receive queue is not empty, but does not contain any skb that we
can use.

Add a "last skb seen on receive queue" argument to sk_wait_data, so
that it sleeps until the receive queue has new skbs.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461
Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258Reported-by: NEnrico Scholz <rh-bugzilla@ensc.de>
Reported-by: NDan Searle <dan@censornet.com>
Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

dfbafc99

25 7月, 2015 1 次提交

cgroup: net_cls: fix false-positive "suspicious RCU usage" · cc9f4daa

由 Konstantin Khlebnikov 提交于 7月 22, 2015

In dev_queue_xmit() net_cls protected with rcu-bh.

[  270.730026] ===============================
[  270.730029] [ INFO: suspicious RCU usage. ]
[  270.730033] 4.2.0-rc3+ #2 Not tainted
[  270.730036] -------------------------------
[  270.730040] include/linux/cgroup.h:353 suspicious rcu_dereference_check() usage!
[  270.730041] other info that might help us debug this:
[  270.730043] rcu_scheduler_active = 1, debug_locks = 1
[  270.730045] 2 locks held by dhclient/748:
[  270.730046]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff81682b70>] __dev_queue_xmit+0x50/0x960
[  270.730085]  #1:  (&qdisc_tx_lock){+.....}, at: [<ffffffff81682d60>] __dev_queue_xmit+0x240/0x960
[  270.730090] stack backtrace:
[  270.730096] CPU: 0 PID: 748 Comm: dhclient Not tainted 4.2.0-rc3+ #2
[  270.730098] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
[  270.730100]  0000000000000001 ffff8800bafeba58 ffffffff817ad487 0000000000000007
[  270.730103]  ffff880232a0a780 ffff8800bafeba88 ffffffff810ca4f2 ffff88022fb23e00
[  270.730105]  ffff880232a0a780 ffff8800bafebb68 ffff8800bafebb68 ffff8800bafebaa8
[  270.730108] Call Trace:
[  270.730121]  [<ffffffff817ad487>] dump_stack+0x4c/0x65
[  270.730148]  [<ffffffff810ca4f2>] lockdep_rcu_suspicious+0xe2/0x120
[  270.730153]  [<ffffffff816a62d2>] task_cls_state+0x92/0xa0
[  270.730158]  [<ffffffffa00b534f>] cls_cgroup_classify+0x4f/0x120 [cls_cgroup]
[  270.730164]  [<ffffffff816aac74>] tc_classify_compat+0x74/0xc0
[  270.730166]  [<ffffffff816ab573>] tc_classify+0x33/0x90
[  270.730170]  [<ffffffffa00bcb0a>] htb_enqueue+0xaa/0x4a0 [sch_htb]
[  270.730172]  [<ffffffff81682e26>] __dev_queue_xmit+0x306/0x960
[  270.730174]  [<ffffffff81682b70>] ? __dev_queue_xmit+0x50/0x960
[  270.730176]  [<ffffffff816834a3>] dev_queue_xmit_sk+0x13/0x20
[  270.730185]  [<ffffffff81787770>] dev_queue_xmit+0x10/0x20
[  270.730187]  [<ffffffff8178b91c>] packet_snd.isra.62+0x54c/0x760
[  270.730190]  [<ffffffff8178be25>] packet_sendmsg+0x2f5/0x3f0
[  270.730203]  [<ffffffff81665245>] ? sock_def_readable+0x5/0x190
[  270.730210]  [<ffffffff817b64bb>] ? _raw_spin_unlock+0x2b/0x40
[  270.730216]  [<ffffffff8173bcbc>] ? unix_dgram_sendmsg+0x5cc/0x640
[  270.730219]  [<ffffffff8165f367>] sock_sendmsg+0x47/0x50
[  270.730221]  [<ffffffff8165f42f>] sock_write_iter+0x7f/0xd0
[  270.730232]  [<ffffffff811fd4c7>] __vfs_write+0xa7/0xf0
[  270.730234]  [<ffffffff811fe5b8>] vfs_write+0xb8/0x190
[  270.730236]  [<ffffffff811fe8c2>] SyS_write+0x52/0xb0
[  270.730239]  [<ffffffff817b6bae>] entry_SYSCALL_64_fastpath+0x12/0x76
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cc9f4daa

21 7月, 2015 1 次提交

net: ratelimit warnings about dst entry refcount underflow or overflow · 8bf4ada2

由 Konstantin Khlebnikov 提交于 7月 17, 2015

Kernel generates a lot of warnings when dst entry reference counter
overflows and becomes negative. That bug was seen several times at
machines with outdated 3.10.y kernels. Most like it's already fixed
in upstream. Anyway that flood completely kills machine and makes
further debugging impossible.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8bf4ada2

16 7月, 2015 3 次提交

net: Fix skb csum races when peeking · 89c22d8c

由 Herbert Xu 提交于 7月 13, 2015

When we calculate the checksum on the recv path, we store the
result in the skb as an optimisation in case we need the checksum
again down the line.

This is in fact bogus for the MSG_PEEK case as this is done without
any locking.  So multiple threads can peek and then store the result
to the same skb, potentially resulting in bogus skb states.

This patch fixes this by only storing the result if the skb is not
shared.  This preserves the optimisations for the few cases where
it can be done safely due to locking or other reasons, e.g., SIOCINQ.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89c22d8c

net: Clone skb before setting peeked flag · 738ac1eb

由 Herbert Xu 提交于 7月 13, 2015

Shared skbs must not be modified and this is crucial for broadcast
and/or multicast paths where we use it as an optimisation to avoid
unnecessary cloning.

The function skb_recv_datagram breaks this rule by setting peeked
without cloning the skb first.  This causes funky races which leads
to double-free.

This patch fixes this by cloning the skb and replacing the skb
in the list when setting skb->peeked.

Fixes: a59322be ("[UDP]: Only increment counter on first peek/recv")
Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

738ac1eb

rtnetlink: reject non-IFLA_VF_PORT attributes inside IFLA_VF_PORTS · 035d210f

由 Daniel Borkmann 提交于 7月 13, 2015

Similarly as in commit 4f7d2cdf ("rtnetlink: verify IFLA_VF_INFO
attributes before passing them to driver"), we have a double nesting
of netlink attributes, i.e. IFLA_VF_PORTS only contains IFLA_VF_PORT
that is nested itself. While IFLA_VF_PORTS is a verified attribute
from ifla_policy[], we only check if the IFLA_VF_PORTS container has
IFLA_VF_PORT attributes and then pass the attribute's content itself
via nla_parse_nested(). It would be more correct to reject inner types
other than IFLA_VF_PORT instead of continuing parsing and also similarly
as in commit 4f7d2cdf, to check for a minimum of NLA_HDRLEN.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: Scott Feldman <sfeldma@gmail.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

035d210f

11 7月, 2015 2 次提交

net: call rcu_read_lock early in process_backlog · 2c17d27c

由 Julian Anastasov 提交于 7月 09, 2015

Incoming packet should be either in backlog queue or
in RCU read-side section. Otherwise, the final sequence of
flush_backlog() and synchronize_net() may miss packets
that can run without device reference:

CPU 1                  CPU 2
                       skb->dev: no reference
                       process_backlog:__skb_dequeue
                       process_backlog:local_irq_enable

on_each_cpu for
flush_backlog =>       IPI(hardirq): flush_backlog
                       - packet not found in backlog

                       CPU delayed ...
synchronize_net
- no ongoing RCU
read-side sections

netdev_run_todo,
rcu_barrier: no
ongoing callbacks
                       __netif_receive_skb_core:rcu_read_lock
                       - too late
free dev
                       process packet for freed dev

Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c17d27c

net: do not process device backlog during unregistration · e9e4dd32

由 Julian Anastasov 提交于 7月 09, 2015

commit 381c759d ("ipv4: Avoid crashing in ip_error")
fixes a problem where processed packet comes from device
with destroyed inetdev (dev->ip_ptr). This is not expected
because inetdev_destroy is called in NETDEV_UNREGISTER
phase and packets should not be processed after
dev_close_many() and synchronize_net(). Above fix is still
required because inetdev_destroy can be called for other
reasons. But it shows the real problem: backlog can keep
packets for long time and they do not hold reference to
device. Such packets are then delivered to upper levels
at the same time when device is unregistered.
Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
accounts all packets from backlog but before that some packets
continue to be delivered to upper levels long after the
synchronize_net call which is supposed to wait the last
ones. Also, as Eric pointed out, processed packets, mostly
from other devices, can continue to add new packets to backlog.

Fix the problem by moving flush_backlog early, after the
device driver is stopped and before the synchronize_net() call.
Then use netif_running check to make sure we do not add more
packets to backlog. We have to do it in enqueue_to_backlog
context when the local IRQ is disabled. As result, after the
flush_backlog and synchronize_net sequence all packets
should be accounted.

Thanks to Eric W. Biederman for the test script and his
valuable feedback!
Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
Fixes: 6e583ce5 ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e9e4dd32

10 7月, 2015 2 次提交

net: pktgen: kill the "Wait for kthread_stop" code in pktgen_thread_worker() · 1fbe4b46

由 Oleg Nesterov 提交于 7月 08, 2015

pktgen_thread_worker() doesn't need to wait for kthread_stop(), it
can simply exit. Just pktgen_create_thread() and pg_net_exit() should
do get_task_struct()/put_task_struct(). kthread_stop(dead_thread) is
fine.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1fbe4b46

net: pktgen: fix race between pktgen_thread_worker() and kthread_stop() · fecdf8be

由 Oleg Nesterov 提交于 7月 08, 2015

pktgen_thread_worker() is obviously racy, kthread_stop() can come
between the kthread_should_stop() check and set_current_state().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reported-by: NJan Stancek <jstancek@redhat.com>
Reported-by: NMarcelo Leitner <mleitner@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fecdf8be

09 7月, 2015 4 次提交

rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver · 4f7d2cdf

由 Daniel Borkmann 提交于 7月 07, 2015

Jason Gunthorpe reported that since commit c02db8c6 ("rtnetlink: make
SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
anymore with respect to their policy, that is, ifla_vfinfo_policy[].

Before, they were part of ifla_policy[], but they have been nested since
placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
which is another nested attribute for the actual VF attributes such as
IFLA_VF_MAC, IFLA_VF_VLAN, etc.

Despite the policy being split out from ifla_policy[] in this commit,
it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
testing for struct nlattr, but it doesn't know about the data context and
their requirements.

Fix, on top of Jason's initial work, does 1) parsing of the attributes
with the right policy, and 2) using the resulting parsed attribute table
from 1) instead of the nla_for_each_nested() loop (just like we used to
do when still part of ifla_policy[]).

Reference: http://thread.gmane.org/gmane.linux.network/368913
Fixes: c02db8c6 ("rtnetlink: make SR-IOV VF interface symmetric")
Reported-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com>
Cc: Greg Rose <gregory.v.rose@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Rony Efraim <ronye@mellanox.com>
Cc: Vlad Zolotarov <vladz@cloudius-systems.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NVlad Zolotarov <vladz@cloudius-systems.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f7d2cdf

Revert "dev: set iflink to 0 for virtual interfaces" · 95ec655b

由 Nicolas Dichtel 提交于 7月 06, 2015

This reverts commit e1622baf.

The side effect of this commit is to add a '@NONE' after each virtual
interface name with a 'ip link'. It may break existing scripts.
Reported-by: NOlivier Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: NOliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95ec655b

net: graceful exit from netif_alloc_netdev_queues() · d339727c

由 Eric Dumazet 提交于 7月 06, 2015

User space can crash kernel with

ip link add ifb10 numtxqueues 100000 type ifb

We must replace a BUG_ON() by proper test and return -EINVAL for
crazy values.

Fixes: 60877a32 ("net: allow large number of tx queues")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d339727c

net_sched: gen_estimator: extend pps limit · 32f675bb

由 Eric Dumazet 提交于 7月 02, 2015

rate estimators are limited to 4 Mpps, which was fine years ago, but
too small with current hardware generation.

Lets use 2^5 scaling instead of 2^10 to get 128 Mpps new limit.

On 64bit arch, use an "unsigned long" for temp storage and remove limit.
(We do not expect 32bit arches to be able to reach this point)

Tested:

tc -s -d filter sh dev eth0 parent ffff:

filter protocol ip pref 1 u32
filter protocol ip pref 1 u32 fh 800: ht divisor 1
filter protocol ip pref 1 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:15
  match 07000000/ff000000 at 12
	action order 1: gact action drop
	 random type none pass val 0
	 index 1 ref 1 bind 1 installed 166 sec
 	Action statistics:
	Sent 39734251496 bytes 863788076 pkt (dropped 863788117, overlimits 0 requeues 0)
	rate 4067Mbit 11053596pps backlog 0b 0p requeues 0
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

32f675bb

01 7月, 2015 1 次提交

sock_diag: don't broadcast kernel sockets · b922622e

由 Craig Gallek 提交于 6月 30, 2015

Kernel sockets do not hold a reference for the network namespace to
which they point.  Socket destruction broadcasting relies on the
network namespace and will cause the splat below when a kernel socket
is destroyed.

This fix simply ignores kernel sockets when they are destroyed.

Reported as:
general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9130 Comm: kworker/1:1 Not tainted 4.1.0-gelk-debug+ #1
Workqueue: sock_diag_events sock_diag_broadcast_destroy_work
Stack:
 ffff8800b9c586c0 ffff8800b9c586c0 ffff8800ac4692c0 ffff8800936d4a90
 ffff8800352efd38 ffffffff8469a93e ffff8800352efd98 ffffffffc09b9b90
 ffff8800352efd78 ffff8800ac4692c0 ffff8800b9c586c0 ffff8800831b6ab8
Call Trace:
 [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
 [<ffffffffc09b9b90>] ? inet_diag_handler_get_info+0x110/0x1fb [inet_diag]
 [<ffffffff845c868d>] netlink_broadcast+0x1d/0x20
 [<ffffffff8469a93e>] ? mutex_unlock+0xe/0x10
 [<ffffffff845b2bf5>] sock_diag_broadcast_destroy_work+0xd5/0x160
 [<ffffffff8408ea97>] process_one_work+0x147/0x420
 [<ffffffff8408f0f9>] worker_thread+0x69/0x470
 [<ffffffff8409fda3>] ? preempt_count_sub+0xa3/0xf0
 [<ffffffff8408f090>] ? rescuer_thread+0x320/0x320
 [<ffffffff84093cd7>] kthread+0x107/0x120
 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0
 [<ffffffff8469d31f>] ret_from_fork+0x3f/0x70
 [<ffffffff84093bd0>] ? kthread_create_on_node+0x1b0/0x1b0

Tested:
  Using a debug kernel while 'ss -E' is running:
  ip netns add test-ns
  ip netns delete test-ns

Fixes: eb4cb008 sock_diag: define destruction multicast groups
Fixes: 26abe143 net: Modify sk_alloc to not reference count the
  netns of kernel sockets.
Reported-by: NDave Jones <davej@codemonkey.org.uk>
Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NCraig Gallek <kraig@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b922622e

29 6月, 2015 2 次提交

net: Kill sock->sk_protinfo · 1830fcea

由 David Miller 提交于 6月 25, 2015

No more users, so it can now be removed.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1830fcea

flow_dissector: Pre-initialize ip_proto in __skb_flow_dissect() · 8e690ffd

由 Geert Uytterhoeven 提交于 6月 25, 2015

net/core/flow_dissector.c: In function ‘__skb_flow_dissect’:
net/core/flow_dissector.c:132: warning: ‘ip_proto’ may be used uninitialized in this function
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Acked-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e690ffd

23 6月, 2015 1 次提交

switchdev; add VLAN support for port's bridge_getlink · 7d4f8d87

由 Scott Feldman 提交于 6月 22, 2015

One more missing piece of the puzzle.  Add vlan dump support to switchdev
port's bridge_getlink.  iproute2 "bridge vlan show" cmd already knows how
to show the vlans installed on the bridge and the device , but (until now)
no one implemented the port vlan part of the netlink PF_BRIDGE:RTM_GETLINK
msg.  Before this patch, "bridge vlan show":

	$ bridge -c vlan show
	port    vlan ids
	sw1p1    30-34			<< bridge side vlans
		 57

	sw1p1				<< device side vlans (missing)

	sw1p2    57

	sw1p2

	sw1p3

	sw1p4

	br0     None

(When the port is bridged, the output repeats the vlan list for the vlans
on the bridge side of the port and the vlans on the device side of the
port.  The listing above show no vlans for the device side even though they
are installed).

After this patch:

	$ bridge -c vlan show
	port    vlan ids
	sw1p1    30-34			<< bridge side vlan
		 57

	sw1p1    30-34			<< device side vlans
		 57
		 3840 PVID

	sw1p2    57

	sw1p2    57
		 3840 PVID

	sw1p3    3842 PVID

	sw1p4    3843 PVID

	br0     None

I re-used ndo_dflt_bridge_getlink to add vlan fill call-back func.
switchdev support adds an obj dump for VLAN objects, using the same
call-back scheme as FDB dump.  Support included for both compressed and
un-compressed vlan dumps.
Signed-off-by: NScott Feldman <sfeldma@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7d4f8d87

22 6月, 2015 1 次提交

neigh: do not modify unlinked entries · 2c51a97f

由 Julian Anastasov 提交于 6月 16, 2015

The lockless lookups can return entry that is unlinked.
Sometimes they get reference before last neigh_cleanup_and_release,
sometimes they do not need reference. Later, any
modification attempts may result in the following problems:

1. entry is not destroyed immediately because neigh_update
can start the timer for dead entry, eg. on change to NUD_REACHABLE
state. As result, entry lives for some time but is invisible
and out of control.

2. __neigh_event_send can run in parallel with neigh_destroy
while refcnt=0 but if timer is started and expired refcnt can
reach 0 for second time leading to second neigh_destroy and
possible crash.

Thanks to Eric Dumazet and Ying Xue for their work and analyze
on the __neigh_event_send change.

Fixes: 767e97e1 ("neigh: RCU conversion of struct neighbour")
Fixes: a263b309 ("ipv4: Make neigh lookups directly in output packet path.")
Fixes: 6fd6ce20 ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: NJulian Anastasov <ja@ssi.bg>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c51a97f

16 6月, 2015 5 次提交

bpf: disallow bpf tc programs access current->pid,uid · 89d256bb

由 Alexei Starovoitov 提交于 6月 15, 2015

Accessing current->pid/uid from cls_bpf may lead to misleading results and
should not be used when TC classifiers need accurate information about pid/uid.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

89d256bb

sock_diag: define destruction multicast groups · eb4cb008

由 Craig Gallek 提交于 6月 15, 2015

These groups will contain socket-destruction events for
AF_INET/AF_INET6, IPPROTO_TCP/IPPROTO_UDP.

Near the end of socket destruction, a check for listeners is
performed.  In the presence of a listener, rather than completely
cleanup the socket, a unit of work will be added to a private
work queue which will first broadcast information about the socket
and then finish the cleanup operation.
Signed-off-by: NCraig Gallek <kraig@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb4cb008

net/core: Add reading VF statistics through the PF netdevice · 3b766cd8

由 Eran Ben Elisha 提交于 6月 15, 2015

Add ndo_get_vf_stats where the PF retrieves and fills the VFs traffic
statistics. We encode the VF stats in a nested manner to allow for
future extensions.
Signed-off-by: NEran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b766cd8

bpf: allow networking programs to use bpf_trace_printk() for debugging · 0756ea3e

由 Alexei Starovoitov 提交于 6月 12, 2015

bpf_trace_printk() is a helper function used to debug eBPF programs.
Let socket and TC programs use it as well.
Note, it's DEBUG ONLY helper. If it's used in the program,
the kernel will print warning banner to make sure users don't use
it in production.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0756ea3e

bpf: introduce current->pid, tgid, uid, gid, comm accessors · ffeedafb

由 Alexei Starovoitov 提交于 6月 12, 2015

eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:

u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid

u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid

bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf

They can be used from the programs attached to TC as well to classify packets
based on current task fields.

Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ffeedafb

13 6月, 2015 3 次提交

flow_dissector: fix ipv6 dst, hop-by-hop and routing ext hdrs · 1e98a0f0

由 Eric Dumazet 提交于 6月 12, 2015

__skb_header_pointer() returns a pointer that must be checked.

Fixes infinite loop reported by Alexei, and add __must_check to
catch these errors earlier.

Fixes: 6a74fcf4 ("flow_dissector: add support for dst, hop-by-hop and routing ext hdrs")
Reported-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e98a0f0

flow_dissector: add support for dst, hop-by-hop and routing ext hdrs · 6a74fcf4

由 Tom Herbert 提交于 6月 12, 2015

If dst, hop-by-hop or routing extension headers are present determine
length of the options and skip over them in flow dissection.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6a74fcf4

flow_dissector: Fix MPLS entropy label handling in flow dissector · 611d23c5

由 Tom Herbert 提交于 6月 12, 2015

Need to shift after masking to get label value for comparison.

Fixes: b3baa0fb ("mpls: Add MPLS entropy label in flow_keys")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

611d23c5

12 6月, 2015 1 次提交

net: don't wait for order-3 page allocation · fb05e7a8

由 Shaohua Li 提交于 6月 11, 2015

We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f768
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.

This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.

alloc_skb_with_frags is the same.

The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.

V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer

Cc: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Debabrata Banerjee <dbavatar@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb05e7a8

11 6月, 2015 2 次提交

net/ethtool: Add current supported tunable options · a4244b0c

由 Hadar Hen Zion 提交于 6月 11, 2015

Add strings array of the current supported tunable options.
Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
Reviewed-by: NAmir Vadai <amirv@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4244b0c

net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap · 5d753610

由 Mel Gorman 提交于 6月 10, 2015

Jeff Layton reported the following;

 [   74.232485] ------------[ cut here ]------------
 [   74.233354] WARNING: CPU: 2 PID: 754 at net/core/sock.c:364 sk_clear_memalloc+0x51/0x80()
 [   74.234790] Modules linked in: cts rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs libcrc32c snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device nfsd snd_pcm snd_timer snd e1000 ppdev parport_pc joydev parport pvpanic soundcore floppy serio_raw i2c_piix4 pcspkr nfs_acl lockd virtio_balloon acpi_cpufreq auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci ata_generic virtio_ring pata_acpi virtio
 [   74.243599] CPU: 2 PID: 754 Comm: swapoff Not tainted 4.1.0-rc6+ #5
 [   74.244635] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [   74.245546]  0000000000000000 0000000079e69e31 ffff8800d066bde8 ffffffff8179263d
 [   74.246786]  0000000000000000 0000000000000000 ffff8800d066be28 ffffffff8109e6fa
 [   74.248175]  0000000000000000 ffff880118d48000 ffff8800d58f5c08 ffff880036e380a8
 [   74.249483] Call Trace:
 [   74.249872]  [<ffffffff8179263d>] dump_stack+0x45/0x57
 [   74.250703]  [<ffffffff8109e6fa>] warn_slowpath_common+0x8a/0xc0
 [   74.251655]  [<ffffffff8109e82a>] warn_slowpath_null+0x1a/0x20
 [   74.252585]  [<ffffffff81661241>] sk_clear_memalloc+0x51/0x80
 [   74.253519]  [<ffffffffa0116c72>] xs_disable_swap+0x42/0x80 [sunrpc]
 [   74.254537]  [<ffffffffa01109de>] rpc_clnt_swap_deactivate+0x7e/0xc0 [sunrpc]
 [   74.255610]  [<ffffffffa03e4fd7>] nfs_swap_deactivate+0x27/0x30 [nfs]
 [   74.256582]  [<ffffffff811e99d4>] destroy_swap_extents+0x74/0x80
 [   74.257496]  [<ffffffff811ecb52>] SyS_swapoff+0x222/0x5c0
 [   74.258318]  [<ffffffff81023f27>] ? syscall_trace_leave+0xc7/0x140
 [   74.259253]  [<ffffffff81798dae>] system_call_fastpath+0x12/0x71
 [   74.260158] ---[ end trace 2530722966429f10 ]---

The warning in question was unnecessary but with Jeff's series the rules
are also clearer.  This patch removes the warning and updates the comment
to explain why sk_mem_reclaim() may still be called.

[jlayton: remove if (sk->sk_forward_alloc) conditional. As Leon
          points out that it's not needed.]

Cc: Leon Romanovsky <leon@leon.nu>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5d753610

09 6月, 2015 1 次提交

net: replace last open coded skb_orphan_frags with function call · bbbf2df0

由 Willem de Bruijn 提交于 6月 08, 2015

Commit 70008aa5 ("skbuff: convert to skb_orphan_frags") replaced
open coded tests of SKBTX_DEV_ZEROCOPY and skb_copy_ubufs with calls
to helper function skb_orphan_frags. Apply that to the last remaining
open coded site.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bbbf2df0

07 6月, 2015 2 次提交

bpf: allow programs to write to certain skb fields · d691f9e8

由 Alexei Starovoitov 提交于 6月 04, 2015

allow programs read/write skb->mark, tc_index fields and
((struct qdisc_skb_cb *)cb)->data.

mark and tc_index are generically useful in TC.
cb[0]-cb[4] are primarily used to pass arguments from one
program to another called via bpf_tail_call() which can
be seen in sockex3_kern.c example.

All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs.
mark, tc_index are writeable from tc_cls_act only.
cb[0]-cb[4] are writeable by both sockets and tc_cls_act.

Add verifier tests and improve sample code.
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d691f9e8

bpf: make programs see skb->data == L2 for ingress and egress · 3431205e

由 Alexei Starovoitov 提交于 6月 04, 2015

eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
For ingress L2 header is already pulled, whereas for egress it's present.
This is known to program writers which are currently forced to use
BPF_LL_OFF workaround.
Since programs don't change skb internal pointers it is safe to do
pull/push right around invocation of the program and earlier taps and
later pt->func() will not be affected.
Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
around run_filter/BPF_PROG_RUN even if skb_shared.

This fix finally allows programs to use optimized LD_ABS/IND instructions
without BPF_LL_OFF for higher performance.
tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
       w/o JIT   w/JIT
before  20.5     23.6 Mpps
after   21.8     26.6 Mpps

Old programs with BPF_LL_OFF will still work as-is.

We can now undo most of the earlier workaround commit:
a166151c ("bpf: fix bpf helpers to use skb->mac_header relative offsets")
Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3431205e

05 6月, 2015 2 次提交

mpls: Add MPLS entropy label in flow_keys · b3baa0fb

由 Tom Herbert 提交于 6月 04, 2015

In flow dissector if an MPLS header contains an entropy label this is
saved in the new keyid field of flow_keys. The entropy label is
then represented in the flow hash function input.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b3baa0fb

net: Add GRE keyid in flow_keys · 1fdd512c

由 Tom Herbert 提交于 6月 04, 2015

In flow dissector if a GRE header contains a keyid this is saved in the
new keyid field of flow_keys. The GRE keyid is then represented
in the flow hash function input.
Signed-off-by: NTom Herbert <tom@herbertland.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1fdd512c