提交 · d0c294c53a771ae7e84506dfbd8c18c30f078735 · openeuler / raspberrypi-kernel

12 3月, 2015 3 次提交

sock: fix possible NULL sk dereference in __skb_tstamp_tx · 3a8dd971

由 Willem de Bruijn 提交于 3月 11, 2015

Test that sk != NULL before reading sk->sk_tsflags.

Fixes: 49ca0d8b ("net-timestamp: no-payload option")
Reported-by: NOne Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a8dd971

xps: must clear sender_cpu before forwarding · c29390c6

由 Eric Dumazet 提交于 3月 11, 2015

John reported that my previous commit added a regression
on his router.

This is because sender_cpu & napi_id share a common location,
so get_xps_queue() can see garbage and perform an out of bound access.

We need to make sure sender_cpu is cleared before doing the transmit,
otherwise any NIC busy poll enabled (skb_mark_napi_id()) can trigger
this bug.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NJohn <jw@nuclearfallout.net>
Bisected-by: NJohn <jw@nuclearfallout.net>
Fixes: 2bd82484 ("xps: fix xps for stacked devices")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c29390c6

net: sysctl_net_core: check SNDBUF and RCVBUF for min length · b1cb59cf

由 Alexey Kodanev 提交于 3月 11, 2015

sysctl has sysctl.net.core.rmem_*/wmem_* parameters which can be
set to incorrect values. Given that 'struct sk_buff' allocates from
rcvbuf, incorrectly set buffer length could result to memory
allocation failures. For example, set them as follows:

    # sysctl net.core.rmem_default=64
      net.core.wmem_default = 64
    # sysctl net.core.wmem_default=64
      net.core.wmem_default = 64
    # ping localhost -s 1024 -i 0 > /dev/null

This could result to the following failure:

skbuff: skb_over_panic: text:ffffffff81628db4 len:-32 put:-32
head:ffff88003a1cc200 data:ffff88003a1cc200 tail:0xffffffe0 end:0xc0 dev:<NULL>
kernel BUG at net/core/skbuff.c:102!
invalid opcode: 0000 [#1] SMP
...
task: ffff88003b7f5550 ti: ffff88003ae88000 task.ti: ffff88003ae88000
RIP: 0010:[<ffffffff8155fbd1>]  [<ffffffff8155fbd1>] skb_put+0xa1/0xb0
RSP: 0018:ffff88003ae8bc68  EFLAGS: 00010296
RAX: 000000000000008d RBX: 00000000ffffffe0 RCX: 0000000000000000
RDX: ffff88003fdcf598 RSI: ffff88003fdcd9c8 RDI: ffff88003fdcd9c8
RBP: ffff88003ae8bc88 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 00000000000002b2 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88003d3f7300 R15: ffff88000012a900
FS:  00007fa0e2b4a840(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000d0f7e0 CR3: 000000003b8fb000 CR4: 00000000000006f0
Stack:
 ffff88003a1cc200 00000000ffffffe0 00000000000000c0 ffffffff818cab1d
 ffff88003ae8bd68 ffffffff81628db4 ffff88003ae8bd48 ffff88003b7f5550
 ffff880031a09408 ffff88003b7f5550 ffff88000012aa48 ffff88000012ab00
Call Trace:
 [<ffffffff81628db4>] unix_stream_sendmsg+0x2c4/0x470
 [<ffffffff81556f56>] sock_write_iter+0x146/0x160
 [<ffffffff811d9612>] new_sync_write+0x92/0xd0
 [<ffffffff811d9cd6>] vfs_write+0xd6/0x180
 [<ffffffff811da499>] SyS_write+0x59/0xd0
 [<ffffffff81651532>] system_call_fastpath+0x12/0x17
Code: 00 00 48 89 44 24 10 8b 87 c8 00 00 00 48 89 44 24 08 48 8b 87 d8 00
      00 00 48 c7 c7 30 db 91 81 48 89 04 24 31 c0 e8 4f a8 0e 00 <0f> 0b
      eb fe 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83
RIP  [<ffffffff8155fbd1>] skb_put+0xa1/0xb0
RSP <ffff88003ae8bc68>
Kernel panic - not syncing: Fatal exception

Moreover, the possible minimum is 1, so we can get another kernel panic:
...
BUG: unable to handle kernel paging request at ffff88013caee5c0
IP: [<ffffffff815604cf>] __alloc_skb+0x12f/0x1f0
...
Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1cb59cf

11 3月, 2015 2 次提交

net: Handle unregister properly when netdev namespace change fails. · 43638900

由 David S. Miller 提交于 3月 10, 2015

If rtnl_newlink() fails on it's call to dev_change_net_namespace(), we
have to make use of the ->dellink() method, if present, just like we
do when rtnl_configure_link() fails.

Fixes: 317f4810 ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43638900

net: add comment for sock_efree() usage · 7768eed8

由 Oliver Hartkopp 提交于 3月 10, 2015

Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
Acked-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7768eed8

01 3月, 2015 3 次提交

net: do not use rcu in rtnl_dump_ifinfo() · cac5e65e

由 Eric Dumazet 提交于 2月 27, 2015

We did a failed attempt in the past to only use rcu in rtnl dump
operations (commit e67f88dd "net: dont hold rtnl mutex during
netlink dump callbacks")

Now that dumps are holding RTNL anyway, there is no need to also
use rcu locking, as it forbids any scheduling ability, like
GFP_KERNEL allocations that controlling path should use instead
of GFP_ATOMIC whenever possible.

This should fix following splat Cong Wang reported :

 [ INFO: suspicious RCU usage. ]
 3.19.0+ #805 Tainted: G        W

 include/linux/rcupdate.h:538 Illegal context switch in RCU read-side critical section!

 other info that might help us debug this:

 rcu_scheduler_active = 1, debug_locks = 0
 2 locks held by ip/771:
  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff8182b8f4>] netlink_dump+0x21/0x26c
  #1:  (rcu_read_lock){......}, at: [<ffffffff817d785b>] rcu_read_lock+0x0/0x6e

 stack backtrace:
 CPU: 3 PID: 771 Comm: ip Tainted: G        W       3.19.0+ #805
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000001 ffff8800d51e7718 ffffffff81a27457 0000000029e729e6
  ffff8800d6108000 ffff8800d51e7748 ffffffff810b539b ffffffff820013dd
  00000000000001c8 0000000000000000 ffff8800d7448088 ffff8800d51e7758
 Call Trace:
  [<ffffffff81a27457>] dump_stack+0x4c/0x65
  [<ffffffff810b539b>] lockdep_rcu_suspicious+0x107/0x110
  [<ffffffff8109796f>] rcu_preempt_sleep_check+0x45/0x47
  [<ffffffff8109e457>] ___might_sleep+0x1d/0x1cb
  [<ffffffff8109e67d>] __might_sleep+0x78/0x80
  [<ffffffff814b9b1f>] idr_alloc+0x45/0xd1
  [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
  [<ffffffff814b9f9d>] ? idr_for_each+0x53/0x101
  [<ffffffff817c1383>] alloc_netid+0x61/0x69
  [<ffffffff817c14c3>] __peernet2id+0x79/0x8d
  [<ffffffff817c1ab7>] peernet2id+0x13/0x1f
  [<ffffffff817d8673>] rtnl_fill_ifinfo+0xa8d/0xc20
  [<ffffffff810b17d9>] ? __lock_is_held+0x39/0x52
  [<ffffffff817d894f>] rtnl_dump_ifinfo+0x149/0x213
  [<ffffffff8182b9c2>] netlink_dump+0xef/0x26c
  [<ffffffff8182bcba>] netlink_recvmsg+0x17b/0x2c5
  [<ffffffff817b0adc>] __sock_recvmsg+0x4e/0x59
  [<ffffffff817b1b40>] sock_recvmsg+0x3f/0x51
  [<ffffffff817b1f9a>] ___sys_recvmsg+0xf6/0x1d9
  [<ffffffff8115dc67>] ? handle_pte_fault+0x6e1/0xd3d
  [<ffffffff8100a3a0>] ? native_sched_clock+0x35/0x37
  [<ffffffff8109f45b>] ? sched_clock_local+0x12/0x72
  [<ffffffff8109f6ac>] ? sched_clock_cpu+0x9e/0xb7
  [<ffffffff810cb7ab>] ? rcu_read_lock_held+0x3b/0x3d
  [<ffffffff811abde8>] ? __fcheck_files+0x4c/0x58
  [<ffffffff811ac556>] ? __fget_light+0x2d/0x52
  [<ffffffff817b376f>] __sys_recvmsg+0x42/0x60
  [<ffffffff817b379f>] SyS_recvmsg+0x12/0x1c
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 0c7aecd4 ("netns: add rtnl cmd to add and get peer netns ids")
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cac5e65e

net: Verify permission to link_net in newlink · 06615bed

由 Eric W. Biederman 提交于 2月 26, 2015

When applicable verify that the caller has permisson to the underlying
network namespace for a newly created network device.

Similary checks exist for the network namespace a network device will
be created in.

Fixes: 317f4810 ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

06615bed

net: Verify permission to dest_net in newlink · 505ce415

由 Eric W. Biederman 提交于 2月 26, 2015

When applicable verify that the caller has permision to create a
network device in another network namespace. This check is already
present when moving a network device between network namespaces in
setlink so all that is needed is to duplicate that check in newlink.

This change almost backports cleanly, but there are context conflicts
as the code that follows was added in v4.0-rc1

Fixes: b51642f6 net: Enable a userns root rtnl calls that are safe for unprivilged users
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

505ce415

25 2月, 2015 1 次提交

rtnetlink: avoid 0 sized arrays · 4e10fd5b

由 Sasha Levin 提交于 2月 24, 2015

Arrays (when not in a struct) "shall have a value greater than zero".

GCC complains when it's not the case here.

Fixes: ba7d49b1 ("rtnetlink: provide api for getting and setting slave info")
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e10fd5b

23 2月, 2015 1 次提交

net: pktgen: disable xmit_clone on virtual devices · 52d6c8c6

由 Eric Dumazet 提交于 2月 22, 2015

Trying to use burst capability (aka xmit_more) on a virtual device
like bonding is not supported.

For example, skb might be queued multiple times on a qdisc, with
various list corruptions.

Fixes: 38b2cf29 ("net: pktgen: packet bursting via skb->xmit_more")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52d6c8c6

22 2月, 2015 1 次提交

net: reject creation of netdev names with colons · a4176a93

由 Matthew Thode 提交于 2月 17, 2015

colons are used as a separator in netdev device lookup in dev_ioctl.c

Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME
Signed-off-by: NMatthew Thode <mthode@mthode.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a4176a93

21 2月, 2015 2 次提交

ethtool: Add hw-switch-offload to netdev_features_strings. · 65e9256c

由 Rami Rosen 提交于 2月 20, 2015

commit aafb3e98 (netdev: introduce new NETIF_F_HW_SWITCH_OFFLOAD feature
flag for switch device offloads) add a new feature without adding it to
netdev_features_strings array; this patch fixes this.
Signed-off-by: NRami Rosen <ramirose@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

65e9256c

sock: sock_dequeue_err_skb() needs hard irq safety · 997d5c3f

由 Eric Dumazet 提交于 2月 18, 2015

Non NAPI drivers can call skb_tstamp_tx() and then sock_queue_err_skb()
from hard IRQ context.

Therefore, sock_dequeue_err_skb() needs to block hard irq or
corruptions or hangs can happen.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 364a9e93 ("sock: deduplicate errqueue dequeue")
Fixes: cb820f8e ("net: Provide a generic socket error queue delivery method for Tx time stamps.")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

997d5c3f

20 2月, 2015 1 次提交

gen_stats.c: Duplicate xstats buffer for later use · 1c4cff0c

由 Ignacy Gawędzki 提交于 2月 13, 2015

The gnet_stats_copy_app() function gets called, more often than not, with its
second argument a pointer to an automatic variable in the caller's stack.
Therefore, to avoid copying garbage afterwards when calling
gnet_stats_finish_copy(), this data is better copied to a dynamically allocated
memory that gets freed after use.

[xiyou.wangcong@gmail.com: remove a useless kfree()]
Signed-off-by: NIgnacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c4cff0c

16 2月, 2015 1 次提交

rtnetlink: call ->dellink on failure when ->newlink exists · 7afb8886

由 WANG Cong 提交于 2月 13, 2015

Ignacy reported that when eth0 is down and add a vlan device
on top of it like:

  ip link add link eth0 name eth0.1 up type vlan id 1

We will get a refcount leak:

  unregister_netdevice: waiting for eth0.1 to become free. Usage count = 2

The problem is when rtnl_configure_link() fails in rtnl_newlink(),
we simply call unregister_device(), but for stacked device like vlan,
we almost do nothing when we unregister the upper device, more work
is done when we unregister the lower device, so call its ->dellink().
Reported-by: NIgnacy Gawedzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7afb8886

15 2月, 2015 2 次提交

net: spelling fixes · ca9f1fd2

由 Stephen Hemminger 提交于 2月 14, 2015

Spelling errors caught by codespell.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca9f1fd2

net/core: Fix warning while make xmldocs caused by dev.c · 4a26e453

由 Masanari Iida 提交于 2月 14, 2015

This patch fix following warning wile make xmldocs.

  Warning(.//net/core/dev.c:5345): No description found
  for parameter 'bonding_info'
  Warning(.//net/core/dev.c:5345): Excess function parameter
  'netdev_bonding_info' description in 'netdev_bonding_info_change'

This warning starts to appear after following patch was added
into Linus's tree during merger period.

commit 61bd3857
net/core: Add event for a change in slave state
Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4a26e453

14 2月, 2015 1 次提交

net: use %*pb[l] to print bitmaps including cpumasks and nodemasks · f0906827

由 Tejun Heo 提交于 2月 13, 2015

printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f0906827

12 2月, 2015 1 次提交

net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload · 15e2396d

由 Tom Herbert 提交于 2月 10, 2015

This patch adds infrastructure so that remote checksum offload can
set CHECKSUM_PARTIAL instead of calling csum_partial and writing
the modfied checksum field.

Add skb_remcsum_adjust_partial function to set an skb for using
CHECKSUM_PARTIAL with remote checksum offload.  Changed
skb_remcsum_process and skb_gro_remcsum_process to take a boolean
argument to indicate if checksum partial can be set or the
checksum needs to be modified using the normal algorithm.
Signed-off-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

15e2396d

09 2月, 2015 2 次提交

net:rfs: adjust table size checking · 93c1af6c

由 Eric Dumazet 提交于 2月 08, 2015

Make sure root user does not try something stupid.

Also make sure mask field in struct rps_sock_flow_table
does not share a cache line with the potentially often dirtied
flow table.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Fixes: 567e4b79 ("net: rfs: add hash collision detection")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

93c1af6c

net: rfs: add hash collision detection · 567e4b79

由 Eric Dumazet 提交于 2月 06, 2015

Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.

Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)

This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.

I use a 32bit value, and dynamically split it in two parts.

For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.

Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.

If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).

This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.

This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NTom Herbert <therbert@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

567e4b79

08 2月, 2015 2 次提交

net: use netif_rx_ni() from process context · 91e83133

由 Eric Dumazet 提交于 2月 05, 2015

Hotpluging a cpu might be rare, yet we have to use proper
handlers when taking over packets found in backlog queues.

dev_cpu_callback() runs from process context, thus we should
call netif_rx_ni() to properly invoke softirq handler.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91e83133

rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY · 364d5716

由 Daniel Borkmann 提交于 2月 05, 2015

ifla_vf_policy[] is wrong in advertising its individual member types as
NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
len member as *max* attribute length [0, len].

The issue is that when do_setvfinfo() is being called to set up a VF
through ndo handler, we could set corrupted data if the attribute length
is less than the size of the related structure itself.

The intent is exactly the opposite, namely to make sure to pass at least
data of minimum size of len.

Fixes: ebc08a6f ("rtnetlink: Add VF config code to rtnetlink")
Cc: Mitch Williams <mitch.a.williams@intel.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

364d5716

06 2月, 2015 2 次提交

pktgen: fix UDP checksum computation · 7744b5f3

由 Sabrina Dubroca 提交于 2月 04, 2015

This patch fixes two issues in UDP checksum computation in pktgen.

First, the pseudo-header uses the source and destination IP
addresses. Currently, the ports are used for IPv4.

Second, the UDP checksum covers both header and data.  So we need to
generate the data earlier (move pktgen_finalize_skb up), and compute
the checksum for UDP header + data.

Fixes: c26bf4a5 ("pktgen: Add UDPCSUM flag to support UDP checksums")
Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Acked-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7744b5f3

flowcache: Fix kernel panic in flow_cache_flush_task · 233c96fc

由 Miroslav Urbanek 提交于 2月 05, 2015

flow_cache_flush_task references a structure member flow_cache_gc_work
where it should reference flow_cache_flush_task instead.

Kernel panic occurs on kernels using IPsec during XFRM garbage
collection. The garbage collection interval can be shortened using the
following sysctl settings:

net.ipv4.xfrm4_gc_thresh=4
net.ipv6.xfrm6_gc_thresh=4

With the default settings, our productions servers crash approximately
once a week. With the settings above, they crash immediately.

Fixes: ca925cf1 ("flowcache: Make flow cache name space aware")
Reported-by: NTomáš Charvát <tc@excello.cz>
Tested-by: NJan Hejl <jh@excello.cz>
Signed-off-by: NMiroslav Urbanek <mu@miroslavurbanek.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

233c96fc

05 2月, 2015 3 次提交

net: remove some sparse warnings · 2ce1ee17

由 Eric Dumazet 提交于 2月 04, 2015

netdev_adjacent_add_links() and netdev_adjacent_del_links()
are static.

queue->qdisc has __rcu annotation, need to use RCU_INIT_POINTER()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2ce1ee17

net/core: Add event for a change in slave state · 61bd3857

由 Moni Shoua 提交于 2月 03, 2015

Add event which provides an indication on a change in the state
of a bonding slave. The event handler should cast the pointer to the
appropriate type (struct netdev_bonding_info) in order to get the
full info about the slave.
Signed-off-by: NMoni Shoua <monis@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61bd3857

xps: fix xps for stacked devices · 2bd82484

由 Eric Dumazet 提交于 2月 03, 2015

A typical qdisc setup is the following :

bond0 : bonding device, using HTB hierarchy
eth1/eth2 : slaves, multiqueue NIC, using MQ + FQ qdisc

XPS allows to spread packets on specific tx queues, based on the cpu
doing the send.

Problem is that dequeues from bond0 qdisc can happen on random cpus,
due to the fact that qdisc_run() can dequeue a batch of packets.

CPUA -> queue packet P1 on bond0 qdisc, P1->ooo_okay=1
CPUA -> queue packet P2 on bond0 qdisc, P2->ooo_okay=0

CPUB -> dequeue packet P1 from bond0
        enqueue packet on eth1/eth2
CPUC -> dequeue packet P2 from bond0
        enqueue packet on eth1/eth2 using sk cache (ooo_okay is 0)

get_xps_queue() then might select wrong queue for P1, since current cpu
might be different than CPUA.

P2 might be sent on the old queue (stored in sk->sk_tx_queue_mapping),
if CPUC runs a bit faster (or CPUB spins a bit on qdisc lock)

Effect of this bug is TCP reorders, and more generally not optimal
TX queue placement. (A victim bulk flow can be migrated to the wrong TX
queue for a while)

To fix this, we have to record sender cpu number the first time
dev_queue_xmit() is called for one tx skb.

We can union napi_id (used on receive path) and sender_cpu,
granted we clear sender_cpu in skb_scrub_packet() (credit to Willem for
this union idea)
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2bd82484

04 2月, 2015 1 次提交
- A
  net: bury net/core/iovec.c - nothing in there is used anymore · 31a25fae
  由 Al Viro 提交于 11月 28, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  31a25fae
03 2月, 2015 2 次提交

net-timestamp: no-payload only sysctl · b245be1f

由 Willem de Bruijn 提交于 1月 30, 2015

Tx timestamps are looped onto the error queue on top of an skb. This
mechanism leaks packet headers to processes unless the no-payload
options SOF_TIMESTAMPING_OPT_TSONLY is set.

Add a sysctl that optionally drops looped timestamp with data. This
only affects processes without CAP_NET_RAW.

The policy is checked when timestamps are generated in the stack.
It is possible for timestamps with data to be reported after the
sysctl is set, if these were queued internally earlier.

No vulnerability is immediately known that exploits knowledge
gleaned from packet headers, but it may still be preferable to allow
administrators to lock down this path at the cost of possible
breakage of legacy applications.
Signed-off-by: NWillem de Bruijn <willemb@google.com>

----

Changes
  (v1 -> v2)
  - test socket CAP_NET_RAW instead of capable(CAP_NET_RAW)
  (rfc -> v1)
  - document the sysctl in Documentation/sysctl/net.txt
  - fix access control race: read .._OPT_TSONLY only once,
        use same value for permission check and skb generation.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b245be1f

net-timestamp: no-payload option · 49ca0d8b

由 Willem de Bruijn 提交于 1月 30, 2015

Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
timestamps, this loops timestamps on top of empty packets.

Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
cmsg reception (aside from timestamps) are no longer possible. This
works together with a follow on patch that allows administrators to
only allow tx timestamping if it does not loop payload or metadata.
Signed-off-by: NWillem de Bruijn <willemb@google.com>

----

Changes (rfc -> v1)
  - add documentation
  - remove unnecessary skb->len test (thanks to Richard Cochran)
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49ca0d8b

02 2月, 2015 1 次提交

bridge: add flags argument to ndo_bridge_setlink and ndo_bridge_dellink · add511b3

由 Roopa Prabhu 提交于 1月 29, 2015

bridge flags are needed inside ndo_bridge_setlink/dellink handlers to
avoid another call to parse IFLA_AF_SPEC inside these handlers

This is used later in this series
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

add511b3

31 1月, 2015 1 次提交

net: Fix vlan_get_protocol for stacked vlan · d4bcef3f

由 Toshiaki Makita 提交于 1月 29, 2015

vlan_get_protocol() could not get network protocol if a skb has a 802.1ad
vlan tag or multiple vlans, which caused incorrect checksum calculation
in several drivers.

Fix vlan_get_protocol() to retrieve network protocol instead of incorrect
vlan protocol.

As the logic is the same as skb_network_protocol(), create a common helper
function __vlan_get_protocol() and call it from existing functions.
Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4bcef3f

30 1月, 2015 2 次提交

dev: add per net_device packet type chains · 7866a621

由 Salam Noureddine 提交于 1月 27, 2015

When many pf_packet listeners are created on a lot of interfaces the
current implementation using global packet type lists scales poorly.
This patch adds per net_device packet type lists to fix this problem.

The patch was originally written by Eric Biederman for linux-2.6.29.
Tested on linux-3.16.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NSalam Noureddine <noureddine@arista.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7866a621

rtnetlink: pass link_net to the newlink handler · 7b4ce694

由 Nicolas Dichtel 提交于 1月 27, 2015

When IFLA_LINK_NETNSID is used, the netdevice should be built in this link netns
and moved at the end to another netns (pointed by the socket netns or
IFLA_NET_NS_[PID|FD]).

Existing user of the newlink handler will use the netns argument (src_net) to
find a link netdevice or to check some other information into the link netns.
For example, to find a netdevice, two information are required: an ifindex
(usually from IFLA_LINK) and a netns (this link netns).

Note: when using IFLA_LINK_NETNSID and IFLA_NET_NS_[PID|FD], a user may create a
netdevice that stands in netnsX and with its link part in netnsY, by sending a
rtnl message from netnsZ.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b4ce694

29 1月, 2015 1 次提交

bridge: dont send notification when skb->len == 0 in rtnl_bridge_notify · 59ccaaaa

由 Roopa Prabhu 提交于 1月 28, 2015

Reported in: https://bugzilla.kernel.org/show_bug.cgi?id=92081

This patch avoids calling rtnl_notify if the device ndo_bridge_getlink
handler does not return any bytes in the skb.

Alternately, the skb->len check can be moved inside rtnl_notify.

For the bridge vlan case described in 92081, there is also a fix needed
in bridge driver to generate a proper notification. Will fix that in
subsequent patch.

v2: rebase patch on net tree
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

59ccaaaa

27 1月, 2015 1 次提交

flow_dissector: add tipc support · 08bfc9cb

由 Erik Hugne 提交于 1月 22, 2015

The flows are hashed on the sending node address, which allows us
to spread out the TIPC link processing to RPS enabled cores. There
is no point to include the destination address in the hash as that
will always be the same for all inbound links. We have experimented
with a 3-tuple hash over [srcnode, sport, dport], but this showed to
give slightly lower performance because of increased lock contention
when the same link was handled by multiple cores.
Signed-off-by: NYing Xue <ying.xue@windriver.com>
Signed-off-by: NErik Hugne <erik.hugne@ericsson.com>
Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

08bfc9cb

24 1月, 2015 2 次提交

vxlan: advertise netns of vxlan dev in fdb msg · 193523bf

由 Nicolas Dichtel 提交于 1月 20, 2015

Netlink FDB messages are sent in the link netns. The header of these messages
contains the ifindex (ndm_ifindex) of the netdevice, but this ifindex is
unusable in case of x-netns vxlan.
I named the new attribute NDA_NDM_IFINDEX_NETNSID, to avoid confusion with
NDA_IFINDEX.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

193523bf

rtnl: fix error path when adding an iface with a link net · bdef279b

由 Nicolas Dichtel 提交于 1月 20, 2015

If an error occurs when the netdevice is moved to the link netns, a full cleanup
must be done.

Fixes: 317f4810 ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bdef279b

23 1月, 2015 1 次提交

nl80211: Allow set network namespace by fd · 4b681c82

由 Vadim Kochan 提交于 1月 12, 2015

Added new NL80211_ATTR_NETNS_FD which allows to
set namespace via nl80211 by fd.
Signed-off-by: NVadim Kochan <vadim4j@gmail.com>
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

4b681c82