提交 · ad744b223c521b1e01752a826774545c3e3acd8e · openanolis / cloud-kernel

08 6月, 2017 1 次提交

net: Fix inconsistent teardown and release of private netdev state. · cf124db5

由 David S. Miller 提交于 5月 08, 2017

Network devices can allocate reasources and private memory using
netdev_ops->ndo_init().  However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf124db5

07 6月, 2017 1 次提交

tun: use symmetric hash · feec084a

由 Jason Wang 提交于 6月 06, 2017

Tun actually expects a symmetric hash for queue selecting to work
correctly, otherwise packets belongs to a single flow may be
redirected to the wrong queue. So this patch switch to use
__skb_get_hash_symmetric().
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

feec084a

18 5月, 2017 2 次提交

tun: support receiving skb through msg_control · ac77cfd4

由 Jason Wang 提交于 5月 17, 2017

This patch makes tun_recvmsg() can receive from skb from its caller
through msg_control. Vhost_net will be the first user.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac77cfd4

tun: export skb_array · 83339c6b

由 Jason Wang 提交于 5月 17, 2017

This patch exports skb_array through tun_get_skb_array(). Caller can
then manipulate skb array directly.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

83339c6b

22 3月, 2017 1 次提交

tun: fix inability to set offloads after disabling them via ethtool · 09050957

由 Yaroslav Isakov 提交于 3月 16, 2017

Added missing logic in tun driver, which prevents apps to set
offloads using tun ioctl, if offloads were previously disabled via ethtool
Signed-off-by: NYaroslav Isakov <yaroslav.isakov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

09050957

14 3月, 2017 2 次提交

tun: fix premature POLLOUT notification on tun devices · b20e2d54

由 Hannes Frederic Sowa 提交于 3月 13, 2017

aszlig observed failing ssh tunnels (-w) during initialization since
commit cc9da6cc ("ipv6: addrconf: use stable address generator for
ARPHRD_NONE"). We already had reports that the mentioned commit breaks
Juniper VPN connections. I can't clearly say that the Juniper VPN client
has the same problem, but it is worth a try to hint to this patch.

Because of the early generation of link local addresses, the kernel now
can start asking for routers on the local subnet much earlier than usual.
Those router solicitation packets arrive inside the ssh channels and
should be transmitted to the tun fd before the configuration scripts
might have upped the interface and made it ready for transmission.

ssh polls on the interface and receives back a POLL_OUT. It tries to send
the earily router solicitation packet to the tun interface. Unfortunately
it hasn't been up'ed yet by config scripts, thus failing with -EIO. ssh
doesn't retry again and considers the tun interface broken forever.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=121131
Fixes: cc9da6cc ("ipv6: addrconf: use stable address generator for ARPHRD_NONE")
Cc: Bjørn Mork <bjorn@mork.no>
Reported-by: NValdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Reported-by: NJonas Lippuner <jonas@lippuner.ca>
Cc: Jonas Lippuner <jonas@lippuner.ca>
Reported-by: Naszlig <aszlig@redmoonstudios.org>
Cc: aszlig <aszlig@redmoonstudios.org>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b20e2d54

net: tun: use new api ethtool_{get|set}_link_ksettings · 29ccc49d

由 Philippe Reynes 提交于 3月 11, 2017

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.
Signed-off-by: NPhilippe Reynes <tremyfr@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

29ccc49d

10 3月, 2017 1 次提交

tun: remove copyright printing · 6cbac982

由 LABBE Corentin 提交于 3月 08, 2017

Printing copyright does not give any useful information on the boot
process.
Furthermore, the email address printed is obsolete since
commit ba57b6f2 ("MAINTAINERS: fix bouncing tun/tap entries")
Signed-off-by: NCorentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6cbac982

02 3月, 2017 1 次提交

sched/headers: Prepare to move signal wakeup & sigpending methods from... · 174cd4b1

由 Ingo Molnar 提交于 2月 02, 2017

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

174cd4b1

07 2月, 2017 1 次提交

tun: read vnet_hdr_sz once · e1edab87

由 Willem de Bruijn 提交于 2月 03, 2017

When IFF_VNET_HDR is enabled, a virtio_net header must precede data.
Data length is verified to be greater than or equal to expected header
length tun->vnet_hdr_sz before copying.

Read this value once and cache locally, as it can be updated between
the test and use (TOCTOU).
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Reported-by: NDmitry Vyukov <dvyukov@google.com>
CC: Eric Dumazet <edumazet@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e1edab87

21 1月, 2017 1 次提交

virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving · 6391a448

由 Jason Wang 提交于 1月 20, 2017

Commit 501db511 ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
fixing this by adding a hint (has_data_valid) and set it only on the
receiving path.

Cc: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NRolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6391a448

19 1月, 2017 1 次提交

tun: rx batching · 5503fcec

由 Jason Wang 提交于 1月 18, 2017

We can only process 1 packet at one time during sendmsg(). This often
lead bad cache utilization under heavy load. So this patch tries to do
some batching during rx before submitting them to host network
stack. This is done through accepting MSG_MORE as a hint from
sendmsg() caller, if it was set, batch the packet temporarily in a
linked list and submit them all once MSG_MORE were cleared.

Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:

                                 Mpps  -+%
    rx-frames = 0                0.91  +0%
    rx-frames = 4                1.00  +9.8%
    rx-frames = 8                1.00  +9.8%
    rx-frames = 16               1.01  +10.9%
    rx-frames = 32               1.07  +17.5%
    rx-frames = 48               1.07  +17.5%
    rx-frames = 64               1.08  +18.6%
    rx-frames = 64 (no MSG_MORE) 0.91  +0%

User were allowed to change per device batched packets through
ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
to prevent bh from being disabled too long.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5503fcec

09 1月, 2017 1 次提交

net: make ndo_get_stats64 a void function · bc1f4470

由 stephen hemminger 提交于 1月 06, 2017

The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.
Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc1f4470

25 12月, 2016 1 次提交

Replace <asm/uaccess.h> with <linux/uaccess.h> globally · 7c0f6ba6

由 Linus Torvalds 提交于 12月 24, 2016

This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.
Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7c0f6ba6

07 12月, 2016 1 次提交

tun: Use netif_receive_skb instead of netif_rx · d4aea20d

由 Andrey Konovalov 提交于 12月 01, 2016

This patch changes tun.c to call netif_receive_skb instead of netif_rx
when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid
stack exhaustion). The difference between the two is that netif_rx queues
the packet into the backlog, and netif_receive_skb proccesses the packet
in the current context.

This patch is required for syzkaller [1] to collect coverage from packet
receive paths, when a packet being received through tun (syzkaller collects
coverage per process in the process context).

As mentioned by Eric this change also speeds up tun/tap. As measured by
Peter it speeds up his closed-loop single-stream tap/OVS benchmark by
about 23%, from 700k packets/second to 867k packets/second.

A similar patch was introduced back in 2010 [2, 3], but the author found
out that the patch doesn't help with the task he had in mind (for cgroups
to shape network traffic based on the original process) and decided not to
go further with it. The main concern back then was about possible stack
exhaustion with 4K stacks.

[1] https://github.com/google/syzkaller

[2] https://www.spinics.net/lists/netdev/thrd440.html#130570

[3] https://www.spinics.net/lists/netdev/msg130570.htmlSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d4aea20d

06 12月, 2016 1 次提交

[iov_iter] new primitives - copy_from_iter_full() and friends · cbbd26b8

由 Al Viro 提交于 11月 01, 2016

copy_from_iter_full(), copy_from_iter_full_nocache() and
csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
et.al., advancing iterator only in case of successful full copy
and returning whether it had been successful or not.

Convert some obvious users.  *NOTE* - do not blindly assume that
something is a good candidate for those unless you are sure that
not advancing iov_iter in failure case is the right thing in
this case.  Anything that does short read/short write kind of
stuff (or is in a loop, etc.) is unlikely to be a good one.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

cbbd26b8

01 12月, 2016 1 次提交

tun: handle ubuf refcount correctly when meet errors · af1cc7a2

由 Jason Wang 提交于 11月 30, 2016

We trigger uarg->callback() immediately after we decide do datacopy
even if caller want to do zerocopy. This will cause the callback
(vhost_net_zerocopy_callback) decrease the refcount. But when we meet
an error afterwards, the error handling in vhost handle_tx() will try
to decrease it again. This is wrong and fix this by delay the
uarg->callback() until we're sure there's no errors.
Reported-by: Nwangyunjian <wangyunjian@huawei.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

af1cc7a2

25 11月, 2016 1 次提交

tuntap: remove unnecessary sk_receive_queue length check during xmit · 436acceb

由 Jason Wang 提交于 11月 23, 2016

After commit 1576d986 ("tun: switch to use skb array for tx"),
sk_receive_queue was not used any more. So remove the uncessary
sk_receive_queue length check during xmit.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

436acceb

19 11月, 2016 2 次提交

virtio_net: Do not clear memory for struct virtio_net_hdr twice. · 9403cd7c

由 Jarno Rajahalme 提交于 11月 18, 2016

virtio_net_hdr_from_skb() clears the memory for the header, so there
is no point for the callers to do the same.
Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9403cd7c

virtio_net: Simplify call sites for virtio_net_hdr_{from, to}_skb(). · 3e9e40e7

由 Jarno Rajahalme 提交于 11月 18, 2016

No point storing the return value of virtio_net_hdr_to_skb() or
virtio_net_hdr_from_skb() to a variable when the value is used only
once as a boolean in an immediately following if statement.
Signed-off-by: NJarno Rajahalme <jarno@ovn.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3e9e40e7

31 10月, 2016 1 次提交

driver: tun: Use new macro SOCK_IOC_TYPE instead of literal number 0x89 · 20861f26

由 Gao Feng 提交于 10月 27, 2016

The current codes use _IOC_TYPE(cmd) == 0x89 to check if the cmd is one
socket ioctl command like SIOCGIFHWADDR. But the literal number 0x89 may
confuse readers. So create one macro SOCK_IOC_TYPE to enhance the readability.
Signed-off-by: NGao Feng <fgao@ikuai8.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20861f26

30 10月, 2016 1 次提交

driver: tun: Move tun check into the block of TUNSETIFF condition check · 0f16bc13

由 Gao Feng 提交于 10月 25, 2016

When cmd is TUNSETIFF and tun is not null, the original codes go ahead,
then reach the default case of switch(cmd) and set the ret is -EINVAL.
It is not clear for readers.

Now move the tun check into the block of TUNSETIFF condition check, and
return -EEXIST instead of -EINVAL when the tfile already owns one tun.
Signed-off-by: NGao Feng <fgao@ikuai8.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0f16bc13

21 10月, 2016 1 次提交

net: use core MTU range checking in core net infra · 91572088

由 Jarod Wilson 提交于 10月 20, 2016

geneve:
- Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
- This one isn't quite as straight-forward as others, could use some
  closer inspection and testing

macvlan:
- set min/max_mtu

tun:
- set min/max_mtu, remove tun_net_change_mtu

vxlan:
- Merge __vxlan_change_mtu back into vxlan_change_mtu
- Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function
- This one is also not as straight-forward and could use closer inspection
  and testing from vxlan folks

bridge:
- set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
  change_mtu function

openvswitch:
- set min/max_mtu, remove internal_dev_change_mtu
- note: max_mtu wasn't checked previously, it's been set to 65535, which
  is the largest possible size supported

sch_teql:
- set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)

macsec:
- min_mtu = 0, max_mtu = 65535

macvlan:
- min_mtu = 0, max_mtu = 65535

ntb_netdev:
- min_mtu = 0, max_mtu = 65535

veth:
- min_mtu = 68, max_mtu = 65535

8021q:
- min_mtu = 0, max_mtu = 65535

CC: netdev@vger.kernel.org
CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
CC: Tom Herbert <tom@herbertland.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Jiri Benc <jbenc@redhat.com>
CC: WANG Cong <xiyou.wangcong@gmail.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Pravin B Shelar <pshelar@ovn.org>
CC: Sabrina Dubroca <sd@queasysnail.net>
CC: Patrick McHardy <kaber@trash.net>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Pravin Shelar <pshelar@nicira.com>
CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Signed-off-by: NJarod Wilson <jarod@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

91572088

24 8月, 2016 1 次提交

tun: fix transmit timestamp support · 7b996243

由 Soheil Hassas Yeganeh 提交于 8月 23, 2016

Instead of using sock_tx_timestamp, use skb_tx_timestamp to record
software transmit timestamp of a packet.

sock_tx_timestamp resets and overrides the tx_flags of the skb.
The function is intended to be called from within the protocol
layer when creating the skb, not from a device driver. This is
inconsistent with other drivers and will cause issues for TCP.

In TCP, we intend to sample the timestamps for the last byte
for each sendmsg/sendpage. For that reason, tcp_sendmsg calls
tcp_tx_timestamp only with the last skb that it generates.
For example, if a 128KB message is split into two 64KB packets
we want to sample the SND timestamp of the last packet. The current
code in the tun driver, however, will result in sampling the SND
timestamp for both packets.

Also, when the last packet is split into smaller packets for
retranmission (see tcp_fragment), the tun driver will record
timestamps for all of the retransmitted packets and not only the
last packet.

Fixes: eda29772 (tun: Support software transmit time stamping.)
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NFrancis Yan <francisyyan@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7b996243

21 8月, 2016 2 次提交

tun: Rename a jump label in update_filter() · 3b8d2a69

由 Markus Elfring 提交于 8月 20, 2016

Adjust a jump target according to the Linux coding style convention.
Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b8d2a69

tun: Use memdup_user() rather than duplicating its implementation · 28e8190d

由 Markus Elfring 提交于 8月 20, 2016

Reuse existing functionality from memdup_user() instead of keeping
duplicate source code.

This issue was detected by using the Coccinelle software.
Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
Reviewed-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

28e8190d

09 7月, 2016 1 次提交

tun: Don't assume type tun in tun_device_event · 86dfb4ac

由 Craig Gallek 提交于 7月 06, 2016

The referenced change added a netlink notifier for processing
device queue size events.  These events are fired for all devices
but the registered callback assumed they only occurred for tun
devices.  This fix adds a check (borrowed from macvtap.c) to discard
non-tun device events.

For reference, this fixes the following splat:
[   71.505935] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[   71.513870] IP: [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
[   71.519906] PGD 3f41f56067 PUD 3f264b7067 PMD 0
[   71.524497] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[   71.529374] gsmi: Log Shutdown Reason 0x03
[   71.533417] Modules linked in:[   71.533826] mlx4_en: eth1: Link Up

[   71.539616]  bonding w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
[   71.549282] CPU: 12 PID: 7915 Comm: set.ixion-haswe Not tainted 4.7.0-dbx-DEV #8
[   71.556586] Hardware name: Intel Grantley,Wellsburg/Ixion_IT_15, BIOS 2.58.0 05/03/2016
[   71.564495] task: ffff887f00bb20c0 ti: ffff887f00798000 task.ti: ffff887f00798000
[   71.571894] RIP: 0010:[<ffffffff8153c1a0>]  [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
[   71.580327] RSP: 0018:ffff887f0079bbd8  EFLAGS: 00010202
[   71.585576] RAX: fffffffffffffae8 RBX: ffff887ef6d03378 RCX: 0000000000000000
[   71.592624] RDX: 0000000000000000 RSI: 0000000000000028 RDI: 0000000000000000
[   71.599675] RBP: ffff887f0079bc48 R08: 0000000000000000 R09: 0000000000000001
[   71.606730] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000010
[   71.613780] R13: 0000000000000000 R14: 0000000000000001 R15: ffff887f0079bd00
[   71.620832] FS:  00007f5cdc581700(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[   71.628826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   71.634500] CR2: 0000000000000010 CR3: 0000003f3eb62000 CR4: 00000000001406e0
[   71.641549] Stack:
[   71.643533]  ffff887f0079bc08 0000000000000246 000000000000001e ffff887ef6d00000
[   71.650871]  ffff887f0079bd00 0000000000000000 0000000000000000 ffffffff00000000
[   71.658210]  ffff887f0079bc48 ffffffff81d24070 00000000fffffff9 ffffffff81cec7a0
[   71.665549] Call Trace:
[   71.667975]  [<ffffffff810eeb0d>] notifier_call_chain+0x5d/0x80
[   71.673823]  [<ffffffff816365d0>] ? show_tx_maxrate+0x30/0x30
[   71.679502]  [<ffffffff810eeb3e>] __raw_notifier_call_chain+0xe/0x10
[   71.685778]  [<ffffffff810eeb56>] raw_notifier_call_chain+0x16/0x20
[   71.691976]  [<ffffffff8160eb30>] call_netdevice_notifiers_info+0x40/0x70
[   71.698681]  [<ffffffff8160ec36>] call_netdevice_notifiers+0x16/0x20
[   71.704956]  [<ffffffff81636636>] change_tx_queue_len+0x66/0x90
[   71.710807]  [<ffffffff816381ef>] netdev_store.isra.5+0xbf/0xd0
[   71.716658]  [<ffffffff81638350>] tx_queue_len_store+0x50/0x60
[   71.722431]  [<ffffffff814a6798>] dev_attr_store+0x18/0x30
[   71.727857]  [<ffffffff812ea3ff>] sysfs_kf_write+0x4f/0x70
[   71.733274]  [<ffffffff812e9507>] kernfs_fop_write+0x147/0x1d0
[   71.739045]  [<ffffffff81134a4f>] ? rcu_read_lock_sched_held+0x8f/0xa0
[   71.745499]  [<ffffffff8125a108>] __vfs_write+0x28/0x120
[   71.750748]  [<ffffffff8111b137>] ? percpu_down_read+0x57/0x90
[   71.756516]  [<ffffffff8125d7d8>] ? __sb_start_write+0xc8/0xe0
[   71.762278]  [<ffffffff8125d7d8>] ? __sb_start_write+0xc8/0xe0
[   71.768038]  [<ffffffff8125bd5e>] vfs_write+0xbe/0x1b0
[   71.773113]  [<ffffffff8125c092>] SyS_write+0x52/0xa0
[   71.778110]  [<ffffffff817528e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
[   71.784472] Code: 45 31 f6 48 8b 93 78 33 00 00 48 81 c3 78 33 00 00 48 39 d3 48 8d 82 e8 fa ff ff 74 25 48 8d b0 40 05 00 00 49 63 d6 41 83 c6 01 <49> 89 34 d4 48 8b 90 18 05 00 00 48 39 d3 48 8d 82 e8 fa ff ff
[   71.803655] RIP  [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
[   71.809769]  RSP <ffff887f0079bbd8>
[   71.813213] CR2: 0000000000000010
[   71.816512] ---[ end trace 4db6449606319f73 ]---

Fixes: 1576d986 ("tun: switch to use skb array for tx")
Signed-off-by: NCraig Gallek <kraig@google.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

86dfb4ac

05 7月, 2016 1 次提交

tun: fix build warnings · f48cc6b2

由 Jason Wang 提交于 7月 04, 2016

Stephen Rothwell reports a build warnings(powerpc ppc64_defconfig)

drivers/net/tun.c: In function 'tun_do_read.part.5':
/home/sfr/next/next/drivers/net/tun.c:1491:6: warning: 'err' may be
used uninitialized in this function [-Wmaybe-uninitialized]
   int err;

This is because tun_ring_recv() may return an uninitialized err, fix this.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f48cc6b2

01 7月, 2016 1 次提交

tun: switch to use skb array for tx · 1576d986

由 Jason Wang 提交于 6月 30, 2016

We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.

This patch tries to address this by:

- switch from sk_receive_queue to a skb_array, and resize it when
  tx_queue_len was changed.
- introduce a new proto_ops peek_len which was used for peeking the
  skb length.
- implement a tun version of peek_len for vhost_net to use and convert
  vhost_net to use peek_len if possible.

Pktgen test shows about 15.3% improvement on guest receiving pps for small
buffers:

Before: ~1300000pps
After : ~1500000pps
Signed-off-by: NJason Wang <jasowang@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1576d986

16 6月, 2016 1 次提交

tun: fix csum generation for tap devices · df10db98

由 Paolo Abeni 提交于 6月 14, 2016

The commit 34166093 ("tuntap: use common code for virtio_net_hdr
and skb GSO conversion") replaced the tun code for header manipulation
with the generic helpers. While doing so, it implictly moved the
skb_partial_csum_set() invocation after eth_type_trans(), which
invalidate the current gso start/offset values.
Fix it by moving the helper invocation before the mac pulling.

Fixes: 34166093 ("tuntap: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

df10db98

11 6月, 2016 1 次提交

tuntap: use common code for virtio_net_hdr and skb GSO conversion · 34166093

由 Mike Rapoport 提交于 6月 08, 2016

Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_{from,to}_skb
Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

34166093

21 5月, 2016 1 次提交

tuntap: correctly wake up process during uninit · addf8fc4

由 Jason Wang 提交于 5月 19, 2016

We used to check dev->reg_state against NETREG_REGISTERED after each
time we are woke up. But after commit 9e641bdc ("net-tun:
restructure tun_do_read for better sleep/wakeup efficiency"), it uses
skb_recv_datagram() which does not check dev->reg_state. This will
result if we delete a tun/tap device after a process is blocked in the
reading. The device will wait for the reference count which was held
by that process for ever.

Fixes this by using RCV_SHUTDOWN which will be checked during
sk_recv_datagram() before trying to wake up the process during uninit.

Fixes: 9e641bdc ("net-tun: restructure tun_do_read for better
sleep/wakeup efficiency")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Xi Wang <xii@google.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

addf8fc4

29 4月, 2016 1 次提交

tuntap: calculate rps hash only when needed · 3df97ba8

由 Jason Wang 提交于 4月 25, 2016

There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).

Before:
~1150000 pps
After:
~1300000 pps

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: NJason Wang <jasowang@redhat.com>
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3df97ba8

19 4月, 2016 1 次提交

tun: don't require serialization lock on tx · 2a2bbf17

由 Paolo Abeni 提交于 4月 14, 2016

The current tun_net_xmit() implementation don't need any external
lock since it relies on rcu protection for the tun data structure
and on socket queue lock for skb queuing.

This patch set the NETIF_F_LLTX feature bit in the tun device, so
that on xmit, in absence of qdisc, no serialization lock is acquired
by the caller.

The user space can remove the default tun qdisc with:

tc qdisc replace dev <tun device name> root noqueue
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a2bbf17

15 4月, 2016 1 次提交

tun: use per cpu variables for stats accounting · 608b9977

由 Paolo Abeni 提交于 4月 13, 2016

Currently the tun device accounting uses dev->stats without applying any
kind of protection, regardless that accounting happens in preemptible
process context.
This patch move the tun stats to a per cpu data structure, and protect
the updates with  u64_stats_update_begin()/u64_stats_update_end() or
this_cpu_inc according to the stat type. The per cpu stats are
aggregated by the newly added ndo_get_stats64 ops.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

608b9977

09 4月, 2016 1 次提交

tuntap: restore default qdisc · 016adb72

由 Jason Wang 提交于 4月 08, 2016

After commit f84bb1ea ("net: fix IFF_NO_QUEUE for drivers using
alloc_netdev"), default qdisc was changed to noqueue because
tuntap does not set tx_queue_len during .setup(). This patch restores
default qdisc by setting tx_queue_len in tun_setup().

Fixes: f84bb1ea ("net: fix IFF_NO_QUEUE for drivers using alloc_netdev")
Cc: Phil Sutter <phil@nwl.cc>
Signed-off-by: NJason Wang <jasowang@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NPhil Sutter <phil@nwl.cc>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

016adb72

08 4月, 2016 1 次提交

tun: use socket locks for sk_{attach,detatch}_filter · 8ced425e

由 Hannes Frederic Sowa 提交于 4月 05, 2016

This reverts commit 5a5abb1f ("tun, bpf: fix suspicious RCU usage
in tun_{attach, detach}_filter") and replaces it to use lock_sock around
sk_{attach,detach}_filter. The checks inside filter.c are updated with
lockdep_sock_is_held to check for proper socket locks.

It keeps the code cleaner by ensuring that only one lock governs the
socket filter instead of two independent locks.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8ced425e

05 4月, 2016 1 次提交

sock: enable timestamping using control messages · c14ac945

由 Soheil Hassas Yeganeh 提交于 4月 02, 2016

Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c14ac945

02 4月, 2016 1 次提交

tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter · 5a5abb1f

由 Daniel Borkmann 提交于 3月 31, 2016

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

  [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
  [   52.765688] other info that might help us debug this:
  [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
  [   52.765701] 1 lock held by a.out/1525:
  [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
  [   52.765721] stack backtrace:
  [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
  [...]
  [   52.765768] Call Trace:
  [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
  [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
  [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
  [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
  [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
  [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
  [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
  [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
  [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
  [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
  [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
  [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in f91ff5b9 ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in 99405162 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.

Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a5abb1f

02 3月, 2016 1 次提交

net/tun: implement ndo_set_rx_headroom · eaea34b2

由 Paolo Abeni 提交于 2月 26, 2016

ndo_set_rx_headroom controls the align value used by tun devices to
allocate skbs on frame reception.
When the xmit device adds a large encapsulation, this avoids an skb
head reallocation on forwarding.

The measured improvement when forwarding towards a vxlan dev with
frame size below the egress device MTU is as follow:

vxlan over ipv6, bridged: +6%
vxlan over ipv6, ovs: +7%

In case of ipv4 tunnels there is no improvement, since the tun
device default alignment provides enough headroom to avoid the skb
head reallocation.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eaea34b2

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功