提交 · 5a6cd6de76ae78b651e7c36eba8b1da465d65f06 · openeuler / Kernel

18 10月, 2017 2 次提交

ethtool: add ethtool_intersect_link_masks · 5a6cd6de

由 Alan Brady 提交于 10月 05, 2017

This function provides a way to intersect two link masks together to
find the common ground between them. For example in i40e, the driver
first generates link masks for what is supported by the PHY type. The
driver then gets the link masks for what the NVM supports. The
resulting intersection between them yields what can truly be supported.
Signed-off-by: NAlan Brady <alan.brady@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

5a6cd6de

net: export netdev_txq_to_tc to allow sch_mqprio to compile as module · 8a5f2166

由 Henrik Austad 提交于 10月 17, 2017

In commit 32302902 ("mqprio: Reserve last 32 classid values for HW
traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
to find the correct tc instead of dev->tc_to_txq[]

However, when mqprio is compiled as a module, it cannot resolve the
symbol, leading to this error:

     ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!

This adds an EXPORT_SYMBOL() since the other user in the kernel
(netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
sysfs-callback.

Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NHenrik Austad <haustad@cisco.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8a5f2166

17 10月, 2017 2 次提交

net: core: rcu-ify rtnl af_ops · 5fa85a09

由 Florian Westphal 提交于 10月 16, 2017

rtnl af_ops currently rely on rtnl mutex: unregister (called from module
exit functions) takes the rtnl mutex and all users that do af_ops lookup
also take the rtnl mutex. IOW, parallel rmmod will block until doit()
callback is done.

As none of the af_ops implementation sleep we can use rcu instead.

doit functions that need the af_ops can now use rcu instead of the
rtnl mutex provided the mutex isn't needed for other reasons.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5fa85a09

rtnetlink: place link af dump into own helper · 070cbf5b

由 Florian Westphal 提交于 10月 16, 2017

next patch will rcu-ify rtnl af_ops, i.e. allow af_ops
lookup and function calls with rcu read lock held instead
of rtnl mutex.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

070cbf5b

15 10月, 2017 1 次提交

tcp: add a tracepoint for tcp retransmission · e086101b

由 Cong Wang 提交于 10月 13, 2017

We need a real-time notification for tcp retransmission
for monitoring.

Of course we could use ftrace to dynamically instrument this
kernel function too, however we can't retrieve the connection
information at the same time, for example perf-tools [1] reads
/proc/net/tcp for socket details, which is slow when we have
a lots of connections.

Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
and exposes src/dst IP addresses and ports of the connection.
This also makes it easier to integrate into perf.

Note, I expose both IPv4 and IPv6 addresses at the same time:
for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
Also, add sk and skb pointers as they are useful for BPF.

1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans

Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NBrendan Gregg <bgregg@netflix.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e086101b

11 10月, 2017 4 次提交

net: dst: move cpu inside ifdef to avoid compilation warning · 833e0e2f

由 Jakub Kicinski 提交于 10月 10, 2017

If CONFIG_DST_CACHE is not selected cpu variable
will be unused and we will see a compilation warning.
Move it under the ifdef.
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Fixes: d66f2b91 ("bpf: don't rely on the verifier lock for metadata_dst allocation")
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

833e0e2f

rtnetlink: bridge: use ext_ack instead of printk · b88d12e4

由 Florian Westphal 提交于 10月 10, 2017

We can now piggyback error strings to userspace via extended acks
rather than using printk.

Before:
bridge fdb add 01:02:03:04:05:06 dev br0 vlan 4095
RTNETLINK answers: Invalid argument

After:
bridge fdb add 01:02:03:04:05:06 dev br0 vlan 4095
Error: invalid vlan id.

v3: drop 'RTM_' prefixes, suggested by David Ahern, they
are not useful, the add/del in bridge command line is enough.

Also reword error in response to malformed/bad vlan id attribute
size.

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b88d12e4

net/core: Fix BUG to BUG_ON conditionals. · 9f77fad3

由 Tim Hansen 提交于 10月 09, 2017

Fix BUG() calls to use BUG_ON(conditional) macros.

This was found using make coccicheck M=net/core on linux next
tag next-2017092
Signed-off-by: NTim Hansen <devtimhansen@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f77fad3

bpf: don't rely on the verifier lock for metadata_dst allocation · d66f2b91

由 Jakub Kicinski 提交于 10月 09, 2017

bpf_skb_set_tunnel_*() functions require allocation of per-cpu
metadata_dst.  The allocation happens upon verification of the
first program using those helpers.  In preparation for removing
the verifier lock, use cmpxchg() to make sure we only allocate
the metadata_dsts once.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d66f2b91

05 10月, 2017 8 次提交

net: Add extack to upper device linking · 42ab19ee

由 David Ahern 提交于 10月 04, 2017

Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

42ab19ee

net: Add extack to ndo_add_slave · 33eaf2a6

由 David Ahern 提交于 10月 04, 2017

Pass extack to do_set_master and down to ndo_add_slave
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33eaf2a6

net: Add extack to netdev_notifier_info · 51d0c047

由 David Ahern 提交于 10月 04, 2017

Add netlink_ext_ack to netdev_notifier_info to allow notifier
handlers to return errors to userspace.

Clean up the initialization in dev.c such that extack is easily
added in subsequent patches where relevant. Specifically, remove
the init call in call_netdevice_notifiers_info and have callers
initalize on stack when info is declared.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51d0c047

dev: advertise the new nsid when the netns iface changes · 6621dd29

由 Nicolas Dichtel 提交于 10月 03, 2017

x-netns interfaces are bound to two netns: the link netns and the upper
netns. Usually, this kind of interfaces is created in the link netns and
then moved to the upper netns. At the end, the interface is visible only
in the upper netns. The link nsid is advertised via netlink in the upper
netns, thus the user always knows where is the link part.

There is no such mechanism in the link netns. When the interface is moved
to another netns, the user cannot "follow" it.
This patch adds a new netlink attribute which helps to follow an interface
which moves to another netns. When the interface is unregistered, the new
nsid is advertised. If the interface is a x-netns interface (ie
rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.

CC: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6621dd29

net: cache skb_shinfo() in skb_try_coalesce() · c818fa9e

由 Eric Dumazet 提交于 10月 04, 2017

Compiler does not really know that skb_shinfo(to|from) are constants
in skb_try_coalesce(), lets cache their values to shrink code.

We might even take care of skb_zcopy() calls later.

$ size net/core/skbuff.o.before net/core/skbuff.o
   text	   data	    bss	    dec	    hex	filename
  40727	   1298	      0	  42025	   a429	net/core/skbuff.o.before
  40631	   1298	      0	  41929	   a3c9	net/core/skbuff.o
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c818fa9e

rtnetlink: remove __rtnl_af_unregister · 5c45121d

由 Florian Westphal 提交于 10月 04, 2017

switch the only caller to rtnl_af_unregister.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5c45121d

rtnetlink: remove slave_validate callback · e774d96b

由 Florian Westphal 提交于 10月 04, 2017

no users in the tree.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e774d96b

net: core: fix kerneldoc comment · 20e88320

由 Florian Westphal 提交于 10月 04, 2017

net/core/dev.c:1306: warning: No description found for parameter 'name'
net/core/dev.c:1306: warning: Excess function parameter 'alias' description in 'dev_get_alias'

Fixes: 6c557001 ("net: core: decouple ifalias get/set from rtnl lock")
Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

20e88320

04 10月, 2017 2 次提交

net: core: decouple ifalias get/set from rtnl lock · 6c557001

由 Florian Westphal 提交于 10月 02, 2017

Device alias can be set by either rtnetlink (rtnl is held) or sysfs.

rtnetlink hold the rtnl mutex, sysfs acquires it for this purpose.
Add an extra mutex for it and use rcu to protect concurrent accesses.

This allows the sysfs path to not take rtnl and would later allow
to not hold it when dumping ifalias.

Based on suggestion from Eric Dumazet.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6c557001

net: rtnetlink: fix info leak in RTM_GETSTATS call · ce024f42

由 Nikolay Aleksandrov 提交于 10月 03, 2017

When RTM_GETSTATS was added the fields of its header struct were not all
initialized when returning the result thus leaking 4 bytes of information
to user-space per rtnl_fill_statsinfo call, so initialize them now. Thanks
to Alexander Potapenko for the detailed report and bisection.
Reported-by: NAlexander Potapenko <glider@google.com>
Fixes: 10c9ead9 ("rtnetlink: add new RTM_GETSTATS message to dump link stats")
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce024f42

03 10月, 2017 2 次提交

socket, bpf: fix possible use after free · eefca20e

由 Eric Dumazet 提交于 10月 02, 2017

Starting from linux-4.4, 3WHS no longer takes the listener lock.

Since this time, we might hit a use-after-free in sk_filter_charge(),
if the filter we got in the memcpy() of the listener content
just happened to be replaced by a thread changing listener BPF filter.

To fix this, we need to make sure the filter refcount is not already
zero before incrementing it again.

Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eefca20e

flow_dissector: dissect tunnel info · a38402bc

由 Simon Horman 提交于 10月 02, 2017

Move dissection of tunnel info from the flower classifier to the flow
dissector where all other dissection occurs.  This should not have any
behavioural affect on other users of the flow dissector.
Signed-off-by: NSimon Horman <simon.horman@netronome.com>
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a38402bc

29 9月, 2017 5 次提交

net: Set sk_prot_creator when cloning sockets to the right proto · 9d538fa6

由 Christoph Paasch 提交于 9月 26, 2017

sk->sk_prot and sk->sk_prot_creator can differ when the app uses
IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
Which is why sk_prot_creator is there to make sure that sk_prot_free()
does the kmem_cache_free() on the right kmem_cache slab.

Now, if such a socket gets transformed back to a listening socket (using
connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
sk_clone_lock() when a new connection comes in. But sk_prot_creator will
still point to the IPv6 kmem_cache (as everything got copied in
sk_clone_lock()). When freeing, we will thus put this
memory back into the IPv6 kmem_cache although it was allocated in the
IPv4 cache. I have seen memory corruption happening because of this.

With slub-debugging and MEMCG_KMEM enabled this gives the warning
	"cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"

A C-program to trigger this:

void main(void)
{
        int fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
        int new_fd, newest_fd, client_fd;
        struct sockaddr_in6 bind_addr;
        struct sockaddr_in bind_addr4, client_addr1, client_addr2;
        struct sockaddr unsp;
        int val;

        memset(&bind_addr, 0, sizeof(bind_addr));
        bind_addr.sin6_family = AF_INET6;
        bind_addr.sin6_port = ntohs(42424);

        memset(&client_addr1, 0, sizeof(client_addr1));
        client_addr1.sin_family = AF_INET;
        client_addr1.sin_port = ntohs(42424);
        client_addr1.sin_addr.s_addr = inet_addr("127.0.0.1");

        memset(&client_addr2, 0, sizeof(client_addr2));
        client_addr2.sin_family = AF_INET;
        client_addr2.sin_port = ntohs(42421);
        client_addr2.sin_addr.s_addr = inet_addr("127.0.0.1");

        memset(&unsp, 0, sizeof(unsp));
        unsp.sa_family = AF_UNSPEC;

        bind(fd, (struct sockaddr *)&bind_addr, sizeof(bind_addr));

        listen(fd, 5);

        client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
        connect(client_fd, (struct sockaddr *)&client_addr1, sizeof(client_addr1));
        new_fd = accept(fd, NULL, NULL);
        close(fd);

        val = AF_INET;
        setsockopt(new_fd, SOL_IPV6, IPV6_ADDRFORM, &val, sizeof(val));

        connect(new_fd, &unsp, sizeof(unsp));

        memset(&bind_addr4, 0, sizeof(bind_addr4));
        bind_addr4.sin_family = AF_INET;
        bind_addr4.sin_port = ntohs(42421);
        bind(new_fd, (struct sockaddr *)&bind_addr4, sizeof(bind_addr4));

        listen(new_fd, 5);

        client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
        connect(client_fd, (struct sockaddr *)&client_addr2, sizeof(client_addr2));

        newest_fd = accept(new_fd, NULL, NULL);
        close(new_fd);

        close(client_fd);
        close(new_fd);
}

As far as I can see, this bug has been there since the beginning of the
git-days.
Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9d538fa6

rtnetlink: rtnl_have_link_slave_info doesn't need rtnl · 4c82a95e

由 Florian Westphal 提交于 9月 26, 2017

it can be switched to rcu.
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c82a95e

rtnetlink: add helpers to dump netnsid information · b1e66b9a

由 Florian Westphal 提交于 9月 26, 2017

Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1e66b9a

rtnetlink: add helpers to dump vf information · 250fc3df

由 Florian Westphal 提交于 9月 26, 2017

similar to earlier patches, split out more parts of this function to
better see what is happening and where we assume rtnl is locked.
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

250fc3df

rtnetlink: add helper to put master and link ifindexes · 79110a04

由 Florian Westphal 提交于 9月 26, 2017

rtnl_fill_ifinfo currently requires caller to hold the rtnl mutex.
Unfortunately the function is quite large which makes it harder to see
which spots require the lock, which spots assume it and which ones could
do without.

Add helpers to factor out the ifindex dumping, one can use rcu to avoid
rtnl dependency.
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79110a04

27 9月, 2017 3 次提交

bpf: add meta pointer for direct access · de8f3a83

由 Daniel Borkmann 提交于 9月 25, 2017

This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.

xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.

The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de8f3a83

bpf: rename bpf_compute_data_end into bpf_compute_data_pointers · 6aaae2b6

由 Daniel Borkmann 提交于 9月 25, 2017

Just do the rename into bpf_compute_data_pointers() as we'll add
one more pointer here to recompute.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6aaae2b6

datagram: Remove redundant unlikely() · 98e4fcff

由 Tobias Klauser 提交于 9月 26, 2017

IS_ERR() already implies unlikely(), so it can be omitted.
Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

98e4fcff

26 9月, 2017 2 次提交

neigh: make strucrt neigh_table::entry_size unsigned int · 01ccdf12

由 Alexey Dobriyan 提交于 9月 23, 2017

Key length can't be negative.

Leave comparisons against nla_len() signed just in case truncated attribute
can sneak in there.

Space savings:

	add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-7 (-7)
	function                                     old     new   delta
	pneigh_delete                                273     272      -1
	mlx5e_rep_netevent_event                    1415    1414      -1
	mlx5e_create_encap_header_ipv6              1194    1193      -1
	mlx5e_create_encap_header_ipv4              1071    1070      -1
	cxgb4_l2t_get                               1104    1103      -1
	__pneigh_lookup                               69      68      -1
	__neigh_create                              2452    2451      -1
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

01ccdf12

net: speed up skb_rbtree_purge() · 7c90584c

由 Eric Dumazet 提交于 9月 23, 2017

As measured in my prior patch ("sch_netem: faster rb tree removal"),
rbtree_postorder_for_each_entry_safe() is nice looking but much slower
than using rb_next() directly, except when tree is small enough
to fit in CPU caches (then the cost is the same)

Also note that there is not even an increase of text size :
$ size net/core/skbuff.o.before net/core/skbuff.o
   text	   data	    bss	    dec	    hex	filename
  40711	   1298	      0	  42009	   a419	net/core/skbuff.o.before
  40711	   1298	      0	  42009	   a419	net/core/skbuff.o

From: Eric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7c90584c

23 9月, 2017 2 次提交

net: use 32-bit arithmetic while allocating net device · 52a59bd5

由 Alexey Dobriyan 提交于 9月 21, 2017

Private part of allocation is never big enough to warrant size_t.

Space savings:

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
	function                                     old     new   delta
	alloc_netdev_mqs                            1120    1110     -10
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

52a59bd5

net: orphan frags on stand-alone ptype in dev_queue_xmit_nit · 581fe0ea

由 Willem de Bruijn 提交于 9月 22, 2017

Zerocopy skbs frags are copied when the skb is looped to a local sock.
Commit 1080e512 ("net: orphan frags on receive") introduced calls
to skb_orphan_frags to deliver_skb and __netif_receive_skb for this.

With msg_zerocopy, these skbs can also exist in the tx path and thus
loop from dev_queue_xmit_nit. This already calls deliver_skb in its
loop. But it does not orphan before a separate pt_prev->func().

Add the missing skb_orphan_frags_rx.

Changes
  v1->v2: handle skb_orphan_frags_rx failure

Fixes: 1f8b977a ("sock: enable MSG_ZEROCOPY")
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

581fe0ea

22 9月, 2017 1 次提交

net: ethtool: Add back transceiver type · 19cab887

由 Florian Fainelli 提交于 9月 20, 2017

Commit 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
deprecated the ethtool_cmd::transceiver field, which was fine in
premise, except that the PHY library was actually using it to report the
type of transceiver: internal or external.

Use the first word of the reserved field to put this __u8 transceiver
field back in. It is made read-only, and we don't expect the
ETHTOOL_xLINKSETTINGS API to be doing anything with this anyway, so this
is mostly for the legacy path where we do:

ethtool_get_settings()
-> dev->ethtool_ops->get_link_ksettings()
   -> convert_link_ksettings_to_legacy_settings()

to have no information loss compared to the legacy get_settings API.

Fixes: 3f1ac7a7 ("net: ethtool: add new ETHTOOL_xLINKSETTINGS API")
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19cab887

21 9月, 2017 1 次提交

net: change skb->mac_header when Generic XDP calls adjust_head · 92dd5452

由 Edward Cree 提交于 9月 19, 2017

Since XDP's view of the packet includes the MAC header, moving the start-
 of-packet with bpf_xdp_adjust_head needs to also update the offset of the
 MAC header (which is relative to skb->head, not to the skb->data that was
 changed).
Without this, tcpdump sees packets starting from the old MAC header rather
 than the new one, at least in my tests on the loopback device.

Fixes: b5cdae32 ("net: Generic XDP")
Signed-off-by: NEdward Cree <ecree@solarflare.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

92dd5452

20 9月, 2017 1 次提交

bpf: fix ri->map_owner pointer on bpf_prog_realloc · 7c300131

由 Daniel Borkmann 提交于 9月 20, 2017

Commit 109980b8 ("bpf: don't select potentially stale
ri->map from buggy xdp progs") passed the pointer to the prog
itself to be loaded into r4 prior on bpf_redirect_map() helper
call, so that we can store the owner into ri->map_owner out of
the helper.

Issue with that is that the actual address of the prog is still
subject to change when subsequent rewrites occur that require
slow path in bpf_prog_realloc() to alloc more memory, e.g. from
patching inlining helper functions or constant blinding. Thus,
we really need to take prog->aux as the address we're holding,
which also works with prog clones as they share the same aux
object.

Instead of then fetching aux->prog during runtime, which could
potentially incur cache misses due to false sharing, we are
going to just use aux for comparison on the map owner. This
will also keep the patchlet of the same size, and later check
in xdp_map_invalid() only accesses read-only aux pointer from
the prog, it's also in the same cacheline already from prior
access when calling bpf_func.

Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7c300131

14 9月, 2017 1 次提交

net_sched: gen_estimator: fix scaling error in bytes/packets samples · ca558e18

由 Eric Dumazet 提交于 9月 13, 2017

Denys reported wrong rate estimations with HTB classes.

It appears the bug was added in linux-4.10, since my tests
where using intervals of one second only.

HTB using 4 sec default rate estimators, reported rates
were 4x higher.

We need to properly scale the bytes/packets samples before
integrating them in EWMA.

Tested:
 echo 1 >/sys/module/sch_htb/parameters/htb_rate_est

 Setup HTB with one class with a rate/cail of 5Gbit

 Generate traffic on this class

 tc -s -d cl sh dev eth0 classid 7002:11
class htb 7002:11 parent 7002:1 prio 5 quantum 200000 rate 5Gbit ceil
5Gbit linklayer ethernet burst 80000b/1 mpu 0b cburst 80000b/1 mpu 0b
level 0 rate_handle 1
 Sent 1488215421648 bytes 982969243 pkt (dropped 0, overlimits 0
requeues 0)
 rate 5Gbit 412814pps backlog 136260b 2p requeues 0
 TCP pkts/rtx 982969327/45 bytes 1488215557414/68130
 lended: 22732826 borrowed: 0 giants: 0
 tokens: -1684 ctokens: -1684

Fixes: 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDenys Fedoryshchenko <nuclearcat@nuclearcat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ca558e18

12 9月, 2017 1 次提交

xdp: implement xdp_redirect_map for generic XDP · 96c5508e

由 Jesper Dangaard Brouer 提交于 9月 10, 2017

Using bpf_redirect_map is allowed for generic XDP programs, but the
appropriate map lookup was never performed in xdp_do_generic_redirect().

Instead the map-index is directly used as the ifindex. For the
xdp_redirect_map sample in SKB-mode '-S', this resulted in trying
sending on ifindex 0 which isn't valid, resulting in getting SKB
packets dropped. Thus, the reported performance numbers are wrong in
commit 24251c26 ("samples/bpf: add option for native and skb mode
for redirect apps") for the 'xdp_redirect_map -S' case.

Before commit 109980b8 ("bpf: don't select potentially stale
ri->map from buggy xdp progs") it could crash the kernel. Like this
commit also check that the map_owner owner is correct before
dereferencing the map pointer. But make sure that this API misusage
can be caught by a tracepoint. Thus, allowing userspace via
tracepoints to detect misbehaving bpf_progs.

Fixes: 6103aa96 ("net: implement XDP_REDIRECT for xdp generic")
Fixes: 24251c26 ("samples/bpf: add option for native and skb mode for redirect apps")
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

96c5508e

09 9月, 2017 2 次提交

bpf: make error reporting in bpf_warn_invalid_xdp_action more clear · 9beb8bed

由 Daniel Borkmann 提交于 9月 09, 2017

Differ between illegal XDP action code and just driver
unsupported one to provide better feedback when we throw
a one-time warning here. Reason is that with 814abfab
("xdp: add bpf_redirect helper function") not all drivers
support the new XDP return code yet and thus they will
fall into their 'default' case when checking for return
codes after program return, which then triggers a
bpf_warn_invalid_xdp_action() stating that the return
code is illegal, but from XDP perspective it's not.

I decided not to place something like a XDP_ACT_MAX define
into uapi i) given we don't have this either for all other
program types, ii) future action codes could have further
encoding there, which would render such define unsuitable
and we wouldn't be able to rip it out again, and iii) we
rarely add new action codes.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9beb8bed

net: rcu lock and preempt disable missing around generic xdp · bbbe211c

由 John Fastabend 提交于 9月 08, 2017

do_xdp_generic must be called inside rcu critical section with preempt
disabled to ensure BPF programs are valid and per-cpu variables used
for redirect operations are consistent. This patch ensures this is true
and fixes the splat below.

The netif_receive_skb_internal() code path is now broken into two rcu
critical sections. I decided it was better to limit the preempt_enable/disable
block to just the xdp static key portion and the fallout is more
rcu_read_lock/unlock calls. Seems like the best option to me.

[  607.596901] =============================
[  607.596906] WARNING: suspicious RCU usage
[  607.596912] 4.13.0-rc4+ #570 Not tainted
[  607.596917] -----------------------------
[  607.596923] net/core/dev.c:3948 suspicious rcu_dereference_check() usage!
[  607.596927]
[  607.596927] other info that might help us debug this:
[  607.596927]
[  607.596933]
[  607.596933] rcu_scheduler_active = 2, debug_locks = 1
[  607.596938] 2 locks held by pool/14624:
[  607.596943]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff95445ffd>] ip_finish_output2+0x14d/0x890
[  607.596973]  #1:  (rcu_read_lock_bh){......}, at: [<ffffffff953c8e3a>] __dev_queue_xmit+0x14a/0xfd0
[  607.597000]
[  607.597000] stack backtrace:
[  607.597006] CPU: 5 PID: 14624 Comm: pool Not tainted 4.13.0-rc4+ #570
[  607.597011] Hardware name: Dell Inc. Precision Tower 5810/0HHV7N, BIOS A17 03/01/2017
[  607.597016] Call Trace:
[  607.597027]  dump_stack+0x67/0x92
[  607.597040]  lockdep_rcu_suspicious+0xdd/0x110
[  607.597054]  do_xdp_generic+0x313/0xa50
[  607.597068]  ? time_hardirqs_on+0x5b/0x150
[  607.597076]  ? mark_held_locks+0x6b/0xc0
[  607.597088]  ? netdev_pick_tx+0x150/0x150
[  607.597117]  netif_rx_internal+0x205/0x3f0
[  607.597127]  ? do_xdp_generic+0xa50/0xa50
[  607.597144]  ? lock_downgrade+0x2b0/0x2b0
[  607.597158]  ? __lock_is_held+0x93/0x100
[  607.597187]  netif_rx+0x119/0x190
[  607.597202]  loopback_xmit+0xfd/0x1b0
[  607.597214]  dev_hard_start_xmit+0x127/0x4e0

Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
Fixes: b5cdae32 ("net: Generic XDP")
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bbbe211c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功