- 17 11月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
After commit 4cd13c21 ("softirq: Let ksoftirqd do its job"), sk_busy_loop() needs a bit of care : softirqs might be delayed since we do not allow preemption yet. This patch adds preemptiom points in sk_busy_loop(), and makes sure no unnecessary cache line dirtying or atomic operations are done while looping. A new flag is added into napi->state : NAPI_STATE_IN_BUSY_POLL This prevents napi_complete_done() from clearing NAPIF_STATE_SCHED, so that sk_busy_loop() does not have to grab it again. Similarly, netpoll_poll_lock() is done one time. This gives about 10 to 20 % improvement in various busy polling tests, especially when many threads are busy polling in configurations with large number of NIC queues. This should allow experimenting with bigger delays without hurting overall latencies. Tested: On a 40Gb mlx4 NIC, 32 RX/TX queues. echo 70 >/proc/sys/net/core/busy_read for i in `seq 1 40`; do echo -n $i: ; ./super_netperf $i -H lpaa24 -t UDP_RR -- -N -n; done Before: After: 1: 90072 92819 2: 157289 184007 3: 235772 213504 4: 344074 357513 5: 394755 458267 6: 461151 487819 7: 549116 625963 8: 544423 716219 9: 720460 738446 10: 794686 837612 11: 915998 923960 12: 937507 925107 13: 1019677 971506 14: 1046831 1113650 15: 1114154 1148902 16: 1105221 1179263 17: 1266552 1299585 18: 1258454 1383817 19: 1341453 1312194 20: 1363557 1488487 21: 1387979 1501004 22: 1417552 1601683 23: 1550049 1642002 24: 1568876 1601915 25: 1560239 1683607 26: 1640207 1745211 27: 1706540 1723574 28: 1638518 1722036 29: 1734309 1757447 30: 1782007 1855436 31: 1724806 1888539 32: 1717716 1944297 33: 1778716 1869118 34: 1805738 1983466 35: 1815694 2020758 36: 1893059 2035632 37: 1843406 2034653 38: 1888830 2086580 39: 1972827 2143567 40: 1877729 2181851 Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 11月, 2016 1 次提交
-
-
由 WANG Cong 提交于
Similar to commit 14135f30 ("inet: fix sleeping inside inet_wait_for_connect()"), sk_wait_event() needs to fix too, because release_sock() is blocking, it changes the process state back to running after sleep, which breaks the previous prepare_to_wait(). Switch to the new wait API. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 11月, 2016 2 次提交
-
-
由 Eric Dumazet 提交于
After Tom patch, thoff field could point past the end of the buffer, this could fool some callers. If an skb was provided, skb->len should be the upper limit. If not, hlen is supposed to be the upper limit. Fixes: a6e544b0 ("flow_dissector: Jump to exit code in __skb_flow_dissect") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Yibin Yang <yibyang@cisco.com Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com> Acked-by: NWillem de Bruijn <willemb@google.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Martin KaFai Lau 提交于
If the bpf program calls bpf_redirect(dev, 0) and dev is an ipip/ip6tnl, it currently includes the mac header. e.g. If dev is ipip, the end result is IP-EthHdr-IP instead of IP-IP. The fix is to pull the mac header. At ingress, skb_postpull_rcsum() is not needed because the ethhdr should have been pulled once already and then got pushed back just before calling the bpf_prog. At egress, this patch calls skb_postpull_rcsum(). If bpf_redirect(dev, BPF_F_INGRESS) is called, it also fails now because it calls dev_forward_skb() which eventually calls eth_type_trans(skb, dev). The eth_type_trans() will set skb->type = PACKET_OTHERHOST because the mac address does not match the redirecting dev->dev_addr. The PACKET_OTHERHOST will eventually cause the ip_rcv() errors out. To fix this, ____dev_forward_skb() is added. Joint work with Daniel Borkmann. Fixes: cfc7381b ("ip_tunnel: add collect_md mode to IPIP tunnel") Fixes: 8d79266b ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@fb.com> Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 11月, 2016 5 次提交
-
-
由 Eric Dumazet 提交于
There are no more users except from net/core/dev.c napi_hash_add() can now be static. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Lebrun 提交于
This patch creates a new type of interfaceless lightweight tunnel (SEG6), enabling the encapsulation and injection of SRH within locally emitted packets and forwarded packets. >From a configuration viewpoint, a seg6 tunnel would be configured as follows: ip -6 ro ad fc00::1/128 encap seg6 mode encap segs fc42::1,fc42::2,fc42::3 dev eth0 Any packet whose destination address is fc00::1 would thus be encapsulated within an outer IPv6 header containing the SRH with three segments, and would actually be routed to the first segment of the list. If `mode inline' was specified instead of `mode encap', then the SRH would be directly inserted after the IPv6 header without outer encapsulation. The inline mode is only available if CONFIG_IPV6_SEG6_INLINE is enabled. This feature was made configurable because direct header insertion may break several mechanisms such as PMTUD or IPSec AH. Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Mathias Krause 提交于
To avoid having dangling function pointers left behind, reset calcit in rtnl_unregister(), too. This is no issue so far, as only the rtnl core registers a netlink handler with a calcit hook which won't be unregistered, but may become one if new code makes use of the calcit hook. Fixes: c7ac8679 ("rtnetlink: Compute and store minimum ifinfo...") Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: NMathias Krause <minipli@googlemail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Receiving a GSO packet in dev_gro_receive() is not uncommon in stacked devices, or devices partially implementing LRO/GRO like bnx2x. GRO is implementing the aggregation the device was not able to do itself. Current code causes reorders, like in following case : For a given flow where sender sent 3 packets P1,P2,P3,P4 Receiver might receive P1 as a single packet, stored in GRO engine. Then P2-P4 are received as a single GSO packet, immediately given to upper stack, while P1 is held in GRO engine. This patch will make sure P1 is given to upper stack, then P2-P4 immediately after. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Lorenzo Colitti 提交于
Without this check, it is not possible to create two rules that are identical except for their UID ranges. For example: root@net-test:/# ip rule add prio 1000 lookup 300 root@net-test:/# ip rule add prio 1000 uidrange 100-200 lookup 300 RTNETLINK answers: File exists root@net-test:/# ip rule add prio 1000 uidrange 100-199 lookup 100 root@net-test:/# ip rule add prio 1000 uidrange 200-299 lookup 200 root@net-test:/# ip rule add prio 1000 uidrange 300-399 lookup 100 RTNETLINK answers: File exists Tested: https://android-review.googlesource.com/#/c/299980/Signed-off-by: NLorenzo Colitti <lorenzo@google.com> Acked-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 11月, 2016 3 次提交
-
-
由 Soheil Hassas Yeganeh 提交于
Do not set sk_err when dequeuing errors from the error queue. Doing so results in: a) Bugs: By overwriting existing sk_err values, it possibly hides legitimate errors. It is also incorrect when local errors are queued with ip_local_error. That happens in the context of a system call, which already returns the error code. b) Inconsistent behavior: When there are pending errors on the error queue, sk_err is sometimes 0 (e.g., for the first timestamp on the error queue) and sometimes set to an error code (after dequeuing the first timestamp). c) Suboptimality: Setting sk_err to ENOMSG on simple TX timestamps can abort parallel reads and writes. Removing this line doesn't break userspace. This is because userspace code cannot rely on sk_err for detecting whether there is something on the error queue. Except for ICMP messages received for UDP and RAW, sk_err is not set at enqueue time, and as a result sk_err can be 0 while there are plenty of errors on the error queue. For ICMP packets in UDP and RAW, sk_err is set when they are enqueued on the error queue, but that does not result in aborting reads and writes. For such cases, sk_err is only readable via getsockopt(SO_ERROR) which will reset the value of sk_err on its own. More importantly, prior to this patch, recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e., sk_err is set by sock_dequeue_err_skb without atomic ops or locks) which can store 0 in sk_err even when we have ICMP messages pending. Removing this line from sock_dequeue_err_skb eliminates that race. Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a default qdisc attached, given they will be backed by physical device, that already have a qdisc attached for pushback. It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as this can be useful for difference policy reasons (e.g. bandwidth limiting containers). For this to work, the tx_queue_len need to have a sane value, because some qdiscs inherit/copy the tx_queue_len (namely, pfifo, bfifo, gred, htb, plug and sfb). Commit a813104d ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()") caught situations where some drivers didn't initialize tx_queue_len. The problem with the commit was choosing 1 as the fallback value. A qdisc queue length of 1 causes more harm than good, because it creates hard to debug situations for userspace. It gives userspace a false sense of a working config after attaching a qdisc. As low volume traffic (that doesn't activate the qdisc policy) works, like ping, while traffic that e.g. needs shaping cannot reach the configured policy levels, given the queue length is too small. This patch change the value to DEFAULT_TX_QUEUE_LEN, given other IFF_NO_QUEUE devices (that call ether_setup()) also use this value. Fixes: a813104d ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()") Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
A new argument is added to __skb_recv_datagram to provide an explicit skb destructor, invoked under the receive queue lock. The UDP protocol uses such argument to perform memory reclaiming on dequeue, so that the UDP protocol does not set anymore skb->desctructor. Instead explicit memory reclaiming is performed at close() time and when skbs are removed from the receive queue. The in kernel UDP protocol users now need to call a skb_recv_udp() variant instead of skb_recv_datagram() to properly perform memory accounting on dequeue. Overall, this allows acquiring only once the receive queue lock on dequeue. Tested using pktgen with random src port, 64 bytes packet, wire-speed on a 10G link as sender and udp_sink as the receiver, using an l4 tuple rxhash to stress the contention, and one or more udp_sink instances with reuseport. nr sinks vanilla patched 1 440 560 3 2150 2300 6 3650 3800 9 4450 4600 12 6250 6450 v1 -> v2: - do rmem and allocated memory scheduling under the receive lock - do bulk scheduling in first_packet_length() and in udp_destruct_sock() - avoid the typdef for the dequeue callback Suggested-by: NEric Dumazet <edumazet@google.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 11月, 2016 2 次提交
-
-
由 Lorenzo Colitti 提交于
- Define a new FIB rule attributes, FRA_UID_RANGE, to describe a range of UIDs. - Define a RTA_UID attribute for per-UID route lookups and dumps. - Support passing these attributes to and from userspace via rtnetlink. The value INVALID_UID indicates no UID was specified. - Add a UID field to the flow structures. Signed-off-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Lorenzo Colitti 提交于
Protocol sockets (struct sock) don't have UIDs, but most of the time, they map 1:1 to userspace sockets (struct socket) which do. Various operations such as the iptables xt_owner match need access to the "UID of a socket", and do so by following the backpointer to the struct socket. This involves taking sk_callback_lock and doesn't work when there is no socket because userspace has already called close(). Simplify this by adding a sk_uid field to struct sock whose value matches the UID of the corresponding struct socket. The semantics are as follows: 1. Whenever sk_socket is non-null: sk_uid is the same as the UID in sk_socket, i.e., matches the return value of sock_i_uid. Specifically, the UID is set when userspace calls socket(), fchown(), or accept(). 2. When sk_socket is NULL, sk_uid is defined as follows: - For a socket that no longer has a sk_socket because userspace has called close(): the previous UID. - For a cloned socket (e.g., an incoming connection that is established but on which userspace has not yet called accept): the UID of the socket it was cloned from. - For a socket that has never had an sk_socket: UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. Kernel sockets created by sock_create_kern are a special case of #1 and sk_uid is the user that created them. For kernel sockets created at network namespace creation time, such as the per-processor ICMP and TCP sockets, this is the user that created the network namespace. Signed-off-by: NLorenzo Colitti <lorenzo@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 11月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Andrey Konovalov reported following error while fuzzing with syzkaller : IPv4: Attempt to release alive inet socket ffff880068e98940 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 1 PID: 3905 Comm: a.out Not tainted 4.9.0-rc3+ #333 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88006b9e0000 task.stack: ffff880068770000 RIP: 0010:[<ffffffff819ead5f>] [<ffffffff819ead5f>] selinux_socket_sock_rcv_skb+0xff/0x6a0 security/selinux/hooks.c:4639 RSP: 0018:ffff8800687771c8 EFLAGS: 00010202 RAX: ffff88006b9e0000 RBX: 1ffff1000d0eee3f RCX: 1ffff1000d1d312a RDX: 1ffff1000d1d31a6 RSI: dffffc0000000000 RDI: 0000000000000010 RBP: ffff880068777360 R08: 0000000000000000 R09: 0000000000000002 R10: dffffc0000000000 R11: 0000000000000006 R12: ffff880068e98940 R13: 0000000000000002 R14: ffff880068777338 R15: 0000000000000000 FS: 00007f00ff760700(0000) GS:ffff88006cd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020008000 CR3: 000000006a308000 CR4: 00000000000006e0 Stack: ffff8800687771e0 ffffffff812508a5 ffff8800686f3168 0000000000000007 ffff88006ac8cdfc ffff8800665ea500 0000000041b58ab3 ffffffff847b5480 ffffffff819eac60 ffff88006b9e0860 ffff88006b9e0868 ffff88006b9e07f0 Call Trace: [<ffffffff819c8dd5>] security_sock_rcv_skb+0x75/0xb0 security/security.c:1317 [<ffffffff82c2a9e7>] sk_filter_trim_cap+0x67/0x10e0 net/core/filter.c:81 [<ffffffff82b81e60>] __sk_receive_skb+0x30/0xa00 net/core/sock.c:460 [<ffffffff838bbf12>] dccp_v4_rcv+0xdb2/0x1910 net/dccp/ipv4.c:873 [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0 net/ipv4/ip_input.c:216 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306abd2>] ip_local_deliver+0x1c2/0x4b0 net/ipv4/ip_input.c:257 [< inline >] dst_input ./include/net/dst.h:507 [<ffffffff83068500>] ip_rcv_finish+0x750/0x1c40 net/ipv4/ip_input.c:396 [< inline >] NF_HOOK_THRESH ./include/linux/netfilter.h:232 [< inline >] NF_HOOK ./include/linux/netfilter.h:255 [<ffffffff8306b82f>] ip_rcv+0x96f/0x12f0 net/ipv4/ip_input.c:487 [<ffffffff82bd9fb7>] __netif_receive_skb_core+0x1897/0x2a50 net/core/dev.c:4213 [<ffffffff82bdb19a>] __netif_receive_skb+0x2a/0x170 net/core/dev.c:4251 [<ffffffff82bdb493>] netif_receive_skb_internal+0x1b3/0x390 net/core/dev.c:4279 [<ffffffff82bdb6b8>] netif_receive_skb+0x48/0x250 net/core/dev.c:4303 [<ffffffff8241fc75>] tun_get_user+0xbd5/0x28a0 drivers/net/tun.c:1308 [<ffffffff82421b5a>] tun_chr_write_iter+0xda/0x190 drivers/net/tun.c:1332 [< inline >] new_sync_write fs/read_write.c:499 [<ffffffff8151bd44>] __vfs_write+0x334/0x570 fs/read_write.c:512 [<ffffffff8151f85b>] vfs_write+0x17b/0x500 fs/read_write.c:560 [< inline >] SYSC_write fs/read_write.c:607 [<ffffffff81523184>] SyS_write+0xd4/0x1a0 fs/read_write.c:599 [<ffffffff83fc02c1>] entry_SYSCALL_64_fastpath+0x1f/0xc2 It turns out DCCP calls __sk_receive_skb(), and this broke when lookups no longer took a reference on listeners. Fix this issue by adding a @refcounted parameter to __sk_receive_skb(), so that sock_put() is used only when needed. Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NAndrey Konovalov <andreyknvl@google.com> Tested-by: NAndrey Konovalov <andreyknvl@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 11月, 2016 6 次提交
-
-
由 Eric Dumazet 提交于
Sending zero checksum is ok for TCP, but not for UDP. UDPv6 receiver should by default drop a frame with a 0 checksum, and UDPv4 would not verify the checksum and might accept a corrupted packet. Simply replace such checksum by 0xffff, regardless of transport. This error was caught on SIT tunnels, but seems generic. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: NMaciej Żenczykowski <maze@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
At accept() time, it is possible the parent has a non zero sk_err_soft, leftover from a prior error. Make sure we do not leave this value in the child, as it makes future getsockopt(SO_ERROR) calls quite unreliable. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
This patch adds support for setting and using XPS when QoS via traffic classes is enabled. With this change we will factor in the priority and traffic class mapping of the packet and use that information to correctly select the queue. This allows us to define a set of queues for a given traffic class via mqprio and then configure the XPS mapping for those queues so that the traffic flows can avoid head-of-line blocking between the individual CPUs if so desired. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
This patch updates the code for removing queues from the XPS map and makes it so that we can apply the code any time we change either the number of traffic classes or the mapping of a given block of queues. This way we avoid having queues pulling traffic from a foreign traffic class. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
Add a sysfs attribute for a Tx queue that allows us to determine the traffic class for a given queue. This will allow us to more easily determine this in the future. It is needed as XPS will take the traffic class for a group of queues into account in order to avoid pulling traffic from one traffic class into another. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
The functions for configuring the traffic class to queue mappings have other effects that need to be addressed. Instead of trying to export a bunch of new functions just relocate the functions so that we can instrument them directly with the functionality they will need. Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 10月, 2016 3 次提交
-
-
由 David Ahern 提交于
netdev_walk_all_lower_dev is not properly walking the lower device list. Commit 1a3f060c made netdev_walk_all_lower_dev similar to netdev_walk_all_upper_dev_rcu and netdev_walk_all_lower_dev_rcu but failed to update its netdev_next_lower_dev iterator. This patch fixes that. Fixes: 1a3f060c ("net: Introduce new api for walking upper and lower devices") Reported-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Tested-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
When transmitting on a packet socket with PACKET_VNET_HDR and PACKET_QDISC_BYPASS, validate device support for features requested in vnet_hdr. Drop TSO packets sent to devices that do not support TSO or have the feature disabled. Note that the latter currently do process those packets correctly, regardless of not advertising the feature. Because of SKB_GSO_DODGY, it is not sufficient to test device features with netif_needs_gso. Full validate_xmit_skb is needed. Switch to software checksum for non-TSO packets that request checksum offload if that device feature is unsupported or disabled. Note that similar to the TSO case, device drivers may perform checksum offload correctly even when not advertising it. When switching to software checksum, packets hit skb_checksum_help, which has two BUG_ON checksum not in linear segment. Packet sockets always allocate at least up to csum_start + csum_off + 2 as linear. Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c ethtool -K eth0 tso off tx on psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N ethtool -K eth0 tx off psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N v2: - add EXPORT_SYMBOL_GPL(validate_xmit_skb_list) Fixes: d346a3fa ("packet: introduce PACKET_QDISC_BYPASS socket option") Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 10月, 2016 5 次提交
-
-
由 Andrey Vagin 提交于
No one can see these events, because a network namespace can not be destroyed, if it has sockets. Unlike other devices, uevent-s for network devices are generated only inside their network namespaces. They are filtered in kobj_bcast_filter() My experiments shows that net namespaces are destroyed more 30% faster with this optimization. Here is a perf output for destroying network namespaces without this patch. - 94.76% 0.02% kworker/u48:1 [kernel.kallsyms] [k] cleanup_net - 94.74% cleanup_net - 94.64% ops_exit_list.isra.4 - 41.61% default_device_exit_batch - 41.47% unregister_netdevice_many - rollback_registered_many - 40.36% netdev_unregister_kobject - 14.55% device_del + 13.71% kobject_uevent - 13.04% netdev_queue_update_kobjects + 12.96% kobject_put - 12.72% net_rx_queue_update_kobjects kobject_put - kobject_release + 12.69% kobject_uevent + 0.80% call_netdevice_notifiers_info + 19.57% nfsd_exit_net + 11.15% tcp_net_metrics_exit + 8.25% rpcsec_gss_exit_net It's very critical to optimize the exit path for network namespaces, because they are destroyed under net_mutex and many namespaces can be destroyed for one iteration. v2: use dev_set_uevent_suppress() Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: NAndrei Vagin <avagin@openvz.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Arnd Bergmann 提交于
gcc warns about an uninitialized pointer dereference in the vlan priority handling: net/core/flow_dissector.c: In function '__skb_flow_dissect': net/core/flow_dissector.c:281:61: error: 'vlan' may be used uninitialized in this function [-Werror=maybe-uninitialized] As pointed out by Jiri Pirko, the variable is never actually used without being initialized first as the only way it end up uninitialized is with skb_vlan_tag_present(skb)==true, and that means it does not get accessed. However, the warning hints at some related issues that I'm addressing here: - the second check for the vlan tag is different from the first one that tests the skb for being NULL first, causing both the warning and a possible NULL pointer dereference that was not entirely fixed. - The same patch that introduced the NULL pointer check dropped an earlier optimization that skipped the repeated check of the protocol type - The local '_vlan' variable is referenced through the 'vlan' pointer but the variable has gone out of scope by the time that it is accessed, causing undefined behavior Caching the result of the 'skb && skb_vlan_tag_present(skb)' check in a local variable allows the compiler to further optimize the later check. With those changes, the warning also disappears. Fixes: 3805a938 ("flow_dissector: Check skb for VLAN only if skb specified.") Fixes: d5709f7a ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci") Signed-off-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NEric Garver <e@erig.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Johannes Berg 提交于
Now genl_register_family() is the only thing (other than the users themselves, perhaps, but I didn't find any doing that) writing to the family struct. In all families that I found, genl_register_family() is only called from __init functions (some indirectly, in which case I've add __init annotations to clarifly things), so all can actually be marked __ro_after_init. This protects the data structure from accidental corruption. Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Johannes Berg 提交于
Instead of providing macros/inline functions to initialize the families, make all users initialize them statically and get rid of the macros. This reduces the kernel code size by about 1.6k on x86-64 (with allyesconfig). Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Johannes Berg 提交于
Static family IDs have never really been used, the only use case was the workaround I introduced for those users that assumed their family ID was also their multicast group ID. Additionally, because static family IDs would never be reserved by the generic netlink code, using a relatively low ID would only work for built-in families that can be registered immediately after generic netlink is started, which is basically only the control family (apart from the workaround code, which I also had to add code for so it would reserve those IDs) Thus, anything other than GENL_ID_GENERATE is flawed and luckily not used except in the cases I mentioned. Move those workarounds into a few lines of code, and then get rid of GENL_ID_GENERATE entirely, making it more robust. Signed-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 10月, 2016 1 次提交
-
-
由 Elad Raz 提交于
When a port_type_set() is been called and the new port type set is the same as the old one, just return success. Signed-off-by: NElad Raz <eladr@mellanox.com> Signed-off-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 10月, 2016 1 次提交
-
-
由 Andrey Vagin 提交于
net_mutex can be locked for a long time. It may be because many namespaces are being destroyed or many processes decide to create a network namespace. Both these operations are heavy, so it is better to have an ability to kill a process which is waiting net_mutex. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: NAndrei Vagin <avagin@openvz.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 10月, 2016 3 次提交
-
-
由 Daniel Borkmann 提交于
Use case is mainly for soreuseport to select sockets for the local numa node, but since generic, lets also add this for other networking and tracing program types. Suggested-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
Basic sock operations that udp code can use with its own memory accounting schema. No functional change is introduced in the existing APIs. v4 -> v5: - avoid whitespace changes v2 -> v4: - avoid exporting __sock_enqueue_skb v1 -> v2: - avoid export sock_rmem_free Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paul Moore 提交于
This reverts commit bc51dddf ("netns: avoid disabling irq for netns id") as it was found to cause problems with systems running SELinux/audit, see the mailing list thread below: * http://marc.info/?t=147694653900002&r=1&w=2 Eventually we should be able to reintroduce this code once we have rewritten the audit multicast code to queue messages much the same way we do for unicast messages. A tracking issue for this can be found below: * https://github.com/linux-audit/audit-kernel/issues/23Reported-by: NStephen Smalley <sds@tycho.nsa.gov> Reported-by: NElad Raz <e@eladraz.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NPaul Moore <paul@paul-moore.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 10月, 2016 1 次提交
-
-
由 Sabrina Dubroca 提交于
Currently, GRO can do unlimited recursion through the gro_receive handlers. This was fixed for tunneling protocols by limiting tunnel GRO to one level with encap_mark, but both VLAN and TEB still have this problem. Thus, the kernel is vulnerable to a stack overflow, if we receive a packet composed entirely of VLAN headers. This patch adds a recursion counter to the GRO layer to prevent stack overflow. When a gro_receive function hits the recursion limit, GRO is aborted for this skb and it is processed normally. This recursion counter is put in the GRO CB, but could be turned into a percpu counter if we run out of space in the CB. Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report. Fixes: CVE-2016-7039 Fixes: 9b174d88 ("net: Add Transparent Ethernet Bridging GRO support.") Fixes: 66e5133f ("vlan: Add GRO support for non hardware accelerated vlan") Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Reviewed-by: NJiri Benc <jbenc@redhat.com> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 10月, 2016 3 次提交
-
-
由 Ido Schimmel 提交于
Tamir reported the following trace when processing ARP requests received via a vlan device on top of a VLAN-aware bridge: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0] [...] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.8.0-rc7 #1 Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016 task: ffff88017edfea40 task.stack: ffff88017ee10000 RIP: 0010:[<ffffffff815dcc73>] [<ffffffff815dcc73>] netdev_all_lower_get_next_rcu+0x33/0x60 [...] Call Trace: <IRQ> [<ffffffffa015de0a>] mlxsw_sp_port_lower_dev_hold+0x5a/0xa0 [mlxsw_spectrum] [<ffffffffa016f1b0>] mlxsw_sp_router_netevent_event+0x80/0x150 [mlxsw_spectrum] [<ffffffff810ad07a>] notifier_call_chain+0x4a/0x70 [<ffffffff810ad13a>] atomic_notifier_call_chain+0x1a/0x20 [<ffffffff815ee77b>] call_netevent_notifiers+0x1b/0x20 [<ffffffff815f2eb6>] neigh_update+0x306/0x740 [<ffffffff815f38ce>] neigh_event_ns+0x4e/0xb0 [<ffffffff8165ea3f>] arp_process+0x66f/0x700 [<ffffffff8170214c>] ? common_interrupt+0x8c/0x8c [<ffffffff8165ec29>] arp_rcv+0x139/0x1d0 [<ffffffff816e505a>] ? vlan_do_receive+0xda/0x320 [<ffffffff815e3794>] __netif_receive_skb_core+0x524/0xab0 [<ffffffff815e6830>] ? dev_queue_xmit+0x10/0x20 [<ffffffffa06d612d>] ? br_forward_finish+0x3d/0xc0 [bridge] [<ffffffffa06e5796>] ? br_handle_vlan+0xf6/0x1b0 [bridge] [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60 [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0 [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70 [<ffffffffa06d7856>] br_pass_frame_up+0xc6/0x160 [bridge] [<ffffffffa06d63d7>] ? deliver_clone+0x37/0x50 [bridge] [<ffffffffa06d656c>] ? br_flood+0xcc/0x160 [bridge] [<ffffffffa06d7b14>] br_handle_frame_finish+0x224/0x4f0 [bridge] [<ffffffffa06d7f94>] br_handle_frame+0x174/0x300 [bridge] [<ffffffff815e3599>] __netif_receive_skb_core+0x329/0xab0 [<ffffffff81374815>] ? find_next_bit+0x15/0x20 [<ffffffff8135e802>] ? cpumask_next_and+0x32/0x50 [<ffffffff810c9968>] ? load_balance+0x178/0x9b0 [<ffffffff815e3d38>] __netif_receive_skb+0x18/0x60 [<ffffffff815e3dc0>] netif_receive_skb_internal+0x40/0xb0 [<ffffffff815e3e4c>] netif_receive_skb+0x1c/0x70 [<ffffffffa01544a1>] mlxsw_sp_rx_listener_func+0x61/0xb0 [mlxsw_spectrum] [<ffffffffa005c9f7>] mlxsw_core_skb_receive+0x187/0x200 [mlxsw_core] [<ffffffffa007332a>] mlxsw_pci_cq_tasklet+0x63a/0x9b0 [mlxsw_pci] [<ffffffff81091986>] tasklet_action+0xf6/0x110 [<ffffffff81704556>] __do_softirq+0xf6/0x280 [<ffffffff8109213f>] irq_exit+0xdf/0xf0 [<ffffffff817042b4>] do_IRQ+0x54/0xd0 [<ffffffff8170214c>] common_interrupt+0x8c/0x8c The problem is that netdev_all_lower_get_next_rcu() never advances the iterator, thereby causing the loop over the lower adjacency list to run forever. Fix this by advancing the iterator and avoid the infinite loop. Fixes: 7ce856aa ("mlxsw: spectrum: Add couple of lower device helper functions") Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Reported-by: NTamir Winetroub <tamirw@mellanox.com> Reviewed-by: NJiri Pirko <jiri@mellanox.com> Acked-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Garver 提交于
Fixes a panic when calling eth_get_headlen(). Noticed on i40e driver. Fixes: d5709f7a ("flow_dissector: For stripped vlan, get vlan info from skb->vlan_tci") Signed-off-by: NEric Garver <e@erig.me> Reviewed-by: NJakub Sitnicki <jkbs@redhat.com> Acked-by: NAmir Vadai <amir@vadai.me> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
reuseport_add_sock() is not used from a module, no need to export it. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 10月, 2016 2 次提交
-
-
由 David Ahern 提交于
Adjacency code only has debugs for the insert case. Add debugs for the remove path and make both consistently worded to make it easier to follow the insert and removal with reference counts. In addition, change the BUG to a WARN_ON. A missing adjacency at removal time is not cause for a panic. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Lower list should be empty just like upper. Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-