提交 · 53869cebce4bc53f71a080e7830600d4ae1ab712 · openeuler / raspberrypi-kernel

01 7月, 2017 4 次提交

net: convert nf_bridge_info.use from atomic_t to refcount_t · 53869ceb

由 Reshetova, Elena 提交于 6月 30, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

53869ceb

net: convert neigh_params.refcnt from atomic_t to refcount_t · 6343944b

由 Reshetova, Elena 提交于 6月 30, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6343944b

net: convert neighbour.refcnt from atomic_t to refcount_t · 9f237430

由 Reshetova, Elena 提交于 6月 30, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9f237430

net: convert inet_peer.refcnt from atomic_t to refcount_t · 1cc9a98b

由 Reshetova, Elena 提交于 6月 30, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
This conversion requires overall +1 on the whole
refcounting scheme.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1cc9a98b

30 6月, 2017 5 次提交

net: handle NAPI_GRO_FREE_STOLEN_HEAD case also in napi_frags_finish() · e44699d2

由 Michal Kubeček 提交于 6月 29, 2017

Recently I started seeing warnings about pages with refcount -1. The
problem was traced to packets being reused after their head was merged into
a GRO packet by skb_gro_receive(). While bisecting the issue pointed to
commit c21b48cc ("net: adjust skb->truesize in ___pskb_trim()") and
I have never seen it on a kernel with it reverted, I believe the real
problem appeared earlier when the option to merge head frag in GRO was
implemented.

Handling NAPI_GRO_FREE_STOLEN_HEAD state was only added to GRO_MERGED_FREE
branch of napi_skb_finish() so that if the driver uses napi_gro_frags()
and head is merged (which in my case happens after the skb_condense()
call added by the commit mentioned above), the skb is reused including the
head that has been merged. As a result, we release the page reference
twice and eventually end up with negative page refcount.

To fix the problem, handle NAPI_GRO_FREE_STOLEN_HEAD in napi_frags_finish()
the same way it's done in napi_skb_finish().

Fixes: d7e8883c ("net: make GRO aware of skb->head_frag")
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e44699d2

net: bridge: constify attribute_group structures. · cddbb79f

由 Arvind Yadav 提交于 6月 29, 2017

attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   2645	    896	      0	   3541	    dd5	net/bridge/br_sysfs_br.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
   2701	    832	      0	   3533	    dcd	net/bridge/br_sysfs_br.o
Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cddbb79f

net: constify attribute_group structures. · 38ef00cc

由 Arvind Yadav 提交于 6月 29, 2017

attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/device.h> work with const
attribute_group. So mark the non-const structs as const.

File size before:
   text	   data	    bss	    dec	    hex	filename
   9968	   3168	     16	  13152	   3360	net/core/net-sysfs.o

File size After adding 'const':
   text	   data	    bss	    dec	    hex	filename
  10160	   2976	     16	  13152	   3360	net/core/net-sysfs.o
Signed-off-by: NArvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

38ef00cc

net: ipmr: Add ipmr_rtm_getroute · 4f75ba69

由 Donald Sharp 提交于 6月 28, 2017

Add to RTNL_FAMILY_IPMR, RTM_GETROUTE the ability
to retrieve one S,G mroute from a specified table.

*,G will return mroute information for just that
particular mroute if it exists.  This is because
it is entirely possible to have more S's then
can fit in one skb to return to the requesting
process.
Signed-off-by: NDonald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f75ba69

net: sched: Fix one possible panic when no destroy callback · c1a4872e

由 Gao Feng 提交于 6月 28, 2017

When qdisc fail to init, qdisc_create would invoke the destroy callback
to cleanup. But there is no check if the callback exists really. So it
would cause the panic if there is no real destroy callback like the qdisc
codel, fq, and so on.

Take codel as an example following:
When a malicious user constructs one invalid netlink msg, it would cause
codel_init->codel_change->nla_parse_nested failed.
Then kernel would invoke the destroy callback directly but qdisc codel
doesn't define one. It causes one panic as a result.

Now add one the check for destroy to avoid the possible panic.

Fixes: 87b60cfa ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: NGao Feng <gfree.wind@vip.163.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1a4872e

28 6月, 2017 4 次提交

ipv6: udp: leverage scratch area helpers · 67a51780

由 Paolo Abeni 提交于 6月 26, 2017

The commit b65ac446 ("udp: try to avoid 2 cache miss on dequeue")
leveraged the scratched area helpers for UDP v4 but I forgot to
update accordingly the IPv6 code path.

This change extends the scratch area usage to the IPv6 code, synching
the two implementations and giving some performance benefit.
IPv6 is again almost on the same level of IPv4, performance-wide.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67a51780

udp: move scratch area helpers into the include file · b26bbdae

由 Paolo Abeni 提交于 6月 26, 2017

So that they can be later used by the IPv6 code, too.
Also lift the comments a bit.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b26bbdae

tcp: fix null ptr deref in getsockopt(..., TCP_ULP, ...) · d97af30f

由 Dave Watson 提交于 6月 26, 2017

If icsk_ulp_ops is unset, it dereferences a null ptr.
Add a null ptr check.

BUG: KASAN: null-ptr-deref in copy_to_user include/linux/uaccess.h:168 [inline]
BUG: KASAN: null-ptr-deref in do_tcp_getsockopt.isra.33+0x24f/0x1e30 net/ipv4/tcp.c:3057
Read of size 4 at addr 0000000000000020 by task syz-executor1/15452
Signed-off-by: NDave Watson <davejwatson@fb.com>
Reported-by: N"Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d97af30f

net: prevent sign extension in dev_get_stats() · 6f64ec74

由 Eric Dumazet 提交于 6月 27, 2017

Similar to the fix provided by Dominik Heidler in commit
9b3dc0a1 ("l2tp: cast l2tp traffic counter to unsigned")
we need to take care of 32bit kernels in dev_get_stats().

When using atomic_long_read(), we add a 'long' to u64 and
might misinterpret high order bit, unless we cast to unsigned.

Fixes: caf586e5 ("net: add a core netdev->rx_dropped counter")
Fixes: 015f0688 ("net: net: add a core netdev->tx_dropped counter")
Fixes: 6e7333d3 ("net: add rx_nohandler stat counter")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Jarod Wilson <jarod@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f64ec74

27 6月, 2017 5 次提交

net: add netlink_ext_ack argument to rtnl_link_ops.slave_validate · d116ffc7

由 Matthias Schiffer 提交于 6月 25, 2017

Add support for extended error reporting.
Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d116ffc7

net: add netlink_ext_ack argument to rtnl_link_ops.slave_changelink · 17dd0ec4

由 Matthias Schiffer 提交于 6月 25, 2017

Add support for extended error reporting.
Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

17dd0ec4

net: add netlink_ext_ack argument to rtnl_link_ops.validate · a8b8a889

由 Matthias Schiffer 提交于 6月 25, 2017

Add support for extended error reporting.
Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a8b8a889

net: add netlink_ext_ack argument to rtnl_link_ops.changelink · ad744b22

由 Matthias Schiffer 提交于 6月 25, 2017

Add support for extended error reporting.
Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad744b22

net: add netlink_ext_ack argument to rtnl_link_ops.newlink · 7a3f4a18

由 Matthias Schiffer 提交于 6月 25, 2017

Add support for extended error reporting.
Signed-off-by: NMatthias Schiffer <mschiffer@universe-factory.net>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a3f4a18

26 6月, 2017 5 次提交

sctp: adjust ssthresh when transport is idle · a02d036c

由 Marcelo Ricardo Leitner 提交于 6月 23, 2017

RFC 4960 Errata 3.27 identifies that ssthresh should be adjusted to cwnd
because otherwise it could cause the transport to lock into congestion
avoidance phase specially if ssthresh was previously reduced by some
packet drop, leading to poor performance.

The Errata says to adjust ssthresh to cwnd only once, though the same
goal is achieved by updating it every time we update cwnd too. The
caveat is that we could take longer to get back up to speed but that
should be compensated by the fact that we don't adjust on RTO basis (as
RFC says) but based on Heartbeats, which are usually way longer.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.27Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a02d036c

sctp: adjust cwnd increase in Congestion Avoidance phase · 4ccbd0b0

由 Marcelo Ricardo Leitner 提交于 6月 23, 2017

RFC4960 Errata 3.26 identified that at the same time RFC4960 states that
cwnd should never grow more than 1*MTU per RTT, Section 7.2.2 was
underspecified and as described could allow increasing cwnd more than
that.

This patch updates it so partial_bytes_acked is maxed to cwnd if
flight_size doesn't reach cwnd, protecting it from such case.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.26Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4ccbd0b0

sctp: allow increasing cwnd regardless of ctsn moving or not · e56f777a

由 Marcelo Ricardo Leitner 提交于 6月 23, 2017

As per RFC4960 Errata 3.22, this condition is not needed anymore as it
could cause the partial_bytes_acked to not consider the TSNs acked in
the Gap Ack Blocks although they were received by the peer successfully.

This patch thus drops the check for new Cumulative TSN Ack Point,
leaving just the flight_size < cwnd one.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.22Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e56f777a

sctp: update order of adjustments of partial_bytes_acked and cwnd · d0b53f40

由 Marcelo Ricardo Leitner 提交于 6月 23, 2017

RFC4960 Errata 3.12 says RFC4960 is unclear about the order of
adjustments applied to partial_bytes_acked and cwnd in the congestion
avoidance phase, and that the actual order should be:
partial_bytes_acked is reset to (partial_bytes_acked - cwnd). Next, cwnd
is increased by MTU.

We were first increasing cwnd, and then subtracting the new value pba,
which leads to a different result as pba is smaller than what it should
and could cause cwnd to not grow as much.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-rfc4960-errata-01#section-3.12Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d0b53f40

tcp: reset sk_rx_dst in tcp_disconnect() · d747a7a5

由 WANG Cong 提交于 6月 24, 2017

We have to reset the sk->sk_rx_dst when we disconnect a TCP
connection, because otherwise when we re-connect it this
dst reference is simply overridden in tcp_finish_connect().

This fixes a dst leak which leads to a loopback dev refcnt
leak. It is a long-standing bug, Kevin reported a very similar
(if not same) bug before. Thanks to Andrei for providing such
a reliable reproducer which greatly narrows down the problem.

Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
Reported-by: NAndrei Vagin <avagin@gmail.com>
Reported-by: NKevin Xu <kaiwen.xu@hulu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d747a7a5

25 6月, 2017 4 次提交

net: ipv6: reset daddr and dport in sk if connect() fails · 85cb73ff

由 Wei Wang 提交于 6月 23, 2017

In __ip6_datagram_connect(), reset sk->sk_v6_daddr and inet->dport if
error occurs.
In udp_v6_early_demux(), check for sk_state to make sure it is in
TCP_ESTABLISHED state.
Together, it makes sure unconnected UDP socket won't be considered as a
valid candidate for early demux.

v3: add TCP_ESTABLISHED state check in udp_v6_early_demux()
v2: fix compilation error

Fixes: 5425077d ("net: ipv6: Add early demux handler for UDP unicast")
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NMaciej Żenczykowski <maze@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

85cb73ff

af_iucv: Move sockaddr length checks to before accessing sa_family in bind and connect handlers · e3c42b61

由 Mateusz Jurczyk 提交于 6月 23, 2017

Verify that the caller-provided sockaddr structure is large enough to
contain the sa_family field, before accessing it in bind() and connect()
handlers of the AF_IUCV socket. Since neither syscall enforces a minimum
size of the corresponding memory region, very short sockaddrs (zero or
one byte long) result in operating on uninitialized memory while
referencing .sa_family.

Fixes: 52a82e23 ("af_iucv: Validate socket address length in iucv_sock_bind()")
Signed-off-by: NMateusz Jurczyk <mjurczyk@google.com>
[jwi: removed unneeded null-check for addr]
Signed-off-by: NJulian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e3c42b61

net/iucv: improve endianness handling · 2e56c26b

由 Hans Wippel 提交于 6月 23, 2017

Use proper endianness conversion for an skb protocol assignment. Given
that IUCV is only available on big endian systems (s390), this simply
avoids an endianness warning reported by sparse.
Signed-off-by: NHans Wippel <hwippel@linux.vnet.ibm.com>
Reviewed-by: NJulian Wiedmann <jwi@linux.vnet.ibm.com>
Reviewed-by: NUrsula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: NJulian Wiedmann <jwi@linux.vnet.ibm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2e56c26b

net: store port/representator id in metadata_dst · 3fcece12

由 Jakub Kicinski 提交于 6月 23, 2017

Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id. This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission. The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower netdevs
today.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: NSimon Horman <horms@verge.net.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3fcece12

24 6月, 2017 8 次提交

tls: return -EFAULT if copy_to_user() fails · ac55cd61

由 Dan Carpenter 提交于 6月 23, 2017

The copy_to_user() function returns the number of bytes remaining but we
want to return -EFAULT here.

Fixes: 3c4d7559 ("tls: kernel TLS support")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NDave Watson <davejwatson@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac55cd61

tcp: fix out-of-bounds access in ULP sysctl · 926f38e9

由 Jakub Kicinski 提交于 6月 22, 2017

KASAN reports out-of-bound access in proc_dostring() coming from
proc_tcp_available_ulp() because in case TCP ULP list is empty
the buffer allocated for the response will not have anything
printed into it.  Set the first byte to zero to avoid strlen()
going out-of-bounds.

Fixes: 734942cc ("tcp: ULP infrastructure")
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

926f38e9

sit: use __GFP_NOWARN for user controlled allocation · 0ccc22f4

由 WANG Cong 提交于 6月 22, 2017

The memory allocation size is controlled by user-space,
if it is too large just fail silently and return NULL,
not to mention there is a fallback allocation later.
Reported-by: NAndrey Konovalov <andreyknvl@google.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Tested-by: NAndrey Konovalov <andreyknvl@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0ccc22f4

bpf: possibly avoid extra masking for narrower load in verifier · 23994631

由 Yonghong Song 提交于 6月 22, 2017

Commit 31fd8581 ("bpf: permits narrower load from bpf program
context fields") permits narrower load for certain ctx fields.
The commit however will already generate a masking even if
the prog-specific ctx conversion produces the result with
narrower size.

For example, for __sk_buff->protocol, the ctx conversion
loads the data into register with 2-byte load.
A narrower 2-byte load should not generate masking.
For __sk_buff->vlan_present, the conversion function
set the result as either 0 or 1, essentially a byte.
The narrower 2-byte or 1-byte load should not generate masking.

To avoid unnecessary masking, prog-specific *_is_valid_access
now passes converted_op_size back to verifier, which indicates
the valid data width after perceived future conversion.
Based on this information, verifier is able to avoid
unnecessary marking.

Since we want more information back from prog-specific
*_is_valid_access checking, all of them are packed into
one data structure for more clarity.
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

23994631

xdp: add reporting of offload mode · ce158e58

由 Jakub Kicinski 提交于 6月 21, 2017

Extend the XDP_ATTACHED_* values to include offloaded mode.
Let drivers report whether program is installed in the driver
or the HW by changing the prog_attached field from bool to
u8 (type of the netlink attribute).

Exploit the fact that the value of XDP_ATTACHED_DRV is 1,
therefore since all drivers currently assign the mode with
double negation:
       mode = !!xdp_prog;
no drivers have to be modified.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce158e58

xdp: add HW offload mode flag for installing programs · ee5d032f

由 Jakub Kicinski 提交于 6月 21, 2017

Add an installation-time flag for requesting that the program
be installed only if it can be offloaded to HW.

Internally new command for ndo_xdp is added, this way we avoid
putting checks into drivers since they all return -EINVAL on
an unknown command.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ee5d032f

xdp: pass XDP flags into install handlers · 32d60277

由 Jakub Kicinski 提交于 6月 21, 2017

Pass XDP flags to the xdp ndo.  This will allow drivers to look
at the mode flags and make decisions about offload.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

32d60277

net: account for current skb length when deciding about UFO · a5cb659b

由 Michal Kubeček 提交于 6月 19, 2017

Our customer encountered stuck NFS writes for blocks starting at specific
offsets w.r.t. page boundary caused by networking stack sending packets via
UFO enabled device with wrong checksum. The problem can be reproduced by
composing a long UDP datagram from multiple parts using MSG_MORE flag:

sendto(sd, buff, 1000, MSG_MORE, ...);
sendto(sd, buff, 1000, MSG_MORE, ...);
sendto(sd, buff, 3000, 0, ...);

Assume this packet is to be routed via a device with MTU 1500 and
NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
this condition is tested (among others) to decide whether to call
ip_ufo_append_data():

((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))

At the moment, we already have skb with 1028 bytes of data which is not
marked for GSO so that the test is false (fragheaderlen is usually 20).
Thus we append second 1000 bytes to this skb without invoking UFO. Third
sendto(), however, has sufficient length to trigger the UFO path so that we
end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
uses udp_csum() to calculate the checksum but that assumes all fragments
have correct checksum in skb->csum which is not true for UFO fragments.

When checking against MTU, we need to add skb->len to length of new segment
if we already have a partially filled skb and fragheaderlen only if there
isn't one.

In the IPv6 case, skb can only be null if this is the first segment so that
we have to use headersize (length of the first IPv6 header) rather than
fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.

Fixes: e89e9cf5 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Fixes: e4c5e13a ("ipv6: Should use consistent conditional judgement for
ip6 fragment between __ip6_append_data and ip6_finish_output")
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Acked-by: NVlad Yasevich <vyasevic@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a5cb659b

23 6月, 2017 3 次提交

udp: fix poll() · 9bd780f5

由 Paolo Abeni 提交于 6月 23, 2017

Michael reported an UDP breakage caused by the commit b65ac446
("udp: try to avoid 2 cache miss on dequeue").
The function __first_packet_length() can update the checksum bits
of the pending skb, making the scratched area out-of-sync, and
setting skb->csum, if the skb was previously in need of checksum
validation.

On later recvmsg() for such skb, checksum validation will be
invoked again - due to the wrong udp_skb_csum_unnecessary()
value - and will fail, causing the valid skb to be dropped.

This change addresses the issue refreshing the scratch area in
__first_packet_length() after the possible checksum update.

Fixes: b65ac446 ("udp: try to avoid 2 cache miss on dequeue")
Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9bd780f5

udp/v6: prefetch rmem_alloc in udp6_queue_rcv_skb() · 4b943fae

由 Paolo Abeni 提交于 6月 22, 2017

very similar to commit dd99e425 ("udp: prefetch
rmem_alloc in udp_queue_rcv_skb()"), this allows saving a cache
miss when the BH is bottle-neck for UDP over ipv6 packet
processing, e.g. for small packets when a single RX NIC ingress
queue is in use.

Performances under flood when multiple NIC RX queues used are
unaffected, but when a single NIC rx queue is in use, this
gives ~8% performance improvement.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4b943fae

ipv6: avoid unregistering inet6_dev for loopback · 60abc0be

由 WANG Cong 提交于 6月 21, 2017

The per netns loopback_dev->ip6_ptr is unregistered and set to
NULL when its mtu is set to smaller than IPV6_MIN_MTU, this
leads to that we could set rt->rt6i_idev NULL after a
rt6_uncached_list_flush_dev() and then crash after another
call.

In this case we should just bring its inet6_dev down, rather
than unregistering it, at least prior to commit 176c39af
("netns: fix addrconf_ifdown kernel panic") we always
override the case for loopback.

Thanks a lot to Andrey for finding a reliable reproducer.

Fixes: 176c39af ("netns: fix addrconf_ifdown kernel panic")
Reported-by: NAndrey Konovalov <andreyknvl@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Tested-by: NAndrey Konovalov <andreyknvl@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

60abc0be

22 6月, 2017 2 次提交

rds: tcp: set linger to 1 when unloading a rds-tcp · c14b0366

由 Sowmini Varadhan 提交于 6月 21, 2017

If we are unloading the rds_tcp module, we can set linger to 1
and drop pending packets to accelerate reconnect. The peer will
end up resetting the connection based on new generation numbers
of the new incarnation, so hanging on to unsent TCP packets via
linger is mostly pointless in this case.
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: NJenny Xu <jenny.x.xu@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c14b0366

rds: tcp: send handshake ping-probe from passive endpoint · 69b92b5b

由 Sowmini Varadhan 提交于 6月 21, 2017

The RDS handshake ping probe added by commit 5916e2c1
("RDS: TCP: Enable multipath RDS for TCP") is sent from rds_sendmsg()
before the first data packet is sent to a peer. If the conversation
is not bidirectional  (i.e., one side is always passive and never
invokes rds_sendmsg()) and the passive side restarts its rds_tcp
module, a new HS ping probe needs to be sent, so that the number
of paths can be re-established.

This patch achieves that by sending a HS ping probe from
rds_tcp_accept_one() when c_npaths is 0 (i.e., we have not done
a handshake probe with this peer yet).
Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: NJenny Xu <jenny.x.xu@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69b92b5b