提交 · 1df94c3c5dadbce3df6cc0e989d8c85d43a903d6 · openeuler / raspberrypi-kernel

20 12月, 2017 1 次提交

net_sched: properly check for empty skb array on error path · 1df94c3c

由 Cong Wang 提交于 12月 18, 2017

First, the check of &q->ring.queue against NULL is wrong, it
is always false. We should check the value rather than the address.

Secondly, we need the same check in pfifo_fast_reset() too,
as both ->reset() and ->destroy() are called in qdisc_destroy().

Fixes: c5ad119f ("net: sched: pfifo_fast use skb_array")
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1df94c3c

19 12月, 2017 7 次提交

net: Disable GRO_HW when generic XDP is installed on a device. · 56f5aa77

由 Michael Chan 提交于 12月 16, 2017

Hardware should not aggregate any packets when generic XDP is installed.

Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

56f5aa77

net: Introduce NETIF_F_GRO_HW. · fb1f5f79

由 Michael Chan 提交于 12月 16, 2017

Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware
GRO.  With this flag, we can now independently turn on or off hardware
GRO when GRO is on.  Previously, drivers were using NETIF_F_GRO to
control hardware GRO and so it cannot be independently turned on or
off without affecting GRO.

Hardware GRO (just like GRO) guarantees that packets can be re-segmented
by TSO/GSO to reconstruct the original packet stream.  Logically,
GRO_HW should depend on GRO since it a subset, but we will let
individual drivers enforce this dependency as they see fit.

Since NETIF_F_GRO is not propagated between upper and lower devices,
NETIF_F_GRO_HW should follow suit since it is a subset of GRO.  In other
words, a lower device can independent have GRO/GRO_HW enabled or disabled
and no feature propagation is required.  This will preserve the current
GRO behavior.  This can be changed later if we decide to propagate GRO/
GRO_HW/RXCSUM from upper to lower devices.

Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb1f5f79

sock: Move the socket inuse to namespace. · 648845ab

由 Tonghao Zhang 提交于 12月 14, 2017

In some case, we want to know how many sockets are in use in
different _net_ namespaces. It's a key resource metric.

This patch add a member in struct netns_core. This is a counter
for socket-inuse in the _net_ namespace. The patch will add/sub
counter in the sk_alloc, sk_clone_lock and __sk_free.

This patch will not counter the socket created in kernel.
It's not very useful for userspace to know how many kernel
sockets we created.

The main reasons for doing this are that:

1. When linux calls the 'do_exit' for process to exit, the functions
'exit_task_namespaces' and 'exit_task_work' will be called sequentially.
'exit_task_namespaces' may have destroyed the _net_ namespace, but
'sock_release' called in 'exit_task_work' may use the _net_ namespace
if we counter the socket-inuse in sock_release.

2. socket and sock are in pair. More important, sock holds the _net_
namespace. We counter the socket-inuse in sock, for avoiding holding
_net_ namespace again in socket. It's a easy way to maintain the code.
Signed-off-by: NMartin Zhang <zhangjunweimartin@didichuxing.com>
Signed-off-by: NTonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

648845ab

sock: Change the netns_core member name. · 08fc7f81

由 Tonghao Zhang 提交于 12月 14, 2017

Change the member name will make the code more readable.
This patch will be used in next patch.
Signed-off-by: NMartin Zhang <zhangjunweimartin@didichuxing.com>
Signed-off-by: NTonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

08fc7f81

net: erspan: reload pointer after pskb_may_pull · d91e8db5

由 William Tu 提交于 12月 15, 2017

pskb_may_pull() can change skb->data, so we need to re-load pkt_md
and ershdr at the right place.

Fixes: 94d7d8f2 ("ip6_gre: add erspan v2 support")
Fixes: f551c91d ("net: erspan: introduce erspan v2 for ip_gre")
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d91e8db5

net: erspan: fix wrong return value · ae3e1337

由 William Tu 提交于 12月 15, 2017

If pskb_may_pull return failed, return PACKET_REJECT
instead of -ENOMEM.

Fixes: 94d7d8f2 ("ip6_gre: add erspan v2 support")
Fixes: f551c91d ("net: erspan: introduce erspan v2 for ip_gre")
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ae3e1337

net/ncsi: Don't take any action on HNCDSC AEN · 75e8e156

由 Samuel Mendoza-Jonas 提交于 12月 15, 2017

The current HNCDSC handler takes the status flag from the AEN packet and
will update or change the current channel based on this flag and the
current channel status.

However the flag from the HNCDSC packet merely represents the host link
state. While the state of the host interface is potentially interesting
information it should not affect the state of the NCSI link. Indeed the
NCSI specification makes no mention of any recommended action related to
the host network controller driver state.

Update the HNCDSC handler to record the host network driver status but
take no other action.
Signed-off-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
Acked-by: NJeremy Kerr <jk@ozlabs.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75e8e156

16 12月, 2017 19 次提交

net: sched: fix static key imbalance in case of ingress/clsact_init error · b59e6979

由 Jiri Pirko 提交于 12月 15, 2017

Move static key increments to the beginning of the init function
so they pair 1:1 with decrements in ingress/clsact_destroy,
which is called in case ingress/clsact_init fails.

Fixes: 6529eaba ("net: sched: introduce tcf block infractructure")
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b59e6979

net: sched: fix clsact init error path · 343723dd

由 Jiri Pirko 提交于 12月 15, 2017

Since in qdisc_create, the destroy op is called when init fails, we
don't do cleanup in init and leave it up to destroy.
This fixes use-after-free when trying to put already freed block.

Fixes: 6e40cf2d ("net: sched: use extended variants of block_get/put in ingress and clsact qdiscs")
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

343723dd

SUNRPC: Fix a race in the receive code path · 90d91b0c

由 Trond Myklebust 提交于 12月 14, 2017

We must ensure that the call to rpc_sleep_on() in xprt_transmit() cannot
race with the call to xprt_complete_rqst().
Reported-by: NChuck Lever <chuck.lever@oracle.com>
Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317
Fixes: ce7c252a ("SUNRPC: Add a separate spinlock to protect..")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

90d91b0c

xprtrdma: Spread reply processing over more CPUs · ccede759

由 Chuck Lever 提交于 12月 04, 2017

Commit d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler
directly from RECV completion") introduced a performance regression
for NFS I/O small enough to not need memory registration. In multi-
threaded benchmarks that generate primarily small I/O requests,
IOPS throughput is reduced by nearly a third. This patch restores
the previous level of throughput.

Because workqueues are typically BOUND (in particular ib_comp_wq,
nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
to aggregate on the CPU that is handling Receive completions.

The usual approach to addressing this problem is to create a QP
and CQ for each CPU, and then schedule transactions on the QP
for the CPU where you want the transaction to complete. The
transaction then does not require an extra context switch during
completion to end up on the same CPU where the transaction was
started.

This approach doesn't work for the Linux NFS/RDMA client because
currently the Linux NFS client does not support multiple connections
per client-server pair, and the RDMA core API does not make it
straightforward for ULPs to determine which CPU is responsible for
handling Receive completions for a CQ.

So for the moment, record the CPU number in the rpcrdma_req before
the transport sends each RPC Call. Then during Receive completion,
queue the RPC completion on that same CPU.

Additionally, move all RPC completion processing to the deferred
handler so that even RPCs with simple small replies complete on
the CPU that sent the corresponding RPC Call.

Fixes: d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ccede759

ip_gre: fix wrong return value of erspan_rcv · c05fad57

由 Haishuang Yan 提交于 12月 15, 2017

If pskb_may_pull return failed, return PACKET_REJECT instead of -ENOMEM.

Fixes: 84e54fe0 ("gre: introduce native tunnel support for ERSPAN")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c05fad57

sctp: support sysctl to allow users to use stream interleave · 463118c3

由 Xin Long 提交于 12月 15, 2017

This is the last patch for support of stream interleave, after this patch,
users could enable stream interleave by systcl -w net.sctp.intl_enable=1.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

463118c3

sctp: update mid instead of ssn when doing stream and asoc reset · 107e2425

由 Xin Long 提交于 12月 15, 2017

When using idata and doing stream and asoc reset, setting ssn with
0 could only clear the 1st 16 bits of mid.

So to make this work for both data and idata, it sets mid with 0
instead of ssn, and also mid_uo for unordered idata also need to
be cleared, as said in section 2.3.2 of RFC8260.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

107e2425

sctp: add stream interleave support in stream scheduler · ef4775e3

由 Xin Long 提交于 12月 15, 2017

As Marcelo said in the stream scheduler patch:

  Support for I-DATA chunks, also described in RFC8260, with user message
  interleaving is straightforward as it just requires the schedulers to
  probe for the feature and ignore datamsg boundaries when dequeueing.

All needs to do is just to ignore datamsg boundaries when dequeueing.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ef4775e3

sctp: implement handle_ftsn for sctp_stream_interleave · de60fe91

由 Xin Long 提交于 12月 15, 2017

handle_ftsn is added as a member of sctp_stream_interleave, used to skip
ssn for data or mid for idata, called for SCTP_CMD_PROCESS_FWDTSN cmd.

sctp_handle_iftsn works for ifwdtsn, and sctp_handle_fwdtsn works for
fwdtsn. Note that different from sctp_handle_fwdtsn, sctp_handle_iftsn
could do stream abort pd.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de60fe91

sctp: implement report_ftsn for sctp_stream_interleave · 47b20a88

由 Xin Long 提交于 12月 15, 2017

report_ftsn is added as a member of sctp_stream_interleave, used to
skip tsn from tsnmap, remove old events from reasm or lobby queue,
and abort pd for data or idata, called for SCTP_CMD_REPORT_FWDTSN
cmd and asoc reset.

sctp_report_iftsn works for ifwdtsn, and sctp_report_fwdtsn works
for fwdtsn. Note that sctp_report_iftsn doesn't do asoc abort_pd,
as stream abort_pd will be done when handling ifwdtsn. But when
ftsn is equal with ftsn, which means asoc reset, asoc abort_pd has
to be done.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

47b20a88

sctp: implement validate_ftsn for sctp_stream_interleave · 0fc2ea92

由 Xin Long 提交于 12月 15, 2017

validate_ftsn is added as a member of sctp_stream_interleave, used to
validate ssn/chunk type for fwdtsn or mid (message id)/chunk type for
ifwdtsn, called in sctp_sf_eat_fwd_tsn, just as validate_data.

If this check fails, an abort packet will be sent, as said in section
2.3.1 of RFC8260.

As ifwdtsn and fwdtsn chunks have different length, it also defines
ftsn_chunk_len for sctp_stream_interleave to describe the chunk size.
Then it replaces all sizeof(struct sctp_fwdtsn_chunk) with
sctp_ftsnchk_len.

It also adds the process for ifwdtsn in rx path. As Marcelo pointed
out, there's no need to add event table for ifwdtsn, but just share
prsctp_chunk_event_table with fwdtsn's. It would drop fwdtsn chunk
for ifwdtsn and drop ifwdtsn chunk for fwdtsn by calling validate_ftsn
in sctp_sf_eat_fwd_tsn.

After this patch, the ifwdtsn can be accepted.

Note that this patch also removes the sctp.intl_enable check for
idata chunks in sctp_chunk_event_lookup, as it will do this check
in validate_data later.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0fc2ea92

sctp: implement generate_ftsn for sctp_stream_interleave · 8e0c3b73

由 Xin Long 提交于 12月 15, 2017

generate_ftsn is added as a member of sctp_stream_interleave, used to
create fwdtsn or ifwdtsn chunk according to abandoned chunks, called
in sctp_retransmit and sctp_outq_sack.

sctp_generate_iftsn works for ifwdtsn, and sctp_generate_fwdtsn is
still used for making fwdtsn.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8e0c3b73

sctp: add basic structures and make chunk function for ifwdtsn · 2d07a49a

由 Xin Long 提交于 12月 15, 2017

sctp_ifwdtsn_skip, sctp_ifwdtsn_hdr and sctp_ifwdtsn_chunk are used to
define and parse I-FWD TSN chunk format, and sctp_make_ifwdtsn is a
function to build the chunk.

The I-FORWARD-TSN Chunk Format is defined in section 2.3.1 of RFC8260.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d07a49a

net: sched: Move to new offload indication in RED · 428a68af

由 Yuval Mintz 提交于 12月 14, 2017

Let RED utilize the new internal flag, TCQ_F_OFFLOADED,
to mark a given qdisc as offloaded instead of using a dedicated
indication.

Also, change internal logic into looking at said flag when possible.

Fixes: 602f3baf ("net_sch: red: Add offload ability to RED qdisc")
Signed-off-by: NYuval Mintz <yuvalm@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

428a68af

net: sched: Add TCA_HW_OFFLOAD · 7a4fa291

由 Yuval Mintz 提交于 12月 14, 2017

Qdiscs can be offloaded to HW, but current implementation isn't uniform.
Instead, qdiscs either pass information about offload status via their
TCA_OPTIONS or omit it altogether.

Introduce a new attribute - TCA_HW_OFFLOAD that would form a uniform
uAPI for the offloading status of qdiscs.
Signed-off-by: NYuval Mintz <yuvalm@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a4fa291

ip6_gre: add erspan v2 support · 94d7d8f2

由 William Tu 提交于 12月 13, 2017

Similar to support for ipv4 erspan, this patch adds
erspan v2 to ip6erspan tunnel.
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

94d7d8f2

net: erspan: introduce erspan v2 for ip_gre · f551c91d

由 William Tu 提交于 12月 13, 2017

The patch adds support for erspan version 2.  Not all features are
supported in this patch.  The SGT (security group tag), GRA (timestamp
granularity), FT (frame type) are set to fixed value.  Only hardware
ID and direction are configurable.  Optional subheader is also not
supported.
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f551c91d

net: erspan: refactor existing erspan code · 1d7e2ed2

由 William Tu 提交于 12月 13, 2017

The patch refactors the existing erspan implementation in order
to support erspan version 2, which has additional metadata.  So, in
stead of having one 'struct erspanhdr' holding erspan version 1,
breaks it into 'struct erspan_base_hdr' and 'struct erspan_metadata'.
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d7e2ed2

sock: free skb in skb_complete_tx_timestamp on error · 35b99dff

由 Willem de Bruijn 提交于 12月 13, 2017

skb_complete_tx_timestamp must ingest the skb it is passed. Call
kfree_skb if the skb cannot be enqueued.

Fixes: b245be1f ("net-timestamp: no-payload only sysctl")
Fixes: 9ac25fc0 ("net: fix socket refcounting in skb_complete_tx_timestamp()")
Reported-by: NRichard Cochran <richardcochran@gmail.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

35b99dff

15 12月, 2017 2 次提交

net: dsa: mediatek: combine MediaTek tag with VLAN tag · f0af3431

由 Sean Wang 提交于 12月 15, 2017

In order to let MT7530 switch can recognize well those egress packets
having both special tag and VLAN tag, the information about the special
tag should be carried on the existing VLAN tag. On the other hand, it's
unnecessary for extra handling for ingress packets when VLAN tag is
present since it is able to put the VLAN tag after the special tag and
then follow the existing way to parse.
Signed-off-by: NSean Wang <sean.wang@mediatek.com>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f0af3431

kernel: make groups_sort calling a responsibility group_info allocators · bdcf0a42

由 Thiago Rafael Becker 提交于 12月 14, 2017

In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.

This patch:
 - Make groups_sort globally visible.
 - Move the call to groups_sort to the modifiers of group_info
 - Remove the call to groups_sort from set_groups

Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.comSigned-off-by: NThiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NNeilBrown <neilb@suse.com>
Acked-by: N"J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bdcf0a42

14 12月, 2017 11 次提交

tcp: refresh tcp_mstamp from timers callbacks · 4688eb7c

由 Eric Dumazet 提交于 12月 12, 2017

Only the retransmit timer currently refreshes tcp_mstamp

We should do the same for delayed acks and keepalives.

Even if RFC 7323 does not request it, this is consistent to what linux
did in the past, when TS values were based on jiffies.

Fixes: 385e2070 ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Mike Maloney <maloney@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NMike Maloney <maloney@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4688eb7c

tcp: fix potential underestimation on rcv_rtt · 9ee11bd0

由 Wei Wang 提交于 12月 12, 2017

When ms timestamp is used, current logic uses 1us in
tcp_rcv_rtt_update() when the real rcv_rtt is within 1 - 999us.
This could cause rcv_rtt underestimation.
Fix it by always using a min value of 1ms if ms timestamp is used.

Fixes: 645f4c6f ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps")
Signed-off-by: NWei Wang <weiwan@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9ee11bd0

tcp: pause Fast Open globally after third consecutive timeout · 7268586b

由 Yuchung Cheng 提交于 12月 12, 2017

Prior to this patch, active Fast Open is paused on a specific
destination IP address if the previous connections to the
IP address have experienced recurring timeouts . But recent
experiments by Microsoft (https://goo.gl/cykmn7) and Mozilla
browsers indicate the isssue is often caused by broken middle-boxes
sitting close to the client. Therefore it is much better user
experience if Fast Open is disabled out-right globally to avoid
experiencing further timeouts on connections toward other
destinations.

This patch changes the destination-IP disablement to global
disablement if a connection experiencing recurring timeouts
or aborts due to timeout.  Repeated incidents would still
exponentially increase the pause time, starting from an hour.
This is extremely conservative but an unfortunate compromise to
minimize bad experience due to broken middle-boxes.
Reported-by: NDragana Damjanovic <ddamjanovic@mozilla.com>
Reported-by: NPatrick McManus <mcmanus@ducksong.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Reviewed-by: NWei Wang <weiwan@google.com>
Reviewed-by: NNeal Cardwell <ncardwell@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7268586b

net: avoid skb_warn_bad_offload on IS_ERR · 8d74e9f8

由 Willem de Bruijn 提交于 12月 12, 2017

skb_warn_bad_offload warns when packets enter the GSO stack that
require skb_checksum_help or vice versa. Do not warn on arbitrary
bad packets. Packet sockets can craft many. Syzkaller was able to
demonstrate another one with eth_type games.

In particular, suppress the warning when segmentation returns an
error, which is for reasons other than checksum offload.

See also commit 36c92474 ("net: WARN if skb_checksum_help() is
called on skb requiring segmentation") for context on this warning.
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d74e9f8

net: bridge: use rhashtable for fdbs · eb793583

由 Nikolay Aleksandrov 提交于 12月 12, 2017

Before this patch the bridge used a fixed 256 element hash table which
was fine for small use cases (in my tests it starts to degrade
above 1000 entries), but it wasn't enough for medium or large
scale deployments. Modern setups have thousands of participants in a
single bridge, even only enabling vlans and adding a few thousand vlan
entries will cause a few thousand fdbs to be automatically inserted per
participating port. So we need to scale the fdb table considerably to
cope with modern workloads, and this patch converts it to use a
rhashtable for its operations thus improving the bridge scalability.
Tests show the following results (10 runs each), at up to 1000 entries
rhashtable is ~3% slower, at 2000 rhashtable is 30% faster, at 3000 it
is 2 times faster and at 30000 it is 50 times faster.
Obviously this happens because of the properties of the two constructs
and is expected, rhashtable keeps pretty much a constant time even with
10000000 entries (tested), while the fixed hash table struggles
considerably even above 10000.
As a side effect this also reduces the net_bridge struct size from 3248
bytes to 1344 bytes. Also note that the key struct is 8 bytes.
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eb793583

tcp/dccp: avoid one atomic operation for timewait hashdance · ec94c269

由 Eric Dumazet 提交于 12月 11, 2017

First, rename __inet_twsk_hashdance() to inet_twsk_hashdance()

Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead
of 4, but adding a fat warning that we do not have the right to access
tw anymore after inet_twsk_hashdance()
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ec94c269

Bluetooth: introduce DEFINE_SHOW_ATTRIBUTE() macro · 22b371cb

由 Andy Shevchenko 提交于 11月 22, 2017

This macro deduplicates a lot of similar code across the hci_debugfs.c
module. Targeting to be moved to seq_file.h eventually.
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

22b371cb

tcp: allow TLP in ECN CWR · b4f70c3d

由 Neal Cardwell 提交于 12月 11, 2017

This patch enables tail loss probe in cwnd reduction (CWR) state
to detect potential losses. Prior to this patch, since the sender
uses PRR to determine the cwnd in CWR state, the combination of
CWR+PRR plus tcp_tso_should_defer() could cause unnecessary stalls
upon losses: PRR makes cwnd so gentle that tcp_tso_should_defer()
defers sending wait for more ACKs. The ACKs may not come due to
packet losses.

Disallowing TLP when there is unused cwnd had the primary effect
of disallowing TLP when there is TSO deferral, Nagle deferral,
or we hit the rwin limit. Because basically every application
write() or incoming ACK will cause us to run tcp_write_xmit()
to see if we can send more, and then if we sent something we call
tcp_schedule_loss_probe() to see if we should schedule a TLP. At
that point, there are a few common reasons why some cwnd budget
could still be unused:

(a) rwin limit
(b) nagle check
(c) TSO deferral
(d) TSQ

For (d), after the next packet tx completion the TSQ mechanism
will allow us to send more packets, so we don't really need a
TLP (in practice it shouldn't matter whether we schedule one
or not). But for (a), (b), (c) the sender won't send any more
packets until it gets another ACK. But if the whole flight was
lost, or all the ACKs were lost, then we won't get any more ACKs,
and ideally we should schedule and send a TLP to get more feedback.
In particular for a long time we have wanted some kind of timer for
TSO deferral, and at least this would give us some kind of timer
Reported-by: NSteve Ibanez <sibanez@stanford.edu>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Reviewed-by: NNandita Dukkipati <nanditad@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b4f70c3d

net_sched: switch to exit_batch for action pernet ops · 039af9c6

由 Cong Wang 提交于 12月 11, 2017

Since we now hold RTNL lock in tc_action_net_exit(), it is good to
batch them to speedup tc action dismantle.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

039af9c6

net: igmp: Use correct source address on IGMPv3 reports · a46182b0

由 Kevin Cernekee 提交于 12月 11, 2017

Closing a multicast socket after the final IPv4 address is deleted
from an interface can generate a membership report that uses the
source IP from a different interface.  The following test script, run
from an isolated netns, reproduces the issue:

    #!/bin/bash

    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link set dummy0 up
    ip link set dummy1 up
    ip addr add 10.1.1.1/24 dev dummy0
    ip addr add 192.168.99.99/24 dev dummy1

    tcpdump -U -i dummy0 &
    socat EXEC:"sleep 2" \
        UDP4-DATAGRAM:239.101.1.68:8889,ip-add-membership=239.0.1.68:10.1.1.1 &

    sleep 1
    ip addr del 10.1.1.1/24 dev dummy0
    sleep 5
    kill %tcpdump

RFC 3376 specifies that the report must be sent with a valid IP source
address from the destination subnet, or from address 0.0.0.0.  Add an
extra check to make sure this is the case.
Signed-off-by: NKevin Cernekee <cernekee@chromium.org>
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a46182b0

tipc: eliminate potential memory leak · c545a945

由 Jon Maloy 提交于 12月 11, 2017

In the function tipc_sk_mcast_rcv() we call refcount_dec(&skb->users)
on received sk_buffers. Since the reference counter might hit zero at
this point, we have a potential memory leak.

We fix this by replacing refcount_dec() with kfree_skb().
Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c545a945