提交 · 64bc17811b72758753e2b64cd8f2a63812c61fe1 · openeuler / raspberrypi-kernel

20 9月, 2017 5 次提交

ipv4: speedup ipv6 tunnels dismantle · 64bc1781

由 Eric Dumazet 提交于 9月 19, 2017

Implement exit_batch() method to dismantle more devices
per round.

(rtnl_lock() ...
 unregister_netdevice_many() ...
 rtnl_unlock())

Tested:
$ cat add_del_unshare.sh
for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo

Before patch :
$ time ./add_del_unshare.sh
net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0

real    1m38.965s
user    0m0.688s
sys     0m37.017s

After patch:
$ time ./add_del_unshare.sh
net_namespace        135    291   5504    1    2 : tunables    8    4    0 : slabdata    135    291      0

real	0m22.117s
user	0m0.728s
sys	0m35.328s
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

64bc1781

ipv6: addrlabel: per netns list · a90c9347

由 Eric Dumazet 提交于 9月 19, 2017

Having a global list of labels do not scale to thousands of
netns in the cloud era. This causes quadratic behavior on
netns creation and deletion.

This is time having a per netns list of ~10 labels.

Tested:

$ time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 3.637 MB perf.data (~158898 samples) ]

real    0m20.837s # instead of 0m24.227s
user    0m0.328s
sys     0m20.338s # instead of 0m23.753s

    16.17%       ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
    12.30%       ip  [kernel.kallsyms]  [k] netlink_has_listeners
     6.76%       ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     5.78%       ip  [kernel.kallsyms]  [k] memset_erms
     5.77%       ip  [kernel.kallsyms]  [k] kobject_uevent_env
     5.18%       ip  [kernel.kallsyms]  [k] refcount_sub_and_test
     4.96%       ip  [kernel.kallsyms]  [k] _raw_read_lock
     3.82%       ip  [kernel.kallsyms]  [k] refcount_inc_not_zero
     3.33%       ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     2.11%       ip  [kernel.kallsyms]  [k] unmap_page_range
     1.77%       ip  [kernel.kallsyms]  [k] __wake_up
     1.69%       ip  [kernel.kallsyms]  [k] strlen
     1.17%       ip  [kernel.kallsyms]  [k] __wake_up_common
     1.09%       ip  [kernel.kallsyms]  [k] insert_header
     1.04%       ip  [kernel.kallsyms]  [k] page_remove_rmap
     1.01%       ip  [kernel.kallsyms]  [k] consume_skb
     0.98%       ip  [kernel.kallsyms]  [k] netlink_trim
     0.51%       ip  [kernel.kallsyms]  [k] kernfs_link_sibling
     0.51%       ip  [kernel.kallsyms]  [k] filemap_map_pages
     0.46%       ip  [kernel.kallsyms]  [k] memcpy_erms
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a90c9347

net_sched: no need to free qdisc in RCU callback · 752fbcc3

由 Cong Wang 提交于 9月 19, 2017

gen estimator has been rewritten in commit 1c0d32fd
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller no longer needs to wait for a grace period. So this
patch gets rid of it.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

752fbcc3

net: dsa: remove copy of master ethtool_ops · f5619866

由 Vivien Didelot 提交于 9月 19, 2017

There is no need to store a copy of the master ethtool ops, storing the
original pointer in DSA and the new one in the master netdev itself is
enough.

In the meantime, set orig_ethtool_ops to NULL when restoring the master
ethtool ops and check the presence of the master original ethtool ops as
well as its needed functions before calling them.
Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f5619866

net: sk_buff rbnode reorg · bffa72cf

由 Eric Dumazet 提交于 9月 19, 2017

skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

Current uses (TCP receive ofo queue and netem) need to save/restore
tstamp, while skb->dev is either NULL (TCP) or a constant for a given
queue (netem).

Since we plan using an RB tree for TCP retransmit queue to speedup SACK
processing with large BDP, this patch exchanges skb->dev and
skb->tstamp.

This saves some overhead in both TCP and netem.

v2: removes the swtstamp field from struct tcp_skb_cb
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bffa72cf

19 9月, 2017 1 次提交

net: Convert int functions to bool · 7016e062

由 Joe Perches 提交于 9月 13, 2017

Global function ipv6_rcv_saddr_equal and static functions
ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int.

bool is slightly more descriptive for these functions so change
their return type from int to bool.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7016e062

16 9月, 2017 1 次提交

sctp: fix an use-after-free issue in sctp_sock_dump · d25adbeb

由 Xin Long 提交于 9月 15, 2017

Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the
dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep
with holding sock and sock lock.

But Paolo noticed that endpoint could be destroyed in sctp_rcv without
sock lock protection. It means the use-after-free issue still could be
triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks
!ep, although it's pretty hard to reproduce.

I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close
and sctp_sock_dump long time.

This patch is to add another param cb_done to sctp_for_each_transport
and dump ep->assocs with holding tsp after jumping out of transport's
traversal in it to avoid this issue.

It can also improve sctp diag dump to make it run faster, as no need
to save sk into cb->args[5] and keep calling sctp_for_each_transport
any more.

This patch is also to use int * instead of int for the pos argument
in sctp_for_each_transport, which could make postion increment only
in sctp_for_each_transport and no need to keep changing cb->args[2]
in sctp_sock_filter and sctp_sock_dump any more.

Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump")
Reported-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d25adbeb

14 9月, 2017 1 次提交

sctp: potential read out of bounds in sctp_ulpevent_type_enabled() · fa5f7b51

由 Dan Carpenter 提交于 9月 14, 2017

This code causes a static checker warning because Smatch doesn't trust
anything that comes from skb->data.  I've reviewed this code and I do
think skb->data can be controlled by the user here.

The sctp_event_subscribe struct has 13 __u8 fields and we want to see
if ours is non-zero.  sn_type can be any value in the 0-USHRT_MAX range.
We're subtracting SCTP_SN_TYPE_BASE which is 1 << 15 so we could read
either before the start of the struct or after the end.

This is a very old bug and it's surprising that it would go undetected
for so long but my theory is that it just doesn't have a big impact so
it would be hard to notice.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa5f7b51

13 9月, 2017 1 次提交

net_sched: get rid of tcfa_rcu · d7fb60b9

由 Cong Wang 提交于 9月 11, 2017

gen estimator has been rewritten in commit 1c0d32fd
("net_sched: gen_estimator: complete rewrite of rate estimators"),
the caller is no longer needed to wait for a grace period.
So this patch gets rid of it.

This also completely closes a race condition between action free
path and filter chain add/remove path for the following patch.
Because otherwise the nested RCU callback can't be caught by
rcu_barrier().

Please see also the comments in code.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d7fb60b9

09 9月, 2017 1 次提交

netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to rhashtable" · e1bf1687

由 Florian Westphal 提交于 9月 06, 2017

This reverts commit 870190a9.

It was not a good idea. The custom hash table was a much better
fit for this purpose.

A fast lookup is not essential, in fact for most cases there is no lookup
at all because original tuple is not taken and can be used as-is.
What needs to be fast is insertion and deletion.

rhlist removal however requires a rhlist walk.
We can have thousands of entries in such a list if source port/addresses
are reused for multiple flows, if this happens removal requests are so
expensive that deletions of a few thousand flows can take several
seconds(!).

The advantages that we got from rhashtable are:
1) table auto-sizing
2) multiple locks

1) would be nice to have, but it is not essential as we have at
most one lookup per new flow, so even a million flows in the bysource
table are not a problem compared to current deletion cost.
2) is easy to add to custom hash table.

I tried to add hlist_node to rhlist to speed up rhltable_remove but this
isn't doable without changing semantics.  rhltable_remove_fast will
check that the to-be-deleted object is part of the table and that
requires a list walk that we want to avoid.

Furthermore, using hlist_node increases size of struct rhlist_head, which
in turn increases nf_conn size.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=196821Reported-by: NIvan Babrou <ibobrik@gmail.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

e1bf1687

06 9月, 2017 3 次提交

net: dsa: Allow switch drivers to indicate number of TX queues · 55199df6

由 Florian Fainelli 提交于 9月 03, 2017

Let switch drivers indicate how many TX queues they support. Some
switches, such as Broadcom Starfighter 2 are designed with 8 egress
queues. Future changes will allow us to leverage the queue mapping and
direct the transmission towards a particular queue.
Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

55199df6

flow_dissector: Cleanup control flow · 3a1214e8

由 Tom Herbert 提交于 9月 01, 2017

__skb_flow_dissect is riddled with gotos that make discerning the flow,
debugging, and extending the capability difficult. This patch
reorganizes things so that we only perform goto's after the two main
switch statements (no gotos within the cases now). It also eliminates
several goto labels so that there are only two labels that can be target
for goto.
Reported-by: NAlexander Popov <alex.popov@linux.com>
Signed-off-by: NTom Herbert <tom@quantonium.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3a1214e8

net/ncsi: fix ncsi_vlan_rx_{add,kill}_vid references · fd0c88b7

由 Arnd Bergmann 提交于 9月 05, 2017

We get a new link error in allmodconfig kernels after ftgmac100
started using the ncsi helpers:

ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined!
ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined!

Related to that, we get another error when CONFIG_NET_NCSI is disabled:

drivers/net/ethernet/faraday/ftgmac100.c:1626:25: error: 'ncsi_vlan_rx_add_vid' undeclared here (not in a function); did you mean 'ncsi_start_dev'?
drivers/net/ethernet/faraday/ftgmac100.c:1627:26: error: 'ncsi_vlan_rx_kill_vid' undeclared here (not in a function); did you mean 'ncsi_vlan_rx_add_vid'?

This fixes both problems at once, using a 'static inline' stub helper
for the disabled case, and exporting the functions when they are present.

Fixes: 51564585 ("ftgmac100: Support NCSI VLAN filtering when available")
Fixes: 21acf630 ("net/ncsi: Configure VLAN tag filter")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fd0c88b7

05 9月, 2017 1 次提交

mac80211: fix VLAN handling with TXQs · 53168215

由 Johannes Berg 提交于 6月 22, 2017

With TXQs, the AP_VLAN interfaces are resolved to their owner AP
interface when enqueuing the frame, which makes sense since the
frame really goes out on that as far as the driver is concerned.

However, this introduces a problem: frames to be encrypted with
a VLAN-specific GTK will now be encrypted with the AP GTK, since
the information about which virtual interface to use to select
the key is taken from the TXQ.

Fix this by preserving info->control.vif and using that in the
dequeue function. This now requires doing the driver-mapping
in the dequeue as well.

Since there's no way to filter the frames that are sitting on a
TXQ, drop all frames, which may affect other interfaces, when an
AP_VLAN is removed.

Cc: stable@vger.kernel.org
Signed-off-by: NJohannes Berg <johannes.berg@intel.com>

53168215

04 9月, 2017 7 次提交

net: Remove CONFIG_NETFILTER_DEBUG and _ASSERT() macros. · 9efdb14f

由 Varsha Rao 提交于 8月 30, 2017

This patch removes CONFIG_NETFILTER_DEBUG and _ASSERT() macros as they
are no longer required. Replace _ASSERT() macros with WARN_ON().
Signed-off-by: NVarsha Rao <rvarsha016@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

9efdb14f

net: Replace NF_CT_ASSERT() with WARN_ON(). · 44d6e2f2

由 Varsha Rao 提交于 8月 30, 2017

This patch removes NF_CT_ASSERT() and instead uses WARN_ON().
Signed-off-by: NVarsha Rao <rvarsha016@gmail.com>

44d6e2f2

F
netfilter: remove unused hooknum arg from packet functions · d1c1e39d
由 Florian Westphal 提交于 8月 29, 2017
```
tested with allmodconfig build.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
```
d1c1e39d

netfilter: nf_tables: add select_ops for stateful objects · dfc46034

由 Pablo M. Bermudo Garay 提交于 8月 23, 2017

This patch adds support for overloading stateful objects operations
through the select_ops() callback, just as it is implemented for
expressions.

This change is needed for upcoming additions to the stateful objects
infrastructure.
Signed-off-by: NPablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

dfc46034

Revert "net: fix percpu memory leaks" · 5a63643e

由 Jesper Dangaard Brouer 提交于 9月 01, 2017

This reverts commit 1d6119ba.

After reverting commit 6d7b857d ("net: use lib/percpu_counter API
for fragmentation mem accounting") then here is no need for this
fix-up patch.  As percpu_counter is no longer used, it cannot
memory leak it any-longer.

Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119ba ("net: fix percpu memory leaks")
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5a63643e

Revert "net: use lib/percpu_counter API for fragmentation mem accounting" · fb452a1a

由 Jesper Dangaard Brouer 提交于 9月 01, 2017

This reverts commit 6d7b857d.

There is a bug in fragmentation codes use of the percpu_counter API,
that can cause issues on systems with many CPUs.

The frag_mem_limit() just reads the global counter (fbc->count),
without considering other CPUs can have upto batch size (130K) that
haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
this become dangerous at >=24 CPUs (3*1024*1024/130000=24).

The correct API usage would be to use __percpu_counter_compare() which
does the right thing, and takes into account the number of (online)
CPUs and batch size, to account for this and call __percpu_counter_sum()
when needed.

We choose to revert the use of the lib/percpu_counter API for frag
memory accounting for several reasons:

1) On systems with CPUs > 24, the heavier fully locked
   __percpu_counter_sum() is always invoked, which will be more
   expensive than the atomic_t that is reverted to.

Given systems with more than 24 CPUs are becoming common this doesn't
seem like a good option.  To mitigate this, the batch size could be
decreased and thresh be increased.

2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
   CPU, before SKBs are pushed into sockets on remote CPUs.  Given
   NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
   likely be limited.  Thus, a fair chance that atomic add+dec happen
   on the same CPU.

Revert note that commit 1d6119ba ("net: fix percpu memory leaks")
removed init_frag_mem_limit() and instead use inet_frags_init_net().
After this revert, inet_frags_uninit_net() becomes empty.

Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119ba ("net: fix percpu memory leaks")
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb452a1a

ipv4: Don't override return code from ip_route_input_noref() · 64327fc8

由 Stefano Brivio 提交于 8月 31, 2017

After ip_route_input() calls ip_route_input_noref(), another
check on skb_dst() is done, but if this fails, we shouldn't
override the return code from ip_route_input_noref(), as it
could have been more specific (i.e. -EHOSTUNREACH).

This also saves one call to skb_dst_force_safe() and one to
skb_dst() in case the ip_route_input_noref() check fails.
Reported-by: NSabrina Dubroca <sdubroca@redhat.com>
Fixes: 9df16efa ("ipv4: call dst_hold_safe() properly")
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Acked-by: NWei Wang <weiwan@google.com>
Acked-by: NSabrina Dubroca <sd@queasysnail.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

64327fc8

02 9月, 2017 2 次提交

net: Add module reference to FIB notifiers · 864150df

由 Ido Schimmel 提交于 9月 01, 2017

When a listener registers to the FIB notification chain it receives a
dump of the FIB entries and rules from existing address families by
invoking their dump operations.

While we call into these modules we need to make sure they aren't
removed. Do that by increasing their reference count before invoking
their dump operations and decrease it afterwards.

Fixes: 04b1d4e5 ("net: core: Make the FIB notification chain generic")
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

864150df

Bluetooth: make baswap src const · 65bce462

由 Loic Poulain 提交于 9月 01, 2017

Signed-off-by: NLoic Poulain <loic.poulain@linaro.org>
Signed-off-by: NBjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>

65bce462

01 9月, 2017 2 次提交

devlink: Add IPv6 header for dpipe · 1797f5b3

由 Arkadi Sharshevsky 提交于 8月 31, 2017

This will be used by the IPv6 host table which will be introduced in the
following patches. The fields in the header are added per-use. This header
is global and can be reused by many drivers.
Signed-off-by: NArkadi Sharshevsky <arkadis@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1797f5b3

net_sched: add reverse binding for tc class · 07d79fc7

由 Cong Wang 提交于 8月 30, 2017

TC filters when used as classifiers are bound to TC classes.
However, there is a hidden difference when adding them in different
orders:

1. If we add tc classes before its filters, everything is fine.
   Logically, the classes exist before we specify their ID's in
   filters, it is easy to bind them together, just as in the current
   code base.

2. If we add tc filters before the tc classes they bind, we have to
   do dynamic lookup in fast path. What's worse, this happens all
   the time not just once, because on fast path tcf_result is passed
   on stack, there is no way to propagate back to the one in tc filters.

This hidden difference hurts performance silently if we have many tc
classes in hierarchy.

This patch intends to close this gap by doing the reverse binding when
we create a new class, in this case we can actually search all the
filters in its parent, match and fixup by classid. And because
tcf_result is specific to each type of tc filter, we have to introduce
a new ops for each filter to tell how to bind the class.

Note, we still can NOT totally get rid of those class lookup in
->enqueue() because cgroup and flow filters have no way to determine
the classid at setup time, they still have to go through dynamic lookup.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

07d79fc7

31 8月, 2017 4 次提交

xfrm: Add support for network devices capable of removing the ESP trailer · 47ebcc0b

由 Yossi Kuperman 提交于 8月 30, 2017

In conjunction with crypto offload [1], removing the ESP trailer by
hardware can potentially improve the performance by avoiding (1) a
cache miss incurred by reading the nexthdr field and (2) the necessity
to calculate the csum value of the trailer in order to keep skb->csum
valid.

This patch introduces the changes to the xfrm stack and merely serves
as an infrastructure. Subsequent patch to mlx5 driver will put this to
a good use.

[1] https://www.mail-archive.com/netdev@vger.kernel.org/msg175733.htmlSigned-off-by: NYossi Kuperman <yossiku@mellanox.com>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>

47ebcc0b

net/sched: Change act_api and act_xxx modules to use IDR · 65a206c0

由 Chris Mi 提交于 8月 30, 2017

Typically, each TC filter has its own action. All the actions of the
same type are saved in its hash table. But the hash buckets are too
small that it degrades to a list. And the performance is greatly
affected. For example, it takes about 0m11.914s to insert 64K rules.
If we convert the hash table to IDR, it only takes about 0m1.500s.
The improvement is huge.

But please note that the test result is based on previous patch that
cls_flower uses IDR.
Signed-off-by: NChris Mi <chrism@mellanox.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Acked-by: NJamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

65a206c0

tcp: Revert "tcp: remove header prediction" · 31770e34

由 Florian Westphal 提交于 8月 30, 2017

This reverts commit 45f119bf.

Eric Dumazet says:
  We found at Google a significant regression caused by
  45f119bf tcp: remove header prediction

  In typical RPC  (TCP_RR), when a TCP socket receives data, we now call
  tcp_ack() while we used to not call it.

  This touches enough cache lines to cause a slowdown.

so problem does not seem to be HP removal itself but the tcp_ack()
call.  Therefore, it might be possible to remove HP after all, provided
one finds a way to elide tcp_ack for most cases.
Reported-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

31770e34

tcp: Revert "tcp: remove CA_ACK_SLOWPATH" · c1d2b4c3

由 Florian Westphal 提交于 8月 30, 2017

This change was a followup to the header prediction removal,
so first revert this as a prerequisite to back out hp removal.
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c1d2b4c3

30 8月, 2017 4 次提交

neigh: increase queue_len_bytes to match wmem_default · eaa72dc4

由 Eric Dumazet 提交于 8月 29, 2017

Florian reported UDP xmit drops that could be root caused to the
too small neigh limit.

Current limit is 64 KB, meaning that even a single UDP socket would hit
it, since its default sk_sndbuf comes from net.core.wmem_default
(~212992 bytes on 64bit arches).

Once ARP/ND resolution is in progress, we should allow a little more
packets to be queued, at least for one producer.

Once neigh arp_queue is filled, a rogue socket should hit its sk_sndbuf
limit and either block in sendmsg() or return -EAGAIN.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

eaa72dc4

ipv6: Use rt6i_idev index for echo replies to a local address · 1b70d792

由 David Ahern 提交于 8月 28, 2017

Tariq repored local pings to linklocal address is failing:
$ ifconfig ens8
ens8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 11.141.16.6  netmask 255.255.0.0  broadcast 11.141.255.255
        inet6 fe80::7efe:90ff:fecb:7502  prefixlen 64  scopeid 0x20<link>
        ether 7c:fe:90:cb:75:02  txqueuelen 1000  (Ethernet)
        RX packets 12  bytes 1164 (1.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 30  bytes 2484 (2.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

$  /bin/ping6 -c 3 fe80::7efe:90ff:fecb:7502%ens8
PING fe80::7efe:90ff:fecb:7502%ens8(fe80::7efe:90ff:fecb:7502) 56 data bytes
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1b70d792

net: add NSH header structures and helpers · 1f0b7744

由 Yi Yang 提交于 8月 28, 2017

NSH (Network Service Header)[1] is a new protocol for service
function chaining, it can be handled as a L3 protocol like
IPv4 and IPv6, Eth + NSH + Inner packet or VxLAN-gpe + NSH +
Inner packet are two typical use cases.

This patch adds NSH header structures and helpers for NSH GSO
support and Open vSwitch NSH support.

[1] https://datatracker.ietf.org/doc/draft-ietf-sfc-nsh/

[Jiri: added nsh_hdr() helper and renamed the header struct to "struct
nshhdr" to match the usual pattern. Removed packet type defines, these are
now shared with VXLAN-GPE.]
Signed-off-by: NYi Yang <yi.y.yang@intel.com>
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f0b7744

vxlan: factor out VXLAN-GPE next protocol · fa20e0e3

由 Jiri Benc 提交于 8月 28, 2017

The values are shared between VXLAN-GPE and NSH. Originally probably by
coincidence but I notified both working groups about this last year and they
seem to keep the values in sync since then.

Hopefully they'll get a single IANA registry for the values, too. (I asked
them for that.)

Factor out the code to be shared by the NSH implementation.

NSH and MPLS values are added in this patch, too. For MPLS, the drafts
incorrectly assign only a single value, while we have two MPLS ethertypes.
I raised the problem with both groups. For now, I assume the value is for
unicast.
Signed-off-by: NJiri Benc <jbenc@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa20e0e3

29 8月, 2017 6 次提交

rxrpc: Allow failed client calls to be retried · c038a58c

由 David Howells 提交于 8月 29, 2017

Allow a client call that failed on network error to be retried, provided
that the Tx queue still holds DATA packet 1.  This allows an operation to
be submitted to another server or another address for the same server
without having to repackage and re-encrypt the data so far processed.

Two new functions are provided:

 (1) rxrpc_kernel_check_call() - This is used to find out the completion
     state of a call to guess whether it can be retried and whether it
     should be retried.

 (2) rxrpc_kernel_retry_call() - Disconnect the call from its current
     connection, reset the state and submit it as a new client call to a
     new address.  The new address need not match the previous address.

A call may be retried even if all the data hasn't been loaded into it yet;
a partially constructed will be retained at the same point it was at when
an error condition was detected.  msg_data_left() can be used to find out
how much data was packaged before the error occurred.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

c038a58c

rxrpc: Add notification of end-of-Tx phase · e833251a

由 David Howells 提交于 8月 29, 2017

Add a callback to rxrpc_kernel_send_data() so that a kernel service can get
a notification that the AF_RXRPC call has transitioned out the Tx phase and
is now waiting for a reply or a final ACK.

This is called from AF_RXRPC with the call state lock held so the
notification is guaranteed to come before any reply is passed back.

Further, modify the AFS filesystem to make use of this so that we don't have
to change the afs_call state before sending the last bit of data.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

e833251a

net/ncsi: Configure VLAN tag filter · 21acf630

由 Samuel Mendoza-Jonas 提交于 8月 28, 2017

Make use of the ndo_vlan_rx_{add,kill}_vid callbacks to have the NCSI
stack process new VLAN tags and configure the channel VLAN filter
appropriately.
Several VLAN tags can be set and a "Set VLAN Filter" packet must be sent
for each one, meaning the ncsi_dev_state_config_svf state must be
repeated. An internal list of VLAN tags is maintained, and compared
against the current channel's ncsi_channel_filter in order to keep track
within the state. VLAN filters are removed in a similar manner, with the
introduction of the ncsi_dev_state_config_clear_vids state. The maximum
number of VLAN tag filters is determined by the "Get Capabilities"
response from the channel.
Signed-off-by: NSamuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

21acf630

irda: move include/net/irda into staging subdirectory · 5bf916ee

由 Greg Kroah-Hartman 提交于 8月 27, 2017

And finally, move the irda include files into
drivers/staging/irda/include/net/irda.  Yes, it's a long path, but it
makes it easy for us to just add a Makefile directory path addition and
all of the net and drivers code "just works".
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5bf916ee

ipv6: fix sparse warning on rt6i_node · 4e587ea7

由 Wei Wang 提交于 8月 25, 2017

Commit c5cff856 adds rcu grace period before freeing fib6_node. This
generates a new sparse warning on rt->rt6i_node related code:
  net/ipv6/route.c:1394:30: error: incompatible types in comparison
  expression (different address spaces)
  ./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
  expression (different address spaces)

This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
rcu API is used for it.
After this fix, sparse no longer generates the above warning.

Fixes: c5cff856 ("ipv6: add rcu grace period before freeing fib6_node")
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e587ea7

gre: add collect_md mode to ERSPAN tunnel · 1a66a836

由 William Tu 提交于 8月 25, 2017

Similar to gre, vxlan, geneve, ipip tunnels, allow ERSPAN tunnels to
operate in 'collect metadata' mode. bpf_skb_[gs]et_tunnel_key() helpers
can make use of it right away. OVS can use it as well in the future.
Signed-off-by: NWilliam Tu <u9012063@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a66a836

28 8月, 2017 1 次提交

netfilter: convert hook list to an array · 960632ec

由 Aaron Conole 提交于 8月 24, 2017

This converts the storage and layout of netfilter hook entries from a
linked list to an array.  After this commit, hook entries will be
stored adjacent in memory.  The next pointer is no longer required.

The ops pointers are stored at the end of the array as they are only
used in the register/unregister path and in the legacy br_netfilter code.

nf_unregister_net_hooks() is slower than needed as it just calls
nf_unregister_net_hook in a loop (i.e. at least n synchronize_net()
calls), this will be addressed in followup patch.

Test setup:
 - ixgbe 10gbit
 - netperf UDP_STREAM, 64 byte packets
 - 5 hooks: (raw + mangle prerouting, mangle+filter input, inet filter):
empty mangle and raw prerouting, mangle and filter input hooks:
353.9
this patch:
364.2
Signed-off-by: NAaron Conole <aconole@bytheb.org>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

960632ec