提交 · 22e6c58b8c2843337ec4e8464b1ce6e869ca5bf4 · openeuler / Kernel

16 10月, 2018 40 次提交

netlink: Add answer_flags to netlink_callback · 22e6c58b

由 David Ahern 提交于 10月 15, 2018

With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED
flag is set on a message back to the user if the data returned is
influenced by some input attributes. Normally this can be done as
messages are added to the skb, but if the filter results in no data
being returned, the user could be confused as to why.

This patch adds answer_flags to the netlink_callback allowing dump
handlers to set the NLM_F_DUMP_FILTERED at a minimum in the
NLMSG_DONE message ensuring the flag gets back to the user.

The netlink_callback space is initialized to 0 via a memset in
__netlink_dump_start, so init of the new answer_flags is covered.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

22e6c58b

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · e8567951

由 David S. Miller 提交于 10月 15, 2018

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-10-16

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Convert BPF sockmap and kTLS to both use a new sk_msg API and enable
   sk_msg BPF integration for the latter, from Daniel and John.

2) Enable BPF syscall side to indicate for maps that they do not support
   a map lookup operation as opposed to just missing key, from Prashant.

3) Add bpftool map create command which after map creation pins the
   map into bpf fs for further processing, from Jakub.

4) Add bpftool support for attaching programs to maps allowing sock_map
   and sock_hash to be used from bpftool, from John.

5) Improve syscall BPF map update/delete path for map-in-map types to
   wait a RCU grace period for pending references to complete, from Daniel.

6) Couple of follow-up fixes for the BPF socket lookup to get it
   enabled also when IPv6 is compiled as a module, from Joe.

7) Fix a generic-XDP bug to handle the case when the Ethernet header
   was mangled and thus update skb's protocol and data, from Jesper.

8) Add a missing BTF header length check between header copies from
   user space, from Wenwen.

9) Minor fixups in libbpf to use __u32 instead u32 types and include
   proper perf_event.h uapi header instead of perf internal one, from Yonghong.

10) Allow to pass user-defined flags through EXTRA_CFLAGS and EXTRA_LDFLAGS
    to bpftool's build, from Jiri.

11) BPF kselftest tweaks to add LWTUNNEL to config fragment and to install
    with_addr.sh script from flow dissector selftest, from Anders.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e8567951

net: phy: merge phy_start_aneg and phy_start_aneg_priv · c45d7150

由 Heiner Kallweit 提交于 10月 15, 2018

After commit 9f2959b6 ("net: phy: improve handling delayed work")
the sync parameter isn't needed any longer in phy_start_aneg_priv().
This allows to merge phy_start_aneg() and phy_start_aneg_priv().
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c45d7150

hv_netvsc: fix vf serial matching with pci slot info · 00547955

由 Haiyang Zhang 提交于 10月 15, 2018

The VF device's serial number is saved as a string in PCI slot's
kobj name, not the slot->number. This patch corrects the netvsc
driver, so the VF device can be successfully paired with synthetic
NIC.

Fixes: 00d7ddba ("hv_netvsc: pair VF based on serial number")
Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NHaiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: NStephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00547955

Merge branch 'tcp-second-round-for-EDT-conversion' · b1394967

由 David S. Miller 提交于 10月 15, 2018

Eric Dumazet says:

====================
tcp: second round for EDT conversion

First round of EDT patches left TCP stack in a non optimal state.

- High speed flows suffered from loss of performance, addressed
  by the first patch of this series.

- Second patch brings pacing to the current state of networking,
  since we now reach ~100 Gbit on a single TCP flow.

- Third patch implements a mitigation for scheduling delays,
  like the one we did in sch_fq in the past.

- Fourth patch removes one special case in sch_fq for ACK packets.

- Fifth patch removes a serious perfomance cost for TCP internal
  pacing. We should setup the high resolution timer only if
  really needed.

- Sixth patch fixes a typo in BBR.

- Last patch is one minor change in cdg congestion control.

Neal Cardwell also has a patch series fixing BBR after
EDT adoption.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b1394967

tcp: cdg: use tcp high resolution clock cache · 825e1c52

由 Eric Dumazet 提交于 10月 15, 2018

We store in tcp socket a cache of most recent high resolution
clock, there is no need to call local_clock() again, since
this cache is good enough.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

825e1c52

tcp_bbr: fix typo in bbr_pacing_margin_percent · 97ec3eb3

由 Neal Cardwell 提交于 10月 15, 2018

There was a typo in this parameter name.
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97ec3eb3

tcp: optimize tcp internal pacing · 864e5c09

由 Eric Dumazet 提交于 10月 15, 2018

When TCP implements its own pacing (when no fq packet scheduler is used),
it is arming high resolution timer after a packet is sent.

But in many cases (like TCP_RR kind of workloads), this high resolution
timer expires before the application attempts to write the following
packet. This overhead also happens when the flow is ACK clocked and
cwnd limited instead of being limited by the pacing rate.

This leads to extra overhead (high number of IRQ)

Now tcp_wstamp_ns is reserved for the pacing timer only
(after commit "tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh"),
we can setup the timer only when a packet is about to be sent,
and if tcp_wstamp_ns is in the future.

This leads to a ~10% performance increase in TCP_RR workloads.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

864e5c09

net_sched: sch_fq: no longer use skb_is_tcp_pure_ack() · 7baf33bd

由 Eric Dumazet 提交于 10月 15, 2018

With the new EDT model, sch_fq no longer has to special
case TCP pure acks, since their skb->tstamp will allow them
being sent without pacing delay.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7baf33bd

tcp: mitigate scheduling jitter in EDT pacing model · a7a25630

由 Eric Dumazet 提交于 10月 15, 2018

In commit fefa569a ("net_sched: sch_fq: account for schedule/timers
drifts") we added a mitigation for scheduling jitter in fq packet scheduler.

This patch does the same in TCP stack, now it is using EDT model.

Note that this mitigation is valid for both external (fq packet scheduler)
or internal TCP pacing.

This uses the same strategy than the above commit, allowing
a time credit of half the packet currently sent.

Consider following case :

An skb is sent, after an idle period of 300 usec.
The air-time (skb->len/pacing_rate) is 500 usec
Instead of setting the pacing timer to now+500 usec,
it will use now+min(500/2, 300) -> now+250usec

This is like having a token bucket with a depth of half
an skb.

Tested:

tc qdisc replace dev eth0 root pfifo_fast

Before
netperf -P0 -H remote -- -q 1000000000 # 8000Mbit
540000 262144 262144    10.00    7710.43

After :
netperf -P0 -H remote -- -q 1000000000 # 8000 Mbit
540000 262144 262144    10.00    7999.75   # Much closer to 8000Mbit target
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a7a25630

net: extend sk_pacing_rate to unsigned long · 76a9ebe8

由 Eric Dumazet 提交于 10月 15, 2018

sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.

We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.

This patch adds no cost for 32bit kernels.

The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.

Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.

State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                 timer:(on,003ms,0) ino:91863 sk:2 <->
 skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
 ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
 rcvmss:536 advmss:1448
 cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
 segs_in:3916318 data_segs_out:177279175
 bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
 send 28045.5Mbps lastrcv:73333
 pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
 busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
 notsent:2085120 minrtt:0.013
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

76a9ebe8

tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh · 5f6188a8

由 Eric Dumazet 提交于 10月 15, 2018

In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.

This causes major regressions at high speed flows.

Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.

tcp_wstamp_ns is only updated when a packet is sent.

Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.

Fixes: 9799ccb0 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5f6188a8

net: bridge: fix a possible memory leak in __vlan_add · 1a3aea25

由 Li RongQing 提交于 10月 15, 2018

After per-port vlan stats, vlan stats should be released
when fail to add vlan

Fixes: 9163a0fc ("net: bridge: add support for per-port vlan stats")
CC: bridge@lists.linux-foundation.org
cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NZhang Yu <zhangyu31@baidu.com>
Signed-off-by: NLi RongQing <lirongqing@baidu.com>
Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a3aea25

rxrpc: Add /proc/net/rxrpc/peers to display peer list · bc0e7cf4

由 David Howells 提交于 10月 15, 2018

Add /proc/net/rxrpc/peers to display the list of peers currently active.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bc0e7cf4

fore200e: fix missing unlock on error in bsq_audit() · d275444c

由 Wei Yongjun 提交于 10月 15, 2018

Add the missing unlock before return from function bsq_audit()
in the error handling case.

Fixes: 1d9d8be9 ("fore200e: check for dma mapping failures")
Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d275444c

Merge branch 'bnxt_en-Add-support-for-new-57500-chips' · 65f2247d

由 David S. Miller 提交于 10月 15, 2018

Michael Chan says:

====================
bnxt_en: Add support for new 57500 chips.

This patch-set is larger than normal because I wanted a complete series
to add basic support for the new 57500 chips.  The new chips have the
following main differences compared to legacy chips:

1. Requires the PF driver to allocate DMA context memory as a backing
store.
2. New NQ (notification queue) for interrupt events.
3. One or more CP rings can be associated with an NQ.
4. 64-bit doorbells.

Most other structures and firmware APIs are compatible with legacy
devices with some exceptions.  For example, ring groups are no longer
used and RSS table format has changed.

The patch-set includes the usual firmware spec. update, some refactoring
and restructuring, and adding the new code to add basic support for the
new class of devices.
====================
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

65f2247d

bnxt_en: Add PCI ID for BCM57508 device. · 1ab968d2